Vajra(diamond) is a powerful weapon which symbolizes indestructibility and a thunderbolt. Here, vajra is a tool which is written in python which helps to scale your hadoop cluster based on demand. CDH is Cloudera’s 100% open source platform distribution, including Apache Hadoop and built specifically to meet enterprise demands. Cloudera manager is a cluster management tool for managing CDH cluster. It provides an end to end solution from provisioning, configuring, managing and monitoring cluster nodes.
- We can create infrastructure using provisioning tools in cloud and provision cluster using cloudera manager using its web interface or api easily. But no framework addresses the scalability of cluster nodes. Recently, Cloudera director addressed the scalability problem in CDH cluster. But it is bit difficult to use in some aspects. We have to maintain consistency between cloudera manager and cloudera director.
- Cloud vendors like AWS offers hadoop service as Elastic Map Reduce(EMR), Microsoft Azure offers HD Insight to offer hadoop service and Openstack has launched a project called Sahara – Dataprocessing as a service to address hadoop use cases.
- Here, I have documented the steps to create a hadoop cluster on top of aws and configure autoscaling of worker nodes and edge nodes based on demand.
- I have included some scripts to cleanup you hadoop cluster on hourly basis or up to your choice. A single user can spoil the entire hadoop cluster if the hadoop cluster is not configured correctly or any privileged user executes a long running mapreduce or spark jobs.
Hardware and Software Requirements:
- AWS account with sufficient quota to launch ec2 instances and other resources
- Provision your cluster resources based on your architecture pattern.
- Install the packages and manage its configuration thrugh cloudera manager.
- Create two host templates in cloudera manager say WN(worker nodes) and GN(gateway or edge nodes) with necessary roles. Which will be used by autoscaling group while creating VM.
- Host template worker node should have roles such as node manager and data node and spark gateway configuration if you use spark on YARN cluster. Host template gateway node should have roles such as hue and configuration of all hadoop components such as hdfs, mapreduce, spark, hive, impala and spark.
- Create a repo in code commit or github and keep the code.
- Prepare a golden image for your gateway and worker nodes.
- Create launch configuration/ launch template from the above golden image and configure other things related to your deployment. use the contents of cdh_autoscaling_scaleout_gn.sh and cdh_autoscaling_scaleout_gn.sh as user data for gateway node and worker node respectively.
- Create autoscaling group for on demand instances and spot instances for worker node autoscaling. Worker node can withstand ec2 instance failure, Jobs will be rescheduled to a healthy host in the cluster. We can decommission the roles and migrate the required data before the termination of nodes.
- Create a lambda function and deploy the cleanup script and autoscaling script.
- Create a cloudwatch event to invoke the lambda function to perform cluster health check.
- Configure cloudwatch alarm to perform scale in and scale out of your cluster nodes.
How Vajra works?
- Vajra uses cloudera manager REST API and AWS autoscaling to provision edge node and worker node dynamically. YARN Cluster exposes REST API to track the metrics like total memory, available memory, total core, available core, number of mapreduce/spark jobs running, number of jobs in pending, resources consumed by an individual job. Our lambda function contains the steps to compute the available memory, available core and number of running and pending jobs. It will update the metrics to cloudwatch with interval of every 1minute. Lambda function is invoked by cloudwatch event.
- We can access the metrics from lambda function through API gateway.
- We have alarm to create threshold to perform scale in and scale out of ec2 instances based on the above metrics and we can define the limits of maximum instances in an autoscaling group.
- User-data plays a major role in injecting the script while launching instances from autoscaling. This script contains the steps to install packages and registering as a node in CDH cluster. Once it is successfully registered as a host, we can apply host templates to assign a required roles on the host.
- We can refresh the cluster configuration without restarting any existing roles through API and deploy the latest client configuration against the nodes.
- Autoscaling instances will be having life cycle which helps us to perform some activity against the instance before launching and before termination of instances. Scale-in script will be executed every minute to check the lifecycle status of the host. Whenever autoscaling updates the group to perform scale-in activity, life cycle status of an ec2 instance will be changed. During that time, we will decommission the hosts and move the copy of the data to s3 or any shared storage.
- We stop the roles and delete the host from the cluster. Later we do configuration refresh to remove the stale configs.
- Finally, we will stop the instance programmatically and it will be terminated by autoscaling group after the lifecycle state of the instance changed from termination:Wait to termination:Proceed.
- While scaling edge node, we push the system memory to cloudwatch for autoscaling.
- Worker nodes are resilient to termination of nodes, so we can add it to autoscaling group of spot instances which leverages the lifecycle hook of autoscaling.
About ELB Custom Health Checker:
- AWS application load balancer performs health check based on the configured traffic port to check the availability of the service. It will be marked as a healthy state when the health check is passed.
- Application load balancer will route the traffic only to the healthy instances. Sometimes we might have a scenario where the application is running but the system resource become a bottle neck when multiple users accessing the application but ELB routes the traffic to the instance.
- One more scenario, Application load balancer will randomly route the traffic when there is no healthy instance in the target group.
- This example will illustrate the above scenarios. We have an application running on ec2 where all users running spark shell, so the memory will be consumed rapidly. At one point the system resources will be exhausted. Any new users will get this host will get performance issue.
- Custom health checker will address the issue. It is written in python(Flask Web Framework) which exposes REST API to get metrics of health status of hue, web server, jupyterhub, etc. Here system memory, disk availability and service status will be compared. If the above three are OK, service will return 200 HTTP response code otherwise 500 response code will be returned. Finally, It will address the problem.