hadoop_henchman

This is a small project created by me in order to solve test task for DevOPS position. The goal was to create a simple, yet manageble way to deploy, configure and maintain a small, 3-node cluster with Hadoop and HBase running on it.

I have decided to use chef-solo as deployment/configuration change engine (this will allow us to get rid of any server nodes or management nodes) and a simple Flask-based script-server which provides REST API interface to update configuration, run chef and maintain services called "the Henchman". Hadoop was set up on 3 nodes: 1st node: NameNode, Secondary NameNode, DataNode, ResourceManager, NodeManager 2nd node: DataNode, NodeManager 3rd node: DataNode, NodeManager

HBase was set up on 3 nodes: 1st node: Master, Zookeeper 2nd node: Backup Master, Zookeeper, RegionServer 3rd node: Zookeeper, RegionServer

using for reference: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm https://hbase.apache.org/book.html#quickstart

The Henchman is a small server-like script, which serves as a REST API interface for host management. Right now there are 4 types of calls:

http://{address}/file/{somefile} - GET - for getting files from Henchman working directory (for example, configuration or recipe file)
http://{address}/api/run_chef - POST - serves for 2 purposes: to update config file (attr.json) and to run Chef with chosen steps. Send a POST request like this:

    {
    "some_configuration":{
       "some_key":"some_value",
	   "some_keys":{
	        "some_hosts":["host01","host02","host03"]
  		}
        },
    "run_list": [
        "recipe[some_cookbook::some_recipe]"
      ]
    }

http://{address}/api/get_config - GET - get the config file (attr.json)
http://{address}/api/{hadoop,hbase}/{stop_datanode,start_hbase,etc} - GET - trigger some action

Example: Let's say, we want to change something in our configuration, for example, HDFS replication factor from 1 to 3. With henchman it's easy. All we need to do is to:

Update variables configuration
Re-generate hdfs-site
Re-start HDFS

Here is our configuration file:

{
    "deploy":{
	    "directory":"/home/develop/deploy"
	},
    "hadoop":{
        "version":"hadoop-2.6.0",
        "user":"hadoop",
        "namenode_port":"9000",
        "hadoop_hosts":
        {
        "master":"test-01",
        "datanodes":["test-01","test-02","test-03"]	  
        },
        "replication_factor":"3",
        "data_folder":"/data",
	    "yarn":{
	        "resource_manager":
	        {
	        "host":"test-01",
	        "resource_manager_port":"8050",
	        "scheduler_port":"8035",
	        "resource-tracker_port":"8025",
	    	"mapreduce_job_tracker_port":"5431"
	        }
	}
  },
  "hbase":{
       "user":"hbase",
	   "version":"hbase-1.0.3",
	   "hbase_hosts":{
	        "zk_nodes":["test-01","test-02","test-03"],
 			"regionserver_nodes":["test-02","test-03"],
			"backup_master_nodes":["test-02"]
		}
  },
    "run_list": [
        "recipe[hadoop::generate_hdfs-site_xml]"
    ]
}

We have changed replication factor and we can save it as attrs.json. After that we perform curl for all our nodes:

HOSTS=`cat attrs.json | ./jq -r '.hadoop.hadoop_hosts.datanodes' |sed  "/\[\|\]/d" | tr -d ',\n' |tr -d '"'`
for HOST in $HOSTS
do
   curl -vX POST http://$HOST:5000/api/run_chef -d @attrs.json --header "Content-Type: application/json"
done

Then, we restart HDFS:

curl -i test-01:5000/api/hadoop/stop_dfs; curl -i test-01:5000/api/hadoop/start_dfs

And it's done.

TODO:

Daemonize this!!
Build RPM package
Add authorization
Add healthcheck feature (process statuses monitoring, system metrics
Rework API commands to something more convenient and reliable, maybe wrap them with Chef
Add UI and DB for historical data
Rework cookbooks, add better passwordless ssh management (https://supermarket.chef.io/cookbooks/ssh-keys)

Right now the project is in very crude and unfinished state, it was thrown together by me in a couple of nights, so there are lots of things to be improved.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cookbooks		cookbooks
README.md		README.md
attrs.json		attrs.json
config.rb		config.rb
henchman.py		henchman.py
initial_deploy.sh		initial_deploy.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hadoop_henchman

About

Releases

Packages

Languages

ericparland/hadoop_henchman

Folders and files

Latest commit

History

Repository files navigation

hadoop_henchman

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages