-->

Thursday, September 25, 2014

Use Boto to Start an Elastic Map-Reduce Cluster with Hive and Impala Installed

I spent all of yesterday beating my head against the Boto document (or lack thereof). Boto is a popular (the?) tool for using Amazon Webservices (AWS) with Python. The parts of AWS that are used quite a bit have good documentation, while the rest suffer for explanation.

The task I wanted to accomplish was:

  1. Use Boto to start an elastic mapreduce cluster of machines.
  2. Install Hive and Impala on the machines.
  3. Use Spot instances for the core nodes.
Below is sample code to accomplish these tasks. I spent a great deal of time combing through the sourcecode for Boto. You may need to do the same.

This code is for Boto 2.32.1:


from boto.emr.connection import EmrConnection
from boto.emr.step import InstallHiveStep
from boto.emr import BootstrapAction
conn = boto.emr.connect_to_region('us-east-1')
hive_step = InstallHiveStep()
bootstrap_impala = BootstrapAction("impala","s3://elasticmapreduce/libs/impala/setup-impala",["--base-path","s3://elasticmapreduce","--impala-version","latest"])
instance_groups = [InstanceGroup(1, "MASTER", master_instance_type, "ON_DEMAND","mastername"),
                   InstanceGroup(num_instances, "CORE", slave_instance_type, 'SPOT', "slavename",bidprice=bidprice)]
jobid = conn.run_jobflow(cluster_name,log_uri="s3n://log_bucket",\
ec2_keyname="YOUR EC2 KEYPAIR NAME",\
availability_zone="us-east-1e",\
instance_groups=instance_groups,\
num_instances=str(num_instances),\
keep_alive="True",\
enable_debugging="True",\
hadoop_version="2.4.0",\
ami_version="3.1.0",\
visible_to_all_users="True",\
steps=[hive_step],\
bootstrap_actions=[bootstrap_impala])