I spent all of yesterday beating my head against the Boto document (or lack thereof). Boto is a popular (the?) tool for using Amazon Webservices (AWS) with Python. The parts of AWS that are used quite a bit have good documentation, while the rest suffer for explanation.
The task I wanted to accomplish was:
- Use Boto to start an elastic mapreduce cluster of machines.
- Install Hive and Impala on the machines.
- Use Spot instances for the core nodes.
Below is sample code to accomplish these tasks. I spent a great deal of time combing through the sourcecode for Boto. You may need to do the same.
This code is for Boto 2.32.1:
from boto.emr.connection import EmrConnection
from boto.emr.step import InstallHiveStep
from boto.emr import BootstrapAction
conn = boto.emr.connect_to_region('us-east-1')
hive_step = InstallHiveStep()
bootstrap_impala = BootstrapAction("impala","s3://elasticmapreduce/libs/impala/setup-impala",["--base-path","s3://elasticmapreduce","--impala-version","latest"])
instance_groups = [InstanceGroup(1, "MASTER", master_instance_type, "ON_DEMAND","mastername"),
InstanceGroup(num_instances, "CORE", slave_instance_type, 'SPOT', "slavename",bidprice=bidprice)]
jobid = conn.run_jobflow(cluster_name,log_uri="s3n://log_bucket",\
ec2_keyname="YOUR EC2 KEYPAIR NAME",\
availability_zone="us-east-1e",\
instance_groups=instance_groups,\
num_instances=str(num_instances),\
keep_alive="True",\
enable_debugging="True",\
hadoop_version="2.4.0",\
ami_version="3.1.0",\
visible_to_all_users="True",\
steps=[hive_step],\
bootstrap_actions=[bootstrap_impala])