Hive is great for doing batch-mode processing of a lot of data, and pulling data from S3 into the Hadoop HDFS. Impala then allows you do to fast(er) queries on that data. Both are 1-click installed using Amazon's EMR console (or command line).
The difficulty now is that I'm writing queries at the command-line and don't have a particularly elegant way of plotting or poking at the results. Wouldn't it be great if I could get the results into an IPython Notebook and plot there? Two problems: 1) getting the results into Python and 2) getting access to a Notebook server that's running on the EMR cluster.
I now have two solutions to these two problems:
1) Install Anaconda on the EMR cluster:
remote$> wget http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.shremote$> bash Anaconda-2.1.0-Linux-x86_64.sh
now log out of your SSH connection and reconnect using the command
local$> ssh -i ~/yourkeyfile.pem -L 8889:localhost:8888 hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com
$> ipython notebook --browswer=none
2) The Impyla project allows you to connect to a running Impala sever, make queries, and spit the output into a Pandas dataframe. You can install Impyla using the command
remote$> pip install impyla
Inside your IPython Notebook you should be able to execute something like
from impala.util import as_pandas
from impala.dbapi import connect
conn = connect(host='localhost', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM some_table')
df = as_pandas(cursor)
and now you have a magical dataframe with Overweight Data that you can plot and otherwise poke at.