Thursday, October 2, 2014

Query Data from Impala on Amazon EMR into Python, Pandas and IPython Notebook

I've been envious of tools such as Hue that allow for an easy way to execute SQL-like queries on Hive or Impala and then immediately plot the results. Installing Hue on EMR has thus-far thwarted me (if you know how, I'm all ears), so I needed a better way.

Hive is great for doing batch-mode processing of a lot of data, and pulling data from S3 into the Hadoop HDFS. Impala then allows you do to fast(er) queries on that data. Both are 1-click installed using Amazon's EMR console (or command line).

The difficulty now is that I'm writing queries at the command-line and don't have a particularly elegant way of plotting or poking at the results. Wouldn't it be great if I could get the results into an IPython Notebook and plot there? Two problems: 1) getting the results into Python and 2) getting access to a Notebook server that's running on the EMR cluster.

I now have two solutions to these two problems:

1) Install Anaconda on the EMR cluster:

remote$> wget http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.shremote$> bash Anaconda-2.1.0-Linux-x86_64.sh

now log out of your SSH connection and reconnect using the command

local$> ssh -i ~/yourkeyfile.pem -L 8889:localhost:8888 hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

This starts a port forwarding SSH connection that connects http://localhost:8889 on your local machine to http://localhost:8888 on the remote machine (which is where the notebook will run). Now start the remote IPython Notebook server using the command

$> ipython notebook --browswer=none

You should now be able to navigate to http://localhost:8889 on your local machine and see the notebook server running on your EMR machine! Ok, now what about getting the Impala data into the notebook?

2) The Impyla project allows you to connect to a running Impala sever, make queries, and spit the output into a Pandas dataframe. You can install Impyla using the command

remote$> pip install impyla

Inside your IPython Notebook you should be able to execute something like

from impala.util import as_pandas 
from impala.dbapi import connect 
conn = connect(host='localhost', port=21050) 
cursor = conn.cursor() 
cursor.execute('SELECT * FROM some_table') 
df = as_pandas(cursor)

and now you have a magical dataframe with Overweight Data that you can plot and otherwise poke at.

1 comment:

  1. William,

    I saw a post you made on Slashdot (http://stackoverflow.com/questions/20042478/matplotlib-1-3-1-plotmatrix1-2-3-runtimeerror-maximum-recursion-depth) where you were asking how to patch Anaconda to the latest version of matplotlib.

    I am in the same boat. What I did to fix the problem was go and see the patch that fixed the infinite recursion here (https://github.com/dopplershift/matplotlib/commit/cee4ba990c7e209561e4deec75452e9dc97c5a30) and notice that it is only about 8 lines of code. Then, within Anaconda, I opened the file:
    And just edited the file. In order to make it stick, you need to then remove units.pyc from the directory so that python's cache manager is forced to recompile the altered source.

    I hope this helps you. I couldn't post on Slashdot because I don't have enough reputation.