Hive is great for doing batch-mode processing of a lot of data, and pulling data from S3 into the Hadoop HDFS. Impala then allows you do to fast(er) queries on that data. Both are 1-click installed using Amazon's EMR console (or command line).
The difficulty now is that I'm writing queries at the command-line and don't have a particularly elegant way of plotting or poking at the results. Wouldn't it be great if I could get the results into an IPython Notebook and plot there? Two problems: 1) getting the results into Python and 2) getting access to a Notebook server that's running on the EMR cluster.
I now have two solutions to these two problems:
1) Install Anaconda on the EMR cluster:
remote$> wget http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.shremote$> bash Anaconda-2.1.0-Linux-x86_64.sh
now log out of your SSH connection and reconnect using the command
local$> ssh -i ~/yourkeyfile.pem -L 8889:localhost:8888 [email protected]
This starts a port forwarding SSH connection that connects http://localhost:8889 on your local machine to http://localhost:8888 on the remote machine (which is where the notebook will run). Now start the remote IPython Notebook server using the command
$> ipython notebook --browswer=none
You should now be able to navigate to http://localhost:8889 on your local machine and see the notebook server running on your EMR machine! Ok, now what about getting the Impala data into the notebook?
2) The Impyla project allows you to connect to a running Impala sever, make queries, and spit the output into a Pandas dataframe. You can install Impyla using the command
remote$> pip install impyla
Inside your IPython Notebook you should be able to execute something like
from impala.util import as_pandas
from impala.dbapi import connect
conn = connect(host='localhost', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM some_table')
df = as_pandas(cursor)
and now you have a magical dataframe with Overweight Data that you can plot and otherwise poke at.
William,
ReplyDeleteI saw a post you made on Slashdot (http://stackoverflow.com/questions/20042478/matplotlib-1-3-1-plotmatrix1-2-3-runtimeerror-maximum-recursion-depth) where you were asking how to patch Anaconda to the latest version of matplotlib.
I am in the same boat. What I did to fix the problem was go and see the patch that fixed the infinite recursion here (https://github.com/dopplershift/matplotlib/commit/cee4ba990c7e209561e4deec75452e9dc97c5a30) and notice that it is only about 8 lines of code. Then, within Anaconda, I opened the file:
/Lib/site-packages/matplotlib/units.py
And just edited the file. In order to make it stick, you need to then remove units.pyc from the directory so that python's cache manager is forced to recompile the altered source.
I hope this helps you. I couldn't post on Slashdot because I don't have enough reputation.