-->

Thursday, October 2, 2014

Query Data from Impala on Amazon EMR into Python, Pandas and IPython Notebook

I've been envious of tools such as Hue that allow for an easy way to execute SQL-like queries on Hive or Impala and then immediately plot the results. Installing Hue on EMR has thus-far thwarted me (if you know how, I'm all ears), so I needed a better way.

Hive is great for doing batch-mode processing of a lot of data, and pulling data from S3 into the Hadoop HDFS. Impala then allows you do to fast(er) queries on that data. Both are 1-click installed using Amazon's EMR console (or command line).

The difficulty now is that I'm writing queries at the command-line and don't have a particularly elegant way of plotting or poking at the results. Wouldn't it be great if I could get the results into an IPython Notebook and plot there? Two problems: 1) getting the results into Python and 2) getting access to a Notebook server that's running on the EMR cluster.

I now have two solutions to these two problems:

1) Install Anaconda on the EMR cluster:

remote$> wget http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.shremote$> bash Anaconda-2.1.0-Linux-x86_64.sh

now log out of your SSH connection and reconnect using the command


local$> ssh -i ~/yourkeyfile.pem -L 8889:localhost:8888 hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

This starts a port forwarding SSH connection that connects http://localhost:8889 on your local machine to http://localhost:8888 on the remote machine (which is where the notebook will run). Now start the remote IPython Notebook server using the command

$> ipython notebook --browswer=none

You should now be able to navigate to http://localhost:8889 on your local machine and see the notebook server running on your EMR machine! Ok, now what about getting the Impala data into the notebook?

2) The Impyla project allows you to connect to a running Impala sever, make queries, and spit the output into a Pandas dataframe. You can install Impyla using the command

remote$> pip install impyla

Inside your IPython Notebook you should be able to execute something like

from impala.util import as_pandas 
from impala.dbapi import connect 
conn = connect(host='localhost', port=21050) 
cursor = conn.cursor() 
cursor.execute('SELECT * FROM some_table') 
df = as_pandas(cursor)

and now you have a magical dataframe with Overweight Data that you can plot and otherwise poke at.

Thursday, September 25, 2014

Use Boto to Start an Elastic Map-Reduce Cluster with Hive and Impala Installed

I spent all of yesterday beating my head against the Boto document (or lack thereof). Boto is a popular (the?) tool for using Amazon Webservices (AWS) with Python. The parts of AWS that are used quite a bit have good documentation, while the rest suffer for explanation.

The task I wanted to accomplish was:

  1. Use Boto to start an elastic mapreduce cluster of machines.
  2. Install Hive and Impala on the machines.
  3. Use Spot instances for the core nodes.
Below is sample code to accomplish these tasks. I spent a great deal of time combing through the sourcecode for Boto. You may need to do the same.

This code is for Boto 2.32.1:


from boto.emr.connection import EmrConnection
from boto.emr.step import InstallHiveStep
from boto.emr import BootstrapAction
conn = boto.emr.connect_to_region('us-east-1')
hive_step = InstallHiveStep()
bootstrap_impala = BootstrapAction("impala","s3://elasticmapreduce/libs/impala/setup-impala",["--base-path","s3://elasticmapreduce","--impala-version","latest"])
instance_groups = [InstanceGroup(1, "MASTER", master_instance_type, "ON_DEMAND","mastername"),
                   InstanceGroup(num_instances, "CORE", slave_instance_type, 'SPOT', "slavename",bidprice=bidprice)]
jobid = conn.run_jobflow(cluster_name,log_uri="s3n://log_bucket",\
ec2_keyname="YOUR EC2 KEYPAIR NAME",\
availability_zone="us-east-1e",\
instance_groups=instance_groups,\
num_instances=str(num_instances),\
keep_alive="True",\
enable_debugging="True",\
hadoop_version="2.4.0",\
ami_version="3.1.0",\
visible_to_all_users="True",\
steps=[hive_step],\
bootstrap_actions=[bootstrap_impala])

Wednesday, August 20, 2014

Use Asciinema to Record and Share Terminal Sessions

I discovered a cool tool today by listening to Jeroen Janssen's talk on Data Science at the Command Line. The tool is Asciinema, which is a terminal plugin that lets you record you terminal session, save it and then share - with copyable text! It supports both OSX and Linux.

Here's an example from their website:

Can't wait to use this for this blog!

Monday, August 18, 2014

Automatically Partition Data in Hive using S3 Buckets

Did you know that if you are processing data stored in S3 using Hive, you can have Hive automatically partition the data (logical separation) by encoding the S3 bucket names using a key=value pair? For instance, if you have time-based data, and you store it in buckets like this:

/root_path_to_buckets/date=20140801
/root_path_to_buckets/date=20140802
/root_path_to_buckets/date=20140803
/root_path_to_buckets/...

And you build a table in Hive, like


CREATE EXTERNAL TABLE time_data(
   value STRING,
   value2 INT,
   value3 STRING,
   ...
)
PARTITIONED BY(date STRING)
LOCATION s3n://root_path_to_buckets/

Hive will automatically know that your data is logically separated by dates. Usually this requires you to refresh the partition list by calling the command:

ALTER TABLE time_data RECOVER PARTITIONS;

After that, you can check to see if the partitions have taken using the SHOW command:

SHOW PARTITIONS time_data;

Now when you run a SELECT command, Hive will only load the data needed. This saves a tremendous amount of downloading and processing time. Example:

SELECT value, value2 FROM time_data WHERE date > "20140802"

This will only load 1/3 of the data (since 20140801 and 20140802 are excluded).

Thursday, July 31, 2014

Buying a Car With Data Science!

Forget Big Data, here's how I used small data and stupid simple tools to make buying a car easier for me to stomach.

When your family's size increases past the ⌊E[#number of children]⌋ in America - you need a bigger vehicle. Given that we now have a family of 3, and that the 3rd child will need a carseat in a few months, it was time to buy a family van.

I'm a big fan of staying out of debt and so I had a fixed budget with which to acquire a vehicle - a used vehicle. A great place to start looking is with the AutoTempest search engine, which aggregates data from lots of different sites. The problem is that it's difficult to know how much you should pay for any given vehicle. If you're buying new you can check something like TrueCar and there's resources like Kelly Blue Book, NADA and Edmunds but from past used buying experience those services tended to underestimate the actual costs with most vehicles I've bought, and while some folks love to negotiate, I find it difficult without any "ground truth" to base my asking price off of.


I toyed around with the idea of doing a full-fledged scrapping of data but it just wasn't worth the time, since I was under a deadline. Instead I took the approach of least resistance and asked my wife to pull together basic info on 20 vehicles matching our criteria - year, milage and asking price. Together we stuck it into a Google Document and plotted the results:


To my surprise and delight, there seemed to be two distinct collections of data - an upper and a lower priced section. Since Google Docs doesn't provide any easy way to put in a regression line I then moved over to Excel and added those in:





The data points highlighted in green were ones that I was considering. Suddenly I now had isolated two vehicles that were "over priced" and ripe for easy negotiation. I also chose the highest price - and lowest milage vehicle - and asked for a significantly lower price on a whim.

The additional data I'd gathered gave me confidence when negotiating. I contacted 2 dealerships with my data and the price I wanted to pay (slightly under the lower price regression).  Ultimately the 1st person I'd talked to accepted my offer, which I new was good from my data and I didn't have to worry about whether or not I should keep negotiating.

What my data DIDN'T do for me:
  • The data didn't impress anyone
  • The data didn't magically make people accept my offers
  • The data didn't make buying a car easy
What my data DID do for me:
  • Took away the awful feeling of not knowing.
  • Gave me confidence when negotiating offers.
  • Let me quickly see which vehicles I should pursue and which to not focus on.
Now I'm driving a baller ... van. Cool indeed.

Python Pandas Group by Column A and Sum Contents of Column B

Here's something that I can never remember how to do in Pandas: group by 1 column (e.g. Account ID) and sum another column (e.g. purchase price). So, here's an example for our reference:

data.groupby(by=['account_ID'])['purchases'].sum()

Simple, but not worth re-Googling each time!

Thursday, July 24, 2014

OSCON Wednesday Recap

Ended up in Paul Frankwicks talk on "Build Your Own Exobrain" a bit late, but it worked out well. "Think of it as IFTTT, but free, open source, and you keep control of your privacy." Had a great conversation with Paul afterwards regarding stochastic time and mood tracking. Hopefully we'll get some stochastic triggers added to Exobrain, along with Python support, soon.

Next I listened to Adam Gibson give a talk on his deep learning library, DeepLearning4j. He's clearly a bright guy and is very passionate about his project. I'd previous watched him give a similar talk at the Hadoop Summit along with Josh Patterson. I spent some time talking with him after the session and trading machine learning stories. He nearly inspired me to learn Scala and start hacking on deeplearning4j - it sounds like a fabulous platform with all the possible moving pieces you could want for building a deep learning pipeline. 

Afterwards I went to Harrison Mebane's talk on spotting Caltrain cars using a Raspberry Pi outfitted with microphones and a camera. It looked like a neat project incorporating data, sensors and hardware.

Next, on a whim, I went to Tobias Zander's talk on web security. I know very little about security so I was fascinated by all of the interesting ways to compromise a system he showcased. He showed how clever hackers can learn all sorts of information in non-obvious ways. He also royally trolled the audience by using a Facebook compromise to gather people's Facebook profile pictures after they visited his website during the talk. 

Finally, I went to a lovely talk by Tim Bell on how CERN's IT infrastructure and how they went agile. It was a fascinating talk that dove into the complexities of such a massive system. The difficulties, both political, scientific and technological are enormous. When the video is posted it's well worth your time to go and watch.

Wednesday, July 23, 2014

OSCON Tuesday Recap

Excellent set of Keynotes. Especially enjoyed the one from Planet Labs - inspiring work to photograph the entire globe, every day.

Next was a talk on building an open Thermostat from Zach Supalla at Spark.io, the makers of an Internet connected microcontroller and cloud infrastructure. Zach says building hardware is hard, but it's easy to get noticed - if you build anything remotely cool you'll be on Engadget with no problem.

A talk on Functional Thinking by Rob Ford, who is a great speaker, was informative but wasn't exactly something that I can apply to my work right now. I at least caught up on some of the nomenclature and can use it as a jumping-off point for future learning. Apparently all major languages are adding functional programming support these days (Python?).

Ruth Suehle gave a tremendously fun talk on Raspbery Pi hacks - it also turns out she lives in my city and knows a bunch of people that I do. Go figure! She inspired me to go buy a Pi and do something other than a) leave it in a box or b) put XBMC on it. I'm thinking a weather station would be a fun project to build.

Tim Berglund have a (packed!) talk on "Graph Theory you Need to Know". Tim is a good speaker, but the talk struggled a bit with needing to pack in lots of definition (not Tim's fault). I never knew how easy it was to take an adjacency graph and compute the N-length paths to other nodes - just multiply them! Also neat to see a quick example of going from the graph to a markov chain with probabilities.

Ethan Dereszynski and Eric Butler from Webtrends showed off their (beautiful!) realtime system for observing and predicting user behavior on a website. It uses Kafka/Storm to train and classify user behavior using a HMM - the dashboard can show you, in real-time, individual users on your site and the probability that they'll do some action. You can then serve them ads or coupons based on how likely they are to buy/leave/etc. Want to talk to these guys more, because I'm trying to solve a similar problem at Distil.

Finally, my talk on the Fourier Transform, FFT, and How to Use It went smashingly well. I hit perfect timing, saw lots of mesmerized faces and had plenty of questions afterwards. The slides are up and the code will be uploaded soon. 

Sunday, July 20, 2014

Welcome OSCON Visitors!

Welcome OSCON visitors. If you're here because of my talk on Time Series Analysis and the Fourier Transform  here is a link to the presentation in Power Point form - more formats coming soon. I'll be uploading code later on today.

I'm Speaking at OSCON 2013 (size 728 X 90)





Thursday, July 3, 2014

How do I Read Python Error Messages?

One of the complaints I've heard about Python from Matlab developers is that the error messages can be cryptic. 

Here's an example of an error I recently got while running some Python code (see Error Message below). When compared with a Matlab-style error, this can be a bit overwelming. Here's a few tips:
  1. Start at the bottom. That will be where the actual error message is. In this case it's "ValueError: to_rgba: Invalid rgba arg "['#eeeeee']" need more than 1 value to unpack". Notice that the argument in question is inside of square brackets - that means it's a list with 1 item in it. Perhaps that's telling.
  2. Find where the code you wrote is breaking. Often the stack of errors is deep inside some other module or code, but what you wrote is actually passing in bad data. The 2nd block of errors starting with "<ipython-input-2-2a6f1bb6961e>" is code that we wrote. Looks like the offending line is "color = colors_bmh[ y[ix].astype(int) ]". Somehow we're setting the array of "color" to some bad values.
  3. Look at the specific error message, "ValueError: ... need more than 1 value to unpack". This means that the code was expecting 2 or more values to be returned but only 1 was. 
It actually took me a while to debug the code. In this case, the point I made under Step 1 above about the square brackets (indicating a list) around the argument was the key. The colors array was defined this way:

color = colors_bmh[ y[ix].astype(int) ]

The problem with that statement is that if you look at the array shape, (np.shape(color)) it's 2-dimensional (100,1) whereas the function was expecting a list, or a 1-Dimensional array. Changing the code to this fixed it:

color = colors_bmh[ y[ix].astype(int) ].ravel()

The ravel() method removes singleton dimensions. It's similar to Matlab's squeeze() function. Hope that helps!

Error Message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-808bb77074e6> in <module>()
      3 p = np.random.randn(len(y))
      4 
----> 5 separation_plot(p,y)

<ipython-input-2-2a6f1bb6961e> in separation_plot(p, y, **kwargs)
     29         bars = ax.bar( np.arange(n), np.ones(n), width=1., 
     30                 color = colors_bmh[ y[ix].astype(int) ],
---> 31                 edgecolor = 'none')
     32         ax.plot( np.arange(n), p[ix,i], "k", 
     33                 linewidth = 3.,drawstyle="steps-post" )

/Users/william/anaconda/lib/python2.7/site-packages/matplotlib/axes.pyc in bar(self, left, height, width, bottom, **kwargs)
   4976             color = [None] * nbars
   4977         else:
-> 4978             color = list(mcolors.colorConverter.to_rgba_array(color))
   4979             if len(color) == 0:  # until to_rgba_array is changed
   4980                 color = [[0, 0, 0, 0]]

/Users/william/anaconda/lib/python2.7/site-packages/matplotlib/colors.pyc in to_rgba_array(self, c, alpha)
    409             result = np.zeros((nc, 4), dtype=np.float)
    410             for i, cc in enumerate(c):
--> 411                 result[i] = self.to_rgba(cc, alpha)
    412             return result
    413 

/Users/william/anaconda/lib/python2.7/site-packages/matplotlib/colors.pyc in to_rgba(self, arg, alpha)
    363         except (TypeError, ValueError) as exc:
    364             raise ValueError(
--> 365                 'to_rgba: Invalid rgba arg "%s"\n%s' % (str(arg), exc))
    366 
    367     def to_rgba_array(self, c, alpha=None):

ValueError: to_rgba: Invalid rgba arg "['#eeeeee']"
need more than 1 value to unpack

Tuesday, July 1, 2014

Volunteer Work to Boost Data Science Skills

I came across an insightful Reddit comment in regard to doing volunteer work with data science. It's encouraging to see folks putting their skills to good use and should be illuminating for those trying to "break into" the field.

Here's vmsmith's answer on how he volunteer to boost his data science skills:

Well, I volunteered to be the data manager for my state delegate. That wasn't too demanding or in-depth, but she did have a lot of data on constituents and donors that needed to be munged and put into consistent formats, and it was a chance to start writing small Python programs to work with .csv files.
Second, I volunteered at the research library at a near by university, and ended up writing a pretty comprehensive report on data management. This led to two things: one, a paid consulting offer at another university to help them get a data management plan in place, and (2) another volunteer gig actually developing a data management application in Python that's modeled on the Open Archival Information System (OAIS).
Third, I offered to do data munging for a physicist I know who's doing a hyperspectral imagery project for a government research activity, and needed someone to munge all of the geospatial data to create integrated .kmz files for each experimental session.
Finally, I volunteered to be on NIST's Big Data Reference Architecture Working Group.
All of those things (1) increased my skill and knowledge levels, (2) provided decent resume bullets, and (3) developed references and networking contacts. 

Thursday, June 26, 2014

How to Put Data into Amazon S3 From OSX

Amazon's S3 storage can be a great option for cheaply storing data "in the cloud". The problem is how to get data in and out of S3? There are several options for doing this. If you need a GUI application for OSX you might try Forklift. For a command line option, install the s3cmd program. If you have Homebrew installed you can do this via the command:

brew install s3cmd

Now run s3cmd --configure to set it up.
Enter your AWS Access Key and Secret Key (which you can find in your control panel). Pick some encryption password (which is used during the transfer). I don't use any GPG program,  no HTTPS protocol and no HTTP proxy server name.

The next option is to test the connection. Hit "Y". You should see the output:
`Success. Your access key and secret key worked fine :-)`

Now hit "Y" again to save the settings.

Great! Now you've got s3cmd set up and you can move files into and out of s3!

To move a file into s3, run the command:

s3cmd put path/to/filename/filename.file s3://path/to/s3/bucket/

The file can also be accessed via HTTP using the following format:

http(s)://<bucket>.s3.amazonaws.com/<object>
http(s)://s3.amazonaws.com/<bucket>/<object>
via AWS Support Forums
You can also navigate to the file in the S3 console, hit the properties and you'll see the HTTP link listed.

To pull down a file from s3, run the command:

s3cmd get s3://path/to/s3/bucket/filename.file


Monday, June 9, 2014

Introduction to IPython.Parallel and Distributed Model Selection

At PyCON 2013 Olivier Grisel presented a tutorial on Advanced Scikit-Learn. One of the topics was parallel computation and model training. This started at 1:03 in the video. There's nice coverage of memory mapping large files using joblib and Numpy that is priceless.


The data and notebooks for the talk can be checked out here. Grisel also covered using StarCluster to distribute computation (very) easily among many EC2 machines. I can't wait to give it a try!

Great talk and well worth the watch.

Thursday, May 29, 2014

Plotting a Logarithmic Y-Axis from a Pandas Histogram

Note to self: How to plot a histogram from Pandas that has a logarithmic y-axis. By using the "bottom" argument, you can make sure the bars actually show up. Ordinarily a "bottom" of 0 will result in no bars.
In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
In [9]:
df = pd.DataFrame(np.random.randn(1000,1))
df.hist(bins=200, bottom=1)
plt.semilogy()
Out[9]:
[]
In []:
 


Join my mailing list for topics on scientific computing.



Wednesday, May 28, 2014

Sleep Quality Over Time

Every morning I log how I felt my sleep was. I log it on a scale of 1 to 5, with a "3" being "meh" sleep, "1" is awful and "5" is something I've never experienced. I've been doing this daily since September of 13 - so about 6 months of data at this point.

Here's a plot of my sleep ratings computed with a 7 day running average - a 7 day boxcar filter, as it were. The raw data is difficult to visualize since it's a very quantized value.


That seems incredible to me that there's such a strong periodicity to my perceived sleep quality. Any ideas why? 

And in case someone thinks the window is the problem, here's the day with weekly averages plotted.


Tuesday, May 27, 2014

IPython Extension of the Week: ipy_alerts

Have you ever had a IPython cell that was taking a **long** time to finish? Maybe you were importing data or training some model? If you've ever wished you could get an alert when the process finished, here's the extension for you: ipy_alerts from Trent Hauck at Zulilly.

The cell magic will send an email once the cell has finished running. Login credentials can be provided in the cell or via a config file.

Tuesday, May 20, 2014

IPython Extension of the Week: IPyCache

The IPython "extension of the week" is the IPyChache extention from Cyrille Rossant (rossant). The extension defines a cell magic (%%cache) that will cache the results of a cell into a persistent pickel file. When the cell is re-run the cached values of the variables are loaded instead of having them recomputed.

Here's an example notebook showing the operation.

The extension also has options for forcing the load of variables or forcing the rewrite of the cache.

Go check out IPyCache and let me know how it works! Thanks rossant!

Yann LeCun on Protection Against Privacy Invasion from AI

The best protections against privacy invasion through abusive data mining (whether it uses AI or mundane ML) are strong democratic institutions, an independent judiciary, and low levels of corruption. - Yann LeCun 

Monday, May 19, 2014

Auto-start IPython Notebook Server on Amazon EC2 Instance

Creating an IPython Notebook server on Amazon EC2 is a relatively straightforward process. I simply followed the instructions here. The only catch was that each time I created a new instance I had to SSH into the machine and manually start up the notebook server. That's irritating. I'd rather have them start automatically.

Added the following to my rc.local file:

ipython notebook --profile=nbserver > /tmp/ipynb.out 2>&1

But, when I rebooted the machine and looked at the log file it said, "/etc/rc.local: ipython: not found". Hmmm.

The following command in the rc.local file does work. The problem is that the rc.local file runs as 'root', not as 'ubuntu' where I installed Anaconda and set up the notebook server.

su ubuntu -c 'ipython notebook --profile=nbserver > /tmp/ipynb.out 2>&1'

That also appears to not work. I think this is because the bash.rc file is not loading. This is where Anaconda is assigned to the user's path.

This didn't work either:

su ubuntu -c 'export PATH=/home/ubuntu/anaconda/bin:$PATH'

su ubuntu -c 'ipython notebook --profile=nbserver > /tmp/ipynb.out 2>&1 &'

I'm not sure why that didn't work as two steps.

Here's the final solution. 
It involves two steps.

1) Add the following line to your rc.local file:

su ubuntu -c 'bash /home/ubuntu/startup_ipynb.sh'

2) Create a script called "startup_ipynb.sh" that contains these lines:

export PATH=/home/ubuntu/anaconda/bin:$PATH

ipython notebook --notebook-dir=\writeable\path\ --profile=nbserver > /tmp/ipynb.out 2>&1 &

This will run those two lines as the 'ubuntu' user and start up the notebook server. 

Tuesday, February 25, 2014

A Brief Introduction to Compressed Sensing with Scikit-Learn

This is based on the blogpost by Jiadev Deshpande. I tried to expand and explain his work, though he deserves most of the credit for what's below. You can download the IPython Notebook I used for this post here.

Suppose we have a time signal. This signal is not known, but we do know something about it - namely that it's sparse in some domain. Maybe that means it's only made up of certain frequencies, so when you look at the signal in the frequency domain it's mostly zeros, with a few non-zero values intermixed.

Let's call this unknown signal \(X\). Now, even though we don't know the whole signal, we can still make observations of it, or samples. Sampling could be expensive, time consuming or not technilogically feasible, which is why we want to limit the number of observations or samples to as small a number as possible.

Furthermore, when we make observations of the signal we aren't going to do it in a repeting pattern (e.g. every 10th sample). No, instead we're going to randomly sample the data.

Let's try this in Python:

In [124]:
%matplotlib inline
from matplotlib.pyplot import plot, show, figure, title
import matplotlib as plt
import numpy as np

from scipy.fftpack import dct, idct
from scipy.sparse import coo_matrix
In [153]:
Fs = 40e3  #Sample rate
duration = 1./8
N_samps = np.floor(duration*Fs)
M = 250   # Number of compressed "basis" functions - we're going from N_samps to M samps.
f1 = 200
f2 = 3950

print "Compression ratio {0}".format(M/N_samps)
t = np.linspace(0,duration,N_samps)
Compression ratio 0.05

In [154]:
X = np.sin(2*np.pi*f1*t) + np.sin(2*np.pi*f2*t)

figure(figsize=[10,4])
plot(t,X)
title('Original signal')
plt.pyplot.xlabel('Time (s)')
plt.pyplot.ylabel('X(t)')
Out[154]:
<matplotlib.text.Text at 0x34842a58>

Now we're going to chose random points in time from which to sample the data.

In [155]:
yi = np.random.randint(0,N_samps,(M,))
yi = np.sort(yi)
Y = X[yi]
In [156]:
figure(figsize=[10,4])
plot(t,X,'b',t[yi],Y,'r.')
title('Original Signal with Random Sampling Points')
plt.pyplot.xlabel('Time (s)')
plt.pyplot.ylabel('X(t) and X(random sample)')
Out[156]:
<matplotlib.text.Text at 0x3485cf28>

Now we know two things, A) we know our samples, \(X(t_{samps})\) and we know the sample times, \(t_{samps}\). Using this we can transform our samples into the sparse domain. In this case, we've chosen the Discrete Cosine Domain. This could just as easily been the Discrete Fourier domain. In fact, you might want to try both just to see the differences.

We're computing the DCT for each of the possible sample points (N_samps) and then only choosing the ones we actually sampled at. \(A\) is now the DCT basis functions corresponding to our random sample times in the time domain. It is literally a matrix where each column corresponds a cosine function at a different frequency. The frequencies correspond to the sample times we chose randomly. In other words, \(A\) now represents what those samples look like in another domain.

In [157]:
D = dct(np.eye(N_samps))   # Construct the DCT basis functions for each of the frequencies
A = D[yi]                  # Downsample based on our random sampling
In [158]:
np.shape(A)
Out[158]:
(250L, 5000L)

This means we now know \(A\), based on the sample times, and \(Y\), which is our sampled data. We also know that the original data, \(X\) was sparse in the DCT domain. We'll call the sparse DCT version of \(X\), \(\tilde X\). We are now trying to solve for \(\tilde X\) based on this knowledge. In other words we're trying to find the solution to \(\tilde X\) that makes the equation \(A\tilde X = Y\) true. \(A\) is the translation from our random time samples into the DCT domain, \(\tilde X\) is in DCT domain and is presumed sparse, and \(Y\) is a time domain signal. In other words, we're trying to solve, \(\min (Y - A\tilde X)\) in some fashion.

We can use the LASSO to find the sparse solution. It's solving:

\[ c * ||Y - A\tilde X||^2_2 + alpha * ||\tilde X||_1 \]

where \(c\) is some constant based on the number of samples, and \(||\tilde X||_1\) is the \(\ell_1\) norm, which assures that the solution to \(\tilde X\) is sparse.

In [159]:
from sklearn.linear_model import Lasso   # http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
In [160]:
lasso = Lasso(alpha=0.01)
lasso.fit(A,Y)

plot(lasso.coef_)

sparseness = np.sum(lasso.coef_ == 0)/N_samps
print "Solution is %{0} sparse".format(100.*sparseness)
Solution is %96.22 sparse

The plot above shows how many DCT coefficients in the final solution are non-zero. Remember, this solution is what we approximate \(X\) to be in the DCT domain. You can see that there are very very few coefficients that are not zero. To find out what we estimate the true signal to be in the time domain, \(\hat X\), we take the Inverse Discrete Cosine Transform (IDCT) of the LASSO solution.

In [161]:
Xhat = idct(lasso.coef_)
In [162]:
figure()
plot(t,Xhat)
title('Reconstructed signal')
figure()
plot(t,Xhat-X)
title('Error delta')
Out[162]:
<matplotlib.text.Text at 0x41a56e80>

The plots above show the reconstructed signal in the time domain and the reconstructed signal, \(\hat X\) versus the original signal, \(X\). You can see how the error tends to go up at the beginning and end of the signal. I presume this is due to the fact that we don't have unlimited data and are taking a window of fixed length.

In [165]:
figure(figsize=[12,6])
plot(t,Xhat)
plot(t,X)
title("Original and reconstructed signal plotted together (green is orig)")
Out[165]:
<matplotlib.text.Text at 0x66eea240>

Don't forget that this reconstruction is achieved using only a small fraction of the total sample. The compression ratio is:

In [168]:
print "Signal is reconstructed using only %{0} of the data!".format(100.*M/N_samps)
Signal is reconstructed using only %5.0 of the data!

It's a little hard to tell what's going on in the figure above. A better way to look at things would be to look at the signal in the frequency domain where, since that's where it's sparse.

In [171]:
f = np.linspace(-Fs/2,Fs/2,N_samps)
figure(figsize=[20,10])
plot(f,np.log10(np.fft.fftshift(np.abs(fft(Xhat)))))
title("Frequency spectrum of original signal & reconstruction")
plot(f,np.log10(np.fft.fftshift(np.abs(fft(X)))),'r')
plt.pyplot.xlabel('Frequency -Fs/2 to Fs/2')
plt.pyplot.ylabel('Power (dB)')
Out[171]:
<matplotlib.text.Text at 0x67251da0>
In []: