Gallamine's Scientific Computing Blog: July 2014

Thursday, July 31, 2014

Buying a Car With Data Science!

Forget Big Data, here's how I used small data and stupid simple tools to make buying a car easier for me to stomach.

When your family's size increases past the ⌊E[#number of children]⌋ in America - you need a bigger vehicle. Given that we now have a family of 3, and that the 3rd child will need a carseat in a few months, it was time to buy a family van.

I'm a big fan of staying out of debt and so I had a fixed budget with which to acquire a vehicle - a used vehicle. A great place to start looking is with the AutoTempest search engine, which aggregates data from lots of different sites. The problem is that it's difficult to know how much you should pay for any given vehicle. If you're buying new you can check something like TrueCar and there's resources like Kelly Blue Book, NADA and Edmunds but from past used buying experience those services tended to underestimate the actual costs with most vehicles I've bought, and while some folks love to negotiate, I find it difficult without any "ground truth" to base my asking price off of.

I toyed around with the idea of doing a full-fledged scrapping of data but it just wasn't worth the time, since I was under a deadline. Instead I took the approach of least resistance and asked my wife to pull together basic info on 20 vehicles matching our criteria - year, milage and asking price. Together we stuck it into a Google Document and plotted the results:

To my surprise and delight, there seemed to be two distinct collections of data - an upper and a lower priced section. Since Google Docs doesn't provide any easy way to put in a regression line I then moved over to Excel and added those in:

The data points highlighted in green were ones that I was considering. Suddenly I now had isolated two vehicles that were "over priced" and ripe for easy negotiation. I also chose the highest price - and lowest milage vehicle - and asked for a significantly lower price on a whim.

The additional data I'd gathered gave me confidence when negotiating. I contacted 2 dealerships with my data and the price I wanted to pay (slightly under the lower price regression). Ultimately the 1st person I'd talked to accepted my offer, which I new was good from my data and I didn't have to worry about whether or not I should keep negotiating.

What my data DIDN'T do for me:

The data didn't impress anyone
The data didn't magically make people accept my offers
The data didn't make buying a car easy

What my data DID do for me:

Took away the awful feeling of not knowing.
Gave me confidence when negotiating offers.
Let me quickly see which vehicles I should pursue and which to not focus on.

Now I'm driving a baller ... van. Cool indeed.

Python Pandas Group by Column A and Sum Contents of Column B

Here's something that I can never remember how to do in Pandas: group by 1 column (e.g. Account ID) and sum another column (e.g. purchase price). So, here's an example for our reference:

data.groupby(by=['account_ID'])['purchases'].sum()

Simple, but not worth re-Googling each time!

Thursday, July 24, 2014

OSCON Wednesday Recap

Ended up in Paul Frankwicks talk on "Build Your Own Exobrain" a bit late, but it worked out well. "Think of it as IFTTT, but free, open source, and you keep control of your privacy." Had a great conversation with Paul afterwards regarding stochastic time and mood tracking. Hopefully we'll get some stochastic triggers added to Exobrain, along with Python support, soon.

Next I listened to Adam Gibson give a talk on his deep learning library, DeepLearning4j. He's clearly a bright guy and is very passionate about his project. I'd previous watched him give a similar talk at the Hadoop Summit along with Josh Patterson. I spent some time talking with him after the session and trading machine learning stories. He nearly inspired me to learn Scala and start hacking on deeplearning4j - it sounds like a fabulous platform with all the possible moving pieces you could want for building a deep learning pipeline.

Afterwards I went to Harrison Mebane's talk on spotting Caltrain cars using a Raspberry Pi outfitted with microphones and a camera. It looked like a neat project incorporating data, sensors and hardware.

Next, on a whim, I went to Tobias Zander's talk on web security. I know very little about security so I was fascinated by all of the interesting ways to compromise a system he showcased. He showed how clever hackers can learn all sorts of information in non-obvious ways. He also royally trolled the audience by using a Facebook compromise to gather people's Facebook profile pictures after they visited his website during the talk.

Finally, I went to a lovely talk by Tim Bell on how CERN's IT infrastructure and how they went agile. It was a fascinating talk that dove into the complexities of such a massive system. The difficulties, both political, scientific and technological are enormous. When the video is posted it's well worth your time to go and watch.

Wednesday, July 23, 2014

OSCON Tuesday Recap

Excellent set of Keynotes. Especially enjoyed the one from Planet Labs - inspiring work to photograph the entire globe, every day.

Next was a talk on building an open Thermostat from Zach Supalla at Spark.io, the makers of an Internet connected microcontroller and cloud infrastructure. Zach says building hardware is hard, but it's easy to get noticed - if you build anything remotely cool you'll be on Engadget with no problem.

A talk on Functional Thinking by Rob Ford, who is a great speaker, was informative but wasn't exactly something that I can apply to my work right now. I at least caught up on some of the nomenclature and can use it as a jumping-off point for future learning. Apparently all major languages are adding functional programming support these days (Python?).

Ruth Suehle gave a tremendously fun talk on Raspbery Pi hacks - it also turns out she lives in my city and knows a bunch of people that I do. Go figure! She inspired me to go buy a Pi and do something other than a) leave it in a box or b) put XBMC on it. I'm thinking a weather station would be a fun project to build.

Tim Berglund have a (packed!) talk on "Graph Theory you Need to Know". Tim is a good speaker, but the talk struggled a bit with needing to pack in lots of definition (not Tim's fault). I never knew how easy it was to take an adjacency graph and compute the N-length paths to other nodes - just multiply them! Also neat to see a quick example of going from the graph to a markov chain with probabilities.

Ethan Dereszynski and Eric Butler from Webtrends showed off their (beautiful!) realtime system for observing and predicting user behavior on a website. It uses Kafka/Storm to train and classify user behavior using a HMM - the dashboard can show you, in real-time, individual users on your site and the probability that they'll do some action. You can then serve them ads or coupons based on how likely they are to buy/leave/etc. Want to talk to these guys more, because I'm trying to solve a similar problem at Distil.

Finally, my talk on the Fourier Transform, FFT, and How to Use It went smashingly well. I hit perfect timing, saw lots of mesmerized faces and had plenty of questions afterwards. The slides are up and the code will be uploaded soon.

Sunday, July 20, 2014

Welcome OSCON Visitors!

Welcome OSCON visitors. If you're here because of my talk on Time Series Analysis and the Fourier Transform here is a link to the presentation in Power Point form - more formats coming soon. I'll be uploading code later on today.

I'm Speaking at OSCON 2013 (size 728 X 90)

How do I Read Python Error Messages?

One of the complaints I've heard about Python from Matlab developers is that the error messages can be cryptic.

Here's an example of an error I recently got while running some Python code (see Error Message below). When compared with a Matlab-style error, this can be a bit overwelming. Here's a few tips:

Start at the bottom. That will be where the actual error message is. In this case it's "ValueError: to_rgba: Invalid rgba arg "['#eeeeee']" need more than 1 value to unpack". Notice that the argument in question is inside of square brackets - that means it's a list with 1 item in it. Perhaps that's telling.
Find where the code you wrote is breaking. Often the stack of errors is deep inside some other module or code, but what you wrote is actually passing in bad data. The 2nd block of errors starting with "<ipython-input-2-2a6f1bb6961e>" is code that we wrote. Looks like the offending line is "color = colors_bmh[ y[ix].astype(int) ]". Somehow we're setting the array of "color" to some bad values.
Look at the specific error message, "ValueError: ... need more than 1 value to unpack". This means that the code was expecting 2 or more values to be returned but only 1 was.

It actually took me a while to debug the code. In this case, the point I made under Step 1 above about the square brackets (indicating a list) around the argument was the key. The colors array was defined this way:

color = colors_bmh[ y[ix].astype(int) ]

The problem with that statement is that if you look at the array shape, (np.shape(color)) it's 2-dimensional (100,1) whereas the function was expecting a list, or a 1-Dimensional array. Changing the code to this fixed it:

color = colors_bmh[ y[ix].astype(int) ].ravel()

The ravel() method removes singleton dimensions. It's similar to Matlab's squeeze() function. Hope that helps!

Error Message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-808bb77074e6> in <module>()
      3 p = np.random.randn(len(y))
      4 
----> 5 separation_plot(p,y)

<ipython-input-2-2a6f1bb6961e> in separation_plot(p, y, **kwargs)
     29         bars = ax.bar( np.arange(n), np.ones(n), width=1., 
     30                 color = colors_bmh[ y[ix].astype(int) ],
---> 31                 edgecolor = 'none')
     32         ax.plot( np.arange(n), p[ix,i], "k", 
     33                 linewidth = 3.,drawstyle="steps-post" )

/Users/william/anaconda/lib/python2.7/site-packages/matplotlib/axes.pyc in bar(self, left, height, width, bottom, **kwargs)
   4976             color = [None] * nbars
   4977         else:
-> 4978             color = list(mcolors.colorConverter.to_rgba_array(color))
   4979             if len(color) == 0:  # until to_rgba_array is changed
   4980                 color = [[0, 0, 0, 0]]

/Users/william/anaconda/lib/python2.7/site-packages/matplotlib/colors.pyc in to_rgba_array(self, c, alpha)
    409             result = np.zeros((nc, 4), dtype=np.float)
    410             for i, cc in enumerate(c):
--> 411                 result[i] = self.to_rgba(cc, alpha)
    412             return result
    413 

/Users/william/anaconda/lib/python2.7/site-packages/matplotlib/colors.pyc in to_rgba(self, arg, alpha)
    363         except (TypeError, ValueError) as exc:
    364             raise ValueError(
--> 365                 'to_rgba: Invalid rgba arg "%s"\n%s' % (str(arg), exc))
    366 
    367     def to_rgba_array(self, c, alpha=None):

ValueError: to_rgba: Invalid rgba arg "['#eeeeee']"
need more than 1 value to unpack

Tuesday, July 1, 2014

Volunteer Work to Boost Data Science Skills

I came across an insightful Reddit comment in regard to doing volunteer work with data science. It's encouraging to see folks putting their skills to good use and should be illuminating for those trying to "break into" the field.

Here's vmsmith's answer on how he volunteer to boost his data science skills:

Well, I volunteered to be the data manager for my state delegate. That wasn't too demanding or in-depth, but she did have a lot of data on constituents and donors that needed to be munged and put into consistent formats, and it was a chance to start writing small Python programs to work with .csv files.
Second, I volunteered at the research library at a near by university, and ended up writing a pretty comprehensive report on data management. This led to two things: one, a paid consulting offer at another university to help them get a data management plan in place, and (2) another volunteer gig actually developing a data management application in Python that's modeled on the Open Archival Information System (OAIS).
Third, I offered to do data munging for a physicist I know who's doing a hyperspectral imagery project for a government research activity, and needed someone to munge all of the geospatial data to create integrated .kmz files for each experimental session.
Finally, I volunteered to be on NIST's Big Data Reference Architecture Working Group.
All of those things (1) increased my skill and knowledge levels, (2) provided decent resume bullets, and (3) developed references and networking contacts.

Gallamine's Scientific Computing Blog

Pages