Gallamine's Scientific Computing Blog

Setting the Background Color of Matplotlib Images

2021-02-01T08:36:00.000-08:00

I ran into an issue where viewing Matplotlib images in a dark mode browser or editor wasn't showing the axis of the plots. To fix this you can run one of the following at the top of your editor:


mpl.rcParams['figure.facecolor'] = 'white'

#or

import matplotlib.pyplot as plt
plt.style.use({'figure.facecolor':'white'})

Supposedly you can also put `figure.facecolor: white` in your matplotlibrc file, but I haven't gotten that to work yet.

Now the plots all look like this:

Git cleanup remotes

2020-08-12T06:57:00.003-07:00

Running `git fetch origin --prune` will remove unused remote branches. This happens frequently when using Github since it will automatically delete a remote branch that has been merged.

Renaming a Column in Pandas to One That Already Exists Can Break Things

2019-06-19T09:40:00.001-07:00

Today I wrestled with an irritating issue where I had a perfectly fine DataFrame, renamed some columns and suddenly the thing was just broken. It turns out that the problem is that in Pandas (v. 0.24.1) when you rename a column to an already existing column it just breaks. Try this example:

import pandas as pd

df = pd.DataFrame({"colA": [1,2], "colB": [3,4]})
df = df.rename(columns={"colA": "colB"})

df.colB.unique()

Instead of printing "[1,2]" as you'd expect, instead it throws an AttributeError: 'DataFrame' object has no attribute 'unique'. Other than that, the dataframe appears to be fine. Calling df.columns show that there are now two columns with identical names. When you try and access that column name Pandas returns both in a DataFrame, rather than a single Series object for the one column. Since a DataFrame object doesn't have the unique() function, that's why we get the error above.

Get index of Pandas Series row when column matches certain value

2019-04-10T09:20:00.000-07:00

Say you have a Pandas DataFrame that looks like:

df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})

If you do a GroupBy operation on a specific column of the DataFrame Pandas returns a Series object. Like

df3.groupby(['X'])['Y'].sum()
X
A 4
B 6
Name: Y, dtype: int64

Now if we want to found out which groups had a specific aggregate value - say which groups had a sum == 4, we can do something like:

>>> df3.groupby(['X'])['Y'].sum().eq(4)
X
A True
B False
Name: Y, dtype: bool

Now the question is, how do we get the *index* name where the row equals 4 (in this example we want `A` since it's value is `True` in the Series).

>>> groupings = df3.groupby(['X'])['Y'].sum().eq(4)
>>> groupings.index[groupings == True]
Index([u'A'], dtype='object', name=u'X')

PS. groupings.index[groupings is True] doesn't work even though PEP8 checkers will warn you to switch to it. The groupings object isn't Truthy. The syntax groupings.index[groupings.eq(True)] is an alternative.

Python Doesn't Require Commas in Lists .. Sorta

2019-04-09T14:13:00.000-07:00

Today I helped a colleague with a subtle Python bug. We have a system that queries for data given a list of IDs. The list of IDs looked like this:

ids = [
'7d38c515-d543-4186-a6a6-e46d4e356a81' # location 1
'f384fc68-3030-473f-95a8-52d5fee6cfd4' # location 2
'b27fef7f-9e5d-4af5-8596-a6949dd257a5' # location 3
]

It look us an unfortunate amount of time to realize we were missing commas in that list. Python blissfully will concatenate string elements inside of a list for you.

bad_list = ['a' 'b' 'c']
bad_list[0] == 'abc'
True

This is because ,

"a""b" == "ab"
True

I'm failing to come up with a helpful example of where this behavior is useful though.

Running a Function on Dask Workers At Startup

2019-03-22T08:46:00.000-07:00

I made the mistake of thinking that client.run() would run a function on each Dask worker regardless if the worker hadn't started yet. In the case where new workers come online that function won't run. Instead you'll need to take advantage of the register_worker_callbacks() function that will register a function to run on all of the workers at startup. It looks like this:

cluster = LocalCluster()
client = Client(cluster)
client.register_worker_callbacks(setup=your_function_name)

I found this by looking through the tests for this function's pull request. It's otherwise undocumented.

Salary Progression (in Tech)

2019-03-15T10:08:00.002-07:00

This was an insightful article on one person's salary progression in technology. Averages out to be something like $11,000 per year in increase over the 15 year career so far. This matches pretty well with other folks I've talked to. An accompanying Hackernews post has more datapoints. Stock options / RSUs tend to skew things and lend a high degree of variability to compensation. At higher levels it isn't uncommon for a large portion of your total compensation to be in RSUs.

Of course, where you live causes a wild variation in the value of your income. Additionally, if you're working for dope cash but miserable it's hard to see that as a net positive.

New Job For 2019

2019-03-14T13:18:00.000-07:00

I'm now working as a Senior Machine Learning Engineer for Grubhub on the order volume forecasting team. We own predictive time series models that produce forecasts the business uses to schedule drivers.

The job search took about 3 months and involved at least 5 out-right rejections, a lot of non-answers, and 2 offers.

CountMinSketch In Python

2017-06-08T10:51:00.002-07:00

Thanks to my friend Chris I've been pondering some of the work from Misha Bilenko at Microsoft. This lead me down the path of investigating the CountMinSketch algorithm for tracking counts of entities in a stream. To help me learn I wrote a Python implementation of CountMinSketch.

You can use it like so

from countminsketch.countminsketch import CountMinSketch

d = 10

w = 100

cms = CountMinSketch(d=10, w=100)

cms.add('test_value')

print("Count of elements is:")

print(cms.query('test_value'))

Reading Files with Encoding Errors Into Pandas

2017-05-16T08:25:00.002-07:00

I found myself in a situation where I needed to read a file into Pandas that had mixed character encoding. Pandas does not handle this situation and instead requires a fixed encoding and throws an error when encountering a bad line. Practically this means if you have a file containing bytes the way you interpret those bytes differs from line to line. In my case, most of the lines were utf-8 while some were of other varieties of encodings.

Character encoding is a particularly confusing problem (for me) so it took a while to figure out a workaround to the issue. I discovered that base Python provides different error handling when decoding bytes into Strings. The default, "strict" (which Pandas uses) throws UnicodeError when a bad line is found. Other options include "ignore" and different varieties of replacement. For my case, I wanted to us the "backslashreplace" style, which converts non-UTF-8 characters into their backslash escaped byte sequences. For example, the Unicode characters "グケゲ" would get turned into "\x30b0\x30b1\30b2" in my Python string. Python also allows you to register a custom error handler if you so desire. If you wanted to be really fancy, you could use a custom error handler to guess other encoding types using FTFY or chardet.

Unfortunately Pandas read_csv() method doesn't support using the non-strict error handling, so I needed a way to decode the bytes before Pandas accessed them. My final solution was to wrap my file in a io.TextIOWrapper class while then allowed me to specify the error handling and to pass it directly to pandas read_cv() method.

Example:

import gzip
import io
import pandas pd

gz = gzip.open('./logs/nasty.log.gz', 'r')
decoder_wrapper = io.TextIOWrapper(gz, encoding='utf-8', errors='backslashreplace')
df = pd.read_csv(decoder_wrapper, sep='\t')

Figuring all that out took about two days.

How to Add A Hive Step to a Running Cluster on EMR

2016-01-28T06:26:00.005-08:00

Put the file on S3:

s3cmd put temp_1_load_logs_20160126.sql s3://my-bucket/

Add the step:

aws emr add-steps --cluster-id j-XXXXXXX --steps Type=Hive,Name="load logs",Args=[-f,s3://my-bucket/temp_1_load_logs_20160126.sql]

Recursively Find all the Files and Sizes of a Bucket on S3

2015-11-18T09:06:00.002-08:00

Say I want to recurse through a S3 bucket, find all the file sizes and sum them up? Easy:

s3cmd ls s3://your-s3-bucket/ --recursive | awk -F' ' '{s +=$3} END {print s}'

The output of s3cmd ls looks like:

2015-11-15 12:22   4482528   s3://bucket/-4878692415071619643--6245724311294558574_479343588_data.0

2015-11-15 12:34  34398163   s3://bucket/-6827273792407145391--2667978502585357890_1957252193_data.0

2015-11-15 12:46   4558355   s3://bucket/2184012989583635362-3242759126742622102_1630577622_data.0

2015-11-15 12:59  13297607   s3://bucket/6147240539106964522-4824521201578762651_240049741_data.0

So you want to split on the spaces, and take the size argument (3rd argument) and recursively sum them. That's what awk -F' ' '{s +=$3} does (the -F ' ' splits on whitespace). The END {print s} prints out the sum at the end.

Turn a {key, value} Python Dictionary into a Pandas DataFrame

2015-07-20T13:00:00.000-07:00

Quick solution to a problem I had today. I had a dictionary of {key, values} that I wanted into a dataframe. My solution:

import pandas as pd
pd.DataFrame([[key,value] for key,value in python_dict.iteritems()],columns=["key_col","val_col"])

A Day in the Life of a Data Scientist (Part 1)

2015-05-05T05:50:00.002-07:00

Here is a log of my day in all of it's pain and glory. It's not necessarily typical in its length or futility. Then again, there are worse days.

8:30AM - Start Amazon EMR cluster in preparation for product beta test beginning next week. Eat breakfast while system is bootstrapping.

9 AM - Email. Reading JIRA cards. Reading Spark documentation.

10AM - Remember 10:30 AM meeting. Context switch.

10:20AM - Meeting canceled. Context switch. Start looking at running a Spark cluster on EC2.

10:30AM - Previously started cluster is operational now. Transfer files and begin the booting process. Process takes approximately 1.5 hrs to finish. After that the system should be monitor-only.

10:35 AM - Try various spark cluster configurations that don't work. AWS spot pricing is the worst.

11AM - Think, "if I was a real data scientist I'd probably be reading a paper right now." Don't read paper.

12PM - Witty repartee on Twitter:

@benhamner @kaggle building a model and then putting it into production only to see it negatively influence hundreds of customers is worse.
— William Cox ن (@gallamine) May 4, 2015

12:15 PM - Go eat lunch. Sit on porch. Talk with my children and wife.

1:05 PM - return. Try a different Spark cluster configuration. Monitor system progresss on ML system started earlier.

1:10 PM - Think, "I need to appear smart". Read description of Medcouple algorithm.

1:20 PM - Spark cluster running. Try logging in. Try running local IPython notebook to connect.
More twittering:

Same RT @gallamine: I don’t always use Spark, but when I do … I have to use @tdhopper’s slides to remember anything: http://t.co/RbQhr4BOVq
— Tim Hopper (@tdhopper) May 4, 2015

1:50 PM
Cluster connection error. Apparently a known issue with PySpark and using a standalone cluster. Try to fix.

Install Anaconda on cluster itself. Start notebook server on cluster and use this trick to forward browswer:

ssh -i ~/key.pem -L 8889:localhost:8888 root@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

More configuration errors. Can't load data from S3.

2:40 PM - Still flailing.

2:50 PM - Hate spark. Hate life. Start EMR cluster.

3:00 PM - Coffee.

3:10 PM - 2nd cluster still not started.

3:25 PM - Bid price too low. Try different zone.

3:40 PM - No capacity. Try different machine.

4:00 PM - Answer data question on Slack.

4:10 PM - So. Much. "Provisioning".

4:14 PM - Write data queries hoping cluster will provision. Make some educated guesses as to which fields in the data will be useful.

4:40 PM - Still no cluster. Try one last configuration on EMR and hope it works.

4:50 PM - Switch to different task. Fix bug in bash script doing process auditing.

4:56 PM - NOW my cluster starts! Context switch again.

5:00 PM - Log into cluster. Start Hive query to batch 3 days of browser signature data.

5:01 PM - While MR is loading data onto the cluster, switch to previous data. Load into a Google Docs spreadsheet for visual poking.

5:02 PM - Query finished! Tables empty. Debugging ... oh, external table location was wrong. Fix that. Restart query.

5:09 PM - Google model drift in random forest, because, why not. Hole in the literature. Make mental note.

5:10 PM - Back to Python for parsing data.

5:40 PM - Hive query finishes.

5:50 PM - Fight with Hive syntax for extracting tuples from JSON strings.

6:00 PM - Deal with a resume that was emailed to me. Add to hiring pipeline.

6:05 PM - Finish query. Pull into Google docs for plotting.

6:27 PM - Success! Useful data. Now I need dinner. Shutting down the cluster (but I worked so hard for it!)

Conclusion - It seems we have some anomalous behavior with screen resolutions on our network. The first chart is the top 100 screen resolutions of OS X devices. The bottom chart is all the OS X screen resolutions in 3 days of data. Look fishy.

The folks with non-standard Apple-device screen resolutions are likely candidates for investigation of fraud.

Remote Work + Data Science

2015-04-18T06:58:00.002-07:00

I've been working as a remote data scientist for nearly a year now. Our team (of two!) is fully distributed and we're in the process of adding another data scientist. Finding other remote data science jobs is pretty difficult so I decided to start another blog to champion the idea of remote data science and track jobs that fit that description. Please visit www.RemoteDataScience.com and let me know what you think!

Linux Date Injection into Hive

2015-04-14T14:25:00.000-07:00

This week I found myself needing to generate a table in Hive that used today's date in the output location. Basically I was running a daily report and wanted it to automatically send the output to the appropriate bucket on S3.

To accomplish this, I used a combination of embedded Linux commands and Hive variables.

First, in your Hive query, you need to turn on variable substitution:

set hive.variable.substitute=true;

Next, in your Hive query you can have an expression substituted for the variable value. For instance, you can create a table like this:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table
(
    values STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

LOCATION 's3://mybucket/${hiveconf:DATE_VARIABLE}';

The Hive syntax for a variable is ${hiveconf:VARNAME}When calling Hive, you can give it a variable by using the -hiveconf VARNAME=VALUE syntax. For instance:

hive -hiveconf DATE_VARIABLE=$(date +y=%Y/m=%m/d=%d/h=%H/) -f query.sql

Notice that the value of the variable in the above query is $(date +y=%Y/m=%m/d=%d/h=%H/). This is the syntax for telling Linux to execute the command inside the $( ) and return the value. You can also use backticks ( ` `) instead of $( ). Essentially the date command will run, returning a date string like y=2015/m=04/d=13 and assign that to the Hive variable. That variable will then be substituted in the Hive query and build a custom table location.

Super handy.

Ham Technicians license

2015-04-05T14:09:00.001-07:00

After about 15 years of it being on my "to do" list, I finally took, and passed, the ham technicians license exam. After 10 years of EE education it wasn't all that difficult. I did read the excellent guide from KB6NU (http://www.kb6nu.com/study-guides/) to get me up to speed on the regulation aspects and the "ham lingo" I didn't know.

I'm not sure what I'll do with it, but it's nice to know I have more spectrum and transmit power accessible for when j figure it out!

Query Data from Impala on Amazon EMR into Python, Pandas and IPython Notebook

2014-10-02T13:11:00.001-07:00

I've been envious of tools such as Hue that allow for an easy way to execute SQL-like queries on Hive or Impala and then immediately plot the results. Installing Hue on EMR has thus-far thwarted me (if you know how, I'm all ears), so I needed a better way.

Hive is great for doing batch-mode processing of a lot of data, and pulling data from S3 into the Hadoop HDFS. Impala then allows you do to fast(er) queries on that data. Both are 1-click installed using Amazon's EMR console (or command line).

The difficulty now is that I'm writing queries at the command-line and don't have a particularly elegant way of plotting or poking at the results. Wouldn't it be great if I could get the results into an IPython Notebook and plot there? Two problems: 1) getting the results into Python and 2) getting access to a Notebook server that's running on the EMR cluster.

I now have two solutions to these two problems:

1) Install Anaconda on the EMR cluster:

remote$> wget http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.shremote$> bash Anaconda-2.1.0-Linux-x86_64.sh

now log out of your SSH connection and reconnect using the command

local$> ssh -i ~/yourkeyfile.pem -L 8889:localhost:8888 hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

This starts a port forwarding SSH connection that connects http://localhost:8889 on your local machine to http://localhost:8888 on the remote machine (which is where the notebook will run). Now start the remote IPython Notebook server using the command

$> ipython notebook --browswer=none

You should now be able to navigate to http://localhost:8889 on your local machine and see the notebook server running on your EMR machine! Ok, now what about getting the Impala data into the notebook?

2) The Impyla project allows you to connect to a running Impala sever, make queries, and spit the output into a Pandas dataframe. You can install Impyla using the command

remote$> pip install impyla

Inside your IPython Notebook you should be able to execute something like

from impala.util import as_pandas

from impala.dbapi import connect

conn = connect(host='localhost', port=21050)

cursor = conn.cursor()

cursor.execute('SELECT * FROM some_table')

df = as_pandas(cursor)

and now you have a magical dataframe with Overweight Data that you can plot and otherwise poke at.

Use Boto to Start an Elastic Map-Reduce Cluster with Hive and Impala Installed

2014-09-25T19:00:00.000-07:00

I spent all of yesterday beating my head against the Boto document (or lack thereof). Boto is a popular (the?) tool for using Amazon Webservices (AWS) with Python. The parts of AWS that are used quite a bit have good documentation, while the rest suffer for explanation.

The task I wanted to accomplish was:

Use Boto to start an elastic mapreduce cluster of machines.
Install Hive and Impala on the machines.
Use Spot instances for the core nodes.

Below is sample code to accomplish these tasks. I spent a great deal of time combing through the sourcecode for Boto. You may need to do the same.

This code is for Boto 2.32.1:

from boto.emr.connection import EmrConnection
from boto.emr.step import InstallHiveStep
from boto.emr import BootstrapAction
conn = boto.emr.connect_to_region('us-east-1')
hive_step = InstallHiveStep()
bootstrap_impala = BootstrapAction("impala","s3://elasticmapreduce/libs/impala/setup-impala",["--base-path","s3://elasticmapreduce","--impala-version","latest"])
instance_groups = [InstanceGroup(1, "MASTER", master_instance_type, "ON_DEMAND","mastername"),
InstanceGroup(num_instances, "CORE", slave_instance_type, 'SPOT', "slavename",bidprice=bidprice)]
jobid = conn.run_jobflow(cluster_name,log_uri="s3n://log_bucket",\
ec2_keyname="YOUR EC2 KEYPAIR NAME",\
availability_zone="us-east-1e",\
instance_groups=instance_groups,\
num_instances=str(num_instances),\
keep_alive="True",\
enable_debugging="True",\
hadoop_version="2.4.0",\
ami_version="3.1.0",\
visible_to_all_users="True",\
steps=[hive_step],\
bootstrap_actions=[bootstrap_impala])

Use Asciinema to Record and Share Terminal Sessions

2014-08-20T17:00:00.000-07:00

I discovered a cool tool today by listening to Jeroen Janssen's talk on Data Science at the Command Line. The tool is Asciinema, which is a terminal plugin that lets you record you terminal session, save it and then share - with copyable text! It supports both OSX and Linux.

Here's an example from their website:

Can't wait to use this for this blog!

Automatically Partition Data in Hive using S3 Buckets

2014-08-18T18:00:00.000-07:00

Did you know that if you are processing data stored in S3 using Hive, you can have Hive automatically partition the data (logical separation) by encoding the S3 bucket names using a key=value pair? For instance, if you have time-based data, and you store it in buckets like this:

/root_path_to_buckets/date=20140801
/root_path_to_buckets/date=20140802
/root_path_to_buckets/date=20140803
/root_path_to_buckets/...

And you build a table in Hive, like

CREATE EXTERNAL TABLE time_data(
value STRING,
value2 INT,
value3 STRING,
...
)
PARTITIONED BY(date STRING)
LOCATION s3n://root_path_to_buckets/

Hive will automatically know that your data is logically separated by dates. Usually this requires you to refresh the partition list by calling the command:

ALTER TABLE time_data RECOVER PARTITIONS;

After that, you can check to see if the partitions have taken using the SHOW command:

SHOW PARTITIONS time_data;

Now when you run a SELECT command, Hive will only load the data needed. This saves a tremendous amount of downloading and processing time. Example:

SELECT value, value2 FROM time_data WHERE date > "20140802"

This will only load 1/3 of the data (since 20140801 and 20140802 are excluded).

Buying a Car With Data Science!

2014-07-31T14:10:00.000-07:00

Forget Big Data, here's how I used small data and stupid simple tools to make buying a car easier for me to stomach.

When your family's size increases past the ⌊E[#number of children]⌋ in America - you need a bigger vehicle. Given that we now have a family of 3, and that the 3rd child will need a carseat in a few months, it was time to buy a family van.

I'm a big fan of staying out of debt and so I had a fixed budget with which to acquire a vehicle - a used vehicle. A great place to start looking is with the AutoTempest search engine, which aggregates data from lots of different sites. The problem is that it's difficult to know how much you should pay for any given vehicle. If you're buying new you can check something like TrueCar and there's resources like Kelly Blue Book, NADA and Edmunds but from past used buying experience those services tended to underestimate the actual costs with most vehicles I've bought, and while some folks love to negotiate, I find it difficult without any "ground truth" to base my asking price off of.

I toyed around with the idea of doing a full-fledged scrapping of data but it just wasn't worth the time, since I was under a deadline. Instead I took the approach of least resistance and asked my wife to pull together basic info on 20 vehicles matching our criteria - year, milage and asking price. Together we stuck it into a Google Document and plotted the results:

To my surprise and delight, there seemed to be two distinct collections of data - an upper and a lower priced section. Since Google Docs doesn't provide any easy way to put in a regression line I then moved over to Excel and added those in:

The data points highlighted in green were ones that I was considering. Suddenly I now had isolated two vehicles that were "over priced" and ripe for easy negotiation. I also chose the highest price - and lowest milage vehicle - and asked for a significantly lower price on a whim.

The additional data I'd gathered gave me confidence when negotiating. I contacted 2 dealerships with my data and the price I wanted to pay (slightly under the lower price regression). Ultimately the 1st person I'd talked to accepted my offer, which I new was good from my data and I didn't have to worry about whether or not I should keep negotiating.

What my data DIDN'T do for me:

The data didn't impress anyone
The data didn't magically make people accept my offers
The data didn't make buying a car easy

What my data DID do for me:

Took away the awful feeling of not knowing.
Gave me confidence when negotiating offers.
Let me quickly see which vehicles I should pursue and which to not focus on.

Now I'm driving a baller ... van. Cool indeed.

Python Pandas Group by Column A and Sum Contents of Column B

2014-07-31T14:08:00.003-07:00

Here's something that I can never remember how to do in Pandas: group by 1 column (e.g. Account ID) and sum another column (e.g. purchase price). So, here's an example for our reference:

data.groupby(by=['account_ID'])['purchases'].sum()

Simple, but not worth re-Googling each time!

OSCON Wednesday Recap

2014-07-24T16:35:00.000-07:00

Ended up in Paul Frankwicks talk on "Build Your Own Exobrain" a bit late, but it worked out well. "Think of it as IFTTT, but free, open source, and you keep control of your privacy." Had a great conversation with Paul afterwards regarding stochastic time and mood tracking. Hopefully we'll get some stochastic triggers added to Exobrain, along with Python support, soon.

Next I listened to Adam Gibson give a talk on his deep learning library, DeepLearning4j. He's clearly a bright guy and is very passionate about his project. I'd previous watched him give a similar talk at the Hadoop Summit along with Josh Patterson. I spent some time talking with him after the session and trading machine learning stories. He nearly inspired me to learn Scala and start hacking on deeplearning4j - it sounds like a fabulous platform with all the possible moving pieces you could want for building a deep learning pipeline.

Afterwards I went to Harrison Mebane's talk on spotting Caltrain cars using a Raspberry Pi outfitted with microphones and a camera. It looked like a neat project incorporating data, sensors and hardware.

Next, on a whim, I went to Tobias Zander's talk on web security. I know very little about security so I was fascinated by all of the interesting ways to compromise a system he showcased. He showed how clever hackers can learn all sorts of information in non-obvious ways. He also royally trolled the audience by using a Facebook compromise to gather people's Facebook profile pictures after they visited his website during the talk.

Finally, I went to a lovely talk by Tim Bell on how CERN's IT infrastructure and how they went agile. It was a fascinating talk that dove into the complexities of such a massive system. The difficulties, both political, scientific and technological are enormous. When the video is posted it's well worth your time to go and watch.

OSCON Tuesday Recap

2014-07-23T18:33:00.002-07:00

Excellent set of Keynotes. Especially enjoyed the one from Planet Labs - inspiring work to photograph the entire globe, every day.

Next was a talk on building an open Thermostat from Zach Supalla at Spark.io, the makers of an Internet connected microcontroller and cloud infrastructure. Zach says building hardware is hard, but it's easy to get noticed - if you build anything remotely cool you'll be on Engadget with no problem.

A talk on Functional Thinking by Rob Ford, who is a great speaker, was informative but wasn't exactly something that I can apply to my work right now. I at least caught up on some of the nomenclature and can use it as a jumping-off point for future learning. Apparently all major languages are adding functional programming support these days (Python?).

Ruth Suehle gave a tremendously fun talk on Raspbery Pi hacks - it also turns out she lives in my city and knows a bunch of people that I do. Go figure! She inspired me to go buy a Pi and do something other than a) leave it in a box or b) put XBMC on it. I'm thinking a weather station would be a fun project to build.

Tim Berglund have a (packed!) talk on "Graph Theory you Need to Know". Tim is a good speaker, but the talk struggled a bit with needing to pack in lots of definition (not Tim's fault). I never knew how easy it was to take an adjacency graph and compute the N-length paths to other nodes - just multiply them! Also neat to see a quick example of going from the graph to a markov chain with probabilities.

Ethan Dereszynski and Eric Butler from Webtrends showed off their (beautiful!) realtime system for observing and predicting user behavior on a website. It uses Kafka/Storm to train and classify user behavior using a HMM - the dashboard can show you, in real-time, individual users on your site and the probability that they'll do some action. You can then serve them ads or coupons based on how likely they are to buy/leave/etc. Want to talk to these guys more, because I'm trying to solve a similar problem at Distil.

Finally, my talk on the Fourier Transform, FFT, and How to Use It went smashingly well. I hit perfect timing, saw lots of mesmerized faces and had plenty of questions afterwards. The slides are up and the code will be uploaded soon.