Thursday, June 8, 2017

CountMinSketch In Python

Thanks to my friend Chris I've been pondering some of the work from Misha Bilenko at Microsoft. This lead me down the path of investigating the CountMinSketch algorithm for tracking counts of entities in a stream. To help me learn I wrote a Python implementation of CountMinSketch.

You can use it like so

from countminsketch.countminsketch import CountMinSketch
d = 10
w = 100
cms = CountMinSketch(d=10, w=100)
print("Count of elements is:")

Tuesday, May 16, 2017

Reading Files with Encoding Errors Into Pandas

I found myself in a situation where I needed to read a file into Pandas that had mixed character encoding. Pandas does not handle this situation and instead requires a fixed encoding and throws an error when encountering a bad line. Practically this means if you have a file containing bytes the way you interpret those bytes differs from line to line. In my case, most of the lines were utf-8 while some were of other varieties of encodings.

Character encoding is a particularly confusing problem (for me) so it took a while to figure out a workaround to the issue. I discovered that base Python provides different error handling when decoding bytes into Strings. The default, "strict" (which Pandas uses) throws UnicodeError when a bad line is found. Other options include "ignore" and different varieties of replacement. For my case, I wanted to us the "backslashreplace" style, which converts non-UTF-8 characters into their backslash escaped byte sequences. For example, the Unicode characters "グケゲ" would get turned into "\x30b0\x30b1\30b2" in my Python string. Python also allows you to register a custom error handler if you so desire. If you wanted to be really fancy, you could use a custom error handler to guess other encoding types using FTFY or chardet.

Unfortunately Pandas read_csv() method doesn't support using the non-strict error handling, so I needed a way to decode the bytes before Pandas accessed them. My final solution was to wrap my file in a io.TextIOWrapper class while then allowed me to specify the error handling and to pass it directly to pandas read_cv() method.

 import gzip
import io
import pandas pd

gz = gzip.open('./logs/nasty.log.gz', 'r')
decoder_wrapper = io.TextIOWrapper(gz, encoding='utf-8', errors='backslashreplace') 
df = pd.read_csv(decoder_wrapper, sep='\t')
Figuring all that out took about two days.

Thursday, January 28, 2016

How to Add A Hive Step to a Running Cluster on EMR

Put the file on S3:

s3cmd put temp_1_load_logs_20160126.sql s3://my-bucket/

Add the step:

aws emr add-steps --cluster-id j-XXXXXXX --steps Type=Hive,Name="load logs",Args=[-f,s3://my-bucket/temp_1_load_logs_20160126.sql]

Wednesday, November 18, 2015

Recursively Find all the Files and Sizes of a Bucket on S3

Say I want to recurse through a S3 bucket, find all the file sizes and sum them up? Easy:

s3cmd ls s3://your-s3-bucket/ --recursive | awk -F' ' '{s +=$3} END {print s}' 

The output of s3cmd ls looks like:

2015-11-15 12:22   4482528   s3://bucket/-4878692415071619643--6245724311294558574_479343588_data.0
2015-11-15 12:34  34398163   s3://bucket/-6827273792407145391--2667978502585357890_1957252193_data.0
2015-11-15 12:46   4558355   s3://bucket/2184012989583635362-3242759126742622102_1630577622_data.0
2015-11-15 12:59  13297607   s3://bucket/6147240539106964522-4824521201578762651_240049741_data.0

So you want to split on the spaces, and take the size argument (3rd argument) and recursively sum them. That's what awk -F' ' '{s +=$3} does (the -F ' ' splits on whitespace). The END {print s} prints out the sum at the end.

Monday, July 20, 2015

Turn a {key, value} Python Dictionary into a Pandas DataFrame

Quick solution to a problem I had today. I had a dictionary of {key, values} that I wanted into a dataframe. My solution:

import pandas as pd
pd.DataFrame([[key,value] for key,value in python_dict.iteritems()],columns=["key_col","val_col"])

Tuesday, May 5, 2015

A Day in the Life of a Data Scientist (Part 1)

Here is a log of my day in all of it's pain and glory. It's not necessarily typical in its length or futility. Then again, there are worse days.

8:30AM - Start Amazon EMR cluster in preparation for product beta test beginning next week. Eat breakfast while system is bootstrapping.

9 AM - Email. Reading JIRA cards. Reading Spark documentation.

10AM - Remember 10:30 AM meeting. Context switch.

10:20AM - Meeting canceled. Context switch. Start looking at running a Spark cluster on EC2.

10:30AM - Previously started cluster is operational now. Transfer files and begin the booting process. Process takes approximately 1.5 hrs to finish. After that the system should be monitor-only.

10:35 AM - Try various spark cluster configurations that don't work. AWS spot pricing is the worst.

11AM - Think, "if I was a real data scientist I'd probably be reading a paper right now." Don't read paper.

12PM - Witty repartee on Twitter:

12:15 PM - Go eat lunch. Sit on porch. Talk with my children and wife.

1:05 PM - return. Try a different Spark cluster configuration. Monitor system progresss on ML system started earlier.

1:10 PM - Think, "I need to appear smart". Read description of Medcouple algorithm.

1:20 PM - Spark cluster running. Try logging in. Try running local IPython notebook to connect.
More twittering:

1:50 PM
Cluster connection error. Apparently a known issue with PySpark and using a standalone cluster. Try to fix.

Install Anaconda on cluster itself. Start notebook server on cluster and use this trick to forward browswer:

ssh -i ~/key.pem -L 8889:localhost:8888 root@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

More configuration errors. Can't load data from S3.

2:40 PM - Still flailing.

2:50 PM - Hate spark. Hate life. Start EMR cluster.

3:00 PM - Coffee.

3:10 PM - 2nd cluster still not started.

3:25 PM - Bid price too low. Try different zone.

3:40 PM - No capacity. Try different machine.

4:00 PM - Answer data question on Slack.

4:10 PM - So. Much. "Provisioning".

4:14 PM - Write data queries hoping cluster will provision. Make some educated guesses as to which fields in the data will be useful.

4:40 PM - Still no cluster. Try one last configuration on EMR and hope it works.

4:50 PM - Switch to different task. Fix bug in bash script doing process auditing.

4:56 PM - NOW my cluster starts! Context switch again.

5:00 PM - Log into cluster. Start Hive query to batch 3 days of browser signature data.

5:01 PM - While MR is loading data onto the cluster, switch to previous data. Load into a Google Docs spreadsheet for visual poking.

5:02 PM - Query finished! Tables empty. Debugging ... oh, external table location was wrong. Fix that. Restart query.

5:09 PM - Google model drift in random forest, because, why not. Hole in the literature. Make mental note.

5:10 PM - Back to Python for parsing data.

5:40 PM - Hive query finishes.

5:50 PM - Fight with Hive syntax for extracting tuples from JSON strings.

6:00 PM - Deal with a resume that was emailed to me. Add to hiring pipeline.

6:05 PM - Finish query. Pull into Google docs for plotting.

6:27 PM - Success! Useful data. Now I need dinner. Shutting down the cluster (but I worked so hard for it!)

Conclusion - It seems we have some anomalous behavior with screen resolutions on our network. The first chart is the top 100 screen resolutions of OS X devices. The bottom chart is all the OS X screen resolutions in 3 days of data. Look fishy.

The folks with non-standard Apple-device screen resolutions are likely candidates for investigation of fraud.

Saturday, April 18, 2015

Remote Work + Data Science

I've been working as a remote data scientist for nearly a year now. Our team (of two!) is fully distributed and we're in the process of adding another data scientist. Finding other remote data science jobs is pretty difficult so I decided to start another blog to champion the idea of remote data science and track jobs that fit that description. Please visit www.RemoteDataScience.com and let me know what you think!