-->

## Wednesday, November 18, 2015

### Recursively Find all the Files and Sizes of a Bucket on S3

Say I want to recurse through a S3 bucket, find all the file sizes and sum them up? Easy:

s3cmd ls s3://your-s3-bucket/ --recursive | awk -F' ' '{s +=$3} END {print s}' The output of s3cmd ls looks like: 2015-11-15 12:22 4482528 s3://bucket/-4878692415071619643--6245724311294558574_479343588_data.0 2015-11-15 12:34 34398163 s3://bucket/-6827273792407145391--2667978502585357890_1957252193_data.0 2015-11-15 12:46 4558355 s3://bucket/2184012989583635362-3242759126742622102_1630577622_data.0 2015-11-15 12:59 13297607 s3://bucket/6147240539106964522-4824521201578762651_240049741_data.0 So you want to split on the spaces, and take the size argument (3rd argument) and recursively sum them. That's what awk -F' ' '{s +=$3} does (the -F ' ' splits on whitespace). The END {print s} prints out the sum at the end.

## Monday, July 20, 2015

### Turn a {key, value} Python Dictionary into a Pandas DataFrame

Quick solution to a problem I had today. I had a dictionary of {key, values} that I wanted into a dataframe. My solution:

import pandas as pd
pd.DataFrame([[key,value] for key,value in python_dict.iteritems()],columns=["key_col","val_col"])

## Tuesday, May 5, 2015

### A Day in the Life of a Data Scientist (Part 1)

Here is a log of my day in all of it's pain and glory. It's not necessarily typical in its length or futility. Then again, there are worse days.

8:30AM - Start Amazon EMR cluster in preparation for product beta test beginning next week. Eat breakfast while system is bootstrapping.

10AM - Remember 10:30 AM meeting. Context switch.

10:20AM - Meeting canceled. Context switch. Start looking at running a Spark cluster on EC2.

10:30AM - Previously started cluster is operational now. Transfer files and begin the booting process. Process takes approximately 1.5 hrs to finish. After that the system should be monitor-only.

10:35 AM - Try various spark cluster configurations that don't work. AWS spot pricing is the worst.

11AM - Think, "if I was a real data scientist I'd probably be reading a paper right now." Don't read paper.

12PM - Witty repartee on Twitter:

12:15 PM - Go eat lunch. Sit on porch. Talk with my children and wife.

1:05 PM - return. Try a different Spark cluster configuration. Monitor system progresss on ML system started earlier.

1:10 PM - Think, "I need to appear smart". Read description of Medcouple algorithm.

1:20 PM - Spark cluster running. Try logging in. Try running local IPython notebook to connect.

1:50 PM
Cluster connection error. Apparently a known issue with PySpark and using a standalone cluster. Try to fix.

Install Anaconda on cluster itself. Start notebook server on cluster and use this trick to forward browswer:

ssh -i ~/key.pem -L 8889:localhost:8888 [email protected]

More configuration errors. Can't load data from S3.

2:40 PM - Still flailing.

2:50 PM - Hate spark. Hate life. Start EMR cluster.

3:00 PM - Coffee.

3:10 PM - 2nd cluster still not started.

3:25 PM - Bid price too low. Try different zone.

3:40 PM - No capacity. Try different machine.

4:00 PM - Answer data question on Slack.

4:10 PM - So. Much. "Provisioning".

4:14 PM - Write data queries hoping cluster will provision. Make some educated guesses as to which fields in the data will be useful.

4:40 PM - Still no cluster. Try one last configuration on EMR and hope it works.

4:50 PM - Switch to different task. Fix bug in bash script doing process auditing.

4:56 PM - NOW my cluster starts! Context switch again.

5:00 PM - Log into cluster. Start Hive query to batch 3 days of browser signature data.

5:02 PM - Query finished! Tables empty. Debugging ... oh, external table location was wrong. Fix that. Restart query.

5:09 PM - Google model drift in random forest, because, why not. Hole in the literature. Make mental note.

5:10 PM - Back to Python for parsing data.

5:40 PM - Hive query finishes.

5:50 PM - Fight with Hive syntax for extracting tuples from JSON strings.

6:00 PM - Deal with a resume that was emailed to me. Add to hiring pipeline.

6:05 PM - Finish query. Pull into Google docs for plotting.

6:27 PM - Success! Useful data. Now I need dinner. Shutting down the cluster (but I worked so hard for it!)

Conclusion - It seems we have some anomalous behavior with screen resolutions on our network. The first chart is the top 100 screen resolutions of OS X devices. The bottom chart is all the OS X screen resolutions in 3 days of data. Look fishy.

The folks with non-standard Apple-device screen resolutions are likely candidates for investigation of fraud.

## Saturday, April 18, 2015

### Remote Work + Data Science

I've been working as a remote data scientist for nearly a year now. Our team (of two!) is fully distributed and we're in the process of adding another data scientist. Finding other remote data science jobs is pretty difficult so I decided to start another blog to champion the idea of remote data science and track jobs that fit that description. Please visit www.RemoteDataScience.com and let me know what you think!

## Tuesday, April 14, 2015

### Linux Date Injection into Hive

This week I found myself needing to generate a table in Hive that used today's date in the output location. Basically I was running a daily report and wanted it to automatically send the output to the appropriate bucket on S3.

To accomplish this, I used a combination of embedded Linux commands and Hive variables.

First, in your Hive query, you need to turn on variable substitution:

set hive.variable.substitute=true;

Next, in your Hive query you can have an expression substituted for the variable value. For instance, you can create a table like this:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table
(
values STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket/${hiveconf:DATE_VARIABLE}'; The Hive syntax for a variable is${hiveconf:VARNAME}When calling Hive, you can give it a variable by using the -hiveconf VARNAME=VALUE syntax. For instance:

hive -hiveconf DATE_VARIABLE=$(date +y=%Y/m=%m/d=%d/h=%H/) -f query.sql Notice that the value of the variable in the above query is$(date +y=%Y/m=%m/d=%d/h=%H/). This is the syntax for telling Linux to execute the command inside the $( ) and return the value. You can also use backticks (  ) instead of$( ). Essentially the date command will run, returning a date string like y=2015/m=04/d=13 and assign that to the Hive variable. That variable will then be substituted in the Hive query and build a custom table location.

Super handy.