-->

Wednesday, November 18, 2015

Recursively Find all the Files and Sizes of a Bucket on S3

Say I want to recurse through a S3 bucket, find all the file sizes and sum them up? Easy:

s3cmd ls s3://your-s3-bucket/ --recursive | awk -F' ' '{s +=$3} END {print s}' 


The output of s3cmd ls looks like:

2015-11-15 12:22   4482528   s3://bucket/-4878692415071619643--6245724311294558574_479343588_data.0
2015-11-15 12:34  34398163   s3://bucket/-6827273792407145391--2667978502585357890_1957252193_data.0
2015-11-15 12:46   4558355   s3://bucket/2184012989583635362-3242759126742622102_1630577622_data.0
2015-11-15 12:59  13297607   s3://bucket/6147240539106964522-4824521201578762651_240049741_data.0


So you want to split on the spaces, and take the size argument (3rd argument) and recursively sum them. That's what awk -F' ' '{s +=$3} does (the -F ' ' splits on whitespace). The END {print s} prints out the sum at the end.

Monday, July 20, 2015

Turn a {key, value} Python Dictionary into a Pandas DataFrame

Quick solution to a problem I had today. I had a dictionary of {key, values} that I wanted into a dataframe. My solution:

import pandas as pd
pd.DataFrame([[key,value] for key,value in python_dict.iteritems()],columns=["key_col","val_col"])

Tuesday, May 5, 2015

A Day in the Life of a Data Scientist (Part 1)

Here is a log of my day in all of it's pain and glory. It's not necessarily typical in its length or futility. Then again, there are worse days.


8:30AM - Start Amazon EMR cluster in preparation for product beta test beginning next week. Eat breakfast while system is bootstrapping.

9 AM - Email. Reading JIRA cards. Reading Spark documentation.

10AM - Remember 10:30 AM meeting. Context switch.

10:20AM - Meeting canceled. Context switch. Start looking at running a Spark cluster on EC2.

10:30AM - Previously started cluster is operational now. Transfer files and begin the booting process. Process takes approximately 1.5 hrs to finish. After that the system should be monitor-only.

10:35 AM - Try various spark cluster configurations that don't work. AWS spot pricing is the worst.

11AM - Think, "if I was a real data scientist I'd probably be reading a paper right now." Don't read paper.

12PM - Witty repartee on Twitter:

12:15 PM - Go eat lunch. Sit on porch. Talk with my children and wife.

1:05 PM - return. Try a different Spark cluster configuration. Monitor system progresss on ML system started earlier.

1:10 PM - Think, "I need to appear smart". Read description of Medcouple algorithm.

1:20 PM - Spark cluster running. Try logging in. Try running local IPython notebook to connect.
More twittering:

1:50 PM
Cluster connection error. Apparently a known issue with PySpark and using a standalone cluster. Try to fix.

Install Anaconda on cluster itself. Start notebook server on cluster and use this trick to forward browswer:

ssh -i ~/key.pem -L 8889:localhost:8888 [email protected]

More configuration errors. Can't load data from S3.

2:40 PM - Still flailing.

2:50 PM - Hate spark. Hate life. Start EMR cluster.

3:00 PM - Coffee.

3:10 PM - 2nd cluster still not started.

3:25 PM - Bid price too low. Try different zone.

3:40 PM - No capacity. Try different machine.

4:00 PM - Answer data question on Slack.

4:10 PM - So. Much. "Provisioning".

4:14 PM - Write data queries hoping cluster will provision. Make some educated guesses as to which fields in the data will be useful.

4:40 PM - Still no cluster. Try one last configuration on EMR and hope it works.

4:50 PM - Switch to different task. Fix bug in bash script doing process auditing.

4:56 PM - NOW my cluster starts! Context switch again.

5:00 PM - Log into cluster. Start Hive query to batch 3 days of browser signature data.

5:01 PM - While MR is loading data onto the cluster, switch to previous data. Load into a Google Docs spreadsheet for visual poking.

5:02 PM - Query finished! Tables empty. Debugging ... oh, external table location was wrong. Fix that. Restart query.

5:09 PM - Google model drift in random forest, because, why not. Hole in the literature. Make mental note.

5:10 PM - Back to Python for parsing data.

5:40 PM - Hive query finishes.

5:50 PM - Fight with Hive syntax for extracting tuples from JSON strings.

6:00 PM - Deal with a resume that was emailed to me. Add to hiring pipeline.

6:05 PM - Finish query. Pull into Google docs for plotting.

6:27 PM - Success! Useful data. Now I need dinner. Shutting down the cluster (but I worked so hard for it!)

Conclusion - It seems we have some anomalous behavior with screen resolutions on our network. The first chart is the top 100 screen resolutions of OS X devices. The bottom chart is all the OS X screen resolutions in 3 days of data. Look fishy.



The folks with non-standard Apple-device screen resolutions are likely candidates for investigation of fraud.




Saturday, April 18, 2015

Remote Work + Data Science

I've been working as a remote data scientist for nearly a year now. Our team (of two!) is fully distributed and we're in the process of adding another data scientist. Finding other remote data science jobs is pretty difficult so I decided to start another blog to champion the idea of remote data science and track jobs that fit that description. Please visit www.RemoteDataScience.com and let me know what you think!

Tuesday, April 14, 2015

Linux Date Injection into Hive

This week I found myself needing to generate a table in Hive that used today's date in the output location. Basically I was running a daily report and wanted it to automatically send the output to the appropriate bucket on S3.

To accomplish this, I used a combination of embedded Linux commands and Hive variables.

First, in your Hive query, you need to turn on variable substitution:

set hive.variable.substitute=true;

Next, in your Hive query you can have an expression substituted for the variable value. For instance, you can create a table like this:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table
(
    values STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket/${hiveconf:DATE_VARIABLE}';

The Hive syntax for a variable is ${hiveconf:VARNAME}When calling Hive, you can give it a variable by using the -hiveconf VARNAME=VALUE syntax. For instance:

hive -hiveconf DATE_VARIABLE=$(date +y=%Y/m=%m/d=%d/h=%H/) -f query.sql

Notice that the value of the variable in the above query is $(date +y=%Y/m=%m/d=%d/h=%H/). This is the syntax for telling Linux to execute the command inside the $( ) and return the value. You can also use backticks ( ` `) instead of $( ). Essentially the date command will run, returning a date string like y=2015/m=04/d=13 and assign that to the Hive variable. That variable will then be substituted in the Hive query and build a custom table location.

Super handy.

Sunday, April 5, 2015

Ham Technicians license

After about 15 years of it being on my "to do" list, I finally took, and passed, the ham technicians license exam. After 10 years of EE education it wasn't all that difficult. I did read the excellent guide from KB6NU (http://www.kb6nu.com/study-guides/) to get me up to speed on the regulation aspects and the "ham lingo" I didn't know. 
I'm not sure what I'll do with it, but it's nice to know I have more spectrum and transmit power accessible for when j figure it out!