Tuesday, May 5, 2015

A Day in the Life of a Data Scientist (Part 1)

Here is a log of my day in all of it's pain and glory. It's not necessarily typical in its length or futility. Then again, there are worse days.

8:30AM - Start Amazon EMR cluster in preparation for product beta test beginning next week. Eat breakfast while system is bootstrapping.

9 AM - Email. Reading JIRA cards. Reading Spark documentation.

10AM - Remember 10:30 AM meeting. Context switch.

10:20AM - Meeting canceled. Context switch. Start looking at running a Spark cluster on EC2.

10:30AM - Previously started cluster is operational now. Transfer files and begin the booting process. Process takes approximately 1.5 hrs to finish. After that the system should be monitor-only.

10:35 AM - Try various spark cluster configurations that don't work. AWS spot pricing is the worst.

11AM - Think, "if I was a real data scientist I'd probably be reading a paper right now." Don't read paper.

12PM - Witty repartee on Twitter:

12:15 PM - Go eat lunch. Sit on porch. Talk with my children and wife.

1:05 PM - return. Try a different Spark cluster configuration. Monitor system progresss on ML system started earlier.

1:10 PM - Think, "I need to appear smart". Read description of Medcouple algorithm.

1:20 PM - Spark cluster running. Try logging in. Try running local IPython notebook to connect.
More twittering:

1:50 PM
Cluster connection error. Apparently a known issue with PySpark and using a standalone cluster. Try to fix.

Install Anaconda on cluster itself. Start notebook server on cluster and use this trick to forward browswer:

ssh -i ~/key.pem -L 8889:localhost:8888 root@ec2-xx-xx-xx-xx.compute-1.amazonaws.com

More configuration errors. Can't load data from S3.

2:40 PM - Still flailing.

2:50 PM - Hate spark. Hate life. Start EMR cluster.

3:00 PM - Coffee.

3:10 PM - 2nd cluster still not started.

3:25 PM - Bid price too low. Try different zone.

3:40 PM - No capacity. Try different machine.

4:00 PM - Answer data question on Slack.

4:10 PM - So. Much. "Provisioning".

4:14 PM - Write data queries hoping cluster will provision. Make some educated guesses as to which fields in the data will be useful.

4:40 PM - Still no cluster. Try one last configuration on EMR and hope it works.

4:50 PM - Switch to different task. Fix bug in bash script doing process auditing.

4:56 PM - NOW my cluster starts! Context switch again.

5:00 PM - Log into cluster. Start Hive query to batch 3 days of browser signature data.

5:01 PM - While MR is loading data onto the cluster, switch to previous data. Load into a Google Docs spreadsheet for visual poking.

5:02 PM - Query finished! Tables empty. Debugging ... oh, external table location was wrong. Fix that. Restart query.

5:09 PM - Google model drift in random forest, because, why not. Hole in the literature. Make mental note.

5:10 PM - Back to Python for parsing data.

5:40 PM - Hive query finishes.

5:50 PM - Fight with Hive syntax for extracting tuples from JSON strings.

6:00 PM - Deal with a resume that was emailed to me. Add to hiring pipeline.

6:05 PM - Finish query. Pull into Google docs for plotting.

6:27 PM - Success! Useful data. Now I need dinner. Shutting down the cluster (but I worked so hard for it!)

Conclusion - It seems we have some anomalous behavior with screen resolutions on our network. The first chart is the top 100 screen resolutions of OS X devices. The bottom chart is all the OS X screen resolutions in 3 days of data. Look fishy.

The folks with non-standard Apple-device screen resolutions are likely candidates for investigation of fraud.