8:30AM - Start Amazon EMR cluster in preparation for product beta test beginning next week. Eat breakfast while system is bootstrapping.
9 AM - Email. Reading JIRA cards. Reading Spark documentation.
10AM - Remember 10:30 AM meeting. Context switch.
10:20AM - Meeting canceled. Context switch. Start looking at running a Spark cluster on EC2.
10:30AM - Previously started cluster is operational now. Transfer files and begin the booting process. Process takes approximately 1.5 hrs to finish. After that the system should be monitor-only.
10:35 AM - Try various spark cluster configurations that don't work. AWS spot pricing is the worst.
11AM - Think, "if I was a real data scientist I'd probably be reading a paper right now." Don't read paper.
12PM - Witty repartee on Twitter:
@benhamner @kaggle building a model and then putting it into production only to see it negatively influence hundreds of customers is worse.
— William Cox ن (@gallamine) May 4, 2015
12:15 PM - Go eat lunch. Sit on porch. Talk with my children and wife.
1:05 PM - return. Try a different Spark cluster configuration. Monitor system progresss on ML system started earlier.
1:10 PM - Think, "I need to appear smart". Read description of Medcouple algorithm.
1:20 PM - Spark cluster running. Try logging in. Try running local IPython notebook to connect.
More twittering:
Same RT @gallamine: I don’t always use Spark, but when I do … I have to use @tdhopper’s slides to remember anything: http://t.co/RbQhr4BOVq— Tim Hopper (@tdhopper) May 4, 2015
1:50 PM
Cluster connection error. Apparently a known issue with PySpark and using a standalone cluster. Try to fix.
Install Anaconda on cluster itself. Start notebook server on cluster and use this trick to forward browswer:
ssh -i ~/key.pem -L 8889:localhost:8888 [email protected]
More configuration errors. Can't load data from S3.
2:40 PM - Still flailing.
2:50 PM - Hate spark. Hate life. Start EMR cluster.
3:00 PM - Coffee.
3:10 PM - 2nd cluster still not started.
3:25 PM - Bid price too low. Try different zone.
3:40 PM - No capacity. Try different machine.
4:00 PM - Answer data question on Slack.
4:10 PM - So. Much. "Provisioning".
4:14 PM - Write data queries hoping cluster will provision. Make some educated guesses as to which fields in the data will be useful.
4:40 PM - Still no cluster. Try one last configuration on EMR and hope it works.
4:50 PM - Switch to different task. Fix bug in bash script doing process auditing.
4:56 PM - NOW my cluster starts! Context switch again.
5:00 PM - Log into cluster. Start Hive query to batch 3 days of browser signature data.
5:01 PM - While MR is loading data onto the cluster, switch to previous data. Load into a Google Docs spreadsheet for visual poking.
5:02 PM - Query finished! Tables empty. Debugging ... oh, external table location was wrong. Fix that. Restart query.
5:09 PM - Google model drift in random forest, because, why not. Hole in the literature. Make mental note.
5:10 PM - Back to Python for parsing data.
5:40 PM - Hive query finishes.
5:50 PM - Fight with Hive syntax for extracting tuples from JSON strings.
6:00 PM - Deal with a resume that was emailed to me. Add to hiring pipeline.
6:05 PM - Finish query. Pull into Google docs for plotting.
6:27 PM - Success! Useful data. Now I need dinner. Shutting down the cluster (but I worked so hard for it!)
Conclusion - It seems we have some anomalous behavior with screen resolutions on our network. The first chart is the top 100 screen resolutions of OS X devices. The bottom chart is all the OS X screen resolutions in 3 days of data. Look fishy.