tag:blogger.com,1999:blog-44120450559987424032024-03-06T19:33:53.347-08:00Gallamine's Scientific Computing BlogA blog about scientific computing with Python and Matlab. See the work of an engineer and data scientist in practice.William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.comBlogger44125tag:blogger.com,1999:blog-4412045055998742403.post-78724783152774040302021-02-01T08:36:00.000-08:002021-02-01T08:36:00.447-08:00Setting the Background Color of Matplotlib Images<p>I ran into an issue where viewing Matplotlib images in a dark mode browser or editor wasn't showing the axis of the plots. To fix this you can run one of the following at the top of your editor:</p><pre style="background-color: white; color: #080808; font-family: 'JetBrains Mono',monospace; font-size: 10.5pt;"><br />mpl.rcParams<span style="color: #871094; font-style: italic;">[</span><span style="color: teal; font-weight: bold;">'figure.facecolor'</span><span style="color: #871094; font-style: italic;">] </span>= <span style="color: teal; font-weight: bold;">'white'<br /></span><span style="color: #8c8c8c; font-style: italic;"></span></pre><pre style="background-color: white; color: #080808; font-family: 'JetBrains Mono',monospace; font-size: 10.5pt;"><span style="color: #8c8c8c; font-style: italic;">#or<br /></span><span style="color: #8c8c8c; font-style: italic;"><br /></span><span style="color: #0033b3;">import </span>matplotlib.pyplot <span style="color: #0033b3;">as </span>plt<br />plt.style.use<span style="color: #871094; font-style: italic;">({</span><span style="color: teal; font-weight: bold;">'figure.facecolor'</span>:<span style="color: teal; font-weight: bold;">'white'</span><span style="color: #871094; font-style: italic;">})</span></pre><pre style="background-color: white; color: #080808; font-family: 'JetBrains Mono',monospace; font-size: 10.5pt;"><span style="color: #871094; font-style: italic;"> </span></pre><p> Supposedly you can also put `figure.facecolor: white` in your matplotlibrc file, but I haven't gotten that to work yet.</p><p>Now the plots all look like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqs5dFVkssIPsI720EvK0dnPfnPILCP1rOHwmO8LFLKP3rL8r2CwAmB9Vd05qMf5TjV9SpOs9nF0QaxCU-WcoM69vlC53rgmth0hdGjRIyMXS9bGSo6A0YsSSouNzqYHzRi-BO6babHAU-/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="438" data-original-width="590" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqs5dFVkssIPsI720EvK0dnPfnPILCP1rOHwmO8LFLKP3rL8r2CwAmB9Vd05qMf5TjV9SpOs9nF0QaxCU-WcoM69vlC53rgmth0hdGjRIyMXS9bGSo6A0YsSSouNzqYHzRi-BO6babHAU-/" width="320" /></a></div><br /><br /><p></p>William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-56505574779221639882020-08-12T06:57:00.003-07:002020-08-12T12:47:31.505-07:00Git cleanup remotes<p> Running `<a href="https://git-scm.com/docs/git-fetch#Documentation/git-fetch.txt---prune">git fetch origin --prune</a>` will remove unused remote branches. This happens frequently when using Github since it will automatically delete a remote branch that has been merged.</p>William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-88945683427216560392019-06-19T09:40:00.001-07:002019-06-19T09:40:10.417-07:00Renaming a Column in Pandas to One That Already Exists Can Break ThingsToday I wrestled with an irritating issue where I had a perfectly fine DataFrame, renamed some columns and suddenly the thing was just broken. It turns out that the problem is that in Pandas (v. 0.24.1) when you rename a column to an already existing column <b>it just breaks.</b> Try this example:<br />
<br />
<pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: Menlo; font-size: 9pt;"><span style="color: #cc7832;">import </span>pandas <span style="color: #cc7832;">as </span>pd
df = pd.DataFrame({<span style="color: #a5c261;">"colA"</span>: [<span style="color: #6897bb;">1</span><span style="color: #cc7832;">,</span><span style="color: #6897bb;">2</span>]<span style="color: #cc7832;">, </span><span style="color: #a5c261;">"colB"</span>: [<span style="color: #6897bb;">3</span><span style="color: #cc7832;">,</span><span style="color: #6897bb;">4</span>]})
df = df.rename(<span style="color: #aa4926;">columns</span>={<span style="color: #a5c261;">"colA"</span>: <span style="color: #a5c261;">"colB"</span>})
df.colB.unique()</pre>
<br />
Instead of printing "[1,2]" as you'd expect, instead it throws an <span style="font-family: Courier New, Courier, monospace;">AttributeError: 'DataFrame' object has no attribute 'unique'</span>. Other than that, the dataframe appears to be fine. Calling <span style="font-family: Courier New, Courier, monospace;">df.columns</span> show that there are now two columns with identical names. When you try and access that column name Pandas returns <b>both</b> in a DataFrame, rather than a single Series object for the one column. Since a DataFrame object doesn't have the <span style="font-family: Courier New, Courier, monospace;">unique()</span> function, that's why we get the error above.William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-27723527725760595192019-04-10T09:20:00.000-07:002019-04-10T09:24:33.921-07:00Get index of Pandas Series row when column matches certain valueSay you have a Pandas DataFrame that looks like:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})</span></blockquote>
<br />
If you do a GroupBy operation on a specific column of the DataFrame Pandas returns a Series object. Like<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">df3.groupby(['X'])['Y'].sum()<br />X<br />A 4<br />B 6<br />Name: Y, dtype: int64</span></blockquote>
<br />
Now if we want to found out which groups had a specific aggregate value - say which groups had a sum == 4, we can do something like:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">>>> df3.groupby(['X'])['Y'].sum().eq(4)<br />X<br />A True<br />B False<br />Name: Y, dtype: bool</span></blockquote>
<br />
<br />
Now the question is, how do we get the *index* name where the row equals 4 (in this example we want `A` since it's value is `True` in the Series).<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">>>> groupings = df3.groupby(['X'])['Y'].sum().eq(4)<br />>>> groupings.index[groupings == True]<br />Index([u'A'], dtype='object', name=u'X')</span></blockquote>
<br />
<br />
PS. <span style="font-family: "courier new" , "courier" , monospace;">groupings.index[groupings is True]</span> doesn't work even though PEP8 checkers will warn you to switch to it. The groupings object isn't Truthy. The syntax <span style="font-family: "courier new" , "courier" , monospace;">groupings.index[groupings.eq(True)]</span><span style="font-family: inherit;"> is an alternative.</span>William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-31649936757298850432019-04-09T14:13:00.000-07:002019-04-09T14:13:01.763-07:00Python Doesn't Require Commas in Lists .. SortaToday I helped a colleague with a subtle Python bug. We have a system that queries for data given a list of IDs. The list of IDs looked like this:<br />
<br />
<blockquote class="tr_bq">
ids = [<br /> '7d38c515-d543-4186-a6a6-e46d4e356a81' # location 1<br /> 'f384fc68-3030-473f-95a8-52d5fee6cfd4' # location 2<br /> 'b27fef7f-9e5d-4af5-8596-a6949dd257a5' # location 3<br />]</blockquote>
<br />
It look us an unfortunate amount of time to realize we were missing commas in that list. Python blissfully will concatenate string elements inside of a list for you.<br />
<br />
<blockquote class="tr_bq">
bad_list = ['a' 'b' 'c']<br />bad_list[0] == 'abc'<br />True</blockquote>
<br />
This is because ,<br />
<blockquote class="tr_bq">
"a""b" == "ab"<br />True</blockquote>
I'm failing to come up with a helpful example of where this behavior is useful though.William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-4454897799048387942019-03-22T08:46:00.000-07:002019-03-22T08:46:52.640-07:00Running a Function on Dask Workers At StartupI made the mistake of thinking that <span style="font-family: Courier New, Courier, monospace;">client.run()</span> would run a function on each Dask worker <i>regardless if the worker hadn't started yet.</i> In the case where new workers come online that function won't run. Instead you'll need to take advantage of the <span style="font-family: Courier New, Courier, monospace;">register_worker_callbacks()</span> function that will register a function to run on all of the workers at startup. It looks like this:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">cluster = LocalCluster()</span><br />
<span style="font-family: Courier New, Courier, monospace;">client = Client(cluster)</span><br />
<span style="font-family: Courier New, Courier, monospace;">client.register_worker_callbacks(setup=your_function_name)</span><br />
<br />
I found this by <a href="https://github.com/dask/distributed/pull/2201/files#diff-5e9e1d3b3446423c741a9a2d406703e3R1243">looking through the tests for this function's pull request</a>. It's otherwise undocumented.William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-32812976347072518662019-03-15T10:08:00.002-07:002019-03-15T10:08:15.308-07:00Salary Progression (in Tech)This was an <a href="https://georgestocker.com/2019/03/14/my-salary-progression-in-tech/">insightful article</a> on one person's salary progression in technology. Averages out to be something like $11,000 per year in increase over the 15 year career so far. This matches pretty well with other folks I've talked to. An <a href="https://news.ycombinator.com/item?id=19393688">accompanying Hackernews post </a>has more datapoints. Stock options / RSUs tend to skew things and lend a high degree of variability to compensation. At higher levels it isn't uncommon for a large portion of your total compensation to be in RSUs.<br />
<br />
Of course, where you live causes a wild variation in the value of your income. Additionally, if you're working for dope cash but miserable it's hard to see that as a net positive.William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-28552536390278408282019-03-14T13:18:00.000-07:002019-03-14T13:18:05.304-07:00New Job For 2019I'm now working as a Senior Machine Learning Engineer for Grubhub on the order volume forecasting team. We own predictive time series models that produce forecasts the business uses to schedule drivers.<br />
<br />
The job search took about 3 months and involved at least 5 out-right rejections, a lot of non-answers, and 2 offers.William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-88321374999086066472017-06-08T10:51:00.002-07:002017-06-08T10:51:50.990-07:00CountMinSketch In PythonThanks to my friend <a href="https://twitter.com/cdubhland">Chris</a> I've been pondering some of the <a href="https://www.slideshare.net/SessionsEvents/misha-bilenko-principal-researcher-microsoft">work from Misha Bilenko</a> at Microsoft. This lead me down the path of investigating the CountMinSketch algorithm for tracking counts of entities in a stream. To help me learn I wrote a <a href="https://github.com/gallamine/countminsketch">Python implementation of CountMinSketch</a>.<br />
<br />
You can use it like so<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">from countminsketch.countminsketch import CountMinSketch</span><div>
<span style="font-family: Courier New, Courier, monospace;">d = 10</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">w = 100</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">cms = CountMinSketch(d=10, w=100)</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">cms.add('test_value')</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">print("Count of elements is:")</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">print(cms.query('test_value'))</span></div>
William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-26412396222801326552017-05-16T08:25:00.002-07:002017-05-16T08:25:48.064-07:00Reading Files with Encoding Errors Into PandasI found myself in a situation where I needed to read a file into Pandas that had mixed character encoding. Pandas does not handle this situation and instead requires a fixed encoding and throws an error when encountering a bad line. Practically this means if you have a file containing bytes the way you interpret those bytes differs from line to line. In my case, <b>most</b> of the lines were utf-8 while some were of other varieties of encodings.<br />
<br />
Character encoding is a particularly confusing problem (for me) so it took a while to figure out a workaround to the issue. I discovered that base Python provides<a href="https://docs.python.org/3/library/stdtypes.html#bytes.decode"> different error handling</a> when decoding bytes into Strings. The default, "strict" (which Pandas uses) throws <a href="https://docs.python.org/3/library/exceptions.html#UnicodeError">UnicodeError</a> when a bad line is found. Other options include "ignore" and different varieties of replacement. For my case, I wanted to us the "backslashreplace" style, which converts non-UTF-8 characters into their backslash escaped byte sequences. For example, the Unicode characters "グケゲ" would get turned into "\x30b0\x30b1\30b2" in my Python string. Python also allows you to <a href="https://docs.python.org/3/library/codecs.html#codecs.register_error">register a custom error handler if you so desire</a>. If you wanted to be really fancy, you could use a custom error handler to guess other encoding types using <a href="https://ftfy.readthedocs.io/en/latest/#a-note-on-encoding-detection">FTFY</a> or <a href="http://chardet.readthedocs.io/en/latest/usage.html">chardet</a>.<br />
<br />
Unfortunately Pandas read_csv() method doesn't support using the non-strict error handling, so I needed a way to decode the bytes before Pandas accessed them. My final solution was to wrap my file in a <a href="https://docs.python.org/3/library/io.html#io.TextIOWrapper">io.TextIOWrapper</a> class while then allowed me to specify the error handling and to pass it directly to pandas read_cv() method.<br />
<br />
<b>Example</b>:<br />
<blockquote class="tr_bq">
<span style="background-color: #f3f3f3; font-family: Courier New, Courier, monospace; font-size: large;"> import gzip</span><br />
<span style="background-color: #f3f3f3; font-family: Courier New, Courier, monospace; font-size: large;">import io</span><br />
<span style="background-color: #f3f3f3; font-family: Courier New, Courier, monospace; font-size: large;">import pandas pd</span><br />
<span style="background-color: #f3f3f3; font-family: Courier New, Courier, monospace; font-size: large;"><br /></span>
<span style="background-color: #f3f3f3; font-family: Courier New, Courier, monospace; font-size: large;">gz = gzip.open('./logs/nasty.log.gz', 'r')</span><br />
<span style="background-color: #f3f3f3; font-family: Courier New, Courier, monospace; font-size: large;">decoder_wrapper = io.TextIOWrapper(gz, encoding='utf-8', errors='backslashreplace') </span><br />
<span style="background-color: #f3f3f3; font-family: Courier New, Courier, monospace; font-size: large;">df = pd.read_csv(decoder_wrapper, sep='\t')</span></blockquote>
Figuring all that out took about two days. William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-81747141643098271902016-01-28T06:26:00.005-08:002016-01-28T06:26:57.182-08:00How to Add A Hive Step to a Running Cluster on EMRPut the file on S3:<br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">s3cmd put temp_1_load_logs_20160126.sql s3://my-bucket/</span><br />
<br />
Add the step:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">aws emr add-steps --cluster-id j-XXXXXXX --steps Type=Hive,Name="load logs",Args=[-f,s3://my-bucket/temp_1_load_logs_20160126.sql]</span>William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-77503699804505452382015-11-18T09:06:00.002-08:002015-11-18T09:06:25.120-08:00Recursively Find all the Files and Sizes of a Bucket on S3Say I want to recurse through a S3 bucket, find all the file sizes and sum them up? Easy:<br />
<span style="background-color: #2c67c8; color: white; font-family: Menlo; font-size: 18px;"><br /></span>
<span style="background-color: #2c67c8; color: white; font-family: Menlo; font-size: 18px;">s3cmd ls s3://your-s3-bucket/ --recursive | awk -F' ' '{s +=$3} END {print s}'</span><span style="background-color: #2c67c8; color: white; font-family: Menlo; font-size: 18px;"> </span><br />
<br /><br />The output of <span style="font-family: Courier New, Courier, monospace;">s3cmd ls </span>looks like:<div>
<br /></div>
<div>
<div style="background-color: #2c67c8; color: white; font-family: Menlo; font-size: 18px; line-height: normal;">
2015-11-15 12:22 4482528 s3://bucket/-4878692415071619643--6245724311294558574_479343588_data.0</div>
<div style="background-color: #2c67c8; color: white; font-family: Menlo; font-size: 18px; line-height: normal;">
2015-11-15 12:34 34398163 s3://bucket/-6827273792407145391--2667978502585357890_1957252193_data.0</div>
<div style="background-color: #2c67c8; color: white; font-family: Menlo; font-size: 18px; line-height: normal;">
2015-11-15 12:46 4558355 s3://bucket/2184012989583635362-3242759126742622102_1630577622_data.0</div>
<div style="background-color: #2c67c8; color: white; font-family: Menlo; font-size: 18px; line-height: normal;">
2015-11-15 12:59 13297607 s3://bucket/6147240539106964522-4824521201578762651_240049741_data.0</div>
<br /><br />So you want to split on the spaces, and take the size argument (3rd argument) and recursively sum them. That's what <span style="background-color: #2c67c8; color: white; font-family: Menlo; font-size: 18px;">awk -F' ' '{s +=$3}</span> does (the<span style="font-family: Courier New, Courier, monospace;"> -F ' '</span> splits on whitespace). The <span style="background-color: #2c67c8;"><span style="color: white; font-family: Menlo; font-size: medium;">END {print s}</span></span> prints out the sum at the end.<br /><br /></div>
William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-2545268636620515912015-07-20T13:00:00.000-07:002015-07-20T13:00:21.404-07:00Turn a {key, value} Python Dictionary into a Pandas DataFrameQuick solution to a problem I had today. I had a dictionary of {key, values} that I wanted into a dataframe. My solution:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">import pandas as pd</span><br />
<span style="font-family: Courier New, Courier, monospace;">pd.DataFrame([[key,value] for key,value in python_dict.iteritems()],columns=["key_col","val_col"])</span>William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com2tag:blogger.com,1999:blog-4412045055998742403.post-59209134853556994852015-05-05T05:50:00.002-07:002015-05-05T05:50:28.960-07:00A Day in the Life of a Data Scientist (Part 1)<i>Here is a log of my day in all of it's pain and glory. It's not necessarily typical in its length or futility. Then again, there are worse days.</i><br />
<br />
<br />
8:30AM - Start Amazon EMR cluster in preparation for product beta test beginning next week. Eat breakfast while system is bootstrapping.<br />
<br />
9 AM - Email. Reading JIRA cards. Reading Spark documentation.<br />
<br />
10AM - Remember 10:30 AM meeting. Context switch.<br />
<br />
10:20AM - Meeting canceled. Context switch. Start looking at running a Spark cluster on EC2.<br />
<br />
10:30AM - Previously started cluster is operational now. Transfer files and begin the booting process. Process takes approximately 1.5 hrs to finish. After that the system should be monitor-only.<br />
<br />
10:35 AM - Try various spark cluster configurations that don't work. AWS spot pricing is the worst.<br />
<br />
11AM - Think, "if I was a real data scientist I'd probably be reading a paper right now." Don't read paper.<br />
<br />
12PM - Witty repartee on Twitter:<br />
<blockquote class="twitter-tweet" lang="en">
<div dir="ltr" lang="en">
<a href="https://twitter.com/benhamner">@benhamner</a> <a href="https://twitter.com/kaggle">@kaggle</a> building a model and then putting it into production only to see it negatively influence hundreds of customers is worse.</div>
— William Cox ن (@gallamine) <a href="https://twitter.com/gallamine/status/595262980725551104">May 4, 2015</a></blockquote>
<script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script>
<br />
12:15 PM - Go eat lunch. Sit on porch. Talk with my children and wife.<br />
<br />
1:05 PM - return. Try a different Spark cluster configuration. Monitor system progresss on ML system started earlier.<br />
<br />
1:10 PM - Think, "I need to appear smart". Read description of <a href="https://en.wikipedia.org/wiki/Medcouple">Medcouple algorithm</a>.<br />
<br />
1:20 PM - Spark cluster running. Try logging in. Try running local IPython notebook to connect.<br />
More twittering:<br />
<blockquote class="twitter-tweet" data-partner="tweetdeck">
<div dir="ltr" lang="en">
Same RT <a href="https://twitter.com/gallamine">@gallamine</a>: I don’t always use Spark, but when I do … I have to use <a href="https://twitter.com/tdhopper">@tdhopper</a>’s slides to remember anything: <a href="http://t.co/RbQhr4BOVq">http://t.co/RbQhr4BOVq</a></div>
— Tim Hopper (@tdhopper) <a href="https://twitter.com/tdhopper/status/595278968514859008">May 4, 2015</a></blockquote>
<script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script>
<br />
1:50 PM<br />
Cluster connection error. Apparently a known issue with PySpark and using a standalone cluster. Try to fix.<br />
<br />
Install Anaconda on cluster itself. Start notebook server on cluster and use this trick to forward browswer:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">ssh -i ~/key.pem -L 8889:localhost:8888 root@ec2-xx-xx-xx-xx.compute-1.amazonaws.com</span><br />
<br />
More configuration errors. Can't load data from S3.<br />
<br />
2:40 PM - Still flailing.<br />
<br />
2:50 PM - Hate spark. Hate life. Start EMR cluster.<br />
<br />
3:00 PM - Coffee.<br />
<br />
3:10 PM - 2nd cluster still not started.<br />
<br />
3:25 PM - Bid price too low. Try different zone.<br />
<br />
3:40 PM - No capacity. Try different machine.<br />
<br />
4:00 PM - Answer data question on Slack.<br />
<br />
4:10 PM - So. Much. "Provisioning".<br />
<br />
4:14 PM - Write data queries hoping cluster will provision. Make some educated guesses as to which fields in the data will be useful.<br />
<br />
4:40 PM - Still no cluster. Try one last configuration on EMR and hope it works.<br />
<br />
4:50 PM - Switch to different task. Fix bug in bash script doing process auditing.<br />
<br />
4:56 PM - NOW my cluster starts! Context switch again.<br />
<br />
5:00 PM - Log into cluster. Start Hive query to batch 3 days of browser signature data.<br />
<br />
5:01 PM - While MR is loading data onto the cluster, switch to previous data. Load into a Google Docs spreadsheet for visual poking.<br />
<br />
5:02 PM - Query finished! Tables empty. Debugging ... oh, external table location was wrong. Fix that. Restart query.<br />
<br />
5:09 PM - Google model drift in random forest, because, why not. Hole in the literature. Make mental note.<br />
<br />
5:10 PM - Back to Python for parsing data.<br />
<br />
5:40 PM - Hive query finishes.<br />
<br />
5:50 PM - Fight with Hive syntax for extracting tuples from JSON strings.<br />
<br />
6:00 PM - Deal with a resume that was emailed to me. Add to hiring pipeline.<br />
<br />
6:05 PM - Finish query. Pull into Google docs for plotting.<br />
<br />
6:27 PM - Success! Useful data. Now I need dinner. Shutting down the cluster (but I worked so hard for it!)<br />
<br />
Conclusion - It seems we have some anomalous behavior with screen resolutions on our network. The first chart is the top 100 screen resolutions of OS X devices. The bottom chart is all the OS X screen resolutions in 3 days of data. Look fishy.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEin7N-RGCsqON8OcvS9iIn37DMIb7JMo3Al0AWR1BjepCrvl5oYqcfoITPveIvxOS4KPfZA2YK-Je22pJUCcF6I7ytVAQZZW-6vMSdhAjlXvvID6bWde2ZmcE6QdYyNv6pVvJ3f5Rhkqn6x/s1600/image-19.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEin7N-RGCsqON8OcvS9iIn37DMIb7JMo3Al0AWR1BjepCrvl5oYqcfoITPveIvxOS4KPfZA2YK-Je22pJUCcF6I7ytVAQZZW-6vMSdhAjlXvvID6bWde2ZmcE6QdYyNv6pVvJ3f5Rhkqn6x/s1600/image-19.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyFDjsSPvHLlqNEcx8eT_0DhtMYcRWtz37rjXgu_LSovB0jF4MSR21VANak93F08FXm7obAZDHBMIyeAq2YJtA6qt5P9fS81ljp8PMdHMc1oRIhK51gzhEbUTeD9cbYcDsVObsREuI5AzM/s1600/image-20.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="412" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyFDjsSPvHLlqNEcx8eT_0DhtMYcRWtz37rjXgu_LSovB0jF4MSR21VANak93F08FXm7obAZDHBMIyeAq2YJtA6qt5P9fS81ljp8PMdHMc1oRIhK51gzhEbUTeD9cbYcDsVObsREuI5AzM/s640/image-20.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
The folks with non-standard Apple-device screen resolutions are likely candidates for investigation of fraud.<br />
<br />
<br />
<br />
<br />William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com3tag:blogger.com,1999:blog-4412045055998742403.post-57237694804762405192015-04-18T06:58:00.002-07:002015-04-18T06:58:47.099-07:00Remote Work + Data ScienceI've been working as a remote data scientist for nearly a year now. Our team (of two!) is fully distributed and we're in the process of adding another data scientist. Finding other remote data science jobs is pretty difficult so I decided to start another blog to champion the idea of remote data science and track jobs that fit that description. Please visit <a href="http://www.remotedatascience.com/">www.RemoteDataScience.com</a> and let me know what you think!William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com3tag:blogger.com,1999:blog-4412045055998742403.post-81707410845960538012015-04-14T14:25:00.000-07:002015-04-14T14:25:04.592-07:00Linux Date Injection into HiveThis week I found myself needing to generate a table in Hive that used today's date in the output location. Basically I was running a daily report and wanted it to automatically send the output to the appropriate bucket on S3.<br />
<br />
To accomplish this, I used a combination of embedded Linux commands and Hive variables.<br />
<br />
First, in your Hive query, you need to turn on variable substitution:<br />
<br />
<pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: 'Menlo'; font-size: 12pt;"><span style="color: #cc7832; font-weight: bold;">set </span>hive.variable.substitute=<span style="color: #cc7832; font-weight: bold;">true</span><span style="color: #cc7832;">;</span></pre>
<br />
Next, in your Hive query you can have an expression substituted for the variable value. For instance, you can create a table like this:
<br />
<br />
<pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: 'Menlo'; font-size: 12pt;"><span style="color: #cc7832; font-weight: bold;">CREATE EXTERNAL TABLE </span>IF <span style="color: #cc7832; font-weight: bold;">NOT EXISTS </span>my_table
(
values <span style="color: #cc7832; font-weight: bold;">STRING</span><span style="color: #cc7832;">,</span><span style="color: #cc7832; font-weight: bold;">
</span>)
ROW FORMAT DELIMITED FIELDS TERMINATED <span style="color: #cc7832; font-weight: bold;">BY </span><span style="color: #a5c261; font-weight: bold;">'\t'</span></pre>
<pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: 'Menlo'; font-size: 12pt;">LINES TERMINATED <span style="color: #cc7832; font-weight: bold;">BY </span><span style="color: #a5c261; font-weight: bold;">'\n'</span></pre>
<pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: 'Menlo'; font-size: 12pt;">LOCATION <span style="color: #a5c261; font-weight: bold;">'s3://mybucket/${hiveconf:DATE_VARIABLE}'</span><span style="color: #cc7832;">;</span></pre>
<br />
The Hive syntax for a variable is <span style="font-family: Courier New, Courier, monospace;">${hiveconf:VARNAME}</span>When calling Hive, you can give it a variable by using the <span style="font-family: Courier New, Courier, monospace;">-hiveconf VARNAME=VALUE </span><span style="font-family: inherit;">syntax</span>. For instance:<br />
<br />
<pre style="background-color: #2b2b2b; color: #a9b7c6; font-family: 'Menlo'; font-size: 12pt;">hive -hiveconf DATE_VARIABLE=$(date +y=%Y/m=%m/d=%d/h=%H/) -f query.sql</pre>
<br />
Notice that the value of the variable in the above query is <span style="font-family: Courier New, Courier, monospace;">$(date +y=%Y/m=%m/d=%d/h=%H/)</span>. This is the syntax for telling Linux to execute the command inside the <span style="font-family: Courier New, Courier, monospace;">$( )</span> and return the value. You can also use backticks ( <span style="font-family: Courier New, Courier, monospace;">` `</span>) instead of <span style="font-family: Courier New, Courier, monospace;">$( )</span>. Essentially the date command will run, returning a date string like <span style="font-family: Courier New, Courier, monospace;">y=2015/m=04/d=13</span> and assign that to the Hive variable. That variable will then be substituted in the Hive query and build a custom table location.<br />
<br />
Super handy.William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-16980530863986003882015-04-05T14:09:00.001-07:002015-04-05T14:09:30.861-07:00Ham Technicians licenseAfter about 15 years of it being on my "to do" list, I finally took, and passed, the ham technicians license exam. After 10 years of EE education it wasn't all that difficult. I did read the excellent guide from KB6NU (http://www.kb6nu.com/study-guides/) to get me up to speed on the regulation aspects and the "ham lingo" I didn't know. <div>I'm not sure what I'll do with it, but it's nice to know I have more spectrum and transmit power accessible for when j figure it out!</div>William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-62429790028544856512014-10-02T13:11:00.001-07:002014-10-02T13:11:46.611-07:00Query Data from Impala on Amazon EMR into Python, Pandas and IPython NotebookI've been envious of tools such as Hue that allow for an easy way to execute SQL-like queries on Hive or Impala and then immediately plot the results. Installing Hue on EMR has thus-far thwarted me (if you know how, I'm all ears), so I needed a better way.<br />
<br />
Hive is great for doing batch-mode processing of a lot of data, and pulling data from S3 into the Hadoop HDFS. Impala then allows you do to fast(er) queries on that data. Both are 1-click installed using Amazon's EMR console (or command line).<br />
<br />
The difficulty now is that I'm writing queries at the command-line and don't have a particularly elegant way of plotting or poking at the results. Wouldn't it be great if I could get the results into an IPython Notebook and plot there? Two problems: 1) getting the results into Python and 2) getting access to a Notebook server that's running on the EMR cluster.<br />
<br />
I now have two solutions to these two problems:<br />
<br />
1)<a href="http://continuum.io/downloads#all"> Install Anaconda </a>on the EMR cluster:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;">remote$> wget http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.1.0-Linux-x86_64.sh</span><span style="font-family: 'Courier New', Courier, monospace;">remote</span><span style="font-family: Courier New, Courier, monospace;">$> bash Anaconda-2.1.0-Linux-x86_64.sh</span></blockquote>
<br />
now log out of your SSH connection and reconnect using the command<br />
<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;">local$> ssh -i ~/yourkeyfile.pem -L 8889:localhost:8888 hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com</span></blockquote>
<div class="p1">
<br /></div>
<div class="p1">
This starts a port forwarding SSH connection that connects <b>http://localhost:8889</b> on your local machine to <b>http://localhost:8888</b> on the remote machine (which is where the notebook will run). Now start the remote IPython Notebook server using the command</div>
<div class="p1">
<br /></div>
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;">$> ipython notebook --browswer=none</span></blockquote>
<div class="p1">
<br /></div>
<div class="p1">
You should now be able to navigate to http://localhost:8889 on your local machine and see the notebook server running on your EMR machine! Ok, now what about getting the Impala data into the notebook?</div>
<br />
2) The <a href="https://github.com/cloudera/impyla">Impyla project</a> allows you to connect to a running Impala sever, make queries, and spit the output into a Pandas dataframe. You can install Impyla using the command<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: 'Courier New', Courier, monospace;">remote</span>$> pip install impyla</blockquote>
<br />
Inside your IPython Notebook you should be able to execute something like<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;">from impala.util import as_pandas</span> </blockquote>
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;"></span><span style="font-family: Courier New, Courier, monospace;">from impala.dbapi import connect</span> </blockquote>
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;"></span><span style="font-family: Courier New, Courier, monospace;">conn = connect(host='localhost', port=21050)</span> </blockquote>
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;"></span><span style="font-family: Courier New, Courier, monospace;">cursor = conn.cursor()</span> </blockquote>
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;"></span><span style="font-family: Courier New, Courier, monospace;">cursor.execute('SELECT * FROM some_table')</span> </blockquote>
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;"></span><span style="font-family: Courier New, Courier, monospace;">df = as_pandas(cursor)</span></blockquote>
<br />
and now you have a magical dataframe with Overweight Data that you can plot and otherwise poke at.William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com1tag:blogger.com,1999:blog-4412045055998742403.post-45520395716512674332014-09-25T19:00:00.000-07:002014-09-25T19:00:00.288-07:00Use Boto to Start an Elastic Map-Reduce Cluster with Hive and Impala Installed<div class="tr_bq">
I spent all of yesterday beating my head against the Boto document (or lack thereof). Boto is a popular (the?) tool for using Amazon Webservices (AWS) with Python. The parts of AWS that are used quite a bit have good documentation, while the rest suffer for explanation.</div>
<br />
The task I wanted to accomplish was:<br />
<br />
<ol>
<li>Use Boto to start an elastic mapreduce cluster of machines.</li>
<li>Install Hive and Impala on the machines.</li>
<li>Use Spot instances for the core nodes.</li>
</ol>
<div>
Below is sample code to accomplish these tasks. I spent a great deal of time combing through the <a href="https://github.com/boto/boto/tree/master">sourcecode for Boto</a>. You may need to do the same.</div>
<div>
<br /></div>
<div>
This code is for Boto 2.32.1:</div>
<div>
<br /></div>
<br />
<blockquote>
<span style="font-family: Courier New, Courier, monospace;">from boto.emr.connection import EmrConnection<br />from boto.emr.step import InstallHiveStep<br />from boto.emr import BootstrapAction<br />conn = boto.emr.connect_to_region('us-east-1')<br />hive_step = InstallHiveStep()<br />bootstrap_impala = BootstrapAction("impala","s3://elasticmapreduce/libs/impala/setup-impala",["--base-path","s3://elasticmapreduce","--impala-version","latest"])<br />instance_groups = [InstanceGroup(1, "MASTER", master_instance_type, "ON_DEMAND","mastername"),<br /> InstanceGroup(num_instances, "CORE", slave_instance_type, 'SPOT', "slavename",bidprice=bidprice)]<br />jobid = conn.run_jobflow(cluster_name,log_uri="s3n://log_bucket",\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>ec2_keyname="YOUR EC2 KEYPAIR NAME",\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>availability_zone="us-east-1e",\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>instance_groups=instance_groups,\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>num_instances=str(num_instances),\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>keep_alive="True",\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>enable_debugging="True",\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>hadoop_version="2.4.0",\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>ami_version="3.1.0",\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>visible_to_all_users="True",\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>steps=[hive_step],\<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>bootstrap_actions=[bootstrap_impala])</span></blockquote>
William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com1tag:blogger.com,1999:blog-4412045055998742403.post-8615331032795111282014-08-20T17:00:00.000-07:002014-08-20T17:00:01.548-07:00Use Asciinema to Record and Share Terminal SessionsI discovered a cool tool today by listening to <a href="http://jeroenjanssens.com/">Jeroen Janssen's</a> <a href="http://event.on24.com/eventRegistration/EventLobbyServlet?target=lobby.jsp&eventid=798721&sessionid=1&key=5BB1A35E851FFB763CBF3CA5423725C0&eventuserid=102528696">talk on Data Science at the Command Line</a>. The tool is <a href="https://asciinema.org/">Asciinema</a>, which is a terminal plugin that lets you record you terminal session, save it and then share - with copyable text! It supports both OSX and Linux.<br />
<br />
Here's an example from their website:<br />
<br />
<script async="" id="asciicast-10214" src="https://asciinema.org/a/10214.js" type="text/javascript"></script>
Can't wait to use this for this blog!<br />William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-68455391148702512532014-08-18T18:00:00.000-07:002014-08-18T18:00:01.280-07:00Automatically Partition Data in Hive using S3 BucketsDid you know that if you are processing data stored in S3 using Hive, you can have Hive automatically partition the data (logical separation) by encoding the S3 bucket names using a <span style="font-family: Courier New, Courier, monospace;">key=value</span> pair? For instance, if you have time-based data, and you store it in buckets like this:<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">/root_path_to_buckets/</span><span style="font-family: Courier New, Courier, monospace;">date=20140801</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">/root_path_to_buckets/</span><span style="font-family: Courier New, Courier, monospace;">date=20140802</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">/root_path_to_buckets/</span><span style="font-family: Courier New, Courier, monospace;">date=20140803</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">/root_path_to_buckets/</span><span style="font-family: Courier New, Courier, monospace;">...</span><br />
<br />
And you build a table in Hive, like<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">CREATE EXTERNAL TABLE time_data(</span><br />
<span style="font-family: Courier New, Courier, monospace;"> value STRING,</span><br />
<span style="font-family: Courier New, Courier, monospace;"> value2 INT,</span><br />
<span style="font-family: Courier New, Courier, monospace;"> value3 STRING,</span><br />
<span style="font-family: Courier New, Courier, monospace;"> ...</span><br />
<span style="font-family: Courier New, Courier, monospace;">)</span><br />
<span style="font-family: Courier New, Courier, monospace;">PARTITIONED BY(date STRING)</span><br />
<span style="font-family: Courier New, Courier, monospace;">LOCATION s3n://root_path_to_buckets/</span><br />
<br />
Hive will automatically know that your data is logically separated by dates. Usually this requires you to refresh the partition list by calling the command:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">ALTER TABLE time_data RECOVER PARTITIONS;</span><br />
<br />
After that, you can check to see if the partitions have taken using the SHOW command:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">SHOW PARTITIONS time_data;</span><br />
<br />
Now when you run a SELECT command, Hive will only load the data needed. This saves a tremendous amount of downloading and processing time. Example:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">SELECT value, value2 FROM time_data WHERE date > "20140802"</span><br />
<br />
This will only load 1/3 of the data (since 20140801 and 20140802 are excluded).William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com3tag:blogger.com,1999:blog-4412045055998742403.post-14341477747146325182014-07-31T14:10:00.000-07:002014-07-31T14:10:10.581-07:00Buying a Car With Data Science!<span style="font-family: Verdana, sans-serif;"><i>Forget Big Data, here's how I used small data and stupid simple tools to make buying a car easier for me to stomach.</i></span><br />
<br />
<div>
When your family's size increases past the ⌊E[#number of children]⌋ in America - you need a bigger vehicle. Given that we now have a family of 3, and that the 3rd child will need a carseat in a few months, it was time to buy a family van. <br />
<br />
I'm a big fan of <a href="http://www.ynab.com/">staying</a> <a href="http://www.daveramsey.com/">out</a> <a href="http://www.clarkhoward.com/">of</a> <a href="https://www.biblegateway.com/passage/?search=Proverbs+22%3A7">debt</a> and so I had a fixed budget with which to acquire a vehicle - a used vehicle. A great place to start looking is with the <a href="http://www.autotempest.com/">AutoTempest</a> search engine, which aggregates data from lots of different sites. The problem is that it's difficult to know how much you should pay for any given vehicle. If you're buying new you can check something like <a href="http://www.truecar.com/">TrueCar</a> and there's resources like Kelly Blue Book, NADA and Edmunds but from past used buying experience those services tended to underestimate the actual costs with most vehicles I've bought, and while some folks love to negotiate, I find it difficult without any "ground truth" to base my asking price off of.<br />
<br />
<br />
I toyed around with the idea of doing a full-fledged scrapping of data but it just wasn't worth the time, since I was under a deadline. Instead I took the approach of least resistance and asked my wife to pull together basic info on 20 vehicles matching our criteria - year, milage and asking price. Together we stuck it into a Google Document and plotted the results:<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhouRrlVF5d0Fk18-nCcKjDiDl9EgFtVSEH_Dt8If3PQfjIwxzr9bfO8uqK0TsBr9xcNEM34gX1NwH6ewAN-suf1ZJ-AYgNmooV1YPfhHwV3bbLy-p3AHdgjthajw3OP6YaTGnKYwydxwrr/s1600/image.png" /><br />
<br />
<br />
To my surprise and delight, there seemed to be two distinct collections of data - an upper and a lower priced section. Since Google Docs doesn't provide any easy way to put in a regression line I then moved over to Excel and added those in:<br />
<br />
<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3Bn9DLsQIUtTr7rANVGkHahDsfw9wdd5ONc5mC1qN5HSNa1HHKMicuQR9RKHaY46ubrDg2px8ytvL4CvQiv0NDgDBs5krBNAJIJZ6mexOuycdgSJBCpI5QoysIAe7FDLNRaV-R3zNNSuQ/s1600/car_regression.png"><img border="0" height="297" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3Bn9DLsQIUtTr7rANVGkHahDsfw9wdd5ONc5mC1qN5HSNa1HHKMicuQR9RKHaY46ubrDg2px8ytvL4CvQiv0NDgDBs5krBNAJIJZ6mexOuycdgSJBCpI5QoysIAe7FDLNRaV-R3zNNSuQ/s1600/car_regression.png" width="640" /></a><br />
<br />
<br />
The data points highlighted in green were ones that I was considering. Suddenly I now had isolated two vehicles that were "over priced" and ripe for easy negotiation. I also chose the highest price - and lowest milage vehicle - and asked for a significantly lower price on a whim.</div>
<div>
<br /></div>
<div>
The additional data I'd gathered gave me confidence when negotiating. I contacted 2 dealerships with my data and the price I wanted to pay (slightly under the lower price regression). Ultimately the 1st person I'd talked to accepted my offer, which I new was good from my data and I didn't have to worry about whether or not I should keep negotiating.</div>
<div>
<br /></div>
<div>
What my data DIDN'T do for me:</div>
<div>
<ul>
<li>The data didn't impress anyone</li>
<li>The data didn't magically make people accept my offers</li>
<li>The data didn't make buying a car easy</li>
</ul>
<div>
What my data DID do for me:</div>
</div>
<div>
<ul>
<li>Took away the awful feeling of not knowing.</li>
<li>Gave me confidence when negotiating offers.</li>
<li>Let me quickly see which vehicles I should pursue and which to not focus on.</li>
</ul>
<div>
Now I'm driving a baller ... van. Cool indeed.</div>
</div>
<!-- Blogger automated replacement: "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhouRrlVF5d0Fk18-nCcKjDiDl9EgFtVSEH_Dt8If3PQfjIwxzr9bfO8uqK0TsBr9xcNEM34gX1NwH6ewAN-suf1ZJ-AYgNmooV1YPfhHwV3bbLy-p3AHdgjthajw3OP6YaTGnKYwydxwrr/s1600/image.png" with "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhouRrlVF5d0Fk18-nCcKjDiDl9EgFtVSEH_Dt8If3PQfjIwxzr9bfO8uqK0TsBr9xcNEM34gX1NwH6ewAN-suf1ZJ-AYgNmooV1YPfhHwV3bbLy-p3AHdgjthajw3OP6YaTGnKYwydxwrr/s1600/image.png" --><!-- Blogger automated replacement: "https://images-blogger-opensocial.googleusercontent.com/gadgets/proxy?url=http%3A%2F%2F2.bp.blogspot.com%2F-f015MBPJsn4%2FU8AA212RRMI%2FAAAAAAAAP98%2FJ3_IA6BX524%2Fs1600%2Fimage.png&container=blogger&gadget=a&rewriteMime=image%2F*" with "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhouRrlVF5d0Fk18-nCcKjDiDl9EgFtVSEH_Dt8If3PQfjIwxzr9bfO8uqK0TsBr9xcNEM34gX1NwH6ewAN-suf1ZJ-AYgNmooV1YPfhHwV3bbLy-p3AHdgjthajw3OP6YaTGnKYwydxwrr/s1600/image.png" --><!-- Blogger automated replacement: "https://images-blogger-opensocial.googleusercontent.com/gadgets/proxy?url=http%3A%2F%2F2.bp.blogspot.com%2F-3vmysJYEERA%2FU8ABc4wD0hI%2FAAAAAAAAP-E%2FWyr5NP6LTNs%2Fs1600%2Fcar_regression.png&container=blogger&gadget=a&rewriteMime=image%2F*" with "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3Bn9DLsQIUtTr7rANVGkHahDsfw9wdd5ONc5mC1qN5HSNa1HHKMicuQR9RKHaY46ubrDg2px8ytvL4CvQiv0NDgDBs5krBNAJIJZ6mexOuycdgSJBCpI5QoysIAe7FDLNRaV-R3zNNSuQ/s1600/car_regression.png" --><!-- Blogger automated replacement: "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3Bn9DLsQIUtTr7rANVGkHahDsfw9wdd5ONc5mC1qN5HSNa1HHKMicuQR9RKHaY46ubrDg2px8ytvL4CvQiv0NDgDBs5krBNAJIJZ6mexOuycdgSJBCpI5QoysIAe7FDLNRaV-R3zNNSuQ/s1600/car_regression.png" with "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3Bn9DLsQIUtTr7rANVGkHahDsfw9wdd5ONc5mC1qN5HSNa1HHKMicuQR9RKHaY46ubrDg2px8ytvL4CvQiv0NDgDBs5krBNAJIJZ6mexOuycdgSJBCpI5QoysIAe7FDLNRaV-R3zNNSuQ/s1600/car_regression.png" -->William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com2tag:blogger.com,1999:blog-4412045055998742403.post-55026742714140948052014-07-31T14:08:00.003-07:002014-07-31T14:08:44.340-07:00Python Pandas Group by Column A and Sum Contents of Column BHere's something that I can never remember how to do in Pandas: group by 1 column (e.g. Account ID) and sum another column (e.g. purchase price). So, here's an example for our reference:<br />
<br />
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace;">data.groupby(by=['account_ID'])['purchases'].sum()</span></blockquote>
<br />
Simple, but not worth re-Googling each time!William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com3tag:blogger.com,1999:blog-4412045055998742403.post-2707390265178731312014-07-24T16:35:00.000-07:002014-07-24T16:35:16.883-07:00OSCON Wednesday RecapEnded up in Paul Frankwicks talk on "Build Your Own Exobrain" a bit late, but it worked out well. "Think of it as IFTTT, but free, open source, and you keep control of your privacy." Had a great conversation with Paul afterwards regarding stochastic time and mood tracking. Hopefully we'll get some stochastic triggers added to Exobrain, along with Python support, soon.<br /><br />Next I listened to Adam Gibson give a <a href="http://www.oscon.com/oscon2014/public/schedule/detail/33709">talk on his deep learning library, DeepLearning4j</a>. He's clearly a bright guy and is very passionate about his project. I'd previous watched him give a similar talk at the <a href="http://www.slideshare.net/jpatanooga/hadoop-summit-2014-san-jose-introduction-to-deep-learning-on-hadoop">Hadoop Summit along with Josh Patterson</a>. I spent some time talking with him after the session and trading machine learning stories. He nearly inspired me to learn Scala and start hacking on deeplearning4j - it sounds like a fabulous platform with all the possible moving pieces you could want for building a deep learning pipeline. <div>
<br /></div>
<div>
Afterwards I went to <a href="http://www.oscon.com/oscon2014/public/schedule/detail/34414">Harrison Mebane's talk on spotting Caltrain cars using a Raspberry Pi outfitted with microphones and a camera</a>. It looked like a neat project incorporating data, sensors and hardware.<br /><br />Next, on a whim, I went to <a href="http://www.oscon.com/oscon2014/public/schedule/detail/33485">Tobias Zander's talk on web security</a>. I know very little about security so I was fascinated by all of the interesting ways to compromise a system he showcased. He showed how clever hackers can learn all sorts of information in non-obvious ways. He also royally trolled the audience by using a Facebook compromise to gather people's Facebook profile pictures after they visited his website during the talk. </div>
<div>
<br /></div>
<div>
Finally, I went to a lovely talk by <a href="http://www.oscon.com/oscon2014/public/schedule/detail/33997">Tim Bell on how CERN's IT infrastructure and how they went agile</a>. It was a fascinating talk that dove into the complexities of such a massive system. The difficulties, both political, scientific and technological are enormous. When the video is posted it's well worth your time to go and watch.</div>
William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0tag:blogger.com,1999:blog-4412045055998742403.post-84206995944260832742014-07-23T18:33:00.002-07:002014-07-23T18:33:34.169-07:00OSCON Tuesday RecapExcellent set of Keynotes. Especially enjoyed the one from <a href="http://www.planet.com/">Planet Labs</a> - inspiring work to photograph the entire globe, every day.<br />
<br />
Next was a talk on <a href="http://www.oscon.com/oscon2014/public/schedule/detail/35390">building an open Thermostat</a> from Zach Supalla at Spark.io, the makers of an Internet connected microcontroller and cloud infrastructure. Zach says building hardware is hard, but it's easy to get noticed - if you build anything remotely cool you'll be on Engadget with no problem.<br />
<br />
A talk on <a href="http://www.oscon.com/oscon2014/public/schedule/detail/34192">Functional Thinking by Rob Ford</a>, who is a great speaker, was informative but wasn't exactly something that I can apply to my work right now. I at least caught up on some of the nomenclature and can use it as a jumping-off point for future learning. Apparently all major languages are adding functional programming support these days (Python?).<br />
<br />
Ruth Suehle gave a <a href="http://www.oscon.com/oscon2014/public/schedule/detail/34018">tremendously fun talk on Raspbery Pi hacks</a> - it also turns out she lives in my city and knows a bunch of people that I do. Go figure! She inspired me to go buy a Pi and do something other than a) leave it in a box or b) put XBMC on it. I'm thinking a weather station would be a fun project to build.<br />
<br />
Tim Berglund have a (packed!) talk on <a href="http://www.oscon.com/oscon2014/public/schedule/detail/34756">"Graph Theory you Need to Know"</a>. Tim is a good speaker, but the talk struggled a bit with needing to pack in lots of definition (not Tim's fault). I never knew how easy it was to take an adjacency graph and compute the N-length paths to other nodes - just multiply them! Also neat to see a quick example of going from the graph to a markov chain with probabilities.<br />
<br />Ethan Dereszynski and Eric Butler from Webtrends showed off their (beautiful!) <a href="http://www.oscon.com/oscon2014/public/schedule/detail/34809">realtime system for observing and predicting user behavior on a website</a>. It uses Kafka/Storm to train and classify user behavior using a HMM - the dashboard can show you, in real-time, individual users on your site and the probability that they'll do some action. You can then serve them ads or coupons based on how likely they are to buy/leave/etc. Want to talk to these guys more, because I'm trying to solve a similar problem at <a href="http://www.distilnetworks.com/">Distil</a>.<div>
<br /></div>
<div>
Finally, my<a href="http://www.oscon.com/oscon2014/public/schedule/detail/34164"> talk on the Fourier Transform, FFT, and How to Use It </a>went smashingly well. I hit perfect timing, saw lots of mesmerized faces and had plenty of questions afterwards. The slides are up and the code will be uploaded soon. </div>
William Coxhttp://www.blogger.com/profile/15211955821510632709noreply@blogger.com0