-->

Wednesday, June 19, 2019

Renaming a Column in Pandas to One That Already Exists Can Break Things

Today I wrestled with an irritating issue where I had a perfectly fine DataFrame, renamed some columns and suddenly the thing was just broken. It turns out that the problem is that in Pandas (v. 0.24.1) when you rename a column to an already existing column it just breaks. Try this example:

import pandas as pd

df = pd.DataFrame({"colA": [1,2], "colB": [3,4]})
df = df.rename(columns={"colA": "colB"})

df.colB.unique()

Instead of printing "[1,2]" as you'd expect, instead it throws an AttributeError: 'DataFrame' object has no attribute 'unique'. Other than that, the dataframe appears to be fine. Calling df.columns show that there are now two columns with identical names. When you try and access that column name Pandas returns both in a DataFrame, rather than a single Series object for the one column. Since a DataFrame object doesn't have the unique() function, that's why we get the error above.

Wednesday, April 10, 2019

Get index of Pandas Series row when column matches certain value

Say you have a Pandas DataFrame that looks like:

df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})

If you do a GroupBy operation on a specific column of the DataFrame Pandas returns a Series object. Like

df3.groupby(['X'])['Y'].sum()
X
A    4
B    6
Name: Y, dtype: int64

Now if we want to found out which groups had a specific aggregate value - say which groups had a sum == 4, we can do something like:

>>> df3.groupby(['X'])['Y'].sum().eq(4)
X
A     True
B    False
Name: Y, dtype: bool


Now the question is, how do we get the *index* name where the row equals 4 (in this example we want `A` since it's value is `True` in the Series).

>>> groupings = df3.groupby(['X'])['Y'].sum().eq(4)
>>> groupings.index[groupings == True]
Index([u'A'], dtype='object', name=u'X')


PS. groupings.index[groupings is True] doesn't work even though PEP8 checkers will warn you to switch to it. The groupings object isn't Truthy. The syntax groupings.index[groupings.eq(True)] is an alternative.

Tuesday, April 9, 2019

Python Doesn't Require Commas in Lists .. Sorta

Today I helped a colleague with a subtle Python bug. We have a system that queries for data given a list of IDs. The list of IDs looked like this:

ids = [
    '7d38c515-d543-4186-a6a6-e46d4e356a81' # location 1
    'f384fc68-3030-473f-95a8-52d5fee6cfd4' # location 2
    'b27fef7f-9e5d-4af5-8596-a6949dd257a5' # location 3
]

It look us an unfortunate amount of time to realize we were missing commas in that list. Python blissfully will concatenate string elements inside of a list for you.

bad_list = ['a' 'b' 'c']
bad_list[0] == 'abc'
True

This is because ,
"a""b" == "ab"
True
I'm failing to come up with a helpful example of where this behavior is useful though.

Friday, March 22, 2019

Running a Function on Dask Workers At Startup

I made the mistake of thinking that client.run() would run a function on each Dask worker regardless if the worker hadn't started yet. In the case where new workers come online that function won't run. Instead you'll need to take advantage of the register_worker_callbacks() function that will register a function to run on all of the workers at startup. It looks like this:

cluster = LocalCluster()
client = Client(cluster)
client.register_worker_callbacks(setup=your_function_name)

I found this by looking through the tests for this function's pull request. It's otherwise undocumented.

Friday, March 15, 2019

Salary Progression (in Tech)

This was an insightful article on one person's salary progression in technology. Averages out to be something like $11,000 per year in increase over the 15 year career so far. This matches pretty well with other folks I've talked to. An accompanying Hackernews post has more datapoints. Stock options / RSUs tend to skew things and lend a high degree of variability to compensation. At higher levels it isn't uncommon for a large portion of your total compensation to be in RSUs.

Of course, where you live causes a wild variation in the value of your income. Additionally, if you're working for dope cash but miserable it's hard to see that as a net positive.

Thursday, March 14, 2019

New Job For 2019

I'm now working as a Senior Machine Learning Engineer for Grubhub on the order volume forecasting team. We own predictive time series models that produce forecasts the business uses to schedule drivers.

The job search took about 3 months and involved at least 5 out-right rejections, a lot of non-answers, and 2 offers.

Thursday, June 8, 2017

CountMinSketch In Python

Thanks to my friend Chris I've been pondering some of the work from Misha Bilenko at Microsoft. This lead me down the path of investigating the CountMinSketch algorithm for tracking counts of entities in a stream. To help me learn I wrote a Python implementation of CountMinSketch.

You can use it like so

from countminsketch.countminsketch import CountMinSketch
d = 10
w = 100
cms = CountMinSketch(d=10, w=100)
cms.add('test_value')
print("Count of elements is:")
print(cms.query('test_value'))