Wednesday, August 12, 2020

Git cleanup remotes

 Running `git fetch origin --prune` will remove unused remote branches. This happens frequently when using Github since it will automatically delete a remote branch that has been merged.

Wednesday, June 19, 2019

Renaming a Column in Pandas to One That Already Exists Can Break Things

Today I wrestled with an irritating issue where I had a perfectly fine DataFrame, renamed some columns and suddenly the thing was just broken. It turns out that the problem is that in Pandas (v. 0.24.1) when you rename a column to an already existing column it just breaks. Try this example:

import pandas as pd

df = pd.DataFrame({"colA": [1,2], "colB": [3,4]})
df = df.rename(columns={"colA": "colB"})


Instead of printing "[1,2]" as you'd expect, instead it throws an AttributeError: 'DataFrame' object has no attribute 'unique'. Other than that, the dataframe appears to be fine. Calling df.columns show that there are now two columns with identical names. When you try and access that column name Pandas returns both in a DataFrame, rather than a single Series object for the one column. Since a DataFrame object doesn't have the unique() function, that's why we get the error above.

Wednesday, April 10, 2019

Get index of Pandas Series row when column matches certain value

Say you have a Pandas DataFrame that looks like:

df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})

If you do a GroupBy operation on a specific column of the DataFrame Pandas returns a Series object. Like

A    4
B    6
Name: Y, dtype: int64

Now if we want to found out which groups had a specific aggregate value - say which groups had a sum == 4, we can do something like:

>>> df3.groupby(['X'])['Y'].sum().eq(4)
A     True
B    False
Name: Y, dtype: bool

Now the question is, how do we get the *index* name where the row equals 4 (in this example we want `A` since it's value is `True` in the Series).

>>> groupings = df3.groupby(['X'])['Y'].sum().eq(4)
>>> groupings.index[groupings == True]
Index([u'A'], dtype='object', name=u'X')

PS. groupings.index[groupings is True] doesn't work even though PEP8 checkers will warn you to switch to it. The groupings object isn't Truthy. The syntax groupings.index[groupings.eq(True)] is an alternative.

Tuesday, April 9, 2019

Python Doesn't Require Commas in Lists .. Sorta

Today I helped a colleague with a subtle Python bug. We have a system that queries for data given a list of IDs. The list of IDs looked like this:

ids = [
    '7d38c515-d543-4186-a6a6-e46d4e356a81' # location 1
    'f384fc68-3030-473f-95a8-52d5fee6cfd4' # location 2
    'b27fef7f-9e5d-4af5-8596-a6949dd257a5' # location 3

It look us an unfortunate amount of time to realize we were missing commas in that list. Python blissfully will concatenate string elements inside of a list for you.

bad_list = ['a' 'b' 'c']
bad_list[0] == 'abc'

This is because ,
"a""b" == "ab"
I'm failing to come up with a helpful example of where this behavior is useful though.

Friday, March 22, 2019

Running a Function on Dask Workers At Startup

I made the mistake of thinking that client.run() would run a function on each Dask worker regardless if the worker hadn't started yet. In the case where new workers come online that function won't run. Instead you'll need to take advantage of the register_worker_callbacks() function that will register a function to run on all of the workers at startup. It looks like this:

cluster = LocalCluster()
client = Client(cluster)

I found this by looking through the tests for this function's pull request. It's otherwise undocumented.

Friday, March 15, 2019

Salary Progression (in Tech)

This was an insightful article on one person's salary progression in technology. Averages out to be something like $11,000 per year in increase over the 15 year career so far. This matches pretty well with other folks I've talked to. An accompanying Hackernews post has more datapoints. Stock options / RSUs tend to skew things and lend a high degree of variability to compensation. At higher levels it isn't uncommon for a large portion of your total compensation to be in RSUs.

Of course, where you live causes a wild variation in the value of your income. Additionally, if you're working for dope cash but miserable it's hard to see that as a net positive.

Thursday, March 14, 2019

New Job For 2019

I'm now working as a Senior Machine Learning Engineer for Grubhub on the order volume forecasting team. We own predictive time series models that produce forecasts the business uses to schedule drivers.

The job search took about 3 months and involved at least 5 out-right rejections, a lot of non-answers, and 2 offers.