Wednesday, April 10, 2019

Get index of Pandas Series row when column matches certain value

Say you have a Pandas DataFrame that looks like:

df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})

If you do a GroupBy operation on a specific column of the DataFrame Pandas returns a Series object. Like

A    4
B    6
Name: Y, dtype: int64

Now if we want to found out which groups had a specific aggregate value - say which groups had a sum == 4, we can do something like:

>>> df3.groupby(['X'])['Y'].sum().eq(4)
A     True
B    False
Name: Y, dtype: bool

Now the question is, how do we get the *index* name where the row equals 4 (in this example we want `A` since it's value is `True` in the Series).

>>> groupings = df3.groupby(['X'])['Y'].sum().eq(4)
>>> groupings.index[groupings == True]
Index([u'A'], dtype='object', name=u'X')

PS. groupings.index[groupings is True] doesn't work even though PEP8 checkers will warn you to switch to it. The groupings object isn't Truthy. The syntax groupings.index[groupings.eq(True)] is an alternative.

Tuesday, April 9, 2019

Python Doesn't Require Commas in Lists .. Sorta

Today I helped a colleague with a subtle Python bug. We have a system that queries for data given a list of IDs. The list of IDs looked like this:

ids = [
    '7d38c515-d543-4186-a6a6-e46d4e356a81' # location 1
    'f384fc68-3030-473f-95a8-52d5fee6cfd4' # location 2
    'b27fef7f-9e5d-4af5-8596-a6949dd257a5' # location 3

It look us an unfortunate amount of time to realize we were missing commas in that list. Python blissfully will concatenate string elements inside of a list for you.

bad_list = ['a' 'b' 'c']
bad_list[0] == 'abc'

This is because ,
"a""b" == "ab"
I'm failing to come up with a helpful example of where this behavior is useful though.

Friday, March 22, 2019

Running a Function on Dask Workers At Startup

I made the mistake of thinking that client.run() would run a function on each Dask worker regardless if the worker hadn't started yet. In the case where new workers come online that function won't run. Instead you'll need to take advantage of the register_worker_callbacks() function that will register a function to run on all of the workers at startup. It looks like this:

cluster = LocalCluster()
client = Client(cluster)

I found this by looking through the tests for this function's pull request. It's otherwise undocumented.

Friday, March 15, 2019

Salary Progression (in Tech)

This was an insightful article on one person's salary progression in technology. Averages out to be something like $11,000 per year in increase over the 15 year career so far. This matches pretty well with other folks I've talked to. An accompanying Hackernews post has more datapoints. Stock options / RSUs tend to skew things and lend a high degree of variability to compensation. At higher levels it isn't uncommon for a large portion of your total compensation to be in RSUs.

Of course, where you live causes a wild variation in the value of your income. Additionally, if you're working for dope cash but miserable it's hard to see that as a net positive.

Thursday, March 14, 2019

New Job For 2019

I'm now working as a Senior Machine Learning Engineer for Grubhub on the order volume forecasting team. We own predictive time series models that produce forecasts the business uses to schedule drivers.

The job search took about 3 months and involved at least 5 out-right rejections, a lot of non-answers, and 2 offers.

Thursday, June 8, 2017

CountMinSketch In Python

Thanks to my friend Chris I've been pondering some of the work from Misha Bilenko at Microsoft. This lead me down the path of investigating the CountMinSketch algorithm for tracking counts of entities in a stream. To help me learn I wrote a Python implementation of CountMinSketch.

You can use it like so

from countminsketch.countminsketch import CountMinSketch
d = 10
w = 100
cms = CountMinSketch(d=10, w=100)
print("Count of elements is:")

Tuesday, May 16, 2017

Reading Files with Encoding Errors Into Pandas

I found myself in a situation where I needed to read a file into Pandas that had mixed character encoding. Pandas does not handle this situation and instead requires a fixed encoding and throws an error when encountering a bad line. Practically this means if you have a file containing bytes the way you interpret those bytes differs from line to line. In my case, most of the lines were utf-8 while some were of other varieties of encodings.

Character encoding is a particularly confusing problem (for me) so it took a while to figure out a workaround to the issue. I discovered that base Python provides different error handling when decoding bytes into Strings. The default, "strict" (which Pandas uses) throws UnicodeError when a bad line is found. Other options include "ignore" and different varieties of replacement. For my case, I wanted to us the "backslashreplace" style, which converts non-UTF-8 characters into their backslash escaped byte sequences. For example, the Unicode characters "グケゲ" would get turned into "\x30b0\x30b1\30b2" in my Python string. Python also allows you to register a custom error handler if you so desire. If you wanted to be really fancy, you could use a custom error handler to guess other encoding types using FTFY or chardet.

Unfortunately Pandas read_csv() method doesn't support using the non-strict error handling, so I needed a way to decode the bytes before Pandas accessed them. My final solution was to wrap my file in a io.TextIOWrapper class while then allowed me to specify the error handling and to pass it directly to pandas read_cv() method.

 import gzip
import io
import pandas pd

gz = gzip.open('./logs/nasty.log.gz', 'r')
decoder_wrapper = io.TextIOWrapper(gz, encoding='utf-8', errors='backslashreplace') 
df = pd.read_csv(decoder_wrapper, sep='\t')
Figuring all that out took about two days.