Thursday, June 8, 2017

CountMinSketch In Python

Thanks to my friend Chris I've been pondering some of the work from Misha Bilenko at Microsoft. This lead me down the path of investigating the CountMinSketch algorithm for tracking counts of entities in a stream. To help me learn I wrote a Python implementation of CountMinSketch.

You can use it like so

from countminsketch.countminsketch import CountMinSketch
d = 10
w = 100
cms = CountMinSketch(d=10, w=100)
print("Count of elements is:")

Tuesday, May 16, 2017

Reading Files with Encoding Errors Into Pandas

I found myself in a situation where I needed to read a file into Pandas that had mixed character encoding. Pandas does not handle this situation and instead requires a fixed encoding and throws an error when encountering a bad line. Practically this means if you have a file containing bytes the way you interpret those bytes differs from line to line. In my case, most of the lines were utf-8 while some were of other varieties of encodings.

Character encoding is a particularly confusing problem (for me) so it took a while to figure out a workaround to the issue. I discovered that base Python provides different error handling when decoding bytes into Strings. The default, "strict" (which Pandas uses) throws UnicodeError when a bad line is found. Other options include "ignore" and different varieties of replacement. For my case, I wanted to us the "backslashreplace" style, which converts non-UTF-8 characters into their backslash escaped byte sequences. For example, the Unicode characters "グケゲ" would get turned into "\x30b0\x30b1\30b2" in my Python string. Python also allows you to register a custom error handler if you so desire. If you wanted to be really fancy, you could use a custom error handler to guess other encoding types using FTFY or chardet.

Unfortunately Pandas read_csv() method doesn't support using the non-strict error handling, so I needed a way to decode the bytes before Pandas accessed them. My final solution was to wrap my file in a io.TextIOWrapper class while then allowed me to specify the error handling and to pass it directly to pandas read_cv() method.

 import gzip
import io
import pandas pd

gz = gzip.open('./logs/nasty.log.gz', 'r')
decoder_wrapper = io.TextIOWrapper(gz, encoding='utf-8', errors='backslashreplace') 
df = pd.read_csv(decoder_wrapper, sep='\t')
Figuring all that out took about two days.