Tuesday, May 16, 2017

Reading Files with Encoding Errors Into Pandas

I found myself in a situation where I needed to read a file into Pandas that had mixed character encoding. Pandas does not handle this situation and instead requires a fixed encoding and throws an error when encountering a bad line. Practically this means if you have a file containing bytes the way you interpret those bytes differs from line to line. In my case, most of the lines were utf-8 while some were of other varieties of encodings.

Character encoding is a particularly confusing problem (for me) so it took a while to figure out a workaround to the issue. I discovered that base Python provides different error handling when decoding bytes into Strings. The default, "strict" (which Pandas uses) throws UnicodeError when a bad line is found. Other options include "ignore" and different varieties of replacement. For my case, I wanted to us the "backslashreplace" style, which converts non-UTF-8 characters into their backslash escaped byte sequences. For example, the Unicode characters "グケゲ" would get turned into "\x30b0\x30b1\30b2" in my Python string. Python also allows you to register a custom error handler if you so desire. If you wanted to be really fancy, you could use a custom error handler to guess other encoding types using FTFY or chardet.

Unfortunately Pandas read_csv() method doesn't support using the non-strict error handling, so I needed a way to decode the bytes before Pandas accessed them. My final solution was to wrap my file in a io.TextIOWrapper class while then allowed me to specify the error handling and to pass it directly to pandas read_cv() method.

 import gzip
import io
import pandas pd

gz = gzip.open('./logs/nasty.log.gz', 'r')
decoder_wrapper = io.TextIOWrapper(gz, encoding='utf-8', errors='backslashreplace') 
df = pd.read_csv(decoder_wrapper, sep='\t')
Figuring all that out took about two days.

No comments:

Post a Comment