-->

Tuesday, May 16, 2017

Reading Files with Encoding Errors Into Pandas

I found myself in a situation where I needed to read a file into Pandas that had mixed character encoding. Pandas does not handle this situation and instead requires a fixed encoding and throws an error when encountering a bad line. Practically this means if you have a file containing bytes the way you interpret those bytes differs from line to line. In my case, most of the lines were utf-8 while some were of other varieties of encodings.

Character encoding is a particularly confusing problem (for me) so it took a while to figure out a workaround to the issue. I discovered that base Python provides different error handling when decoding bytes into Strings. The default, "strict" (which Pandas uses) throws UnicodeError when a bad line is found. Other options include "ignore" and different varieties of replacement. For my case, I wanted to us the "backslashreplace" style, which converts non-UTF-8 characters into their backslash escaped byte sequences. For example, the Unicode characters "グケゲ" would get turned into "\x30b0\x30b1\30b2" in my Python string. Python also allows you to register a custom error handler if you so desire. If you wanted to be really fancy, you could use a custom error handler to guess other encoding types using FTFY or chardet.

Unfortunately Pandas read_csv() method doesn't support using the non-strict error handling, so I needed a way to decode the bytes before Pandas accessed them. My final solution was to wrap my file in a io.TextIOWrapper class while then allowed me to specify the error handling and to pass it directly to pandas read_cv() method.

Example:
 import gzip
import io
import pandas pd

gz = gzip.open('./logs/nasty.log.gz', 'r')
decoder_wrapper = io.TextIOWrapper(gz, encoding='utf-8', errors='backslashreplace') 
df = pd.read_csv(decoder_wrapper, sep='\t')
Figuring all that out took about two days.

2 comments:

  1. Expected to form you a next to no word to thank you once more with respect to the decent recommendations you've contributed here.
    Dotnet Training in Marathahalli

    ReplyDelete
  2. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a .Net developer learn from Dot Net Training in Chennai. or learn thru ASP.NET Essential Training Online . Nowadays Dot Net has tons of job opportunities on various vertical industry.

    ReplyDelete