Character encoding is a particularly confusing problem (for me) so it took a while to figure out a workaround to the issue. I discovered that base Python provides different error handling when decoding bytes into Strings. The default, "strict" (which Pandas uses) throws UnicodeError when a bad line is found. Other options include "ignore" and different varieties of replacement. For my case, I wanted to us the "backslashreplace" style, which converts non-UTF-8 characters into their backslash escaped byte sequences. For example, the Unicode characters "グケゲ" would get turned into "\x30b0\x30b1\30b2" in my Python string. Python also allows you to register a custom error handler if you so desire. If you wanted to be really fancy, you could use a custom error handler to guess other encoding types using FTFY or chardet.
Unfortunately Pandas read_csv() method doesn't support using the non-strict error handling, so I needed a way to decode the bytes before Pandas accessed them. My final solution was to wrap my file in a io.TextIOWrapper class while then allowed me to specify the error handling and to pass it directly to pandas read_cv() method.
Example:
import gzipFiguring all that out took about two days.
import io
import pandas pd
gz = gzip.open('./logs/nasty.log.gz', 'r')
decoder_wrapper = io.TextIOWrapper(gz, encoding='utf-8', errors='backslashreplace')
df = pd.read_csv(decoder_wrapper, sep='\t')
No comments:
Post a Comment