-->

Wednesday, August 20, 2014

Use Asciinema to Record and Share Terminal Sessions

I discovered a cool tool today by listening to Jeroen Janssen's talk on Data Science at the Command Line. The tool is Asciinema, which is a terminal plugin that lets you record you terminal session, save it and then share - with copyable text! It supports both OSX and Linux.

Here's an example from their website:

Can't wait to use this for this blog!

Monday, August 18, 2014

Automatically Partition Data in Hive using S3 Buckets

Did you know that if you are processing data stored in S3 using Hive, you can have Hive automatically partition the data (logical separation) by encoding the S3 bucket names using a key=value pair? For instance, if you have time-based data, and you store it in buckets like this:

/root_path_to_buckets/date=20140801
/root_path_to_buckets/date=20140802
/root_path_to_buckets/date=20140803
/root_path_to_buckets/...

And you build a table in Hive, like


CREATE EXTERNAL TABLE time_data(
   value STRING,
   value2 INT,
   value3 STRING,
   ...
)
PARTITIONED BY(date STRING)
LOCATION s3n://root_path_to_buckets/

Hive will automatically know that your data is logically separated by dates. Usually this requires you to refresh the partition list by calling the command:

ALTER TABLE time_data RECOVER PARTITIONS;

After that, you can check to see if the partitions have taken using the SHOW command:

SHOW PARTITIONS time_data;

Now when you run a SELECT command, Hive will only load the data needed. This saves a tremendous amount of downloading and processing time. Example:

SELECT value, value2 FROM time_data WHERE date > "20140802"

This will only load 1/3 of the data (since 20140801 and 20140802 are excluded).