Gallamine's Scientific Computing Blog: August 2014

Wednesday, August 20, 2014

Monday, August 18, 2014

Automatically Partition Data in Hive using S3 Buckets

Did you know that if you are processing data stored in S3 using Hive, you can have Hive automatically partition the data (logical separation) by encoding the S3 bucket names using a key=value pair? For instance, if you have time-based data, and you store it in buckets like this:

/root_path_to_buckets/date=20140801
/root_path_to_buckets/date=20140802
/root_path_to_buckets/date=20140803
/root_path_to_buckets/...

And you build a table in Hive, like

CREATE EXTERNAL TABLE time_data(
value STRING,
value2 INT,
value3 STRING,
...
)
PARTITIONED BY(date STRING)
LOCATION s3n://root_path_to_buckets/

Hive will automatically know that your data is logically separated by dates. Usually this requires you to refresh the partition list by calling the command:

ALTER TABLE time_data RECOVER PARTITIONS;

After that, you can check to see if the partitions have taken using the SHOW command:

SHOW PARTITIONS time_data;

Now when you run a SELECT command, Hive will only load the data needed. This saves a tremendous amount of downloading and processing time. Example:

SELECT value, value2 FROM time_data WHERE date > "20140802"

This will only load 1/3 of the data (since 20140801 and 20140802 are excluded).

Gallamine's Scientific Computing Blog

Pages

Wednesday, August 20, 2014

Use Asciinema to Record and Share Terminal Sessions

Monday, August 18, 2014

Automatically Partition Data in Hive using S3 Buckets

Blog Archive