Gallamine's Scientific Computing Blog: Automatically Partition Data in Hive using S3 Buckets

Monday, August 18, 2014

Automatically Partition Data in Hive using S3 Buckets

Did you know that if you are processing data stored in S3 using Hive, you can have Hive automatically partition the data (logical separation) by encoding the S3 bucket names using a key=value pair? For instance, if you have time-based data, and you store it in buckets like this:

/root_path_to_buckets/date=20140801
/root_path_to_buckets/date=20140802
/root_path_to_buckets/date=20140803
/root_path_to_buckets/...

And you build a table in Hive, like

CREATE EXTERNAL TABLE time_data(
value STRING,
value2 INT,
value3 STRING,
...
)
PARTITIONED BY(date STRING)
LOCATION s3n://root_path_to_buckets/

Hive will automatically know that your data is logically separated by dates. Usually this requires you to refresh the partition list by calling the command:

ALTER TABLE time_data RECOVER PARTITIONS;

After that, you can check to see if the partitions have taken using the SHOW command:

SHOW PARTITIONS time_data;

Now when you run a SELECT command, Hive will only load the data needed. This saves a tremendous amount of downloading and processing time. Example:

SELECT value, value2 FROM time_data WHERE date > "20140802"

This will only load 1/3 of the data (since 20140801 and 20140802 are excluded).

3 comments:

UnknownAugust 30, 2015 at 12:01 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
UnknownAugust 25, 2016 at 11:38 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownAugust 25, 2016 at 11:48 PM
Thank you. Very helpful article for Automatically Partition Data in Hive using S3 Buckets. I need a example with snapshots. can you post with an example with snapshots?

you can also see the details about hive partition and bucketing as well as the hadoop ecosystems in-depth with clear examples in the below link http://www.geoinsyssoft.com/hive-partition-bucketing/
ReplyDelete
Replies

Add comment

Gallamine's Scientific Computing Blog

Pages

Monday, August 18, 2014

Automatically Partition Data in Hive using S3 Buckets

3 comments:

Blog Archive