-->

Monday, August 18, 2014

Automatically Partition Data in Hive using S3 Buckets

Did you know that if you are processing data stored in S3 using Hive, you can have Hive automatically partition the data (logical separation) by encoding the S3 bucket names using a key=value pair? For instance, if you have time-based data, and you store it in buckets like this:

/root_path_to_buckets/date=20140801
/root_path_to_buckets/date=20140802
/root_path_to_buckets/date=20140803
/root_path_to_buckets/...

And you build a table in Hive, like


CREATE EXTERNAL TABLE time_data(
   value STRING,
   value2 INT,
   value3 STRING,
   ...
)
PARTITIONED BY(date STRING)
LOCATION s3n://root_path_to_buckets/

Hive will automatically know that your data is logically separated by dates. Usually this requires you to refresh the partition list by calling the command:

ALTER TABLE time_data RECOVER PARTITIONS;

After that, you can check to see if the partitions have taken using the SHOW command:

SHOW PARTITIONS time_data;

Now when you run a SELECT command, Hive will only load the data needed. This saves a tremendous amount of downloading and processing time. Example:

SELECT value, value2 FROM time_data WHERE date > "20140802"

This will only load 1/3 of the data (since 20140801 and 20140802 are excluded).

3 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Thank you. Very helpful article for Automatically Partition Data in Hive using S3 Buckets. I need a example with snapshots. can you post with an example with snapshots?

    you can also see the details about hive partition and bucketing as well as the hadoop ecosystems in-depth with clear examples in the below link http://www.geoinsyssoft.com/hive-partition-bucketing/

    ReplyDelete