I discovered a cool tool today by listening to Jeroen Janssen's talk on Data Science at the Command Line. The tool is Asciinema, which is a terminal plugin that lets you record you terminal session, save it and then share - with copyable text! It supports both OSX and Linux.
Here's an example from their website:
Can't wait to use this for this blog!
A blog about scientific computing with Python and Matlab. See the work of an engineer and data scientist in practice.
Wednesday, August 20, 2014
Monday, August 18, 2014
Automatically Partition Data in Hive using S3 Buckets
Did you know that if you are processing data stored in S3 using Hive, you can have Hive automatically partition the data (logical separation) by encoding the S3 bucket names using a key=value pair? For instance, if you have time-based data, and you store it in buckets like this:
/root_path_to_buckets/date=20140801
/root_path_to_buckets/date=20140802
/root_path_to_buckets/date=20140803
/root_path_to_buckets/...
And you build a table in Hive, like
CREATE EXTERNAL TABLE time_data(
value STRING,
value2 INT,
value3 STRING,
...
)
PARTITIONED BY(date STRING)
LOCATION s3n://root_path_to_buckets/
Hive will automatically know that your data is logically separated by dates. Usually this requires you to refresh the partition list by calling the command:
ALTER TABLE time_data RECOVER PARTITIONS;
After that, you can check to see if the partitions have taken using the SHOW command:
SHOW PARTITIONS time_data;
Now when you run a SELECT command, Hive will only load the data needed. This saves a tremendous amount of downloading and processing time. Example:
SELECT value, value2 FROM time_data WHERE date > "20140802"
This will only load 1/3 of the data (since 20140801 and 20140802 are excluded).
/root_path_to_buckets/date=20140801
/root_path_to_buckets/date=20140802
/root_path_to_buckets/date=20140803
/root_path_to_buckets/...
And you build a table in Hive, like
CREATE EXTERNAL TABLE time_data(
value STRING,
value2 INT,
value3 STRING,
...
)
PARTITIONED BY(date STRING)
LOCATION s3n://root_path_to_buckets/
Hive will automatically know that your data is logically separated by dates. Usually this requires you to refresh the partition list by calling the command:
ALTER TABLE time_data RECOVER PARTITIONS;
After that, you can check to see if the partitions have taken using the SHOW command:
SHOW PARTITIONS time_data;
Now when you run a SELECT command, Hive will only load the data needed. This saves a tremendous amount of downloading and processing time. Example:
SELECT value, value2 FROM time_data WHERE date > "20140802"
This will only load 1/3 of the data (since 20140801 and 20140802 are excluded).
Subscribe to:
Posts (Atom)