Thursday, June 26, 2014

How to Put Data into Amazon S3 From OSX

Amazon's S3 storage can be a great option for cheaply storing data "in the cloud". The problem is how to get data in and out of S3? There are several options for doing this. If you need a GUI application for OSX you might try Forklift. For a command line option, install the s3cmd program. If you have Homebrew installed you can do this via the command:

brew install s3cmd

Now run s3cmd --configure to set it up.
Enter your AWS Access Key and Secret Key (which you can find in your control panel). Pick some encryption password (which is used during the transfer). I don't use any GPG program,  no HTTPS protocol and no HTTP proxy server name.

The next option is to test the connection. Hit "Y". You should see the output:
`Success. Your access key and secret key worked fine :-)`

Now hit "Y" again to save the settings.

Great! Now you've got s3cmd set up and you can move files into and out of s3!

To move a file into s3, run the command:

s3cmd put path/to/filename/filename.file s3://path/to/s3/bucket/

The file can also be accessed via HTTP using the following format:

via AWS Support Forums
You can also navigate to the file in the S3 console, hit the properties and you'll see the HTTP link listed.

To pull down a file from s3, run the command:

s3cmd get s3://path/to/s3/bucket/filename.file

Monday, June 9, 2014

Introduction to IPython.Parallel and Distributed Model Selection

At PyCON 2013 Olivier Grisel presented a tutorial on Advanced Scikit-Learn. One of the topics was parallel computation and model training. This started at 1:03 in the video. There's nice coverage of memory mapping large files using joblib and Numpy that is priceless.

The data and notebooks for the talk can be checked out here. Grisel also covered using StarCluster to distribute computation (very) easily among many EC2 machines. I can't wait to give it a try!

Great talk and well worth the watch.