Using EMR (Hadoop) for Massive S3 Transfers

Jordan Graft, over 1 year ago

For use cases where you need to move terabytes of files from one S3 bucket to another, EMR can be the right tool to handle this job.  This process takes awhile for the initial mapping of the bucket to complete after which the task of copying the files will be handed off to task and worker instances.  

  1. From the EMR Console
    1. Select Create new Job Flow.
    2. Name it -> Select Hive Program -> Continue
    3. Select Start an Interactive Hive Session -> Continue
    4. Change the core # to 0 -> Continue
    5. Select a keypair->Continue-> Continue -> Create Job Flow.
  2. Once the cluster has been created SSH in as the Hadoop user
  3. Install & configure s3cmd
  4. Download the s3distcp.jar file and start the job.  The s3a notation is the latest format for referencing an s3 bucket vs s3 and s3n