This is a DRAFT. Please send comments to #data-coord on Slack.

Genomeark AWS CLI Primer

For complete documentation and information about the AWS CLI, please visit Amazon’s AWS website. This primer assumes you have the program installed on the machine where your data resides. If it is not installed, please install it yourself or reach out to your system administrator for installation assistance. The following topics are briefly covered here:

After reviewing this primer, please visit this easy step-by-step guide.

Command Structure

The AWS CLI program is called by the command aws, followed by options and a subcommand:

aws [options] subcommand ...

We’ll be using only the s3 subcommand, which will in turn have it’s own subcommands and options:

aws [options] s3 subcommand [more options] ...

The most commonly used aws s3 subcommands used for interacting with Genomeark are the following:

Man(ual) Pages

The AWS CLI has extensive documentation online and built into the software as man(ual) pages. You can type “help” after any subcommand to trigger displaying the manual:

# show the man page for the aws command
aws help

# show the man page for the s3 subcommand
aws s3 help

# show the man page for the cp sub-subcommand
#    substitute "cp" with any other s3 subcommand
aws s3 cp help

Relevant aws Options

Two options available to the aws command are critical: --no-sign-request & --profile. These options tell aws what kind of permissions you are using for this command. It also helps Amazon know who to bill for certain operations, but you can ignore that for everything we’re talking about here. The --no-sign-request flag is used to make your command anonymous. It is safe to use for operations that download information (e.g., directory listings with ls) or files (e.g., via cp, mv, or sync). Whereas operations that require write permissions to the bucket (i.e., uploading data) cannot be done anonymously. In those cases, you’ll want to use --profile and provide it with the name of the AWS “profile” associated with both the Genomeark bucket and your account. If you followed our credentialing instructions, you would provide the string “genomeark”: --profile=genomeark. If your Genomeark credientials are set to the “default” profile in your ~/.aws/credentials file, you can omit both of these options and aws will behave as if you provided --profile=default. Note that --no-sign-request & --profile are mutually exclusive.

Useful aws s3 subcommands

The core aws s3 subcommands for uploading data to Genomeark are ls (list contents of a directory), cp (copy files), mv (move files), and sync (sync two directories).

aws s3 ls

This command is analagous to using Unix-like operating systems’ ls to view directories and files in a traditional file system. You can view the manual for full details (via aws s3 ls help), but the most common usage looks like this:

aws --no-sign-request s3 ls [--recursive] S3-URI

aws s3 cp

This command is analagous to Unix-like operating systems’ cp (or, more accurately, scp) to copy files from one location to another, with either or both locations being a remote location. By default, it copies only a single file, but the --recursive option can be supplied to copy everything in a directory, include subdirectories. The general usage looks like this:

aws --profile=genomeark s3 cp [--recursive] source destination

One or both of “source” and “destination” must be an S3-URI, i.e., it cannot copy a local file to another local location, but it can copy (a) a local file to the S3 bucket, (b) a file in the S3 bucket to a local location, or (c) a file on the S3 bucket to another location in the S3 bucket. You can view the manual for full details via aws s3 cp help.

aws s3 rm

This command is analagous to Unix-like operating systems’ rm to remove/delete files. By default, it removes only a single file, but the --recursive option can be supplied to delete everything in a directory, include subdirectories. You can view the manual for full details via aws s3 cp help. The general usage looks like this:

aws --profile=genomeark s3 rm [--recursive] S3-URI

aws s3 mv

This command is used to move files. While it is roughly analagous to using mv on files in a traditional file system, it is implemented as a copy operation followed by a remove operation instead of as a rename operation. In a traditional file system (single physical drive), mv operations are very fast because they are simply updating what amounts to metadata. In S3 object storage, every byte of the file(s) must be read from disk and written to a new location. The following are functionally identical:

# mv
aws --profile=${profile} s3 mv s3://${bucket}/path/to/file1.txt s3://${bucket}/other/path/to/file2.txt

# cp + rm
aws --profile=${profile} s3 cp s3://${bucket}/path/to/file1.txt s3://${bucket}/other/path/to/file2.txt
aws --profile=${profile} s3 rm s3://${bucket}/path/to/file1.txt

Accordingly, this command is used similarly to the aws s3 cp command:

aws --profile=genomeark s3 mv [--recursive] source destination

One or both of “source” and “destination” must be an S3-URI, i.e., it cannot move a local file to another local location, but it can move (a) a local file to the S3 bucket (the local copy will be deleted), (b) a file in the S3 bucket to a local location (the copy on the S3 bucket will be deleted), or (c) a file on the S3 bucket to another location in the S3 bucket. You can view the manual for full details via aws s3 mv help.

aws s3 sync

This command is analagous to using rsync to make a destination directory mimic a source directory on traditional (possibly remote) file systems. The typical command looks like this:

aws --profile=genomeark s3 sync [options] source destination

One or both of “source” and “destination” must be an S3-URI, i.e., it cannot sync a local directory to another local location, but it can sync (a) a local directory to the S3 bucket, (b) a directory on the S3 bucket to a local location, or (c) a directory on the S3 bucket to another location in the S3 bucket. You can view the manual for full details via aws s3 sync help. Common options are --include and --exclude.

This command will probably be the most useful when uploading data to Genomeark because the entire upload process can be handled by a single command if you mirror expected directory structure and filenames on your machine. See the Easy Step-by-Step Guide for an example.

caveat emptor: the aws s3 sync --include and --exclude options to filter files in the sync operation behave very differently from the same options in rsync. Compare aws s3 sync help searching for the “Use of Exclude and Include Filters” section with man rsync searching for the “FILTER RULES” section.

Doing a Dryrun

Each of these aws s3 operations can have the --dryrun option supplied to show what would happen had you run the command for real. This can be really helpful to test that your commands will do what you think, especially when operating on many files (e.g., with sync, cp --recursive, or rm --recursive).

# general case
aws [options] s3 subcommand --dryrun ...

# specific examples
aws --profile=genomeark s3 cp --dryrun myHifi/movie.hifi_reads.bam s3://genomeark-upload/path/to/somewhere/
aws --profile=genomeark s3 rm --dryrun --recursive s3://genomeark-upload/path/to/everythingIuploaded/
aws --profile=genomeark s3 sync --dryrun --exclude='*.sh' allMyStuff/ s3://genomeark-upload/some/path/