This is a DRAFT. Please send comments to #data-coord on Slack.

Genomeark Data Preparation

All the Genomeark data is stored on AWS S3 in the “genomeark” bucket. This bucket is periodically crawled to populate the information on the Genomeark website. An S3 bucket can store any arbitrary file, thus Genomeark can (and often does) store more files than are strictly necessary for data sharing and the website, e.g., intermediate files generated during the assembly process. However, the core files that Genomeark stores are the sequencing data and the assemblies. In order to get your sequencing data and/or assemblies into Genomeark and ensure your project is represented correctly on the website, you will need the following:

an image of your species for the website (if not yet present)
a Taxon Identifier (TaxID) from the NCBI Taxonomy database
a Tree of Life Identifier (ToLID) for your specimen(s)
certain metadata for the metadata.yaml file
access to the Genomeark AWS S3 bucket
an understanding of how the S3 bucket is organized
familiarity with the AWS-CLI
your data prepared for upload

If you have any questions after reading this guide, please reach out for help.

A picture for the website

If you are the first to upload data for your species of interest, congratulations! To keep the website looking nice, we need a picture of your species. If you are not the first for your species but you see the entry on the website is missing an image (a picture of a landscape with a magnifying glass overlaid), you have an opportunity to be a good citizen (and make your favorite image famous).

Image requirements:

JPG (please rename .jpeg to .jpg)
Square 500x500 px at 72ppi
Licensable (e.g., under CC BY) or permission for use provided

If you do not have a picture, you can search for one online (e.g., with Google); with the correct search parameters, only licensable images will be displayed. Here is an example search for a Siamang (Symphalangus syndactylus):

Genus="Symphalangus"
species="syndactylus"
echo 'https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=&as_epq='"${Genus} ${species}"'&as_oq=&as_eq=&imgsz=&imgar=&imgc=&imgcolor=&imgtype=&cr=&as_sitesearch=&safe=active&as_filetype=&tbs=sur%3Acl'

Simply copy/paste the resulting link into your browser.

If the image is not square, you can crop it or add a solid color (preferably black) above and below the image such that it becomes square. If the original image is not at least 500px wide or at least 72ppi, it is not suitable. Larger images (made square, if needed) can be scaled down in size and/or resolution to match 500x500 px at 72ppi. If you need help doing this, please reach out.

Any image you take off the web must be licensable or you must obtain written permission (e.g., by an email). Examples of valid licenses: public domain, CC0, CC BY (and variants (e.g., CC BY-SA)), GNU GPL, etc. For CC licenses, NC (non-commercial) variants are okay; ND (non-derivative) variants are not okay because we change the size of the image dynamically on the website, which is technically a derivative. Please note the license type, including the version, and send that information in when you share the image file. Also note the auther/artist/copyright holder and the location from which you sourced the image. Most images come from Wikimedia or flickr. The username of the uploader is sufficient, but if you can easily track down the actual name of the person, that is preferred. Please also provide the link to the page with the original image.

Taxon Identifier

The NCBI Taxonomy database has a numerical identifier for every species from which it hosts sequencing data. Most species are already in the database. However, if yours is not, or if your organism is a hybrid (e.g., between breeds or between species) of cattle), you can request a new Taxon identifier.

Tree of Life Identifier (ToLID)

Data is stored on Genomeark in “directories” (why is this quoted?) using the ToLIDs, so your data cannot be added until you have one for each discrete individual (or pool of individuals - see more on pools below). Creating a new ToLID requires an NCBI Taxon Identifier, so create one of these first if one does not exist for your ogranism. To create a new ToLID, visit the ToLID website and follow the instructions. You will need to create an account. Note that you cannot control the number associated with your ToLID, they are provided on a first-come/first-serve basis, beginning with 1.

To create a new ToLID, you will need to provide the scientific name (Genus & species) of your ogranism, the Taxon ID for your organism, and an internal identifier. If you have more than one internal idenfier, you may provide multiple separated by a sensible character (e.g., /, ;, or ,). If the organism is named (e.g., “Jim” the Gorilla), please include this as an identifier.

ToLIDs for Hybrids and Pools

Creating a ToLID for a hybrid is usually straightforward once you have a Taxon ID. Just read the instructions on the ToLID website and ask for help if needed. For pools of individuals (e.g., a population of fruit flies), you may treat the entire pool as if it were one individual for the purpose of the generating the ToLID. A common reason for needing to sequence a pool of individuals is to get enough genetic material from a small oragnism, such as an insect. If you used multiple pools or a subset of a pool at different time points, each would require separate ToLIDs. This is different from larger organisms, where samples taken from a single individual but from different tissues and/or at different developmental stages all share the same ToLID.

Other Metadata

Some other metadata is required to be added to a YAML file on Genomeark. There is one such file per species (or cross, etc.). These files’ fields are not presently standardized, but certain fields are required to ensure populating the website works correctly. Additional arbitrary fields can be added, and we strongly encourage adding as much information as possible for your future convenience when creating BioSamples and submitting reads and assemblies to NCBI (or INSDC partner).

These files are stored in a GitHub repository. If a YAML file for your species already exists, please add your information to it instead of creating a new one from scratch. You can share your file via email or Slack or you can make a pull request with your changes on the GitHub repository.

The following is a template containing the minimum requirements:

---
species:
  short_name: cGenSpe
  name: Genus species
  common_name: common
  taxon_id: ####
  order:
    name: Order
  family:
    name: Family
  individuals:
  -
      short_name: cGenSpe1
      subspecies:                       # <-- only needed if applicable
        name: Genus species subspecies  # <-- only needed if applicable
        common_name: common name        # <-- only needed if applicable
        taxon_id: ######                # <-- only needed if applicable
  genome_size: ###########
  genome_size_method: method
  project: [ project1, project2, etc. ]

Note that the genome_size must be an integer. Suffixes (e.g., K, Mb, etc.) are not supported. If you do not know the genome size, try looking it up on GoaT.

The following are good examples you can emulate:

Here is a more complete template you can use:

---
species:
  short_name: cGenSpe
  name: Genus species
  common_name: common
  taxon_id: ####
  order:
    name: Order
  family:
    name: Family
  individuals:
  -
      short_name: cGenSpe1
      name: name
      strain: strain-name
      subspecies:
        name: Genus species subspecies
        common_name: common name
        taxon_id: ######
      alt_ids:
      - alt1
      - alt2
      sex: male
      birth_date: date
      birth_location: location
      birth_type: type (e.g., wild vs captive)
      description: >
        A longer description;
      provider: Name or inst, other name or inst, yet another name or inst
      father: null
      mother: null
      birth_date: date
      samples:
      -
          biosample_id: biosample
          tissue: tissue-type
          dev_stage: stage
          treatment: treatment
          storage_condition: condition
          collection_date: date
          collected_by: name or inst
          source_id: source_identifier
          lat_lon: coordinates

  -
      short_name: cGenSpe2
      name: name
      biosample_id: biosample
      strain: strain-name
      subspecies:
        name: Genus species subspecies
        common_name: common name
        taxon_id: ######
      alt_ids:
      - alt1
      - alt2
      sex: female
      birth_date: date
      birth_location: location
      birth_type: type (e.g., wild vs captive)
      description: >
        A longer description;
      provider: Name or inst, other name or inst, yet another name or inst
      father: null
      mother: null
      birth_date: date
  -
      short_name: cGenSpe3
      name: name
      biosample_id: biosample
      strain: strain-name
      subspecies:
        name: Genus species subspecies
        common_name: common name
        taxon_id: ######
      alt_ids:
      - alt1
      - alt2
      sex: (fe)male
      birth_date: date
      birth_location: location
      birth_type: type (e.g., wild vs captive)
      description: >
        A longer description;
      provider: Name or inst, other name or inst, yet another name or inst
      father: cGenSpe1
      mother: cGenSpe2
      birth_date: date
  genome_size: ###########
  genome_size_method: null
  project: [ project1, project2, etc. ]

Write access to the AWS S3 Bucket

In order to get your data onto the AWS S3 bucket, you’ll need write access. Please see our AWS Credentials Guide and then reach out to get access. Generally speaking, we will grant write access to a supplementary bucket (genomeark-upload). Once you’ve uploaded your data there, notify us so we can move it to the primary bucket (genomeark).

You will be required to upload the data in a specific structure. If a mistake is made, you can remove the offending file(s) using aws s3 rm; however, we urge caution, especially with the use of the --recursive flag (akin to rm -r on Linux). We strongly encourage you to use the --dryrun flag to test your commands before executing them since a mistake could result in deleting the data others are uploading too. Please see the section on bucket structure to learn how your data must be organized.

Bucket Structure

Please see this discription of the bucket’s structure. Please note the following other items:

PacBio HiFi Data

Please see these notes about kinetics and methylation tags in the hifi_reads.bam file.

We also encourage you to upload the subreads BAM files because they are helpful for calling bases with DeepConsensus later. This is especially important if DeepConsensus has never been run for your dataset; however, having the option to re-call bases with an updated version of DeepConsensus in the future is also nice.

BAM/CRAM vs FASTQ

Many often provide FASTQs (gzipped) in addition to the BAMs, despite how wasteful that is on space, for convenience — especially during the analysis phase of the project. In some cases, it doesn’t matter which format is provided. However, some data types encode extra information in BAM/CRAM files that aren’t available in FASTQ (e.g., Methylation data in PacBio HiFi or ONT data). In such cases, we prefer BAM/CRAM and leave it up to you whether to also include FASTQ format. Please gzip (or more preferrably bgzip) all FASTQ files.

AWS Command-line Interface

The only practical way to get large data files and large data sets into the Genomeark AWS S3 bucket is via the AWS CLI (Command-line Interface). You can find documentation and other information about the software on Amazon’s AWS website. For the purposes of getting your data on Genomeark, you can safely ignore most of the AWS CLI. See our brief AWS CLI primer for the most relevant commands and a few examples. If you need assistance, please reach out.

Prepare your data for upload

Technically, you need not do any data preparation. You may copy every needed file (using aws s3 cp) one at a time specifying the location and name of the file on the S3 bucket. This requires no up-front work, but it results in a time-consuming, error-prone copy procedure. In most cases, the best approach is to organize your data in advance and copy it all in one go (using aws s3 sync). This requires some up-front work, but makes the copy procedure simple. In some cases, it may make more sense to split the data transfer into multiple sync processes to (a) reduce transfer times via additional parallelization and/or (b) copy large subsets of the data from disparate locations. In this documentation, however, we’ll assume it will be done all at once.

The following is a description of the general process and a few miscellaneous guidelines.

Data Preparation & Upload Process

These steps outline the data preparation and upload process:

Prepare a local directory with copies of your data with filenames and subdirectories matching exactly how it is expected to appear on Genomeark.
- Instead of creating copies of your data, you can create links to your existing files using ln -s.
- In most cases, this will mean creating and populating a directory that will correspond to the /species/{Genus}_{species}/{ToLID}/ directory on Genomeark.
- Do not create empty directories on Genomeark. For example, if you are uploading genomic DNA data and no transcriptome data, you should not create /species/{Genus}_{species}/{ToLID}/transcriptomic_data/.
Generate a UUID (Universally Unique IDentifier) and choose a descriptive name for your dataset.
- You will upload your data to the Genomeark test bucket into /incoming/ in a new directory named with your new UUID and descriptive name, separated by --. For example, if your UUID were “1234” (it will be longer than that) and your name were “T2T-Hippogriff_Upload”, the directory name would be /incoming/1234--T2T-Hippogriff_Upload/.
- Generating a UUID is easy. You can find online tools or run uuid-gen on the command line.
- Your descriptive name can be any string (no whitespace), though try to keep it short. The following are examples:
  - fCarIgn1_seq_data
  - stonefly_HiFi-ONT
  - JHU-Maize-Verkko-asm
Use aws s3 sync to mirror the local directory onto the S3 bucket in the aforementioned directory (named using the UUID and descriptive name). The destination S3-URI will be: s3://genomeark-upload/incoming/1234--T2T-Hippogriff_Upload/.
Notify us that your data is ready. We will move your data to the primary bucket. Note, this means that the copy you uploaded will no longer exist in the genomeark-upload bucket.

Please see this Easy Step-by-Step Guide.

Misc. Data Preparation Guidelines

Please note the following guidelines:

It’s good practice to have a README file in each major directory describing the files and naming scheme and providing a history of the data.
It’s also good practice to have a file (e.g., files.md5) with MD5 sums for all data files. Having a single file with MD5 sums per directory is preferred over having a *.md5 file per file (e.g., file1.fq.gz, file1.fq.gz.md5, file2.bam, file2.bam.md5, etc.).
Please upload (b)gzipped-compressed data whenever applicable, especially if the files are large. Specifically, all FASTA and FASTQ files should gzipped, preferably bgzipped (for convenient indexing; upload the indexes if you generated them). Upload BAM or CRAM formatted files instead of SAM. Bgzip VCF files.
Whenever possible, it’s strongly preferred to have uniformity among similar files. For example, say you upload 3 cells of PacBio HiFi data. Try to name them consistently (e.g., ${movie}.hifi_reads.bam instead of ${movie_1}.hifi_reads.bam, cell2.hifi-reads.bam, & ${species}.ccs.bam). Similarly, if one file has methylation tags, another has kinetics tags, and the last has neither, this is confusing. At a minimum, such discrepancies should be noted in an accompanying README. Ideally, all files would be prepared identically such that they share the same types of information; if that means regenerating the HiFi BAM file to add kinetics and/or re-calling methylation with Primrose, please consider doing so. These examples are for HiFi data, but the same principles apply for other data types too.