This is a DRAFT. Please send comments to #data-coord on Slack.

HiFi Reads BAM tags

We strongly encourage you to include kinetics tags in your hifi_reads.bam file. This can be done automatically with the correct switch being toggled when the run is performed. The purpose of retaining the kinetics tags is to enable calling methylation in the future. Accordingly, we also encourage you to call methylation (e.g., with Primrose), which will create a new BAM file with ML and MM tags. If your core is willing and able to call methylation for you, it can be done automatically with the correct run setup in SMRT Link. In such a case, we recommend toggling the switch to keep the kinetics tags in the resultant BAM file. In this situation, the final BAM file will have both kinetics tags and methylation tags. Why keep the kinetics tags in the final BAM file if you already have methylation tags? If some future method (i.e., an update to Primrose or a new tool) improves the methylation calling, you will have the option to re-call methylation if you retained the kinetics tags.

From PacBio’s file format documentation, you can learn more about kinetics tags and methylation tags for HiFi reads. In the SMRT Link User Guide (PDF) (v11.1 from Nov. 2022 is the most recent at the time of this writing), you can find information about how to configure a run to include kinetics tags in the HiFi BAM, call methylation with Primrose, and retain kinetics tags from the original HiFi BAM in the resulting HiFi BAM with methylation information.

Depending on your setup and whether you or your core ran ccs and/or Primrose, it is possible that you will have multiple nearly-identical HiFi BAMs, differing only by which tags are present (e.g., {movie}.hifi_reads.bam, {movie}.hifi_reads.with-kinetics.bam, and {movie}.hifi_reads.5mC.bam). Assuming your final HiFi BAM has both kinetics and methylation tags, you can save space by deleting the other variants with (a) no kinetics or methylation tags, (b) only kinetics tags, or (c) only methylation tags.

Whatever you do, please document it in the README.

Examples

Least Efficient Process

Here is an example of how to use the most disk space and take the most time to get all the tags (i.e., what not to do, if possible):

# 1. do the sequencing run w/o the toggle activated to keep the kinetics tags
# in the hifi bam. For most people, this step is done by the sequencing core.
> ls
  # {movie}.subreads.bam
  # {movie}.subreads.bam.pbi
  # {movie}.hifi_reads.bam              # <-- no kinetics tags!
  # and some other files

# 2. Re-run CCS to include kinetics tags
> ccs --hifi-kinetics [other-options] {movie}.subreads.bam {movie}.hifi_reads.with-kinetics.bam
> ls
  # {movie}.subreads.bam
  # {movie}.subreads.bam.pbi
  # {movie}.hifi_reads.bam
  # {movie}.hifi_reads.with-kinetics.bam # <-- new! Has kinetics tags
  # and some other files

# 3. Run Primrose
> primrose {movie}.hifi_reads.with-kinetics.bam {movie}.hifi_reads.5mC.bam
> ls
  # {movie}.subreads.bam
  # {movie}.subreads.bam.pbi
  # {movie}.hifi_reads.bam
  # {movie}.hifi_reads.with-kinetics.bam
  # {movie}.hifi_reads.5mC.bam           # <-- new! Has methylation tags, but no kinetics tags
  # and some other files

# 4. Observe that the following files are identical except for which tags are/aren't present:
# {movie}.hifi_reads.bam                 # <-- no tags
# {movie}.hifi_reads.with-kinetics.bam   # <-- kinetics tags only
# {movie}.hifi_reads.5mC.bam             # <-- methylation tags only

More Efficient Process

Instead, you could do the following:

# 1. Do the sequencing run with the toggle activated to keep the kinetics tags
# in the hifi bam (effectively merges steps #1 & #2 above into a single step on
# the instrument or other connected computational resource). For most people,
# this step is done by the sequencing core.
> ls
  # {movie}.subreads.bam
  # {movie}.subreads.bam.pbi
  # {movie}.hifi_reads.bam             # <-- w/ kinetics tags this time!
  # and some other files

# 2. Run primrose, passing through the kinetics tags
> primrose --keep-kinetics {movie}.hifi_reads.with-kinetics.bam {movie}.hifi_reads.5mC.bam
> ls
  # {movie}.subreads.bam
  # {movie}.subreads.bam.pbi
  # {movie}.hifi_reads.bam             # <-- Has kinetics tags
  # {movie}.hifi_reads.5mC.bam         # <-- new! Has methylation and kinetics tags
  # and some other files

# 3. Remove version w/o all tags, possibly renaming the final file
> rm {movie}.hifi_reads.bam
# and/or
> mv {movie}.hifi_reads.5mC.bam {movie}.hifi_reads.bam
> ls
  # {movie}.subreads.bam
  # {movie}.subreads.bam.pbi
  # {movie}.hifi_reads[.5mC].bam     # <-- Has methylation and kinetics tags
  # and some other files

# 4. Record what tags are available in each file in the README
> vim README

This assumes your core is (a) willing to configure the run to add the kinetics to the HiFi BAM (increases file size, but not computational load) and (b) unwilling/unable to run Primrose for you on the instrument or other connected computational resource (note that being unwilling to do this is probably a computational or other resource constraint, not unhelpfulness). If your core is able to call methylation (via Primrose) for you, ask them to configure the run to keep the kinetics tags in the Primrose output BAM. The commands run by SMRT Link “under the hood” will match steps #1 & #2 in the provided example; thus, having your core call methylation for you is not more computationally efficient than you doing it yourself, but it is a lot more convenient if you have the option. Note that they might still provide you with two HiFi BAMs (i.e., one with kinetics tags and another with both kinetics tags and methylation tags), but you can delete the unneeded version.

Which tags do my files have?

“Too late! My core already finished the run and sent me the files.” If you’re wondering which tags are in your HiFi BAM file(s), you can check with samtools. You can usually see which commands were run in the BAM header via @PG tags. The most definitive way to identify which tags your file has is by looking at the individual reads. If you find you are missing desired information that you cannot generate yourself from the existing data, you can always ask your core if they still have the needed prerequisites from your sequencing run. You might get lucky!

Checking the BAM Header

You can check the BAM header via the following command:

samtools head some_file.bam

Search for the @PG header(s), usually found near the end. Look for the ccs command and check whether the --hifi-kinetics or --all-kinetics flags were supplied; if so, your file probably has kinetics tags (unless methylation was also called and the kinetics tags were not retained in that step). To determine if methylation was called, look for the primrose command. The --keep-kinetics flag will tell you that the kinetics tags were retained alonside the methylation tags.

Checking the Reads

The most definitive way to determine which tags your file has is by checking the actual reads. We recommend looking at a subset of this output to avoid getting many gigabytes worth of data sent to your output file or the screen. It can also be helpful to replace tabs with newlines to avoid side-scrolling through long columns.

samtools view some_file.bam | head -n 1 | tr '\t' '\n' | less -S

Search for the MM and ML tags to determine if you have methylation information. To determine whether you have kinetics information, search for either of the following tag sets: (a) pw and ip for single-stranded reads or (b) fi, ri, fp, and rp otherwise.

Misc. Notes

All this information applies to HiFi sequencing in the Sequel II era. Things may be different for HiFi reads generated from a Revio instrument. Consider the principles underlying the discussion on this page and consult with your sequencing provider during the planning stage of your project.