Skip to content

covtobed

Andrea Telatin edited this page Mar 2, 2020 · 10 revisions

What does covtobed do

covtobed reads a sorted BAM file (index is not required, and indeed the BAM can be streamed) and outputs the coverage intended as sequence coverage. The goal is to detect covered and uncovered regions, and thus the CIGAR is not parsed: deletions from the sequences implies that the deleted bases are not covered in the specific sample, but are covered by the specific sequencing. The importance of the deletion is the goal of variant callers, while for the purpose of detecting uncaptured regions, it's better to consider deletions as covered.

For example: if a target enrichment kit aims at capturing a gene, we are interested in understanding if the capture step is effective and we want to detect uncovered regions, not biological deletions.

Implementation

The code is object-oriented, including an Input class handling the reading, parsing and filtering of alignments and an Output class handling coverage filtering and writing in different formats. The main algorithm is based on a priority queue from the standard library and is both fast and memory efficient. covtobed relies on libbamtools for BAM file parsing, and "cpp-optparse" for command-line option parsing.

covtobed does not parse the CIGAR to evaluate, and thus deletions are not counted as lack of coverage.

When parsing overlappig paired-ends the coverage is counted twice in the overlap using the default settings, but - as expected - will be only counted once in physical coverage mode. This is because the standard mode will count how many times a base has been read.

Details on coverage calculation

Each mapped read will increment the coverage on the covered region (i. e. deletions are counted as covered). The program supports a mapping quality filter, but no user-defined filters on SAM flags are currently supported, and in the spirit of the tool the filter should be placed upstream as a samtools view -f FLAG | covtobed. Using the -a flag will ignore non-primary alignments, alignments failing vendor QC and PCR duplicates. mosdepth will perform a filter on these flags by default.

When calculating the physical coverage the whole region between the two pairs, including the two mapped reads, is counted once. If an overlap between the two reads exists, it's still counted once.

If a reference chromosome has no alignments at all, it will be printed as being uncovered. The reference chromosomes are printed in the same order as they appear in the SAM header (as it happens with mosdepth, while bedtools will print the empty chromosome(s) at the end).

Features

  • Stream from and to pipes, like:
bwa mem genome.fa reads.fq | samtools view -bS | \
  covtobed --max-cov 1 | bedtools merge > low_coverage.bed
covtobed --output-strands  panel.bam > coverage.bed
covtobed --output-strands  tradis.bam  | \
  find_peaks.py --forward > upregulated_regions.bed
covtobed --physical-coverage contigs.bam > contigs_physcov.bed
Clone this wiki locally