Exon coverage

../_images/exonCoverage.png

Exon Coverage module in Sequence Miner

Example use case

The Exon coverage query reports sequence read coverage and additional attributes for each exon. One or more subjects can be selected, and results are reported separately for each subject. The analysis is based on a pileup from the original BAM files and is presented in terms of the following:

  • The average depth of coverage over each exon, on a per subject basis

  • The fraction of each exon with coverage below specific thresholds; specifically the fraction of each exon with coverage less than 5, less than 10, and so on up to 30

Exonic regions are defined by Ensembl or RefGenes (both are obtained from the VEP and Ensembl release shown in the version.txt file in the ref directory, or the Version Information link in CSA). Selection between the Ensembl and RefGene reference gene sets is determined by the gene reference set input argument.

Approximating the coverage per exon based on segment coverage

When BAM files for a sample are imported into the WuXi NextCODE system, a segment coverage file ( segment_cov.gord ) is generated with the total read depth for each base. This depth is binned to the nearest interval to reduce file size and improve query speed. Therefore, the results from this report builder are slightly approximated.

Limiting the input list of subjects to maintain a reasonable output report size

The output report can be quite large, with a total number of rows proportionate to the number of input PNs. In order to avoid “out of memory” issues on the local computer, the number of input subjects is limited to 20 or fewer. In order to analyze a larger number of subjects, the report can be run multiple times on batches of 20.

Description of the algorithm

../_images/exonCoverage-usecase.png

Interpreting the output

Each row of output lists attributes and quality statistics per exon for each gene of interest in a single subject (PN). These values include the genomic coordinates of the exon, the exome size (length of the exon in bp), average depth over the exon, Ensembl gene and transcript annotation, the biotype of the exon, and a breakdown of the proportion of the exon with different degrees of coverage. The metrics used to flag regions of low coverage are lt5, lt10, lt15, lt20, lt25, and lt30. Each metric indicates a given coverage threshold, with values between between 0 and 1. For instance, a value of “0.05” for lt10 indicates that 5% of the exon is covered by fewer than 10 sequence reads, and therefore 95% of the exon has at least 10X coverage.

Input parameters

Basic parameters

Input

Description

Values

genome_range

All (whole genome) / selected region (from an open Genome Browswer window)

Inputs are defined by any open Genome Browser window

subjects

Subjects may be selected individually or as a list

A grid with the first column named “PN” must be open in order to select a list of subjects

gene_list

Filter to include only variants in the gene list

The list should be open in a grid tab containing a single column of gene symbols with a header labeled “gene_sybmol”

exon_reference

Choose whether to limit to coding exons

Select Coding only or Coding and UTR

gene_reference_set

Choose between Ensembl or RefGene reference sets

Select Ensembl or RefGene

Advanced parameters

Input

Description

Values

long_running_query

This query will run as a long running job if set to “Yes”

Yes or No

running_time

Time in hours until long running job will be cancelled

1, 2, 4, 8 hours

ref_path

Path to the reference data directory

Default: ref

Column descriptions

Output column descriptions

Group

Column name

Description

Basic

Chrom

Chromosome

pn

The subject ID (identifier)

Gene

biotype

Biological class of gene as annotated by Ensembl

stable_id

Ensembl stable ID for the gene

symbol

Based on HGNC when it exists, otherwise it is the Ensembl internal alias

Other columns

avg_depth

The average read depth of the specific exon

Exon_size

The number of base pairs comprising the given exon

exon

Ensembl exon ID

exonend

Coordinate for the last base of the exon

exonstart

Coordinate for the first base of the exon

lt10

The fraction of the exome with sequence read coverage less than 10X

lt15

The fraction of the exome with sequence read coverage less than 15X

lt20

The fraction of the exome with sequence read coverage less than 20X

lt25

The fraction of the exome with sequence read coverage less than 25X

lt30

The fraction of the exome with sequence read coverage less than 30X

lt5

The fraction of the exome with sequence read coverage less than 5X

set_Transcript_stable_id

Comma-separated list of Ensembl transcript IDs related to the specific exon

strand

Plus/minus of the DNA strand encoding the exon sequence

Transcript_biotype

Biological class of transcript