Sample QC and statistics

The Sample QC and statistics report calculates variation statistics for autosomal chromosomes and gene statistics for autosomes and sex chromosomes. The QC statistics are reported for selected samples as compared to the distribution of the QC attribute values for all samples in the project.

../_images/sampleQCandStatistics.png

Sample QC and Statistics module in Sequence Miner

Example use case

The user wishes to generate a detailed report of variant and gene QC statistics for a list of selected subjects.

Description of the algorithm

For each subject loaded into the WuXi NextCODE platform, several files are generated, including BAM files, VCF files, per-base coverage files, a good coverage locus file, and VEP scores for variant. This report builder draws from all of these files to generate a detailed QC report.

In addition to the Default view perspective, there are five other Perspective views focusing on exome and gene coverage, and on a breakdown of variant attributes per sample. Frequencies, distributions, and additional statistical analyses are generated for attributes.

Interpreting the output

The Default view perspective reports all QC statistics and displays results for nine analyses:

  • SNP vs InDel Analysis: The fraction of variants identified as “SNP” or “InDel”; the proportions of SNP and InDels are expected to be >.96 and > 0.03, respectively

  • dbSNP Analysis: The fraction of variants that were found in dbSNP; the typical proportion of “absent” and “present” dbSNP variants is expected to be < 0.1 and 99.9, respectively.

  • exomeCoverage: Coverage for exons in the exome of a given sample is determined by mapping the transcript coordinates from the reference gene set (Ensembl/Refgene) mapped to a per base coverage file. The coverage across the exome is reported as a value between 0 and 1 that indicates the proportion of the exome that has less than the given coverage value (e.g., exlt10 = fraction with less than 10X coverage). This is intended to flag samples with a higher than usual proportion of the exome with low coverage (e.g., a value of 0.8 for exlt10 indicates 80% of the exome for that sample has less than 10X coverage).

  • frequency Analysis: The fraction of variants found in each of the following categories: “veryrare” (AF < 0.1%), rare (AF 0.1% - 1%), “medium” (AF 1% - 5%), and “common” (AF > 5%)

  • geneCoverage: Coverage for the gene is determined based on mapping of the per base coverage file for each sample to the coordinates of the genes in the selected reference set (Ensembl or Refgene). The proportion of all genes for a given sample that contain a minimum coverage is reported as a value between 0 and 1. This attribute is determined based on what percentage of the gene has a minimum level of coverage (either 10X or 20X) and is reported as the proportion of all genes having 85%, 90%, or 95% of a gene covered by 10X or 20X.

  • impact Analysis: The fraction of variants found in each of the following VEP impact categories:
    • HIGH impact (typically expected to contain approximately < 0.02 of variants)

    • MODERATE impact (typically expected to contain approximately 0.40-0.45 of variants)

    • LOW impact (typically expected to contain approximately 0.40-0.45 of variants)

    • LOWEST impact (typically expected to contain approximately 0.10-0.15 of variants)

  • quality Analysis: Variant call quality, either “LowQual” or “PASS”; the proportions of LowQual and PASS variants are expected to be < 0.1 and 99.9, respectively

  • transition transversion Analysis: A transition refers to a purine to purine substitution (A->G) or pyrimidine to pyrimidine substitution (C->T). A transversion refers to a purine to pyrimidine substitution (A->C, A->T, G->C, or G->T) or vice versa. A transition to transverstion calculation provides a readout of the total number of transitions called versus the total number of transversions called.

  • zygosity Analysis: The fraction of variants that are heterozygous (het) and homozygous (hom); expected to be approximately 0.6 and 0.4, respectively

The attribute measurement for each subject is compared to its distribution across the other input SUBJECTS. The distribution is expressed as a z-score, the number of standard deviations from the mean value for a given attribute. The Z-score value is displayed in the all_z_proportion column. The corresponding rank (LOW, NORMAL and HIGH) of the attribute value for each SUBJECT compared to the distribution of all the input SUBJECTS is displayed in the all_InDistribution column. The all_InDistribution column indicates how a sample attribute (in a given row) compares to the other samples (SUBJECTS) selected for this analysis.

  • If the Z-score (the value in the all_z_proportion column is the number of standard deviations away from the mean) is < -2.0 (more than 2 standard deviations BELOW the mean), the attribute is categorized as falling at the LOW end of the distribution of all the input SUBJECTS (designated in the all_InDistribution column).

  • If the Z-score is > +2.0 (more than 2 standard deviations ABOVE the mean), the attribute is categorized as falling at the HIGH end of the distribution of all the input SUBJECTS.

  • If the Z-score is between -2.0 and and +2.0 (within 2 standard deviations of the mean), the attribute is categorized falling within the NORMAL range of the distribution of all the input SUBJECTS.

Column descriptions

Report output columns and descriptions

Group

Column name

Description

Basic

Chrom

PN

All

avg_avg_depth

Across all samples in the analysis, the average of the average of the 85, 90 and 95% gene depth coverage at greater than 10 and 20 reads

avg_depth

Per sample, the average of 85, 90 and 95% gene depth coverage at greater than 10 and 20 reads

avg_proportion

For each gene depth measurement category, the average value across all selected SUBJECTS is calculated (only visible in the GeneStatGraphs perspective)

Color

  • If the value falls in the range of +/- two standard deviations of the mean, the term “Green” is returned

  • If the value falls in the range of +/- (2 to 3) standard deviation of the mean, the term “Orange” is returned

  • If the value is +/- 3 or more standard deviations away from the mean, the term “Red” is returned

  • Only visible in the GeneStatTable perspective

exomeSize

Total exome size of the sample (reported in base pairs)

InDistribution

Displays where the proportion value of the sample falls in the distribution of the values for all samples in the project:
  • Low is > two standard deviations below the mean or in the lowest 5%

  • High is > two standard deviations above the mean or in the highest 5%

  • Norm is any value within two standard deviations of the mean

lowOReqRankFromTop

Samples are ranked proportionally on a scale of 0 to 1 (each step in rank is 1 / # of samples), with 0 being the highest rank assigned to the sample with the highest attribute value

lowOReqRankFromBottom

Samples are ranked proportionally on a scale of 0 to 1, with 0 being the highest rank assigned to the sample with the lowest attribute value

numberOfGenes

Total number of genes found in the sample for the “All” or “Candidate” genes designation

PNcount

The total count of samples in the analysis

proportion

Percentage of total variations having the designated value in the attribute column for the analysis performed (value between 0 and 100)

rank_perc_FromTop

Samples are ranked in order based on their attribute values (e.g., 1 - 21 for 21 samples), with 1 being the highest rank assigned to the sample with the highest attribute value

rank_perc_FromBottom

Samples are ranked in order based on their attribute values, with 1 being the highest rank assigned to the sample with the lowest attribute value

ratio

The ratio of the indicated numerator (e.g., SNPs, transitions, heterozygous) to the denominator (e.g., InDels, transversions, homozygous)

std_avg_depth

Standard deviation of the average coverage at all exome and gene coverage values across all SUBJECTS selected for the analysis

std_proportion

For each gene depth measurement category, the standard deviation of the average value across all selected SUBJECTS is calculated (only visible in the GeneStatGraphs perspective)

totalVars

Total variants contributing to the ratio calculation

z_proportion

Z-score value

Other columns

analysis

Attribute

bpStart

Perspective views

In addition to the Default view perspective, results are displayed in five other perspectives focusing on exome and gene coverage, and on a breakdown of variant attributes per sample. Frequencies, distributions, and additional statistical analyses are generated for attributes.

Perspectives

Perspective

Description

Default view

Reports all QC statistics and displays results for the 9 anaylses described above

ExomeCovTable

Reports the coverage statistics for exons

GeneStatGraphs

Reports gene coverage statistics

GeneStatTable

Reports gene coverage statistics

Ratios

Includes 3 categories of ratios listed in the analysis column (other analyses are filtered out): zygosity Analysis, SNP vs InDel Analysis, and transition transversion Analysis

VariantStat

Includes 6 categories of variant statistics listed in the analysis column: zygosity Analysis, SNP vs InDel Analysis, dbSNP Analysis, frequency Analysis, impact Analysis, and quality Analysis