Variant QC and statistics

The Variant QC and statistics report builder generates a quality control report for sequence data in selected subjects. Quality parameters returned include coverage per variant, variant genotype, and allele frequencies across subjects, as well as Hardy Weinberg Equilibrium p-values and log p-values per variant.

../_images/variantQCandStatistics.png

Variant QC and Statistics module in Sequence Miner

Example use case

The user has a cohort of 300 cases and controls and wishes to evaluate overall data quality and to identify variants not in HWE.

Interpreting the output

Given a list of subjects, this dialog summarizes the following for each variant in these subjects:

  • The number of subjects failing QC at this locus (FailQc_All)

  • The number of subjects failing QC but having depth of coverage greater than the user-defined threshold at this variant locus (FailQc_NonDepth)

  • The sum of hom subjects for this variant

  • The sum of het subjects for this variant

  • PNcount - The greater number of subjects calculated under the following two conditions:
    1. The number of subjects carrying this variant with GL likelihood score and call ratio meeting the user-defined thresholds

    2. The number of subjects for which the variant was called as het or hom, which tests GL likelihood score, call ratio, AND read depth

If the user chooses the option to include a Hardy Weinberg Equilibrium calculation, then the following columns are included in the output:

  • Chi-square value

  • Hardy Weinberg Equilibrium p-val

  • Hardy Weinberg Equilibrium log p-val

  • The frequency of hom subjects

  • The frequency of het subjects

  • The frequency of variant alleles among all alleles in the subjects

The sum_het and sum_hom columns report the number of heterozygous and homozygous carriers among the input subjects.

QC is categorized as failing for a given variant in a subject when any of the following parameters fail to meet the user-defined thresholds:

  • variant GL score (genotype likelihood score),

  • minimum read depth and

  • call ratios for the het or hom call

The sum_FailQc_All column contains the number of selected subjects for which the variant fails the user-defined QC (quality) thresholds.

The sum_FailQc_NonDepth column reports the number of subjects that fail QC due to reasons other than read depth being less than the user-defined read depth threshold.

If “yes” is selected in the calculate_Hardy_Weinberg_Equilibrium (calcHWE) field, then the output includes several additional columns. The chi-square test statistic (chisq column) is calculated based on the heterozygous frequency, homozygous frequency, and total allele frequency with one degree of freedom. The corresponding chi-square p Value (pVal) is returned along with the -log(p-value). The -log(p value) can then be plotted in the Genome Browser as a Manhattan plot across the genome.

Note

Expected counts for the HWE calculation are determined from the allele frequencies in the input subjects. Therefore, it is recommended that the HWE calculation be selected only in the case of a large number of samples. Otherwise, Fisher’s exact test is the recommended method for measuring the distribution of the heterozygotes and homozygotes.

Column descriptions

Report output columns and descriptions

Group

Column

Description

Basic

Call

Chrom

POS

Reference

sum

FailQc_All

The number of samples that do not pass any of the user-defined thresholds for QC at this variant locus

FailQc_NonDepth

The number of samples that do not pass the user-defined thresholds for QC other than read depth at this variant locus

het

The number of samples heterozygous for the variant

hom

The number of samples homozygous for the variant

Other columns

max_af

PNcount

The number of samples with good coverage (depth meeting the user-defined threshold) at this locus; if variant is present, the variant meets the user-defined thresholds for variant GL score (genotype likelihood score) and call ratios for the het or hom calls

Additional columns

If the calculate_Hardy_Weinberg_Equilibrium (calcHWE) option is set to “yes”, the following columns are added:

Additional output columns and descriptions

Group

Column

Description

HWE

alleleFreq

(2 * sum_hom + sum_het) * 0.5 / (total number of samples containing this variant), the frequency of the variant

chisq

Chi-square test statistic for Hardy Weinberg equilibrium calculated based on the homFreq and homFreq

homFreq

sum_hom / (total number of samples), the ratio of the number of samples containing this variant in a homozygous state versus the total number of samples

logPval

Calculated based on the pVal, -log(pVal)

pVal

The pVal for Hardy Weinberg calculated based on chi-square test statistic (chisq) and 1 degree of freedom

Other columns

hetFreq

sum_het / (total number of samples ), the ratio of the number of samples containing this variant in a heterozygous state versus the the total number of samples