Variant association ExAC (WES)¶

../_images/variantAssociation_exac.png — Variant association ExAC (WES) module in Sequence Miner¶

Example use case¶

The Variant Association Ref Population report builder performs a single variant case-control association analysis using publicly available Exome Aggregation Consortium (ExAC) whole-exome sequenced data as control.

The analysis is a Fisher’s exact test based on the presence or absence of each variant in cases and controls. The ExAC reference population database for allele counts serves as a proxy for controls. The allele count information for each variant, as counted in the case group and the reference population, is used to create 2x2 contingency tables.

Deviating from matched populations¶

This analysis requires an assumption about the equivalence between the case group and reference population. Differences between the case and control group will lead to incorrect results. Please note the following considerations:

In this analysis, the case and control groups will not be from the same wet lab experiment. Differences in sample preparation and sequencing can invalidate the comparison between these cohorts.
Case and control groups also have differences in their analysis steps. The variants listed in the case group and the control group have been called and filtered using different techniques. A quality filtering is applied to the variants in the cases group, while in the reference population the tool uses only the variants that have a PASS value (VCF-derived column named “FILTER”). For the reference population, non-PASS variants will be excluded from analysis.
If there is a substantial difference in sample size between case and control groups ( unbalanced groups), this will inflate the type I error rate.
The tool only takes into account variants that are present at least once in the cases group.

Estimating the control total allele number for variants that are absent in the reference population¶

When the variant identified in the case group is not present in the reference file, the control total allele number for the variant in question is estimated from the reference data file. A variant present in the case group but absent in the reference file will be annotated as follows:

Control allele count (CTRL_alleles) will be 0 as it is absent from the controls
Control total allele number (CTRL_total_alleles) is estimated by the nearest ‘PASS’ variant in the reference population table. If there is no ‘PASS’ variant within 50 bp of the missing variant, it is assumed that this position has bad coverage in the reference file and therefore the variant will be excluded from analysis.

Selecting an appropriate subpopulation¶

The control population is defined by the reference population input parameter. The reference population should be selected to match the case group ethnicity as closely as possible. By default, the reference population parameter is set to “All”, corresponding to the entire ExAC database except the subpopulation “Others (OTH)”. The following options are available for subpopulations:

African/African American (AFR)
Admixed American (AMR)
East Asian (EAS)
Finnish (FIN)
Non-Finnish European (NFE)
South Asian (SAS)

A brief summary of these subpopulations is provided by the ExAC project: http://exac.broadinstitute.org/faq.

Note

Phenotypic information is not available for individuals in the reference population. These individuals may carry the phenotype or genetic condition under consideration in your case-control analysis.

Description of the algorithm¶

../_images/variantAssociation_exac_02.png

All variants identified in Cases are filtered according to the input parameters. For all filtered variants, the following counts are determined:

The number of variant alleles
The number of homozygous individuals
The total number of alleles with good coverage

For each variant, these counts are used to define 2x2 contingency tables for three models: dominant, recessive, and multiplicactive. For each contingency table, calculations are performed to determine the odds ratio and 2-tailed Fisher’s exact test p-value.

Contingency tables are constucted as follows. In general, the “A” and “B” values ae obtained from the Case data, while the “C” and “D” values are derived from the selected reference population.

Generic 2x2 contingency table¶
	With variant	Without variant
Case alleles	A	B
Control alleles	C	D

Multiplicactive model¶
	With variant	Without variant
Case alleles	`case_alleles`	`case_absent_alleles`
Control alleles	`ctrl_alleles`	`ctrl_absent_alleles`

Dominant model¶
	With variant	Without variant
Case alleles	`case_present`	`case_absent`
Control alleles	`ctrl_present`	`ctrl_absent`

Recessive model¶
	With variant	Without variant
Case alleles	`case_hom`	`case_absent_hom`
Control alleles	`ctrl_hom`	`ctrl_absent_hom`

The 2-tailed Fisher’s exact test is performed for each model according to the contingency tables above. Similarly, the odds ratio (OR) for each model is calculated as follows, using a general formula which corresponds to the Generic 2x2 contingency table:

Odds Ratio = (A/B) / (C/D)

Multiplicative Odds Ratio = (case_alleles/case_absent_alleles) / (ctrl_alleles/ctrl_absent_alleles)

Dominant Odds Ratio = (case_present/case_absent) / (ctrl_present/ctrl_absent)

Recessive Odds Ratio = (case_hom/case_absent_hom) / (ctrl_hom/ctrl_absent_hom)

The output includes an OR for each variant for each of the three models. Follow are three exceptions to this OR formula:

When B = 0 and D ≠ 0, the Odds Ratio value will be 1000
When C = 0, the Odds Ratio value will be 1000
When D = 0, the Odds Ratio value will be NaN

Interpreting the output¶

A 2-tailed Fisher’s exact test p-value is returned. The p-value indicates a measure of statistical significance for each model. An odds ratio (OR) > 1 indicates that the variant is more common in cases compared to controls.

For each variant position, the number of individuals in the reference population can be considered (given in the CTRL_total_PNs column). If the number is below a desired threshold, the variant can be flagged or excluded from analysis.

Column descriptions¶

Report output columns and descriptions¶
Group	Column	Description
Basic	Chrom, Pos, Reference, Call	Basic variant information
CASE	CASE_absent	The number of cases in which the variant allele is absent
	CASE_absent_alleles	The total number of reference alleles in cases
	CASE_absent_hom	The number of cases that are not homozygous for the variant allele
	CASE_alleles	The total number of variant alleles in cases
	CASE_het	The number of cases that are heterozygous for the variant allele
	CASE_hom	The number of cases that are homozygous for the variant allele
	CASE_present	The number of cases in which the variant allele is present
	CASE_prop_alleles	The proportion of variant alleles / total alleles with good coverage in the case group
	CASE_prop_het	The proportion of individuals that are heterozygous for the variant allele / total number of individuals with good coverage in the case group
	CASE_prop_hom	The proportion of individuals that are homozygous for the variant allele / total number of individuals with good coverage in the case group
	CASE_total_alleles	The total number of alleles with good coverage in the case group
	CASE_total_PNs	The total number of individuals with good coverage in the case group
	CASE_unknown	The number of cases with low coverage or a poor quality call at this position
CTRL	CTRL_absent	The number of controls in which the variant allele is absent
	CTRL_absent_alleles	The total number of reference alleles in controls
	CTRL_absent_hom	The number of controls that are not homozygous for the variant allele
	CTRL_alleles	The total number of variant alleles in controls
	CTRL_het	The number of controls that are heterozygous for the variant allele
	CTRL_hom	The number of controls that are homozygous for the variant allele
	CTRL_present	The number of controls in which the variant allele is present
	CTRL_prop_alleles	The proportion of variant alleles / total alleles with good coverage in the control group
	CTRL_prop_het	The proportion of individuals that are heterozygous for the variant allele / total number of individuals with good coverage in the control group
	CTRL_prop_hom	The proportion of individuals that are homozygous for the variant allele / total number of individuals with good coverage in the control group
	CTRL_total_alleles	The total number of alleles with good coverage in the control group
	CTRL_total_PNs	The total number of individuals with good coverage in the control group
	CTRL_unknown	The number of controls with low coverage or a poor quality call at this position
MAX	MAX_AF	Maximum reported allele frequency across the population surveys from 1000GP3, EVS, ExAC, Kyoto, GONL, and DeCODE
	Max_Impact	Classification of the level of severity of the transcript consequence type assigned by VEP
OR	OR_dom	Odds ratio as calculated from the 2x2 contingency table for the dominant model: Dominant Odds Ratio = (case_present/case_absent) / (ctrl_present/ctrl_absent)
	OR_mm	Odds ratio as calculated from the 2x2 contingency table for the multiplicative model: Multiplicative Odds Ratio = (case_alleles/case_absent_alleles) / (ctrl_alleles/ctrl_absent_alleles)
	OR_rec	Odds ratio as calculated from the 2x2 contingency table for the recessive model: Recessive Odds Ratio = (case_hom/case_absent_hom) / (ctrl_hom/ctrl_absent_hom)
pVal	pVal_dom	Measure of statistical significance of the difference between the distribution of carriers in the case group versus the reference population
	pVal_mm	Measure of statistical significance of the difference between the distribution of variant alleles in the case group versus the reference population
	pVal_rec	Measure of statistical significance of the difference between the distribution of homozygous carriers in the case group versus the reference population
Other columns	Gene_symbol	HUGO gene symbol associated with this variant

Perspective views¶

Perspectives subtabs focus on subsets of the columns in the Default view.

Perspectives¶
Perspective	Description
Default view
Dominant Model
Multiplicative Model
Recessive Model