Annotate variants

The Annotate variants report builder annotates variants from a set of subjects with VEP summary information, gene-disease information, gene ontology codes, paralogs, etc.

The dialog generates a table of variant and gene annotations for variants meeting user defined criteria in subjects. The annotation may also be limited to variants within a set of genes of interest or within a specific genomic range.

The following annotations are added to each variant in addition to ACMG category scores:

  • Gene annotation from the Clinical Genome Database (CGD)

  • Gene annotation from European Commission project database

  • Gene biotype, gene ID, gene related pathway, gene parlous etc.

  • GO term ID and description

  • Genotype of each variant, call copies, call ratios, read depth, etc.

  • Information from public databases, such as related diseases, gene related diseases, variant related diseases, etc.

  • OMIM annotation of each related gene

  • VEP annotation of each variant, such as max_consequence, max_impact, max_score etc.

../_images/annotateVariants.png

Annotate Variants in Sequence Miner

Example use case

The user wishes to annotate variants in selected subjects that meet user-defined criteria with information such as pathways, paralogs, known diseases involving each gene and clinically relevant information (from the CGD). The user also wants to identify heterozygous and homozygous variants in the subject(s).

Description of the algorithm

This query creates a table of variants for each input subject (PN) that meet the user’s filtering criteria (e.g., call quality, maximum impact, consequence and allele frequency of the variant). The table is then joined to annotation tables.

Variants are annotated with CAT scores. The DIAG ACMG CAT scores are calculated as follows:

  • If the AF > the user’s designated maxAF threshold (e.g., maxAF>0.01), then the variant is CAT4.

  • If the variant is not CAT4 and if MaxClinImpact is “pathogenic” (in the ref/clinical_variants.gorz file), then the variant is CAT1.

  • If the variant is neither CAT4 or CAT1 and the max_impact is “HIGH” (from the source/anno/vep_v3-4-2/vep_single_wes.gord file), then the variant is CAT2.

  • If the variant is neither CAT1, 2 or 4 and

  • if max_Consequence = “missense_variant” and max_score >= 0.9, the variant is CAT3A

OR

  • If max_impact = “LOW” or (max_impact = “MODERATE” and Cat3A = 0), the variant is CAT3B

Interpreting the output

The output table includes multiple annotation columns which are grouped by category and can be viewed in either the Default view or AMCG Perspective views. Output columns and perspectives are described below.

Column descriptions

Basic columns and descriptions

Group

Column

Description

Basic

Call

The actual called sequence (variant), found by replacing a part of the reference sequence and denoted by Pos and Reference, with the sequence in the Call column

Chrom

The chromosome of the variant represented as chr1, chr2, …, chr22, chrXY, chrX, chrY, chrM

hetORhom

The zygosity of the call, either “het” or “hom”

PN

The patient number (identifier)

Pos

The (first) base pair position of the sequence variant, i.e., the position of the first nucleotide in the Reference column

Reference

Sequence from the reference build, the first base starting at the base pair position in the Pos column

CGD columns and descriptions

Group

Column

Description

CGD

The CGD (Clinical Genome Database) columns provide information for variants based on the manually curated database of variants associated with known clinically significant conditions and available interventions.

AGE_GROUP

Pediatric: less than 18 years of age; Adult: at least 18 years of age

COMMENTS

Any additional observations noted by curators

CONDITION

Conditions also resulting from mutations in the same gene but may otherwise be placed in the “General” Intervention category

INHERITANCE

Pattern of inheritance the variant is known to follow: AD - autosomal dominant; AR - autosomal recessive; BG - blood group; Digenic - a condition resulting from simultaneous mutations in different genes; Maternal - maternal mitochondrial inheritance; XL - X-linked (because X-linked conditions can frequently have manifestations in both genetic sexes, X-linked conditions are not designated as dominant or recessive)

INTERVENTION CATEGORIES

This category includes organ systems for which specific and additional inteventions may be beneficial

INTERVENTION RATIONALE

Description of the intervention and its benefit

MANIFESTATION CATEGORIES

This category includes organ systems affected by mutations in corresponding genes; recognition of involved organ systems may help guide supportive care

REFERENCES

CGD: Clinical Genomic Database by NHGRI; PubMed ID of the reference from which the information was taken

COMM columns and descriptions

Group

Column

Description

COMM

The COMM columns provide variant annotation (comments) added to CSA or Sequence Miner by users

CLINICAL_SIGNIFICANCE

The clinical significance (e.g., pathogenic, benign, unknown significance, drug-response, risk factor, etc.) of the variant as annotated (commented) by users; if the same variant has several comments, this cell will contain a set of values

MODE_OF_INHERITANCE

The user-annotated (commented) mode of inheritance of the variant; if the same variant has several comments, this cell will contain a set of values

TEXT

The description (comment) component for the user annotation of the variant

EuroGenetest columns and descriptions

Group

Column

Description

EuroGenetest

The EuroGenetest columns are derived from a European Commission project database containing European genetic testing information for particular genes, variants, and diseases.

Diseases

Diseases associated with a variant derived from the European Commission project database

NoOfDiseases

Number of diseases associated with a variant derived from the European Commission project database

NoOfpanels

Number of gene panels associated with a variant derived from the European Commission project database

panels

EuroGenetest panels associated with a variant derived from the European Commission project database

Gene columns and descriptions

Group

Column

Description

Gene

The Gene columns provide information based on the candidate gene in which a variant is found. When possible, the HUGO Gene Nomenclature Committee (HGNC) gene symbol is provided. Columns list gene annotations for the variants identified, including gene biotype, gene ID, gene related pathway, gene paralogs, etc.

Aliases

The aliases of the given gene

Biotype

Biological class of gene as annotated by VEP

cdsEnd

cDNA end position as annotated by VEP

cdsStart

cDNA start position as annotated by VEP

Description

Description of the gene, i.e., full gene name

gene_stable_id

Ensembl stable ID for the gene

Paralogs

The paralogs of the given gene

Pathways

The pathway(s) in which a given gene is found and listed in Ensembl in the ref/ensgenes/ensgenes_gene2pathway.mmap file

Strand

The transcription strand for the gene (+/-)

Symbol

Based on HGNC when it exists, otherwise it is the Ensembl internal alias

GO columns and descriptions

Group

Column

Description

GO

The GO columns provide a functional annotation of the gene product in which the variant is found. Columns list Gene Ontology (GO) annotations for the gene, including the GO term ID and term description.

Descriptions

Gene ontology category descriptions

IDs

Gene ontology identifiers

GT columns and descriptions

Group

Column

Description

GT

The GT (genotype) columns provide quality control information for the variant call based on the sequence read depth and quality. These scores are based on the Genome Analysis Toolkit (GATK) measures. Columns list genotype information derived from the VCF, including the variant call, call copies, call ratio, call quality, and read depth.

CallCopies

Because the focus is only on variations from the reference, CallCopies refer to how many copies of the variation exist in a subject. A CallCopies value of “2” therefore corresponds to a homozygous variant, whereas a CallCopies value of “1” corresponds to a heterozygous variation.

CallRatio

Proportion of reads containing the variant call; expected to be close to 0.5 for heterozygous calls and close to 1 for homozygous calls

Depth

The number of reads covering the variant call

FILTER

Quality parameter using the ratio between gt-quality and depth showing if the call is considered LowQual quality (not useable) or PASS; this remains a crude quality measure

GL_Call

A statistical measure indicating the likelihood that the call is wrong; the scale has been converted to use only integers - the higher the number, the less likely it is that the call is wrong

KNOWN columns and descriptions

Group

Column

Description

KNOWN

The KNOWN columns provide publicly available information about the candidate gene and/or variant as annotated by ClinVar, HGMD, and OMIM. Columns list publicly known clinical annotations derived from ClinVar, OMIM, and HGMD Professional for the variant and gene including related diseases and predicted clinical impact.

gene_diseases

Diseases known to be associated with the gene as annotated in ClinVar, HGMD, and OMIM

gene_lists

Gene list membership of the gene in which the variant is found in the ref/ensgenes/ensgenes_disease.map file.

InACMG

A Boolean column (“true” or “false”) indicating whether the gene is in the ACMG recommended list of genes for incidental findings and reporting

var_diseases

Diseases known to be associated with the variant as annotated by ClinVar, HGMD, and OMIM

OMIM columns and descriptions

Group

Column

Description

OMIM

The OMIM columns provide the OMIM-designated identification for a particular gene and related disease description.

Descriptions

OMIM disease descriptions for the gene

IDs

The OMIM ID of the gene

VEP columns and descriptions

Group

Column

Description

VEP

The VEP columns provide functional annotations for variants based on the ENSEMBL SNP Effect Predictor database. Columns list Variant Effect Predictor (VEP) annotation for each variant, including the max_consequence, max_impact, max_score, and transcript information.

Amino_Acids

The amino acid with and without variant, separated by a “/” (provided only if the variant affects the protein-coding sequence), otherwise “.”

max_consequence

Consequence type reported for this variant having the greatest impact

Max_Impact

Classification of the level of severity of the transcript consequence type assigned by VEP

Max_Score

Maximum score for the variant as observed in dbNSFP [Score=max ((1-Sift_score), Polyphen2_HDIV_score, Polyphen2_HVAR_score)]

Protein_Position

Position of the amino acid in the protein sequence (only if the variant falls within a coding sequence); a value is given for each corresponding transcript specified in the CDS position field

Other columns and descriptions

Group

Column

Description

Other columns

dbSNP_rsIDs

The dbSNP identifier

DIAG_ACMGCat

Categorization of the sequence variants according to the ACMG scheme

formatZip

VCF genotype fields

FS

Fisher’s exact test of read strand. If the reference reads are balanced between forward and reverse strands then the alternate reads should be as well

max_af

Maximum reported allele frequency (1000GP3, EVS, EXAC, Kyoto, GONL)

Perspective views

Perspectives subtabs focus on subsets of the columns in the Default view.

Perspectives

Perspective

Description

ACMG

Displays only CAT 1 and 2 variants that are Known InACMG (“True”). The following annotation columns are displayed in this perspective: DIAG_ACMGCat, max_consequence, KNOWN_Gene_diseases, and KNOWN_var diseases.

Default view

Displays all columns