Reference Data

The ref folder (accessible from the File Explorer in Sequence Miner) contains a collection of reference files, curated from a variety of database and combined into GOR format.

GORdb stores data indexed by genomic locus coordinates, thereby facilitating real-time retrieval and integration of clinical sequence data with annotation from public, proprietary and custom resources. The raw sequence data and annotation data are stored separately so changes to (e.g., addition of new sample data to the database or the release of a new reference package) can be handled without the need to rewrite per-sample annotation files. As reference data is updated, data joins are performed on the fly using the most up-to-date reference data.

A full list of reference data sources is shown in the table at the end of this page.

Each Clinical Sequence Analyzer (CSA) instance is deployed with its own build of the reference data and each CSA project gets a build setting when it is created. Studies within each project can reference different major-minor versions of the Reference Data build.

The reference data that is deployed with your instance of CSA may depend on your subscriptions to various banks of genomic data.

Reference data files

The following table provides a brief overview of selected reference files and directories:

Reference data in the Sequence Miner

Reference file

Description

ref/cancer/*

Directory contains cancer-related files with clinical actionable information, commercial cancer panels, TCGA cancer related genes, CGD genes

ref/dbsnp/*

Directory contains variants from the dbSNP database

ref/deepCODE/*

Directory contains variants with scores from the deepCODE algorithm

ref/disgenes/*

Directory contains disease-related gene map files, including the ACMG minimum panel, CGD panel, immuno-related disease panel, the Kingsmore childhood panel, etc.

ref/disvariants/*

Directory contains variants from Clinvar and HGMD

ref/encode/*

Directory contains positions with associated ENCODE data

ref/ensgenes/*

Directory contains Ensembl gene-related infomation, exons, transcripts, pathway, Gene Ontology (GO), paralogs, etc.

ref/hgmd/*

Directory contains HGMD-related files, including variant details with clinical information and URL links

ref/refgenes/*

Directory contains the RefSeq gene related information, exons, transcripts, etc.

ref/regulation/*

Directory contains regulation-related files from ENCODE, etc.

ref/repeats/*

Directory contains files that identify regions of simple repeats

ref/variants/*

Directory contains variant information including data from population studies (EVS, ExAC, 1000 Genomes, etc.)

ref/cancer_variants.gorz

Variants from COSMIC and NCI-60 database

ref/clinical_genes.gorz

Disease genes based on variants from HGMD, ClinVar, and OMIM

ref/clinical_variants.gorz

Clinical variants from HGMD, ClinVar, and OMIM

ref/genes.gorz

Ensembl gene list with one entry per gene symbol

ref/rgenes.gorz

RefSeq gene list

ref/1000G.gorz

Variants with allele frequencies from the 1000 Genomes

ref/dbnsfp.gorz

Variants and annotations from the dbNSFP database

ref/evs_anno.gorz + ref/evs_freq

Variants with allele frequencies from the EVS database

ref/exac.gorz

Variants with allele frequencies from the EXAC database

ref/freq_max.gorz

Combined variants from EVS, 1000 Genomes, the Japanese Ancestry population from the Kyoto Consortium, the deCODE population survey of Iceland, Genomes of the Netherlands (GoNL), and ExAC

ref/jpt_freq.gorz

Variants with allele frequencies from the Japanese Ancestry population from the Kyoto Consortium

ref/version.txt

The version of databases listed in the ref folder

Reference data sources

The following table contains a comprehensive list of all sources of the reference data. To find out the exact versions included in your installation of CSA, please refer to the version.txt file in the ref folder or to the release notes for your installation.

Reference data sources

Resource

Description

RefSeq

The RefSeq collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins.

Ensembl

The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online.

dbSNP

Providing variants with accession numbers in the form of RS IDs, the NCBI dbSNP database integrates most germline variants. The allele frequency information in the database is provided directly from 1000 genomes, but dbSNP contains variants from many other sources as well.

1000 Genomes

The 1000 Genomes Project has sequenced over 1000 individuals from 14 populations by combining whole genome sequencing and whole exon sequencing.

dbNSFP

Non-Synonymous Function Predictions database annotates SNPs in the human genome with the functional predictions.

EVS/ESP

The Exome Variant Server (EVS) contains exome sequencing variants as part of the NHLBI Exome Sequencing Project (ESP).The ESP6500 has collected over 6500 exomes, including health controls, specific diseases. The goal of the ESP dataset is to release the frequency counts of specific variants without regard to phenotype.

EXAC

The Exome Aggregation Consortium collected data from unrelated individual exomes sequenced as part of various disease-specific and population genetic studies.

COSMIC

Catalogue of Somatic Mutation in Cancer project stores somatic mutation information and related details and contains information relating to human cancer.

TCGA

The Cancer Genome Atlas database provides somatic mutations and related disease information for a list of specific cancers.

HGMD

The Human Gene Mutation Database represents an attempt to collate known (published) gene lesions responsible for human inherited disease.

CGD

Clinical Genomic Database is a manually curated database of conditions with known genetic causes, focusing on medically significant genetic data with available interventions.

OMIM

Online Mendelian Inheritance in Man is a continuously updated catalog of human genes and genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression.

ClinVar

ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

ACMG

American College of Medical Genetics and Genomics has recommended sets of genes for reporting incidental findings in clinical exome and genome sequencing.

Population allele frequencies

There are several population surveys included in Clinical Sequence Analyzer (CSA). The following table summarizes the number of samples used to generate the allele frequencies for each database.

Database

Number of exomes/genomes

1000 Genomes

2,577

EVS (ESP)

6,503

ExAC

60,706

gnomAD

138,632

Genome of the Netherlands (GoNL)

769

Kyoto Japanese

1,208

Icelandic

2,636

SUM

212,958

Population survey data for a selected variant is also displayed in the References panel of the Variant Curation window in CSA when the Population data category of evidence is selected for scoring.

_images/variantScoring_popRef.png

The tables in the population data references panel include the following information:

  • Pop - Population

  • Freq - Allele frequency

  • AC/AN - Allele count/allele number (the number of times the variant allele appears in a given population/the total number of times the allele is present in either the variant or reference sequence)

Following are brief descriptions of each available database.

ExAC

The ExAC table provides information from the following populations:

  • AFR - African/African American

  • AMR - Latino

  • EAS - East Asian

  • FIN - Finnish

  • NFE - Non-Finnish European

  • SAS - South Asian

Information provided by ExAC (Exome Aggregation Consortium) about population survey sizes is shown in the following table:

ExAC population survey sizes

Population

Male samples

Female samples

Total

African/African American (AFR)

1,888

3,315

5,203

Latino (AMR)

2,254

3,535

5,789

East Asian (EAS)

2,016

2,311

4,327

Finnish (FIN)

2,084

1,223

3,307

Non-Finnish European (NFE)

18,740

14,630

33,370

South Asian

6,387

1,869

8,256

Other (OTH)

275

179

454

TOTAL

33,644

27,062

60,706

For more information, visit the ExAC Browser and ExAC FAQ.

gnomAD

The gnomeAD table provides information from the following populations:

  • AFR - African/African American

  • AMR - Latino

  • ASJ - Ashkenazi Jewish

  • EAS - East Asian

  • FIN - Finnish

  • NFE - Non-Finnish European

  • SAS - South Asian

  • TOTAL - Total variant allele frequency in the combined populations

The gnomAD database (Genome Aggregation Database) includes 123,136 exome samples and 15,496 whole genome samples. Information provided by gnomAD about the populations included in the database is shown in the following table:

gnomAD population survey sizes

Population

Exomes

Genomes

Total

African/African American

7,652

4,368

12,020

Latino (AMR)

16,791

419

17,210

Ashkenazi Jewish (ASJ)

4,925

151

5,076

East Asian

8,624

811

9,435

Finnish (FIN)

11,150

1,747

12,897

Non-Finnish European (NFE)

55,860

7,509

63,369

South Asian (SAS)

15,391

0

15,391

Other (OTH)

2,743

491

3,234

TOTAL

123,136

15,496

138,632

The first release of gnomAD was known as ExAC (see ExAC) and contained exome data only.

For more information, visit http://gnomad.broadinstitute.org.

1000 Genomes

The 1000 Genomes table provides information from the following populations:

  • AFR - Total African Ancestry population

  • AMR - Total Americas Ancestry population

  • EAS - Total East Asian Ancestry population

  • EUR - Total European Ancestry population

  • SAS - South Asian Ancestry population

  • TOTAL - Total variant allele frequency in the combined populations

Information provided by the 1000 Genomes Project about population survey sizes is shown in the following table:

1000GP3 population survey size

Population

Total

Total African Ancestry (AFR)

691

Total Americas Ancestry (AMR)

355

Total East Asian Ancestry (EAS)

523

Total European Ancestry (EUR)

514

Total South Asian Ancestry (SAS)

494

TOTAL

2,577

For more information, visit 1000 Genomes Project phase 3.

EVS

The EVS table provides information from the following populations, derived from the NHLBI Exome Sequencing Project (ESP):

  • AFAM - African American population

  • EUAM - European American population

For more information, visit http://evs.gs.washington.edu/EVS/

Genome of Iceland

The Genome of Iceland table provides the following information from the Icelandic population:

  • TOTAL - Total allele frequency in the population

The Icelandic population dataset contains whole genome sequences from 2.636 individuals from Iceland. This project was executed by deCODE genetics (Gudbjartsson et al, 2015).

For more information, visit https://www.ncbi.nlm.nih.gov/pubmed/25807286.

Genome of the Netherlands

The Genome of the Netherlands table provides the following information from the Dutch population:

  • TOTAL - Total allele frequency in the population

The Genome of the Netherlands (GoNL) dataset includes 769 samples (The Genome of the Netherlands Consortium, 2014).

For more information, visit http://www.nlgenome.nl.

Human Genetic Variation Database (Kyoto - Japan)

The Human Genetic Variation Database (HGVD) table provides the following information from the Japanese population:

  • TOTAL - Total allele frequency in the population

The Kyoto Japanese population dataset includes 1,208 samples (Higasa et al, 2016).

For more information, visit http://www.hgvd.genome.med.kyoto-u.ac.jp/about.html.

Rotterdam Study Exome Sequencing

The Rotterdam Study Exome Sequencing table provides the following information from the Rotterdam Study, a prospective cohort study in Rotterdam, the Netherlands, ongoing since 1990:

  • TOTAL - Total allele frequency in the population

For more information, visit https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2071967/.