Gene ontology association¶

The Gene ontology association report builder calculates the statistical significance of the degree of association between Gene Ontology (GO) database annotations for a list of genes of interest and categories in the GO database. Genes of interest may include, for instance:

Genes identified by a case-control variant association test
Genes identified by a case-control gene association test

../_images/geneOntologyAssociation.png — Gene Ontology Association module in Sequence Miner¶

Example use case¶

The user wishes to identify functional annotation terms for genes identified in a variant association study of cardiomyopathy. The user first conducts a genome-wide Variant association analysis to identify genes carrying more variants in cases compared to controls. After selecting the genes of interest based on a threshold statistical significance score (e.g., a nominal Fisher’s exact test pvalue threshold of < 0.05), the user saves this list as a grid and then selects it in the TestGeneList field.

The user then creates a BackgroundGeneList grid from the entire list of genes queried in the Variant association analysis (e.g., the cardiomyopathy gene panel), and selects this grid in the BackgroundGeneList field.

Description of the algorithm¶

The GO database is comprised of three ontologies:

Biological processes (BP)
Cellular components (CC)
Molecular functions (MF)

Each ontology is based on a layered and highly structured shared vocabulary describing GO categories.

The user provides a list of genes of interest (the TestGeneList grid) as well as a BackgroundGeneList grid. A BackgroundGeneList grid may include the genes with no variants that meet the threshold statistical significance score.

GO annotations for genes of interest are extracted from the Ensembl database. The levels of annotation for a given gene are dependent on the depth of knowledge about that gene. Given a set of GO-annotated genes of interest, GOseq calculates the probability (a Wallenius hypergeometric test p-value) for the observed degree of over-representation of each GO annotation term (across all GO categories) describing the genes of interest.

Interpreting the output¶

The output displays GO annotation terms with the corresponding p-values. Sort the output by p-value and focus on terms with the lowest p-values to identify candidates for further analysis.

Column descriptions¶

Report output columns and descriptions¶
Column	Description
chrom
CorrectedGOPvalue	Adjusted (Benjamini,Hochberg) p value
gene_start
GOAccession	GO Id
GOClass	The class of the nested GO term - BP, MF, or CC
GOListHitsGeneIDs	List of genes mapped to that GO term
GOPvalue	P value
GOTerm	GO Term definition
ListHits	Number of differentially mutated genes for that GO term (GOseq Node)
ListSize	Number of differentially mutated for the corresponding process class - BP, MF, or CC
PopHits	Total number of background genes with that GO Term (Goseq node).
PopSize	Total number of background genes with the corresponding GOseq process class - BP, MF, or CC