Cohort analysis¶
Sequence Miner is a research tool for integrated analysis of genotype and phenotype data on patients. Clinical and other phenotype data can be imported into Sequence Miner through the metadata query language NOR (Non-Ordered Relational). Data fields can be sorted and filtered allowing the user to quickly classify patients into categories (e.g., affected and unaffected) The NOR-defined subsets can then be used as input list(s) for genomic analysis applications in Sequence Miner called report builders that apply the GOR (Genomically Ordered Relational) query language.
For analysis of cohort groups, there are two main report builders:
Variant association (variant-based association test)
Gene association (Gene Aggregate Variant Analysis, or GAVA, gene-based association test).
In the Sequence Miner Reports tab, you can select CaseControl in the Category filter to see additional report builders that use cases and controls as input arguments.
This guide focuses on a specific use-case of setting up a case and control group, analyzing the cohort using variant-based or gene-based association test report builders, and visualizing results in the Genome Browser.
The workflow for cohort analysis in Sequence Miner is shown below:
Defining a cohort group¶
The first step is to define the groups that will be compared. A cohort group can be set up by filtering on features such as an identification in the sample name or a qualitative/quantitative attribute.
Selecting a cohort from the project manifest¶
These features can be found in the All.rep
project manifest file that was created upon sample import. The All.rep
file contains the sample ID (named “PN”), sexgender, ethnicity, project_id, and CSA study_roles. To access this file, open the File Explorer, select the SubjectReports folder, and double-click on the file name.
Columns can be sorted by clicking on the column name (an arrow indicates if the column is sorted in ascending or descending order). To filter on a column, right-click on the column name and select Filter on Column.
Once the filter is selected, type the search term(s) into the Filter Text field and select the check box next to the sample (or feature attribute) desired for inclusion. To apply the filter, click Apply.
To create a “case” or “control” group from the filtered list, highlight the PN column, right-click on the highlighted cells, and select Open in new Grid as Cases/Controls.
The new grid with the PN list opens in a new Sequence Miner window labeled “Cases” or “Controls”, depending on the selection.
Selecting a cohort group from a metadata table¶
You can also define a cohort group from a grid containing phenotypes or clinical data for individuals with corresponding genomic data. This metadata file must include as the leftmost column the list of sample IDs. This column should be named “PN”. For the filtered “case” or “control” list to be used for querying the genotypic data, the sample IDs in the PN column must be identical to those assigned to the genotype files, e.g., BAM and VCF.
The metadata file can be stored in a folder in the Sequence Miner File Explorer. In the example below, the file is named OvarCancerPhenotypes.rep
and is stored in the user_data folder. To open this file, select the file and double-click on the file name. To expand or collapse the column list in the Columns panel, click the arrow at the top of the panel.
Table columns can be sorted by clicking the column header. A triangle appears next to the column name to indicate ascending order (triangle pointing up) or descending order (triangle pointing down). Click the arrow to toggle the sort order.
Table columns can also be filtered by right-clicking the column header. For columns containing numerical values, a distribution of the values is plotted to visualize the values in the column. Cutoffs can be defined by setting different comparators and entering a custom value, a percentage class of the samples, or a range of values. In the case of string values, individual rows with the string values are displayed alongside a count for the number of samples with that value entry. You can select which entries (1 or more) to include in the filtering. For both cases, once the class of values have been selected, click Apply to save the filtering parameters and apply them to the table.
To select a group of filtered PNs, highlight the rows containing the sample IDs in the PN column, right-click on the highlighted rows, and select Open in new Grid as Cases/Controls. The new grid opens in a new Sequence Miner window labeled “Cases” or “Controls”, depending on the selection.
The filters applied to the table are displayed in the filter panel below the table. Here, the NOR query that defines the filters is displayed and can be copied to a text file to save for future reference. After applying a filter, select Exclude Selections in the filter dialog box to return to the filtered column and retrieve the remaining sample IDs. This group can then serve as a control group, for example.
Running a cohort analysis with report builders¶
Once the case and control groups are defined, the next step is to identify the correct application for analysis. Sequence Miner provides a collection of analysis applications called report builders, which are categorized based on the type of analysis they perform.
To view the options for analysis, open the Reports tab by selecting the Report Builder icon in the toolbar on the left-hand side of the Sequence Miner window.
Select CaseControl in the Category filter to see the report builders that accept a “case” and “control” list for cohort analysis. These report builders include a variant-based and gene-based association test.
To run a cohort analysis, select either the variant-based association test (the Variant association report builder) or gene-based association test (the GAVA report builder).
Variant-based association test (Case_Ctrl Analysis)¶
Select the Variant association report builder by clicking on it in the Reports menu grid.
With the case and control grids (tables) open in the browser, click the menu icon in the CASEs field. A pop-up window appears listing the available lists for selection. Select the case grid for input and click Apply. Repeat these steps in the CTRLs field to select the control group as input.
In the remaining fields, select the filtering parameters to apply to the analysis. For example, to filter variants by VEP consequence, click inside the VEP_consequence field. A pop-up window opens which lists the VEP consequence categories and corresponding maximum impact (HIGH=LoF vars, MODERATE=missense vars/inframe indels, LOW=synonymous vars, LOWEST=TFBS vars). Additional filter parameters are described in the panel on the left-hand side of the input fields.
After defining the parameters, initiate the analysis by clicking Create Report.
When the analysis is complete, a new table window labeled, for instance, “CtrlVariant_Association_1” (in the case of a variant-based association test) opens in a new Sequence Miner window. The output table includes the chromosome, position, variantreference (or gene) allele, alternate allele, and gene symbol for significant variants, as well as a p-value.
The Variant association analysis generates counts for cases and controls with/without a given variant, as well as Odds Ratios (OR) and p-values for the dominant (dom), recessive (rec), and multiplicative (mm) models. For the GAVA analysis, additional columns include the chi-squared p-value and a permuted p-value. A description of the analysis and outputs of these queries are described in more detail in the Report Builders section.
To analyze the results, sort and filter columns by clicking or right-clicking on the column header respectively. Apply filtering by defining cutoffs for different numerical values based on the distribution of the column data or by free text searching for string values.
Variants (or genes) of interest can be further annotated with drill-in reports. To view available drill-in report annotations, right-click on the row(s) of interest, select Drill in Reports, and choose from VEP, dbNSFP, dbSNP, HGMD, ClinVar, 1000G, and EVS annotations. This opens a pop-up window with the listed variant(s) and/or gene(s) with the selected annotations.
Additionally, the individual carriers for a given variant (or variants in a gene of interest) can be identified by selecting VarCaseCtrlCarriers for variants or GeneCaseCtrlCarriers for genes.
Selecting VarCaseCtrlCarriers opens a new window listing the selected variant with the associated PN and PNtype (Case/Control). To create a table with the list of PN carriers, highlight the PNs in the PN column, right-click, and select Open in new Grid. A new grid opens in the Sequence Miner window with the selected PNs, which can then be queried for associated phenotypes, further genomic analysis, or for confirmation of the variant in the aligned reads.
Gene-based association test (GAVA)¶
Select the Gene association (GAVA) report builder by clicking on it in the Reports menu grid.
With the case and control grids (tables) open in the Sequence Miner window, click the CASEs field to open a drop-down list of available open grids. Repeat these steps for the CTRLs field to select the control group as input.
Gene variants can be filtered by a number of parameters such as penetrance, inheritance model, and VEP consequence. In the remaining fields, select the filtering parameters to apply to the analysis. For example, to filter variants by VEP consequence, click inside the VEP_consequence field. A pop-up window opens which lists the VEP consequence categories and corresponding maximum impact (HIGH=LoF vars, MODERATE=missense vars/inframe indels, LOW=synonymous vars, LOWEST=TFBS vars). Additional filters are described in the panel on the left-hand side of the input fields.
After defining the parameters, initiate the analysis by clicking Create Report.
When the analysis is complete, a new table window labeled “GAVAene_Association#11{_}” opens in a new Sequence Miner window. The table includes one row per gene, and columns for chromosome #, gene _start, gene _end, gene_symbol, and the GAVA results: a chi-square Pp-Vvalue (chi-square p-value (of the log likelihood ratio), permutation chi-square p-value (permuted p-value) V-value, the number of iterations used to calculate the chi-square permutation p-value and the sum of low coverage samples with low coverage for the given per gene. A description of the analysis and outputs of this query are described in more detail in the Report Builders section.
Sort and filter the resulting genes by clicking or right-clicking the column header respectively. Apply filtering by defining cutoffs for different numerical values such as the permutation chi-square p-value or by free text searching for string values of genes of interest in the gene_symbol column.
Genes of interest can be further annotated with drill-in reports. To view available drill-in report annotations, highlight and right-click on the row(s) of interest, select Drill in Reports to open a drop-down list of optional annotations, and choose from gene Pathways or gene Clinical disease gene annotations. The individual carriers for variants in a gene(s) of interest can also be identified by selecting GeneCarriers in the Drill in Reports drop-down list. Once a drill-in report is selected, a new pop-up window opens displaying the selected gene(s) with additional columns for the selected annotations. The individual carriers for variants in a gene(s) of interest can also be identified by selecting GeneCarriers in the Drill in Reports drop-down list.
Saving and exporting results¶
Results from the analysis can be saved by clicking the Save icon in the toolbar. Users have write permissions for the user_data folder. Select the user_data folder, provide a file name (extensions are assigned based on data type, for example, genomic output tables are saved as tab-delimited text files with a .gor
extension; grid files are saved as tab-delimited text files with a .rep
extension). To export a file from Sequence Miner, select the file in the user_data folder. Right-click on the file name, select Copy, and then paste to a folder on your computer’s home directory.
Visualising results in the Genome Browser¶
Variants (or genes) of interest can be confirmed in the aligned reads through the Genome Browser tool in Sequence Miner. The Genome Browser enables the visualization of any query results and raw data files, such as BAMs and VCFs, that contain a genomic coordinate.
To view, for example, the BAM files of a selected group of individuals, select the PNs of interest from a grid window. For example, select the new grid created from the variant carriers analysis. Next, highlight the list of PNs and right-click to select View Genome Tracks, or click the Genome Browser icon in the toolbar. You will be prompted to select a browser template file such as BAM (
bam_tracks.gbt
), coverage (cov_track.gbt
), or VCF (var_tracks.gbt
). In this example, select thebam_tracks.gbt
to confirm the presence of the variant in the aligned reads for the PN carriers.
Once the Genome Browser track has loaded, return to the results of the variant-based (Variant association) or gene-based (GAVA) association test. Highlight the row with the variant or gene of interest and click the Synchronize icon in the toolbar to center the browser window around the selected genomic coordinate.