4.1. Statistical Analysis of Genomic Association
A statistical analysis designed to detect significantly mutated gene pairs relies on the calculation of a contingency table. Given a gene g in a patient, it can assume three possible states:
w: the gene is mutated neither in normal and tumor tissues;
s: the gene is mutated only in the tumor tissue (somatic mutation);
b: the gene is mutated in both normal and tumor tissues (germline mutation).
In this study, we focus on the patients holding the following relevant gene-pair mutations (RGPMs):
: both genes present somatic mutation, being mutated only in tumor tissue;
, : one gene carries a germline mutation (both in normal and tumor tissues), and the second gene presents a somatic variant (only in tumor tissue).
Considering the three possible gene states , the following notations can be adopted for calculating the contingency table:
is the number of observed subjects in the cell ;
is the expected number of subjects in the cell ;
is the row marginal frequency;
is the column marginal frequency, and
N is the total number of subjects.
For each pair of genes
and
, we focus on the elements of the contingency table accounting for the three case studies mentioned above:
| | | |
w | | | |
s | | | |
b | | | |
To identify potential gene interactions, an independence test on the contingency table could be conducted. However, our specific interest lies in testing the dependency exhibited by the three pre-defined elements. For this purpose, we performed a test by considering contingency tables with identical marginal frequencies while fixing the three cells, aiming to find gene pairs with minimum dependency.
Given that the number of degrees of freedom is 4, fixing the value of the three cells, only one degree of freedom is left. Consequently, our strategy consists of finding the optimal cell value that minimizes the G-test statistic. This is a minimization problem over a single variable subject to the non-negativity constraints of all the table cells.
If we define
x as the new value of
, then the other five cell values can be calculated as a function of
x, and the statistic of the following table can be a function of
x:
| | | |
w | | | |
s | | | |
b | | | x |
where
,
and
are the constant values fixed by the marginal constraints.
Notice that under the null hypothesis of independence, the values are defined only on the marginals and therefore do not depend on x. Additionally, the values of and are constant when the other two cells in the same row or column are constant.
The cells must be positive or zero and satisfy the following positivity constraints:
The
G statistic can be defined as follows:
After some simplifications, its differential can be expressed as:
Thus,
is null when
The second differential has four terms, all of which are positive under the positivity constraints. Consequently, the function is strictly convex, and the value corresponds to a minimum.
The denominator does not present any singularities because
and
If the denominator in the equation for is 0, all the four corner of the table are 0, and the remaining table values are fixed. In such cases, we set .
The positivity constraints are satisfied by , which is obtained by a non-negative numerator and denominator. Furthermore, since , and symmetrically, .
For , the following conditions are valid. If , then .
If
, then
, but in the latter case, the proof is more technical. First, we observe that
; thus,
, and
. Similarly,
. If we fix
and define the function
then
The gradient is null when
, and it has only non-negative components in the quadrant
. Therefore,
f has a minimum at
with a value of
and
In summary, we have a value that generates a new valid contingency table with a minimum for any table. We take this value as a measure of dependency of the three relevant cells we have focused on. To calculate the p-value, we generate random permutations of subjects for one of the genes and calculate the values of the minimum statistic for ∼10 billion tables. The whole process ∼10 h on a 2.6 GHz computer with 16 processors.
The mutation dependency of two genes can be related to observing more or fewer subjects than expected in a cell. We assume that tumor is caused by the accumulation of mutations. Thus, we retain statistically significant gene pairs where at least one of the three cells has more observations than expected.
We select the gene pairs that pass the Bonferroni correction with the FWER (family-wise error rate) Q of 0.05 and a total number of experiments equal to the number of considered contingency tables.
4.2. Survival Analysis
We analysed the survival data available at the TCGA to select possible clinically relevant epistatic gene pairs. We used overall survival (OS) time and not the disease-free interval (DFI) time because the latter is related to the cancer recurrence; therefore, it would be more appropriate for a study on treatment effectiveness. Since our study refers to the overall malignity of the mutations, we considered the OS time.
For each gene pair, we categorized the patients into two groups based on their gene mutation state () and compared their OS times. The first group consisted of patients holding one of the three RGPMs , and . The second group comprised individuals who had either a germline or somatic variant in only one gene of the pair. The mutation states of these individuals, corresponding to the elements , , , and of the contingency table, are referred to as background single-gene mutations (BSGMs).
To estimate the survival function, we used the Kaplan–Meier estimate and the Cox hazard model available in the lifelines [
28] package for Python.
Of the gene pairs that showed significant gene associations, we selected the ones that showed a significant difference in the survival expectancy between double-mutated and non-double-mutated, with a Bonferroni FWER q-value of 0.05.
For each gene pair, we also considered the two subject groups defined above to check for dependencies with clinical variables, such as pathological tumor stage, gender, and age at initial pathological diagnosis. We used logistic regression to find associations between the group and the clinical variables. There were 10 levels for the tumor stage that were modelled numerically; the “discrepancy” labels were also changed to “Not Available”.
Algorithm Overview
The algorithm can be summarized as follows Algorithm 1:
Algorithm 1: Cancer Epistatic Genes Finder (CEG-Finder). |
For each pair of genes: Calculate the contingency table for the values If at some of the three cells : - 2.1.
Calculate and the new contingency table - 2.2.
Perform the G-test for the new table and calculate the p-value
Select the pairs that pass the Bonferroni FWER control Apply the survival analysis and select gene pairs with significantly lower OS time
|
4.3. Dataset Composition and Processing
We collected two datasets, with one consisting of samples of colon adenocarcinoma (COAD) and the other composed of samples of lung adenocarcinoma (LUAD), both of which were released by the TCGA consortium. For each patient, only one pair of samples was considered. The two datasets consist of
and
unique paired samples (normal/tumor) for COAD and LUAD, respectively. The VCF files associated with each sample were obtained considering GRCh37 and GRCh38 reference human genomes [
41] for COAD and LUAD, respectively. The annotation of mutations was performed using Annovar [
42].
We collected the variants affecting a protein sequence (VAPs) in each sample: non-synonymous single-nucleotide variants, frameshift deletions, frameshift insertions, stop-gain, stop-loss, non-frameshift deletions and non-frameshift insertions. We assumed that any VAP can impair gene function.
The considered VAPs were either labelled as “PASS” in the “Filter” column of the VCF or met the specific criteria for the read depth (DP) in support of the alternative allele. In detail, we filtered out the VAPs for which the DP was lower than 10 reads and for which there were a fraction of reads supporting the alternative allele (alternative alleles supporting read divided by read depth) lower than 10%. Additionally, we excluded from the analysis the VAPs with an excess of mutated cases in tumor tissue with respect to normal because they usually correspond to sequencing artefacts. To reduce the impact of these artefacts, we focused only on somatic mutations that exceed the number of cases with a germline mutation by three subjects or fewer.
A gene is considered mutated with respect to the reference genome if it contains any type of mutation at any locus. To reduce the number of genes tested, we filtered out all the genes with less than 5% of the mutated subjects in a tumor tissue.
In addition, all pseudogenes, all genes associated with olfactory receptors, and the macro-gene TTN (titin) were eliminated.
To identify pairs of genes that are not individually associated with cancer, we excluded from our analysis all the driver genes for lung or colon adenocarcinoma reported by Intogen [
43].
Among the 422 patients affected by COAD, the OS time was available for 403 of them, while the information on 93 individuals was censored. In the case of the LUAD cohort, which comprises 405 subjects, the OS time is available for 392 patients, while data for 115 individuals are censored.