Comparative Analysis of Gene Correlation Networks of Breast Cancer Patients Based on Mutations in TP53

Breast cancer is one of the most prevalent cancers in females, with more than 450,000 deaths each year worldwide. Among the subtypes of breast cancer, basal-like breast cancer, also known as triple-negative breast cancer, shows the lowest survival rate and does not have effective treatments yet. Somatic mutations in the TP53 gene frequently occur across all breast cancer subtypes, but comparative analysis of gene correlations with respect to mutations in TP53 has not been done so far. The primary goal of this study is to identify gene correlations in two groups of breast cancer patients and to derive potential prognostic gene pairs for breast cancer. We partitioned breast cancer patients into two groups: one group with a mutated TP53 gene (mTP53) and the other with a wild-type TP53 gene (wtTP53). For every gene pair, we computed the hazard ratio using the Cox proportional hazard model and constructed gene correlation networks (GCNs) enriched with prognostic information. Our GCN is more informative than typical GCNs in the sense that it indicates the type of correlation between genes, the concordance index, and the prognostic type of a gene. Comparative analysis of correlation patterns and survival time of the two groups revealed several interesting findings. First, we found several new gene pairs with opposite correlations in the two GCNs and the difference in their correlation patterns was the most prominent in the basal-like subtype of breast cancer. Second, we obtained potential prognostic genes for breast cancer patients with a wild-type TP53 gene. From a comparative analysis of GCNs of mTP53 and wtTP53, we found several gene pairs that show significantly different correlation patterns in the basal-like breast cancer subtype and obtained prognostic genes for patients with a wild-type TP53 gene. The GCNs and prognostic genes identified in this study will be informative for the prognosis of survival and for selecting a drug target for breast cancer, in particular for basal-like breast cancer. To the best of our knowledge, this is the first attempt to construct GCNs for breast cancer patients with or without mutations in the TP53 gene and to find prognostic genes accordingly.


Introduction
Breast cancer is the most common cancer in women worldwide and the second most common cancer overall. More than 1,300,000 patients and 450,000 deaths occur worldwide each year due to breast cancer. The disease is heterogeneous and classified into several subtypes, including luminal, HER2-enriched, and basal-like subtypes, based on the expression of estrogen receptor (ER), progesterone receptor (PR), and the receptor tyrosine kinase ErbB-2 (HER2). The effect of treatments is different in different subtypes, so each subtype of breast cancer requires a different treatment. The HER2-enriched subtype of breast cancer shows the highest survival rate of HER2-targeted treatments [1,2]. In contrast, the basal-like subtype of breast cancer, also known as triple-negative breast cancer (TNBC), shows the lowest survival rate and does not have effective treatments yet [3,4]. Recent studies reported that the TNBC subtype is much more difficult to treat than the other subtypes of breast cancer due to the lack of experimental data [5][6][7] and the interaction between CTL (cytotoxic T lymphocytes) and TNBC cells [8].
Several genes associated with breast cancer risk have been identified. Perhaps the most well-known breast cancer genes are BRCA1 and BRCA2. These genes are involved in DNA repair, and mutations in them are associated with a significant increase in the risk of breast cancer and ovarian cancer. Other genes such as CHEK2 and ATM also function in DNA repair, but their role in breast cancer development is not fully understood.
Many studies for finding prognostic genes for cancer have focused on identifying individual genes that are differentially expressed in cancer cells and normal cells. For example, Gov and Arga [9] identified a prognostic gene module by comparing gene expression levels in epithelial cells from the ovarian tumor and healthy samples. Other studies by Shi et al. [10], Clarke et al. [11], Yang et al. [12], and Paci et al. [13] attempted to find prognostic genes by analyzing network modules in gene correlation networks using the R package WGCNA [14].
The primary focus of this study is to construct gene correlation networks for the prognosis of breast cancer. In an effort to find potential prognostic gene pairs for breast cancer, we partitioned the patients with breast cancer into two groups, one with a widetype TP53 gene (hereafter called wtTP53) and the other with somatic mutations in the TP53 gene (called mTP53), and constructed gene correlation networks for the two groups. There are several reasons that we selected the TP53 gene for classifying gene expression samples. First, TP53 mutations occur frequently in all breast cancer subtypes. Second, TP53 mutations are known to be associated with poor prognosis in breast cancer [15,16] and other cancers [17].
Our approach is different from the other works [10][11][12] in several ways: (1) We divided breast cancer patients into two groups according to the existence of somatic mutations in TP53, whereas the other works treated breast cancer patients as a whole or divided them into predicted molecular subtypes. (2) They built weighted gene correlation networks using the R package WGCNA, but we did not use the WGCNA package to construct correlation networks. (3) They attempted to find prognostic genes by analyzing network modules found by WGCNA, but we selected genes with significantly different correlations in two groups and performed the log-rank test with different combinations of the genes to find potential prognostic gene pairs. The rest of this paper presents our approach to constructing GCNs for two groups, the results of a comparative analysis of GCNs, and the best prognostic gene pairs found for the two groups. To the best of our knowledge, this is the first attempt to construct GCNs for breast cancer patients with or without mutations in the TP53 gene and to obtain prognostic gene pairs accordingly. One important finding of our study is that the prognostic power of a gene pair can be substantially different depending on the existence of a mutant TP53 gene and that a gene pair is better than a single gene in predicting the survival time of patients.

Materials and Methods
This section discusses our approach to constructing gene correlation networks (GCNs) and obtaining prognostic gene pairs for breast cancer patients. The schematic overview of our approach is shown in Figure 1.

Gene Expression Data of Breast Cancer Patients and Grouping of Patients
We obtained gene expression data of the primary tumor samples of 1050 breast cancer patients in the TCGA-BRCA project [18] from the NIH GDC Data Portal (https://portal. gdc.cancer.gov/repository accessed on 2 December 2019). The number of patients tested for somatic mutations was 750, so the remaining 300 patients were excluded. As a result, a total of 750 TCGA-BRCA patients were analyzed in our study (Supplementary Material S1).  Table 1 shows the number of breast cancer patients and their subtypes based on the PAM50 gene expression assay [19]. A cancer is called estrogen-receptor-positive (or ER+) if it has receptors for estrogen. Similarly, a cancer is called progesterone receptorpositive (PR+) if it has receptors for progesterone. Among the breast cancer subtypes, the normal-like, luminal A, luminal B, and HER2-enriched subtypes have favorable clinical outcomes. The majority of luminal A is ER+/PR+ and HER2-, whereas around 30% of luminal-B are ER+/PR+ and HER2+ [20,21]. On the other hand, the basal-like subtype is hormone receptor-negative (ER-, PR-, and HER2-) and is also called triple-negative breast cancer (TNBC). Table 1. The number and proportion (in parentheses) of breast cancer cases from TCGA and their subtypes based on the PAM50 classification [19]. The proportion represents the ratio of the cases to the entire cases of the same subtype. mTP53: breast cancer patients with somatic mutations in the TP53 gene. wtTP53: breast cancer patients with a wild-type TP53 gene. Regardless of the subtypes, we partitioned the 750 breast cancer patients into two groups: a group of 496 patients with a wild-type TP53 gene (called wtTP53 in this paper) and another group of 254 patients with somatic mutations in the TP53 gene (called mTP53). The reason that we divided the patients into two groups with respect to the existence of somatic mutations in the TP53 gene is that TP53 mutations occur in all breast cancer subtypes and TP53 mutations are known to be associated with poor prognosis in breast cancer [15,16].

Selecting Core Genes
We obtained a total of 516 genes associated with BRCA (74 elite and 442 related genes) from MalaCards database [22]. What we call core genes are comprised of the elite genes of MalaCards and the genes that are potentially associated with breast cancer. We first obtained 74 elite genes of breast cancer from MalaCards, which are defined as those with strong evidence for association with the disease. For comparing core genes and PAM50 [19], we selected an additional 50 genes of PAM50.

The Prognostic Power of Core Genes
According to the Genomic Data Commons (GDC) (https://gdc.cancer.gov/clinicaldata-elements accessed 2 December 2019), the TCGA clinical data can have three properties for survival analysis: 1.
days_to_last_follow_up: time interval from the date of initial pathologic diagnosis to the date of the last follow-up, represented as a calculated number of days 2.
days_to_death: the number of days from the date of the initial pathologic diagnosis to the date of death for the case in the investigation 3.
vital_status: the state of being living or deceased for cases that are part of the investigation In the above three properties, we used the first two properties (days_to_last_follow_up and the days_to_death), which correspond to overall survival. The overall survival time represents the time interval from the date of initial pathologic diagnosis to the date of either death or last follow-up.
We evaluated the prognostic power of core genes by the Cox proportional hazard model [23]. For each pair of core genes in wtTP53 and mTP53, we computed a gene pair of the hazard ratio of the Cox model in the data.

Constructing Gene Correlation Networks
For every pair of core genes in wtTP53 and mTP53, we computed the Pearson correlation coefficients (PCC) between their expression levels by Equation (1). In the equation, N is the number of patients andx is the mean of x.
We then constructed gene correlation networks (GCNs) for wtTP53 and mTP53. In GCNs, PCCs were used as the weights of the edges between genes. From the comparative analysis of GCNs for wtTP53 and mTP53, we selected gene pairs that satisfy three criteria: (1) PCCs of the gene pair have the p-value of PCC < 0.05 in n normal samples of GTEx breast, (2) we added a single sample of the patient to the n normal samples and selected those gene pairs with a p-value of PCC < 0.05 after recomputing PCC with n + 1 samples, (3) PCCs of the gene pair from two GCNs (n normal samples and n + 1 samples) should be significantly different between the normal samples and the patient sample by Equation (2).

Finding Prognostic Genes for Two Groups of Patients
We performed the log-rank test [24] separately for the two groups. In each of the wtTP53 and mTP53 groups, we clustered the patients further into two clusters by hierarchical clustering and performed the log-rank test for gene pairs of core genes with the p-value of PCC < 0.05 to examine the difference in their survival time and obtained the p-value of the test. It should be noted that patients can be clustered differently depending on the gene pairs used for clustering. For each group, we selected the gene pairs which has the lowest p-value in the log-rank test. The p-value was adjusted using the Benjamini-Hochberg procedure, which consists of the following steps to control the false discovery rate at level α.

2.
Find the rank j for which p j ≤ j m α.
p j ≤ j m α in the second step can be transformed to p j m j ≤ α, so min(1, p j m j ) was used as an adjusted p-value.
To find the gene pairs for prognosis of survival time of the wtTP53 and mTP53 groups, we computed n( k 2 ) ∆PCCs for k core genes in n tumor samples.

Prognostic Genes and Gen Pairs for Two Patient Groups
There were 105 differentially expressed genes more than twice in one group than the other with an adjusted p-value < 0.05. Figure 2A shows a volcano plot of differentially expressed genes in two groups, wtTP53 and mTP3. Among the 516 core genes, 321 genes with an adjusted p-value < 0.01 are shown in the volcano plot (Supplementary Material S2).
To find prognostic gene pairs for the wtTP53 and mTP53 groups, we computed 496 · ( 516 2 ) = 65,903,520 ∆PCCs for wtTP53, 261 · ( 516 2 ) = 34,679,070 ∆PCCs for mTP53 (see Table 1 for the number of tumor samples). PCCs from the 100 normal samples and 100 + 1 samples groups were compared using the R package cocor [28]. We selected only those gene pairs with ∆PCC from the normal samples greater than the average of ∆PCCs of gene pairs. There were a total of 46,878 distinct gene pairs in two groups of wtTP53 and mTP53. The gene pairs are available in Supplementary Material S4. Table 2 shows 10 gene pairs with the highest hazard ratios in patients with wtTP53, and Table 3 shows 9 gene pairs with the highest hazard ratios in patients with mTP53. Figure 3 shows the Kaplan-Meier plots [29] of prognostic gene pairs for wtTP53 and mTP53. For comparative purposes, the Kaplan-Meier plots of the same prognostic genes are displayed for mTP53 or wtTP53 as well. For example, an adjusted p-value [30] of the log-rank test of the gene pair MAPK10_PTK6 < 0.05 in wtTP53 (the left plot of Figure 3A) but an adjusted p-value of the log-rank test of the gene pair > 0.05 in mTP53 (the right plot of Figure 3A). On the other hand, an adjusted p-value of the log-rank test of the gene pair KDM5B_ST14 < 0.05 in mTP53 (the left plot of Figure 3B) but an adjusted p-value of the log-rank test of the gene pair > 0.05 in wtTP53 (the right plot of Figure 3B). The results of the prognostic analysis suggest that prognostic gene pairs for breast cancer can be significantly different depending on the existence of a mutant TP53 gene. The log-rank test and the Cox PH model of the gene pairs are available in Supplementary Material S5. Volcano plot for comparing differentially expressed genes in two groups, wtTP53 and mTP3. The horizontal axis represents the fold change (FC) between the two groups on a log2 scale, and the vertical axis shows the negative logarithm to the base 10 of p values from the t-test. A gene with a higher expression level in mTP53 than in wtTP53 has a positive FC and is shown as a red dot. A gene with a lower expression level in mTP53 than in wtTP53 has a negative FC and is shown as a blue dot. Top 10 genes with low adjusted p-values by Benjamini-Hochberg procedure are labeled with their names. (B) GO circle plot and GO terms for genes with significant FC. Genes with higher expression levels in mTP53 than in wtTP53 are shown as red dots, and genes with lower expression levels in mTP53 are shown as blue dots. The z-score is the number of overexpressed genes minus the number of underexpressed genes divided by the square root of the count.

A Comparison of TCGA-BRCA and an Independent Validation Cohort
For the independent validation cohort, we obtained gene expression data (Dataset ID: EGAD00010000210) of 997 breast cancer patients in the METABRIC [31] from the European Genome-phenome Archive (EGA). There are 721 breast cancer patients with wtTP53, 99 patients with mTP53, and 177 patients having no information about the mutation type of the TP53 gene. There are seventeen patients in wtTP53 without the survival time, so we selected 704 patients in wtTP53 of METABRIC.
For comparative purposes, we computed 704 · ( 516 2 ) = 93,540,480 ∆PCCs for wtTP53 in the METABRIC, 99 · ( 516 2 ) = 13,154,130 ∆PCCs for mTP53 in the METABRIC. Figure 4 shows that the highest hazard ratio of gene pair MAPK10_PTK6 for the prognosis of survival of wtTP5 in TCGA-BRCA is shown a similar prognosis of survival of mTP53 in METABRIC. On the other hand, the highest hazard ratio of gene pair KDM5B_ST14 for mTP53 in TCGA-BRCA has no significant p-value for the log-rank test for the prognosis of survival of mTP53 in METABRIC. The log-rank test and the Cox PH model of the gene pairs of METABRIC are available in Supplementary Material S6.

Comparison of Gene Correlations with Gene Expressions in Classifying Subtypes of Breast Cancer
Classification of subtypes of breast cancer is not the focus of our study, but we investigated whether PCCs and ∆PCCs of gene pairs are useful in classifying subtypes of breast cancer as well. We constructed a 4-class (basal-like, HER2-enriched, luminal A, and luminal B) classifier with linear regression in Keras (https://keras.io accessed on 2 December 2019) on TensorFlow (https://www.tensorflow.org accessed on 2 December 2019). The classifier consists of 9 layers (including input and output layers). Default values were used for parameters β1 and β2 in the Adam optimizer (https://keras.io/api/optimizers/adam accessed on 2 December 2019). For classification, we additionally computed 757 · ( 50 2 ) = 927,325 ∆PCCs (among 757 ∆PCCs, 496 are for wtTP53 and 261 are for mTP53). 70% of the dataset was used for training the classifier, and the remaining 30% was used for testing the classifier.
For comparative purposes, we constructed two classification models which use different features. As for features, one model used PCCs and ∆PCCs of gene pairs of 50 genes in PAM50. The other model used gene expression levels of 50 genes in PAM50 for features. As shown in Table 4, the model using PCCs and ∆PCCs of gene pairs exhibited better performance than the model using gene expressions of genes in all subtypes. In particular, the model with PCCs and ∆PCCs was consistently better than the other model in F-score and MCC. These results indicate that PCCs and ∆PCCs of gene pairs can be useful in classifying subtypes of breast cancer as well.

Discussion
Breast cancer is one of the most heterogeneous cancers with many subtypes, and different subtypes require different treatments. A subtype such as basal-like breast cancer, also known as triple-negative breast cancer (TNBC), shows the lowest survival rate and does not have targeted treatments yet. Due to the lack of biomarkers that can distinguish TNBC from other subtypes and definitive targets, there are few therapeutic interventions for TNBC [32] and chemotherapy has been the primary treatment in the past decades [33].
In an effort to find potential prognostic biomarkers for breast cancer, we analyzed gene expression data of 750 breast cancer patients. For the two groups of patients, we constructed gene correlation networks (GCNs) enriched with prognostic information. Our GCN is more informative than typical GCNs in the sense that it indicates the type of correlation between genes, the concordance index, and the prognostic type of a gene (Supplementary Material S2).
We performed principal component analysis (PCA) of ∆PCCs of gene pairs in 750 breast cancer patients using the scikit-learn package (https://scikit-learn.org/ accessed on 2 December 2019). It is interesting to note that ∆PCCs of gene pairs in the basal-like subtype (shown in purple dots in Figure 5) are very different from those in the other subtypes (luminal A, luminal B, and HER2-enriched subtypes) of breast cancer. The R package WGCNA is perhaps the most widely used software for analyzing inter-gene correlations from gene expression data. With the same gene expression data of 455 core genes used in our work, we analyzed gene correlation and constructed a gene correlation network using WGCNA. As shown in Supplementary Material S7, four modules were found by WGCNA, and the turquoise color module contains many gene pairs specific to wtTP53. No modules specific to mTP53 were found by WGCNA. A comparison of our GCN for wtTP53 and the GCN found by WGCNA shows that the two GCNs have 94 genes and 77 gene pairs in common. However, more than 59% (136 out of 230) of the total genes and 95% (1673 out of 1750) of the total gene pairs are found in one GCN only. The 108 genes and 489 gene pairs in our GCN could not be found by WGCNA. WGCNA is useful for automatically finding modules from gene expression data and analyzing them but is not convenient for comparing user-specified groups. As mentioned earlier, a comparison of the patients with wtTP53 to the patients with mTP53 cannot be made using WGCNA because WGCNA cannot find a module specific to wtTP53 or mTP53.
To examine the sensitivity of potential prognostic gene pairs to possible outliers in gene expressions, we randomly removed 10% of breast cancer samples and performed the survival analysis with the remaining 90% of the samples. Due to the randomness of selected samples, the survival analysis was repeated three times in each of the log-rank test and Cox proportional hazard (Cox PH) models. In patients with wtTP53, eight potential prognostic gene pairs out of the ten prognostic gene pairs shown in Table 2 remained significant. In the log-rank test with 90% of patients with mTP53, there were no gene pairs with the adjusted p-value (FDR) < 0.05. However, in the Cox proportional hazard model, five gene pairs out of the nine prognostic gene pairs (Table 3) were significant (that is, their p-values of the Cox PH model < 0.0001). One reason for finding fewer potential prognostic gene pairs in 90% of the patients with mTP53 than in 90% of the patients with wtTP53 seems to be the smaller number of patients with mTP53. Initially, the number of patients with wtTP53 was 496, whereas the number of patients with mTP53 was 254, which is about half of the patients with wtTP53. With more patients available for the survival analysis, we expect the potential prognostic gene pairs to be more stable and less sensitive to outliers. Details are available in Supplementary Material S8.
We also tried Spearman's rank correlation coefficient (SCC) to find potential prognostic gene pairs. Ten potential prognostic gene pairs derived with SCC in patients with wtTP53 are given in Supplementary Material S9. Among the ten potential prognostic gene pairs, AKIP1_SUSD2 is the only one common to the potential prognostic gene pairs derived with PCC (Supplementary Material S5). In patients with mTP53, 16 gene pairs were found as potential prognostic gene pairs from the survival analysis with SCC. There were no prognostic gene pairs common to the 16 gene pairs derived with SCC and the nine prognostic gene pairs derived with PCC (Table 3). These results indicate that potential prognostic gene pairs can be changed depending on the types of gene correlations, but the way of deriving potential prognostic gene pairs still works with different types of correlations.
The main contribution of our work is that we performed a comparative analysis of two groups of breast cancer patients based on mutations in the TP53 gene and constructed gene correlation networks for the two groups. We also derived potential prognostic gene pairs from the networks and observed that the two groups do not share potential prognostic genes. As for another contribution of our work, we discovered that the prognosis of breast cancer patients is significantly different depending on the existence of the mutant TP53 gene and that gene pairs are more prognostic of survival than single genes in patients with a wild-type TP53 gene.
Although the results of our approach are promising in finding potential prognostic gene pairs in breast cancer, there are a few limitations. First, potential prognostic gene pairs can be changed depending on the types of correlations used for deriving gene pairs. Second, potential prognostic gene pairs may be sensitive to outliers in gene expressions when there are not enough samples.

Conclusions
In this study, we partitioned the 750 breast cancer patients into two groups, one with a wild-type TP53 gene (wtTP53) and the other with somatic mutations in the TP53 gene (mTP53), and constructed gene correlation networks (GCNs) enriched with prognostic information. Comparative analysis of the two GCNs revealed several interesting findings. First, many genes show different expression levels in the two patient groups. Second, but more importantly, there are several gene pairs in the two GCNs and their difference in correlation patterns is the most prominent in the basal-like subtype of breast cancer.
From the survival analysis, we identified three prognostic gene pairs {MAPK10_PTK6, ECT2_HIF1A-AS2, HIF1A-AS2_KIF15} for the wtTP53 group, and another three prognostic genes {KDM5B_ST14, NAT2_PBOV1, KIT_RHOBTB2} for the mTP53 group. These results suggest that the prognosis of breast cancer patients is significantly different depending on the existence of the mutant TP53 gene and that gene pairs are more prognostic of survival than single genes in patients with a wild-type TP53 gene.
The enriched GCNs and prognostic gene pairs identified in this study will be informative for the prognosis of survival and for selecting a drug target for breast cancer, in particular for the basal-like subtype of breast cancer. The GCNs and data are available at http://bclab.inha.ac.kr/brca accessed on 2 December 2019.