Discovering Common miRNA Signatures Underlying Female-Specific Cancers via a Machine Learning Approach Driven by the Cancer Hallmark ERBB

Big data processing, using omics data integration and machine learning (ML) methods, drive efforts to discover diagnostic and prognostic biomarkers for clinical decision making. Previously, we used the TCGA database for gene expression profiling of breast, ovary, and endometrial cancers, and identified a top-scoring network centered on the ERBB2 gene, which plays a crucial role in carcinogenesis in the three estrogen-dependent tumors. Here, we focused on microRNA expression signature similarity, asking whether they could target the ERBB family. We applied an ML approach on integrated TCGA miRNA profiling of breast, endometrium, and ovarian cancer to identify common miRNA signatures differentiating tumor and normal conditions. Using the ML-based algorithm and the miRTarBase database, we found 205 features and 158 miRNAs targeting ERBB isoforms, respectively. By merging the results of both databases and ranking each feature according to the weighted Support Vector Machine model, we prioritized 42 features, with accuracy (0.98), AUC (0.93–95% CI 0.917–0.94), sensitivity (0.85), and specificity (0.99), indicating their diagnostic capability to discriminate between the two conditions. In vitro validations by qRT-PCR experiments, using model and parental cell lines for each tumor type showed that five miRNAs (hsa-mir-323a-3p, hsa-mir-323b-3p, hsa-mir-331-3p, hsa-mir-381-3p, and hsa-mir-1301-3p) had expressed trend concordance between breast, ovarian, and endometrium cancer cell lines compared with normal lines, confirming our in silico predictions. This shows that an integrated computational approach combined with biological knowledge, could identify expression signatures as potential diagnostic biomarkers common to multiple tumors.


Introduction
Breast Cancer (BC) is the most common cancer in women. Among gynecological tumors, uterine endometrial corpus cancer (UCEC) is the most frequent in developed countries, while ovarian cancer (OV) is the most lethal [1,2]. Early detection of these two gynecological cancers is challenging as effective screening methods are lacking, whilst the heterogeneity of breast cancer strongly impacts its prognostic and clinical outcomes [3,4].
Hence, the use of multidisciplinary approaches to exploit genotype-phenotype relationships could aid diagnosis and treatments by identifying novel, non-invasively detectable biomarkers, and through the discovery of novel molecular targets [5]. Nowadays, omics data integration and machine learning (ML) methods are at the forefront of big model on integrated miRNA omics and network data, and were able to predict common BC, UCEC, and OV deregulated miRNAs. Some of these are potentially able to target the ERBB family. To address these issues, we used the TCGA and miRTarBase databases to construct a weighted Support Vector Machine (SVM) learning model. We validated in vitro the most promising common miRNA signatures, using cancer and parental cell lines for each tumor type. Collectively, our results highlight common deregulated expression signatures for the estrogen-dependent cancers (BC, UCEC, and OV) analyzed here.
We processed miRNA profiles of 2255 female samples, including primary tumor tissues (1096 TCGA-BRCA, 545 TCGA-UCEC, and 490 TCGA-OV)) and solid tissue normal for TCGA-BRCA (104 cases) and TCGA-UCEC (33 cases). Ovarian miRNA-seq data were not available for solid tissue normal samples (see Table S1). From 2255 samples, we removed 13 male samples from the BRCA dataset and filtered out the vial/plate duplications using the mean as a unique value (resulting in 2229 total samples).
From a total of 1882 miRNAs, we selected all miRNAs with non-zero values for all tumor and normal samples (n = 1683). We retained only those miRNAs expressed as counts-per-million (CPM) above 0.5 in at least 90% of the samples (n = 309) and performed a global data normalization via the upper quartile method (UQUA). The final dataset was composed of 2229 (2093 tumor/136 normal) samples and 309 miRNAs. We retrieved, from miRTarBase [Homo sapiens (hsa ) miRNA, release 7.0] [38], the experimentally validated ERBB2, ERBB3, and ERBB4 miRNA target interaction (MTI). We examined the UALCAN portal [39] for ERBB2 and ERBB3 expression profiling in TCGA-BRCA and TCGA-UCEC cohorts, respectively. We split the dataset into training (70%, 1562 samples) and test sets (30%, 667 samples) for the classification step. Thus, we preserved the proportion of cancer types in the data distribution (in the split data, we had the same percentage of cancer types as in the full dataset, training dataset, and test dataset). Then, we performed a feature selection using the Recursive Feature Elimination (RFE) method (caret v6.0-88), which resulted in a selection of 205 miRNAs. We selected the top 15 features using variable importance analysis and all features found in both the feature selection output and the ERBB miRNA target dataset from our previous study, for a total of 42 miRNAs. Given the unbalanced dataset (2093 tumor/136 normal), we performed an artificial generation of new training samples for the minority class, using the SMOTE method [40]. We used the resulting training set as an input of a Caret Support Vector Machine [41], a classification algorithm widely used in transcriptome classification studies [42][43][44] that maximizes the width of the margin between the classes and, at the same time, minimizes the empirical errors. In the training phase, we used k-fold cross-validation (k=10), a grid approach for parameter tuning (Cost, Weight, and Sigma) and a 70-30% splitting rule for the training and test set definitions.

Integrated Computational Approaches
Our previous study integrated TCGA breast, ovary, and uterine endometrial corpus gene expression profiling, to identify a top-scoring PPI network centered on the ERBB2 gene [11]. The ERBB family is involved in the tumor biology of several solid tumors, including the three female hormone-related cancers [12]. The ErbB receptor tyrosine kinases exist as homodimers and heterodimers composed of ERBB2, ERBB3, and ERBB4 isoforms [48]. In the present study, using a TCGA dataset, we found that ERBB2 and ERBB3 were overexpressed in breast and endometrial cancer tissues compared with normal counterparts ( Figure S1). As the TCGA dataset lacks expression data from normal ovarian tissues, we could not compare ERBB2 and ERBB3 expression between cancer and normal ovarian tissues.
We attempted to identify common miRNA signatures across the three hormone-related cancers and their putative relationships with the cancer hallmark, ERBB isoforms. Figure 1 displays the whole workflow of the present study.
To this end, we queried two bioinformatics resources: (i) the TCGA, a genomic database, for retrieving miRNA expression profiling of large BRCA, UCEC, and OV cohorts, and (ii) miRTarBase, an experimentally validated microRNA target interactions database, for finding miRNAs targeting ERBB isoforms independently from tumor type. To identify miRNAs associated with the three female cancers, we leveraged a model based on the normalized counts of the miRNA.
Due to feature selection on the full dataset, our model identified 205 features (TCGA-BRCA, TCGA-UCEC, and TCGA-OV miRNA, Table S2). On these selected features we applied a Support Vector Machine (SVM) algorithm to execute a Tumor vs. Normal binary classification. We selected this machine learning method, based on its wide usage in the field of transcriptomic and miRNA classification works, as reported in the literature [43,44,[49][50][51]. Other methods (i.e., Logistic regression, Boosted Logistic Regression, Regression with LASSO penalty, Elastic Net, Random Forest, Neural Networks using Model Averaging (avNNet, and the "nnet" package)) produced similar results in AUC and other metrics, compared with SVM (the AUC obtained from the tested ML methods ranging from 0.911 to 0,97 compared to an AUC of 0.931 by SVM) as reported in Table S3. Random Forest and avNNet performance achieved better results in terms of AUC and F1 score. However, we did not obtain deep insight into the systematic appraisal of model performance because it is beyond the scope of the study.
Using the miRTarBase, we found 158 experimentally validated miRNAs able to target ERBB isoforms: 91 targeting the ERBB2 gene, 57 targeting ERBB3, and 10 targeting ERBB4 (Table S4). The intersection of the 205 features ranked according to model importance weighting, with the 158 miRNA targeting ERBB isoforms, resulted in 28 overlapping miRNAs from the 205 selected features. Upon testing model performance, based on different miRNA combinations, we kept the top 15 miRNAs governing model performance (Table S5), plus the 28 miRNAs that targeted ERBB isoforms (i.e., 27 univocal, since hsa-mir-145 was present in both sets of miRNAs). Thereby, our joint analyses identified 42 features, ordered by importance weighting (Table 1), as potential expression signature similarity associated with the three female-specific cancers.  To this end, we queried two bioinformatics resources: (i) the TCGA, a genomic database, for retrieving miRNA expression profiling of large BRCA, UCEC, and OV cohorts, and (ii) miRTarBase, an experimentally validated microRNA target interactions database, for finding miRNAs targeting ERBB isoforms independently from tumor type. To identify

miRNA Validation in the Cancer Cell Line Encyclopedia
Our in silico approaches predicted 42 miRNAs able to discriminate tumors from normal tissues, and some of these were potentially able to target ERBB isoforms. To validate our in silico predictions and select the in vitro model systems for each of the three tumors, we utilized the cancer cell line encyclopedia (CCLE), a large collection of expression and genetic data for human cancer cell models [45]. The CCLE dataset has previously been used to mirror TCGA samples and cell line genomic profiling in OV cancer [52] and BC [53], respectively. However, the CCLE dataset, unlike the TCGA dataset, does not provide miRNA expression data for normal cell lines. We observed that most of the identified miRNAs (39 out of 42) were expressed across a variety of breast, ovary, and endometrium cancer cell lines (Figure 2), except for hsa-miR-139, -miR-337, and -miR-3127. Moreover, unsupervised hierarchical clustering analysis highlighted BRCA, OV, and UCEC cell line clusters along heatmap columns. We then performed a variability data analysis, based on the standard deviation (SD) and interquartile range (IQR) for each miRNA, together with the overall breast, ovary, and endometrium cancer cell lines. In Moreover, unsupervised hierarchical clustering analysis highlighted BRCA, OV, and UCEC cell line clusters along heatmap columns. We then performed a variability data analysis, based on the standard deviation (SD) and interquartile range (IQR) for each miRNA, together with the overall breast, ovary, and endometrium cancer cell lines. In detail, we evaluated SD and IQR distributions, and selected only miRNAs with SD and IQR values less than or equal to the first quartile distribution, to capture signatures with the lowest variability in expression profile. These filtering criteria identified nine pivot miRNAs for further internal validation; these are reported in bold in Table 1.
To select miRNAs for in vitro validation, we used the following criteria: where none of the queried databases specified the miRNA isoform, we decided to analyze both the 3p and 5p isoforms; otherwise, where at least one of the queried databases established the miRNA isoform, we chose to analyze only that one (details in Table S6). In this way, we selected 14 miRNA mature sequences, as reported in Table 2.
Indeed, the expression levels of miR-323a-3p, -323b-3p, and -381-3p were greatly decreased in breast, ovarian, and endometrial cancer cell lines, independent of tumor type. In the same way, miR-331-3p and mir-1301-3p expression increased in cancer cell lines compared with normal lines, although we found small but significant changes in miR-331-3p in the MCF7 vs. MCF10A cell lines, and miR-1301-3p in the EFO21 vs. OCE1 cell lines.  Hence, we wondered whether these five miRNAs could retain their expression trends in TCGA primary tumors. Due to the fact that the TCGA ovarian cohort did not include normal tissues, comparative analysis of miRNA expression levels between tumor and normal tissues was conducted only for TCGA-BRCA and TCGA-UCEC cohorts. Our results show that the TCGA findings matched those obtained by in vitro validation, except for miR-323b-3p, which was downregulated in BC cell lines, while being upregulated in BC tissues ( Figure 4A), and unchanged in UCEC tissues ( Figure 4B).   Taken together, these results highlight that these five miRNAs are similarly deregulated in all three estrogen-dependent cancers.

Discussion
In our previous study, we designed an integrated approach of TCGA-BRCA, -OV, and -UCEC gene expression for identifying expression signature similarity in estrogendependent cancers. We highlighted a leading protein-protein interaction (PPI) network underlying breast, endometrial, and ovarian cancers centered on the oncogene Erb-B2 Receptor Tyrosine Kinase 2 (ERBB2) [11].
ERBB2, also commonly referred to as HER2, plays an essential role in human pathobiology. Four isoforms of the ErbB family have been identified: EGFR (ErbB1, HER1), ErbB2 (HER2), ErbB3 (HER3), and ErbB4 (HER4). Regulation of ERBB activity depends on its homo-or hetero-dimerization. Generally, ERBB2 is involved in mammary gland development during puberty, proliferation, and differentiation, and it is a key player in female-specific malignancies [12]. Identification of its functional pathways and networks could be essential in determining the potential ERBB family regulatory mechanisms in female-specific cancers. Hence, we performed an integrated bioinformatics analysis by combining in silico miRNAs discovered via a machine learning approach from TCGA-BRCA, -OV, and -UCEC datasets and the miRTarBase knowledge for in silico identification of miRNAs targeting ERBB isoforms. In the field of transcriptomic and miRNA classification, many studies adopted the SVM machine learning method because it outperformed the others [43,44,[49][50][51]. We also chose an SVM approach, since other methods (i.e., logistic regression, Boosted Logistic Regression, Quantile Regression with LASSO penalty, Elastic Net, Random Forest, avNNet, and the "nnet" package) produced relatively similar results for AUC and other metrics compared with SVM. Our computational integrated pipeline predicted 42 miRNAs commonly deregulated in the three estrogen-dependent tumors, and some of these were potentially able to target ERBB family members.
To test our in silico predictions, we utilized the publicly available CCLE dataset, including breast, ovarian, and uterine endometrial corpus model systems, which allowed us to establish a set of mature miRNA sequences and a panel of cell lines. We selected similar miRNA expression patterns across the breast, ovarian, and uterine endometrial corpus cell lines. By clustering analysis and miRNA variation analysis, we prioritized a set of 14 pivotal miRNAs for in vitro validation across the breast (MCF7 and TD47), the ovarian (ES2 and EFO21), and the endometrium (MFE-280 and EN) cancer cell lines vs. parental cell lines (MCF10-A, OCE1, and HESC). Interestingly, 5 (miR-331-3p, miR-381-3p, miR-323a-3p, miR-323b-3p, and miR-1301-3p) out of 14 miRNAs showed a similar expression trend across the three tumor types compared with their parental cell lines. Furthermore, four out of five miRNAs (miR-323a-3p, miR-323b-5p, miR-331-3p, and miR-381-3p) retained their expression trend in TCGA primary tumors.
Recently, several studies investigated the miRNA regulatory mechanisms underlying estrogen-dependent cancers. In 2019, Liolios et al. [17] found eight key player miR-NAs that are commonly deregulated in female reproductive system cancers. In addition, the circulating miR-765 was shown to promote endometrial cancer development via the ERβ/miR-765/PLP2/Notch axis [20]. Again, Ritter et al. demonstrated that the deregulated expression of miR-484 and miR-23a in serum and urine human samples was linked to endometrial and ovarian cancers [23].
The growing demand for identifying miRNAs associated with specific diseases gave rise to a common bioinformatic approach; the fusion of various public bioinformatic resources. To this end, recent studies have shown that machine learning approaches can significantly improve the analysis of next-generation sequencing expression data to find new biological patterns and genes, or groups of genes, involved in several pathologies [7]. In fact, a previous study has utilized unsupervised and supervised machine learning techniques on gene expression data with variable success rates [7]. Other studies compared the capacity of supervised methods, such as Support Vector Machines (SVM), Decision Trees, and Random Forest classifiers to perform disease/healthy sample classification tasks [31]. For instance, Ha et al. [56] proposed a novel approach for inferring novel disease-miRNA associations through a machine learning method based on matrix factorization, and used various miRNA databases, such as HMDD and miR2Disease. Machine learning is also applied to discriminate cancer subtypes. Indeed, Ali et al. [33] used a Neighborhood Component Analysis (NCA) and Long Short-Term Memory (LSTM), a type of Recurrent Neural Network, to classify a given miRNA sample into kidney cancer subtypes, using the miRNA quantitative read counts data provided by the TCGA. The results showed a subset of 35 miRNAs that could group kidney cancer miRNAs into five subtypes. The current study aimed to identify common miRNA signatures across the three estrogen-dependent cancers via a machine learning approach, and discovering hidden relationships between the ERBB family and its regulatory miRNAs.
Our weighted support vector machine approach, through feature selection, prioritized miRNAs already known in female malignancies, in agreement with previous findings; for instance, the eight key player miRNAs found by , miR-205, miR-34a, miR-92a, miR-101) [17], and novel miRNAs such as miR-1301 and miR-1296. This latter miRNA had low model importance upon feature selection but, importantly, was one of the miRNAs targeting ERBB isoforms, and satisfied our statistical criteria upon clustering analysis on the CCLE dataset. Indeed, recent functional findings have demonstrated miR-1296-5p involvement in the regulation of breast cancer cell progression by targeting the ERBB2/mTORC1 signaling pathway [57] and tumor suppressor roles in triple-negative breast cancer [58][59][60].
Tumor-specific miRNA expression differences have been proven in vitro to discriminate breast, ovarian, and endometrial cancer cell types [22]. Consistent with Hirschfeld et al. [22], we found 16 out of our 42 features of which mir-200c, mir-100, mir-200b were within our top 15 features for model importance. Riaz et al. used the miRNA expression profiling of 51 human breast cancer cell lines to reveal subtype and driver mutation-specific miRNAs [21]. In line with literature data, we also used model cell systems for in vitro validations, upon pivotal miRNA selection using a public cancer cell line dataset collection. Our in vitro results demonstrated that five miRNAs (miR-331-3p, miR-381-3p, miR-323a-3p, miR-323b-3p, and miR-1301-3p), had a similar expression trend across cancerous breast, ovarian, and uterine endometrial corpus cells compared with their parental cell lines. Furthermore, four out of five miRNAs (miR-323a-3p, miR-323b-5p, miR-331-3p, and miR-381-3p) retained their expression trend in TCGA primary tumors. In agreement with previously published data [62], two of these miRNAs (miR-323a-3p and miR-331-3p) would be able to target ERBB isoforms. miR-331 expression was deregulated in breast cancer [63], and directly targeted HER2 in breast [62], prostate [64], gastric [65], and colorectal [66] cancer cell lines. Interestingly, miR-381-3p upregulation suppresses BC progression by inhibiting epithelial-mesenchymal transition [67]. This result is in line with our data, since we found that miR-381-3p expression was strongly downregulated in BC cell lines, and in ovarian and endometrial cancer cells.
Similarly, a recent study reported that ectopic expression of miR-323a-3p inhibited the malignant behavior of BC cell lines and inhibited tumor growth [68]. In our experiments, we found that the expression level of miR-323a-3p was strongly inhibited in breast, ovarian, and endometrial cancer cell lines. We additionally found that miR-323b-3p was downregulated in all cancer cell lines. It has recently emerged that its downregulation in plasma is associated with lymph node metastases in early breast cancer patients [69]. Regarding miR-1301-3p, Peng et al. reported its downregulation in BC cell lines and tissues [70]. In contrast, our findings show an increased expression level for this miRNA in all assessed cancer cell lines, and these results agree with TCGA dataset results. Hence, our integrated analysis indicated 42 miRNA features characteristic of breast and two gynecologic tumors that are able to discriminate between tumor and normal tissues. In vitro experiments suggest that five miRNAs might be tumor signatures for the three female cancers.
Nevertheless, the present study has some limitations. The first is the lack of ML model validation on the dataset including ovarian normal counterparts, since TGCA-OV lacks these miRNA-seq data. Moreover, the data are unbalanced (2093 tumor/136 normal) and, even though we performed an artificial generation of samples through the SMOTE technique, this method does not consider neighboring values, and thus can increase the overlapping of classes and introduce additional noise [71]. Moreover, we could not find any other large public datasets of breast, ovarian, or endometrial miRNA expression profiling; therefore, we could not test the whole methodological workflow on external datasets to generalize the performance of the classifier. The third limitation is the lack of ex vivo validation, which certainly would have made the ML results more robust.
Indeed, further investigations will focus on the potential implications of the functional roles of the miRNAs and the ERBB family for future translational applications. One future perspective is to test whether the identified miRNAs are also deregulated in the plasma of BC, OV, and UCEC patients, as, if so, this information could be used for diagnostic and prognostic purposes.

Conclusions
In summary, this study aimed to illustrate the potential of a machine learning approach, combined with ERBB knowledge, to dissect similarities in miRNA expression in estrogen-sensitive female-specific tumors. Using publicly available datasets, we propose an integrated computational framework based on a binary classification model and driven by the cancer hallmark ERBB to discover common miRNA signatures characterizing breast, endometrial, and ovarian tumors as possible diagnostic biomarkers, which would have potential for future research in clinical practice. Indeed, further validation would be critical to ascertain the utility and specificity of these miRNAs in diagnostics. Meanwhile, from a plethora of miRNAs across three clinically distinct tumor entities, our analysis framework identified five miRNAs as potential biomarkers for the three cancer types. As a future perspective, it would be interesting to assess the miRNA regulatory network around the ERBB family.