COSMONET: An R Package for Survival Analysis Using Screening-Network Methods

: Identifying relevant genomic features that can act as prognostic markers for building predictive survival models is one of the central themes in medical research, affecting the future of personalized medicine and omics technologies. However, the high dimension of genome-wide omic data, the strong correlation among the features, and the low sample size signiﬁcantly increase the complexity of cancer survival analysis, demanding the development of speciﬁc statistical methods and software. Here, we present a novel R package, COSMONET (COx Survival Methods based On NETworks), that provides a complete workﬂow from the pre-processing of omics data to the selection of gene signatures and prediction of survival outcomes. In particular, COSMONET implements (i) three different screening approaches to reduce the initial dimension of the data from a high-dimensional space p to a moderate scale d , (ii) a network-penalized Cox regression algorithm to identify the gene signature, (iii) several approaches to determine an optimal cut-off on the prognostic index ( PI ) to separate high- and low-risk patients, and (iv) a prediction step for patients’ risk class based on the evaluation of PIs . Moreover, COSMONET provides functions for data pre-processing, visualization, survival prediction, and gene enrichment analysis. We illustrate COSMONET through a step-by-step R vignette using two cancer datasets. connections of breast cancer biomarkers (such as TGFB1 and MAPK8) that are not identiﬁed by the BMD or DAD approaches, ( B , C ), respectively. COSMONET generates the KEGG pathway networks as part of a fully interactive dashboard, which also contains the full networks comprising all the gene signatures, including the isolated genes. The full dashboards are available in the Supplementary Materials.


Introduction
In the last several years, technological advances in the high-throughput quantitative analysis of omics data have provided ample opportunities to investigate the onset and progression mechanisms of several complex diseases, including cancer. International collaborations in large projects, such as The Cancer Genome Atlas (TCGA), which constitutes the core of The Genomic Data Commons (GDC Data Portal) [1,2], the European Genome-Phenome Archive (EGA) [3], and the Gene Expression Omnibus (GEO) [4,5], among many others, have contributed to profiling large tumor sets for different omics layers (genomic, epigenomic, transcriptomic, metabolomics, and proteomic data).
The availability of such a huge volume of cancer omics data has favored the development of novel computational and statistical methods for personalized therapeutic strategies, improving the ability to diagnose, treat, and predict cancer progression. In particular, researchers have devoted significant efforts to developing computational methods to cope with the curse of data dimensionality and variables' correlation, which constitute the two main challenges faced when working with survival and omics data. Such methods are /ultra high-dimensional space to a moderate-dimensional space where the penalization approaches might be more effective. We showed the advantages of merging screening techniques with network-based Cox models in [34,35]. In conclusion, although few tools are already available, there is still a lack of versatile packages that combine screening approaches with penalized regression methods based on network analysis.
Here, we present COSMONET (COx Survival Methods based On NETworks), a novel R package inspired by the methodology proposed in [34,35]. The novelty of the statistical model behind COSMONET is the combination of screening techniques (i.e., the transformation of data from a high-dimensional space into a low-dimensional space) and network-penalized Cox regression approaches for the selection of significant biomarkers. By combining the biological knowledge related to the disease under investigation and the statistical information derived by the data, the package allows the user to identify new potential biomarkers. The package implements a novel pipeline that selects a subset of genes associated with cancer survival and predicts individual patients' risk by evaluating a prognostic index (PI) on the selected gene signature. In addition, COSMONET offers a complete workflow that starts from the pre-processing and normalization steps of omics data and trains the model to identify potential biomarkers, computes PIs, and determines the optimal cut-off to separate high-from low-risk patients. Then, it allows the evaluation of the PIs on a new set of patients (test set) and predicts the patient-specific survival risk. The package also includes advanced visualization options to investigate the data using survival curves, heat-maps, gene pathway networks, Venn diagrams, and correlation plots. Moreover, COSMONET has a complete and straightforward step-by-step guide, with a detailed vignette, making the proposed methodology ready to be used within the biomedical community. Overall, it provides to the scientific collectivity a comprehensive set of easy-to-use functions to take full benefit of the increasingly available shared and heterogeneous cancer data. COSMONET is publicly available and it can be accessed at http://bioinfo.na.iac.cnr.it/cosmonet/ (accessed on 13 November 2021).
In the following sections, we describe the workflow, the statistical methodology, and the functions available in COSMONET. We also illustrate the package's capabilities using two cancer datasets (downloaded from the GEO database and GDC data portal).

Materials and Methods
COSMONET input data are of the form {(X i , Y i , δ i )}, for i = 1, . . . , n. Here, X i = (X i1 , . . . , X ip ) T is the omic profile of the ith patient over p genes. Y i = min(t i , c i ) is the response variable composed of the survival time t i (i.e., the time until endpoint or last follow-up) and the censoring time c i , and δ i is the censoring indicator I(t i < c i ) (i.e., a 0/1 variable, where 0 indicates that the ith patient was censored at time t i and 1 that the ith patient had an event at time t i ). Specifically, COSMONET takes as input data two numeric matrices X T ∈ R n T ×p , the training set T, and X D ∈ R n D ×p , the testing set D, and the relative survival information (t T , δ T ) and (t D , δ D ), respectively. Here, n T and n D denotes the number of samples in the two sets of data and p is the number of covariates. The omic data can represent gene expression profiles as measured by microarrays or RNA-seq technologies, or any other numeric genome-wide feature that can be associated with genes or proteins to form numeric matrices. In our examples, we will use gene expression data since they are the most popular choice. Figure 1 summarizes the steps implemented in COSMONET. More precisely, it consists of two phases: (i) the training phase and (ii) the testing phase.
(i) The training phase uses X T to identify prognostic markers, computes PIs, and defines a cut-off PI * for defining patient risk classes. The core of the training phase consists of the following steps: • a screening approach to provide an essential dimensional reduction step that allows a transition from the high-dimensionality p to a moderate scale d < p using biological information, data-related information, or combining both pieces of knowledge; • a network-penalized Cox regression method to model observed survival times through genome-wide omic profiles (while accounting for coordinated genes functioning in the form of biological pathways or networks) and identify gene signatures; • a procedure to evaluate the PIs and determine an optimal cut-off to separate low-from high-risk patients; • a subnetwork analysis based on gene signatures to visualize new potential genes and biological pathways.
(ii) The testing phase performs survival prediction to evaluate prognostic genes on X D by using the parameters tuned in the training phase (regression coefficients, gene signatures, and the optimal cut-off for the PIs). This phase uses the log-rank test to compare the Kaplan-Meier curves of the patients in the high-and low-risk groups.
We describe the training and testing phases in detail in the following sections. Note that, when only a single dataset is available, the functionSplitData() can perform a random split in the training and testing sets before starting the training phase.

Training Phase
In the training phase, COSMONET performs an (i) optional pre-processing and data normalization step, (ii) variable screening (offering three different approaches plus the possibility to give as the input parameter a list defined by the user), (iii) network construction for the regression analysis (offering two different possibilities), (iv) network-penalized Cox regression analysis, and (v) determination of an optimal cut-off PI * for the PIs. COSMONET includes pathway analysis and heat-map visualizations in the training phase to facilitate the interpretation of the results.
The following sections report the details of each step and Table 1 reports the options available in the package for the screening procedure and the network construction that a user can apply during the process of the training phase. COSMONET can perform an optional pre-processing phase (see Figure 1a) consisting of normalization between samples and between different datasets (here denoted between sample normalization), when necessary. It does not perform the so-called within-sample normalization since such a procedure strongly depends on the type of high-throughput platform used. Therefore, individual samples must be already normalized according to the specific technology. COSMONET selects adaptively the optimal cut-off PI * T on X T . (i) COSMONET computes PI on X D using the regression coefficient and the PI * T and assigns each patient in the testing set into the high-/low-risk group depending on the comparison between PIs and the estimated threshold PI * T , (j) COSMONET performs the log-rank test to compare the survival curves between the patients in the high-risk and low-risk groups. (k) COSMONET plots Kaplan-Meier curves.

Testing Phase
The function NormalizeBetweenData() implements the between-sample normalization. This function uses quantile normalization to make the distributions of the training and testing sets the same across samples. The function first normalizes the training set using the quantile normalization and obtains the normalized training dataset used for the training phase. Then, it normalizes the testing dataset to the normalized training dataset sampleby-sample. It adds one column (i.e., a sample) of the testing set to the training set and normalizes the dataset with n T + 1 samples, repeating the normalization for all n D samples in the testing set. Then, it takes all the normalized test columns and builds the normalized testing set. The between-sample normalization is necessary to remove unwanted variability, improve the models' performance and stability, and make the two datasets comparable. The proposed approach has the advantage that the normalization training set is entirely independent of the testing set. Moreover, the test set samples are normalized individually to the training, thus allowing the system to incorporate new samples with a study's progress. The user can omit the pre-processing step if the datasets were already normalized using other procedures.

Screening Techniques
The variable screening approaches (Figure 1b) reduce the number of variables p to a moderate dimension d < p. To this purpose, let {x j , j ∈ I} be the subset of the screened variables. Denote d = |{x j , j ∈ I}| its cardinality. COSMONET includes three different screening approaches.

•
Biomedical-driven (BMD) screening. The subset {x j , j ∈ I} of the screened variables consists of those variables that are known to be associated with the type of cancer under investigation [34]. Typically, this knowledge is derived from the literature or external databases. In [34], we considered the Human Experimental/Functional Mapper (HEFaIMp) database [41]. Recently, the HumanBase database [42]) replaced HEFaIMp. HumanBase uses posterior probabilities (PPs) to identify a significant relationship between a set of genes and the disease of interest. Therefore, to select the biologically relevant gene set I BMD , COSMONET uses the information provided by HumanBase to (i) decreasingly rank the p genes (connected to the disease of interest) based on PPs and (ii) select the top genes. The user can perform the latter step by using as a threshold score th BMD a cut-off on PPs, i.e., the value of the reduced dimension d selecting the top d genes (the genes with the highest PPs). • Data-driven (DAD) screening. The subset {x j , j ∈ I} of the screened variables is obtained by using only information from the data, as in [43]. In particular, it uses component-wise estimators that are computed very efficiently and do not suffer from the numerical instability associated with ultrahigh-dimensional estimation problems, as follows. Let M * = 1 ≤ i ≤ d : β * i = 0 be the true sparse Cox model, where β * = (β * 0 , . . . , β * d ) T denotes the true value of the parameter vector and β * 0 = 0. The maximum marginal likelihood estimator (MMLE) β M j , for j = 1, . . . , d, is defined in Cox models as the maximizer of the log-partial likelihood with a single covariate: where R(t i ) is the risk set. This procedure (implemented in the MarginalCoxRanking() function) provides the marginal regression coefficients of each feature and the p-values associated with the univariate models. Then, to select the optimal threshold d and optimize data prediction, COSMONET allows the user to rank the genes according to the three options below.
a. Magnitude of the marginal regression coefficients. This approach selects the dtop-ranked covariates, i.e., the d genes with the largest marginal coefficients in absolute value. Typically, d is equal to n/logn [43]. b. p-value. This option identifies as d-genes all the genes that have p-values < 0.05, regardless of the magnitude of their marginal regression coefficients.
c. Magnitude of the marginal regression coefficients and p-values. This approach first orders the genes according to the largest marginal regression coefficients in absolute value; then, it selects only those genes that have p-values < 0.05. • Biomedical-driven + data-driven (BMD + DAD) screening. The subset {x j , j ∈ I} of the screened variables is obtained by combining the biomedical information and the data-driven knowledge. To this purpose, COSMONET takes the union of the BMD and DAD sets of genes. By using BMD + DAD screening, COSMONET explores the best model that can sufficiently explain the data in the most parsimonious way to (i) make use of available information, (ii) identify new markers that the BMD screening ignores, and (iii) improve the ability to make precise prognosis, diagnosis, and treatments.
The function ScreeningMethods() implements the three methodologies discussed above. Moreover, COSMONET also allows the user to import an external list of genes as the screened variables (obtained from the user's clinical experience or the literature).

Gene Network Construction
The set of screened variables {x j , j ∈ I} is used to build the networks that will be included in the penalized Cox regression model (Figure 1c,d). In particular, COSMONET implements the following network construction procedure: • Functional linkage (FL) network. COSMONET builds S = S ij (i,j)∈I BMD from the Hu-manBase tool [42,44]). Each element in the S matrix represents the PP that two genes are functionally related. The higher the probability, the stronger the functional relation between the genes in the disease of interest. COSMONET completes the S matrix by setting to zero the weights of all genes that are not identified by the functional linkage map. The function CreateNetworkMatrix() builds matrix S.
Moreover, COSMONET allows the user to import a molecular network S that matches the set of screened genes {x j , j ∈ I}. Note that the network matrix S must be an adjacency matrix with zero diagonal and non-negative off-diagonal.

Network-Penalized Cox Regression Algorithm
The next step consists in the application of a penalized algorithm using the set of screened variables {x j , j ∈ I} and the a priori network information (Figure 1b,d).
Given the Cox penalized partial log-likelihood function COSMONET considers the following penalized problem: with where ρ, γ > 0 are two regularization parameters. The penalty function consists of two terms. The first term is a 0 -norm that enforces sparsity in the solution. The second term Γ(·) is a Laplacian matrix constraint that gives smoothness among connected genomic variables or regression coefficients in the network. More precisely, the graph Laplacian regularization Γ(β I ) is equal to β T I Lβ I with L = D − S ∈ R p×p (graph Laplacian), where D is the degree matrix and S ∈ R p×p is the adjacency matrix.
To solve the optimization problem proposed in Equation (2), COSMONET uses an efficient optimization, and the second stage performs hard thresholding. Indeed, the objective function in Equation (2) is reformulated by introducing a surrogate parameter θ I of β I and bounding the difference between them by a smooth convex function, which guarantees the convergence of the proximal operator. The following formula gives the Lagrangian form of APM-L 0 : where Φ j (·) is a convex function such that Φ j (0) = 0 and Φ j (|x|) ≥ 0, and α ∈ (0, 1]. To minimize Equation (4) for a given λ, ρ and α, APM-L 0 updates all parameters using the following algorithm: where θ 0 I j is an initial value. Equation (6) is minimized component-wise, and its solution is given byθ i.e.,θ I is obtained by hard thresholding theβ I parameters obtained in the first step.
COSMONET sets α = 0.5 as the default parameter, and chooses λ and ρ iteratively by k-fold cross-validation (k = 5 as default parameter). Our package allows the user to select either the value of λ that gives the minimum average cross-validated error or the value of λ that gives the most regularized model such that the cross-validated error is within one standard error of the optimal. The latter choice is inspired by the glmnet package [46], and we modified the APM-L 0 package for such a purpose (the modified version is incorporated into COSMONET).
Overall, this step selects a subset of genes (i.e.,β I = (β 1 , . . . ,β d ) T = 0 I , gene signature in Figure 1e), which is used in the next step to determine PIs and predict patients' risk groups.

Prognostic Index (PI) and Survival Analysis
By using the regression coefficientsβ I = 0 I , for each patient i (i = 1, . . . , n T ) in the training set T, COSMONET computes PI T i , defined as PI T i = x I iβI , where x I i is the vector of screened gene expression values associated with the i-th patient (Figure 1g). To determine the optimal cut-off PI T, * for dividing the patients into low-risk (LR) and high-risk (HR) groups, COSMONET offers three different procedures ( Figure 1h).

•
Adaptive-based approach: each patient i (i = 1, ..., n T ) in T is placed in the high-risk (or low-risk) group if PI T i is above (or below) the q γ -quantile, where The procedure is repeated in an adaptive manner for each q γ , and a log-rank test for each q γ -quantile is used to compare the Kaplan-Meier survival curve between the two risk groups. The optimal cut-off PI T, * is the value that corresponds to the best separation of the two groups, i.e., the q γ -quantile related to the lowest non-zero p-value resulting from the log-rank test.
• Median-based approach: each patient i (i = 1, . . . , n T ) in T is set in the high-risk (or low-risk) group if PI T i is above (or below) the q γ -quantile with γ equal to 0.50. The optimal cut-off PI T, * is the value that corresponds to the median of the PIs.
• Survival-based approach: each patient in T is allocated to the high-risk (or low-risk) group by using an outcome-oriented method that provides a cut-point corresponding to the most significant relationship with the survival. COSMONET uses the surv_cutpoint() function from the survminer package.
The function SelectOptimalCutoff() computes the optimal cut-off PI T, * and the p-value corresponding to the Kaplan-Meier (KM) survival curves on T. The KM curves provide a useful visual representation for assessing the effect of cancer progression over time for different groups of patients.
The function CosmonetTraining() performs the complete procedure (variable selection and survival analysis on T).

Network and Pathway Analysis
COSMONET provides an interactive visualization to investigate the gene signaturê β I = 0 I in terms of pathway analysis ( Figure 1f). It uses pathway information from the KEGG database [31] to create sub-networks of the gene signature identified by the regression algorithm ( Figure 1e). Specifically, COSMONET generates a dashboard with two networks: (i) a sub-network of the not-isolated genes, i.e., genes that share at least one pathway with another selected gene, and (ii) a complete network displaying the full gene signature selected by the network regression method. Each vertex in the network is a gene. An edge between two genes indicates that they belong to the same KEGG pathway. The color of the nodes/genes allows identification of the functional link between the genes and the disease of interest according to the HumanBase database. Specifically, COSMONET automatically retrieves from HumanBase a list of genes ranked according to the PP of being associated with the disease under investigation (identified through the disease ID according to the Disease Ontology (DO) Project [47]). The signature genes among the top 500 genes in the list are displayed in red (and labeled as Mapped Up); the other genes in the list are shown in blue (and labeled as Mapped Down). Any gene signature not recorded in the HumanBase ranking is shown in white (and labeled as Not Mapped).
The complete networks can be useful to gauge an understanding of the type of genes selected as gene signatures by the different approaches (mapped up, mapped down, or not mapped genes). Specifically, the networks allow the detection of any not mapped isolated genes (white nodes, i.e., genes that are not associated with cancer under study by HumanBase), enabling the identification of new potential biomarkers.
The function GenerateNetwork() implements the pipeline for visualizing the gene pathway network. This function can also be used as a standalone tool for visualizing the network given a list of genes provided by the user. COSMONET includes information from the KEGG database in an internal file (KEGGrepository.RData) that is used by the GenerateNetwork() function to automatically map the list of genes to the corresponding KEGG pathways, and create a gene-by-pathway adjacency matrix. This matrix is then used to create the final gene pathway network.

Heat-Maps
To facilitate the interpretation of the results, COSMONET includes the heat-mapSurvGroup() function. This function first orders patients according to PIs (from the highest to the smallest). Then, it divides them into two risk groups (i.e., low-risk and high-risk classes) using the optimal cut-off PI * . Finally, it depicts the expression values for each gene of the signature as a Z-score. In this way, the function provides a practical way to identify genes whose upregulation leads to a poorer prognosis (i.e., those in red in the high-risk group) and those whose downregulation leads to a poorer prognosis (i.e., those depicted in green in the high-risk group). Therefore, the heat-map in clinical studies aids in visualizing and pointing out the gene signature (and up/down regulation of key genes), assessing its reproducibility between the training and testing sets (and other datasets), and evaluating the similarity between observations or clusters of patients, in a single figure.

Testing Phase
The testing phase works on an external dataset, or the testing set resulting from the initial splitting into training and testing sets, as explained in Section 2.1.1. In the testing phase, survival analysis is performed to assess the prediction accuracy (Figure 1i-k).

Survival Analysis
To perform survival analysis, COSMONET computes PIs for each patient j (j = 1, . . . , n D ) in the test set as PI D j = x I jβI , where x I j is the vector of screened gene expression values associated with the j-th patient andβ I = 0 I is the gene signature derived from the training phase ( Figure 1i). The optimal cut-off PI T, * selected on the training set T is used on the validation set to split the patients into high-risk and low-risk groups. Each j-th patient is designated as a high-(or low-) risk of death if PI D j is above (or below) the optimal cut-off PI T, * . COSMONET uses the log-rank test to calculate the statistical significance level (i.e., computing p-values; Figure 1j) and the KM curves to further analyze the results (Figure 1k). A good separation between high-and low-risk survival curves and a significant p-value (i.e., p-value < 0.05) indicate that the prognostic classifiers selected in the validation set can identify good prognosis patients. In other words, the low-risk class has a significantly better prognosis than the high-risk class. The function ValidationTest() implements the procedure to perform survival analysis using the testing set. The function CosmonetTesting() performs the complete prediction procedure. The KM curves allow us to visualize the survival probability depending on the assigned risk class and evaluate the difference. The larger is the difference between the two curves, the better is the gene signature as a prognostic biomarker.

Correlation Analysis
The function CorrPlot() allows the user to analyze and visualize the correlation coefficient R (and its p-value) between results obtained using two pairs of PIs, such as those generated using different screening techniques on the same dataset. This enables the comparison of risk assignments based on the two gene signatures and allows us to classify the patients at the individual level: low-risk, high-risk, and not consistently identified (usually, they are borderline). From a clinical point of view, we suggest that the group of patients not consistently recognized might need further investigation to establish cancer risk factors. The function allows the user to select different methods for computing correlation coefficients, such as Pearson, Kendall, or Spearman. The benefit of using correlation analysis is to assess the overall concordance between the PI indexes obtained using different methods, and to easily identify patients that are consistently assigned to one class or the other, from patients whose assignment might depend on the chosen method. In the latter case, the final assignment might be done using additional information.

R Implementation
COSMONET is a novel R-package that implements all the steps described in Figure 1. It is available at http://bioinfo.na.iac.cnr.it/cosmonet/ (accessed on 13 November 2021), as a source code, with data examples and a detailed vignette. The main function Cosmonet() can be used to run the full pipeline. However, the user can also run the two phases individually by using the CosmonetTraining() and CosmonetTesting() functions. Table 2 describes all the functions available in the package.
Select optimal cut-off PI T, * on training set T; generate Kaplan-Meier curves resulting from the log-rank test, distribution plot of PI T . GenerateNetwork() Generate a biological network based on KEGG pathways to investigate the signature genes.

HeatmapSurvGroup()
Plot hierarchical clustering heat-map of signature genes of low-and high-risk prognostic survival groups using training T and testing D set.
i-k CosmonetTesting() Make prediction on data.

ValidationTest()
Return test validation, Kaplan-Meier curves resulting from the log-rank test on D using the signature genesβ I = 0 and the optimal cut-off PI T, * , distribution plot of PI T .

d-k
Cosmonet() Used to run the full pipeline including both CosmonetTraining() and CosmonetTesting().
Utility function CorrPlot() Create a scatter plot including the correlation coefficient, p-value and linear regression line between the prognostic indices PIs. VennPlot() Display three-set Venn diagram plots between the number of patients at low-risk (or high-risk) obtained for each screening.

Real Data Examples
To illustrate the features and performance of COSMONET, we used two gene expression data examples in breast and lung cancer studies. Table 3 shows the details of the datasets. We used two independent gene expression datasets on breast cancer available from the GEO database [4]. We used GSE2034 as the training set T to build the model, while we used GSE2990 as the testing set D to validate the model. The training set T consists of gene expression profiles from the total frozen RNA of 286 lymph-node-negative breast cancer patients [48]. The testing set D contains gene expression profiles of 189 invasive breast carcinomas [49]. The median survival time (RFS) in T was 86 months, and the censoring proportion was 62.59%, while the median survival time (RFS) in D was 77 months and the censoring proportion was 63.49%.
We used the RMA and preprocessCore Bioconductor packages for the pre-processing steps (see Section 2.1.1). After the within-arrays normalization and the reduction to the same p-dimensional feature space (for a total of 13229 genes), we applied the function NormalizeBetweenData() to perform the normalization between datasets (see Supplementary Figure S1A).

Example 2: Lung Cancer Dataset from an RNA-Seq Case Study
We considered TCGA-LUAD (lung adenocarcinoma) gene expression data from the GDC Data Portal [1]. The data were obtained from the Illumina HiSeq platform and included 492 lung cancer patients (for a total of 19988 genes). We used the data already pre-processed and normalized (gene-level, RPKM), which are available in the LinkedOmics portal [50]. The median survival time (OS) in T was 701 days, and the censoring proportion was 63.82%, while the median survival time (OS) in D was 624.5 days, and the censoring proportion was 63.82%.
To analyze the dataset, we first randomly split the dataset into 50% train (246 samples) and 50% test (246 samples) and then applied the between normalization by using the functions splitData() and NormalizeBetweenData(), respectively (see Supplementary Figure S1B).

Results
This section aims to illustrate the functionalities of COSMONET using microarray and RNA-Seq data examples. The accompanying on-line vignette, available at http://bioi nfo.na.iac.cnr.it/cosmonet/, provides a step-by-step analysis of the following examples (accessed on 14 November 2021). Users can also access the pre-processed data and the repositories containing the prior biological information. COSMONET is based on the Disease Ontology (DO) Project [47] to identify the disease ID for the disease of interest (e.g., breast cancer DOID:1612 and lung cancer DOID:1324). The normalization between the training set T and the testing set D ( Table 2, step (a)) was performed as described in Sections 2.5.1 and 2.5.2, respectively (see Supplementary Figure S1).

Results Using the BMD Screening
We applied the BMD screening procedure by intersecting the 500 genes with the highest PPs associated with breast cancer (according to HumanBase) with the 13229 genes measured on the gene expression microarrays (Table 2, step (b)). As a result, the screening retained 437 BMD genes (i.e., |I BMD | = 437). In the training phase, COSMONET identified 40 BMD genes as potential breast cancer signature genes (with α = 0.5), which we used to compute PI T,i i = 1, . . . , n T and the optimal cut-off PI T, * BMD (see Table S1). Here, for illustrative purposes, we used the median-based procedure (q γ -quantile with γ = 0.50), i.e., n T LR = n T HR = 143 (Table 2, steps (d)-(h)). The on-line vignette provides additional results and plots on the training set T.
In the testing phase, the estimation of PI D,i , i = 1, . . . , n D led to n D LR = 144 low-risk patients and n D HR = 43 high-risk patients ( Table 2, steps (i)-(k)). The on-line vignette provides additional results and plots on the testing set D. Table 4 summarizes the results. Table 4. Breast cancer data. Summary of BMD, DAD, and BMD + DAD screening network Cox regression methods. The table is divided into two sides corresponding to the training and testing phases. In the training phase are indicated the type of screening, the number of screened genes I, the signature genes (β I = 0), and the number of patients at low and high risk (LR and HR). In the testing phase, we report the p-values for the log-rank test and the number of patients in the highand low-risk groups.

Training Phase
Testing Phase

Results Using the DAD Screening
We also applied the DAD screening by ordering the marginal Cox regression coefficients in absolute value and choosing the d = 50 top genes (i.e., |I DAD | = 50) as screened DAD genes (Table 2, step (b)). Note that such a choice is suggested by d equal to n/logn , with n = 286. COSMONET identified 32 DAD genes (i.e., withβ I DAD = 0) as possible breast cancer signature genes (see Table S1). Similarly to the BMD case, we computed PI T,i and the optimal cut-off PI T, * DAD . In this case, for expository purposes, we used the survivalbased approach. The number of patients at low risk was n T LR = 177, while the number of patients at high risk was n T HR = 109 (Table 2, steps (d)-(h)). The on-line vignette provides additional results and plots on the training set T.
In the testing phase, PI D,i , i = 1, . . . , n D , we separated the patients into two prognostic groups of cardinality, n D LR = 113 and n D HR = 74 (Table 2, steps (i)-(k)). The on-line vignette provides additional results and plots on the testing set D. Table 4 summarizes the results.

Interpreting Gene Signatures
To better understand and investigate the gene signatures, i.e.,β I = 0, obtained by applying one of the screening network methods, we used the heat.map function (i.e., the function heat-mapSurvGroups() in Table 2, steps (d)-(h)). Figure S3 (panel A) reports the heat-maps generated using the BMD + DAD gene signature (β I BMD = 0).

Correlation Results
We performed a correlation analysis (based on Spearman rank correlation) on PIs obtained using the different screening approaches (PI BMD , PI DAD and PI BMD+DAD ) on the training T and testing D set. As an illustrative example, Figure 3 (panel A) shows a moderate positive correlation R = 0.49 between PI D BMD and PI D DAD , and a strong correlation R = 0.70 between PI D BMD and PI D BMD+DAD . Both correlations are strongly statistically significant (p-value < 0.05). Points in Figure 3 depicted in red or in blue are consistently assigned to the same risk group, regardless of the screening method; those in grey (in the border of the two classes) suggested that further information might be required before making a clinical decision.
3.1.6. Pathway Analysis Figure 4 reports the KEGG pathway networks of the not-isolated genes selected by the three screening methods. It is worth noting that the BMD + DAD screening approach (Figure 4 (panel A)) can detect pathway-based relationships between genes that are already known to be biologically associated with the disease (red genes, mapped up) and genes that are not known to be highly associated with breast cancer (blue genes, mapped down). For example, the link between the mapped up gene CDKN2A and the mapped down gene DCC illustrates that both genes can play an important role in breast cancer as part of the same KEGG pathway (i.e., KEGG pathways in cancer). In fact, several studies have shown the relevance of CDKN2A in breast cancer, but only recently, DCC has been associated with an increased risk of various cancers [9]. Figure 4 (panel A and B) show that the BMD + DAD and BMD models led to the selection of similar KEGG pathways (they share 13 pathways). Among them, the cytokinecytokine receptor interaction pathway was identified by the three models. Indeed, cytokine receptors are deeply involved in cancer regulation stem cells through complex interactions with the tumor microenvironment and represent attractive targets for therapeutic development [51,52]. Both the BMD + DAD and BMD models (Figure 4 (panel A and B), respectively) identified three critical genes in breast cancer, i.e., VEGFC, PGF, and GAB1. However, the mapped down genes identified by the BMD + DAD model (Figure 4 (panel A)) allow the detection of potential new gene signatures and cancer pathways that are still unexplored. The interactive networks dashboard is available in Appendix A.1).

Results Using the BMD Screening
In this analysis, we selected the top 500 genes associated with lung cancer, downloaded from the HumanBase database (the disease ID is DOID:1324). The BMD screening selected 482 genes (i.e., |I BMD | = 482) as the intersection of the 500 genes with the 19988 genes measured using the RNA-seq technology, i.e., Illumina HiSeq (Table 2, step (b)). Then, applying the penalized network regression, COSMONET identified 25 BMD genes (i.e., witĥ β I BMD = 0 and α = 0.1) as potential lung cancer signatures (see Table S2). We used these genes to compute PI T,i BMD with i = 1, . . . , n T and select the optimal cut-off. Using the median-based approach, we obtained n T LR = n T HR = 123 (Table 2, steps (d)-(h)). The on-line vignette provides additional results and plots on the training set T.
In the testing phase, we obtained two prognostic groups, n D LR = 125 and n D HR = 121 ( Table 2, steps (i)-(k)). The on-line vignette provides additional results and plots on the testing set D. Table 5 summarizes the results.   Red nodes represent the mapped up genes, i.e., signatures that are among the top 500 genes biologically known to be associated with breast cancer according to HumanBase. Blue nodes represent the mapped down genes, i.e., signatures that are biologically known to be associated with breast cancer, but not among the top 500 genes according to HumanBase. Any gene that is not associated with breast cancer will be displayed as a white node (none were selected in the not-isolated gene signatures shown in the figure). The BMD + DAD network (A) shows that merging biological and data-driven information allows the identification of pathway-based connections of breast cancer biomarkers (such as TGFB1 and MAPK8) that are not identified by the BMD or DAD approaches, (B,C), respectively. COSMONET generates the KEGG pathway networks as part of a fully interactive dashboard, which also contains the full networks comprising all the gene signatures, including the isolated genes. The full dashboards are available in the Supplementary Materials. Table 5. Lung cancer data. Summary of BMD, DAD, and BMD + DAD screening network Cox regression methods. The table is divided into the training and testing phase. In the training phase are indicated the type of screening, the number of screened genes I, the signature genes (β I = 0), and the number of patients at low and high risk (LR and HR). In the testing phase are reported the resulting p-values for the log-rank test and the number of patients in the high-and low-risk groups. Here, we performed the DAD screening by ordering the marginal Cox regression coefficients in absolute value and selecting the d top 70 ranked genes (i.e., |I DAD | = 70) as screened DAD genes (Table 2, step (b)). Then, we applied the network-penalized regression, which selected 59 DAD genes (i.e., withβ I DAD = 0 and α = 0.1) as potential lung cancer signature genes (see Table S2). The DAD signature genes allowed us to compute PI s and the optimal cut-off. Here, using the survival-based approach, we obtained that the number of patients was n T LR = 151 for the low-risk group and n T HR = 95 for the high-risk group ( Table 2, steps (d)-(h)). The on-line vignette provides additional results and plots on the training set T.
In the testing phase, COSMONET separated the patients into two prognostic groups, n D LR = 154 and n D HR = 92 ( Table 2, steps (i)-(k)). The on-line vignette provides additional results and plots on the testing set D. Table 5 summarizes the results.

Results Using the BMD + DAD Screening
As previously done with the microarray data, we first selected as BMD + DADscreened genes the union of the 482 BMD-screened genes and the 70 ordered DAD-screened genes for a total of 567 genes (|I BMD+DAD | = 567). Then, from the training phase, as before, we detected 54 BMD + DAD genes, i.e., withβ I BMD+DAD = 0 and α = 0.1 (see Table  S2). Here, the number of patients at low and high risk according to the adaptive-based approach was equal to n D LR = 172 and n D HR = 74, respectively ( Table 2,  We validated the survival model in the testing phase, obtaining a subdivision of the patients into two sets of cardinality, n D LR = 172 and n D HR = 74 (Table 2, steps (i)-(k)). For illustrative purposes, Figure 3 (panel C) shows the Kaplan-Meier curves for OS patients in the low-risk versus high-risk groups together with the p-values (p-values < 0.05). We show the distribution plot of PI D BMD+DAD in Figure 3 (panel D). Table 5 summarizes all the results.

Correlation Results
Furthermore, in this case, we compared the estimated PIs (Spearman rank correlation) on the training T and testing D set obtained using the different screening procedures. For illustrative purposes, Figure 3 (panel B) shows a moderate positive correlation between PI BMD and PI DAD (R = 0.44) and a strongly positive correlation between PI BMD and PI BMD+DAD (R = 0.81). Both correlations are statistically significant (p-value < 0.05).

Pathway Analysis
For the lung cancer example, we performed the same steps as in Section 3.1.6. Figure       , and DAD (C) screening approaches. Red nodes represent the mapped up genes, i.e., signatures that are among the top 500 genes biologically known to be associated with lung cancer according to HumanBase. Blue nodes represent the mapped down genes, i.e., signatures that are biologically known to be associated with lung cancer, but not among the top 500 genes according to HumanBase. Any gene that is not associated with lung cancer will be displayed as a white node (none were selected in the not-isolated gene signatures shown in the figure). COSMONET generates the KEGG pathway networks as part of a fully interactive dashboard, which also contains the full networks comprising all the gene signatures including the isolated genes. The full dashboards are available in Appendix A.
The mapped up gene PLK1 was selected by all methods, confirming its key role in lung cancer [53][54][55]. The BMD + DAD and BMD screening methods ( Figure 5 (panel A) and (panel B)) shared seven biologically relevant genes (mapped up genes), including SHC1 and CRKL. The association of these genes with lung cancer is already known [56]. However, by investigating the KEGG pathways shared by both genes (i.e., links between the two nodes), it will be possible to better understand the interaction of the two genes and their role in lung cancer.
By analyzing the links between the mapped up genes and mapped down genes, it is possible to identify new potential biomarkers, as well as the KEGG pathways associated with them. For example, Figure 5 (panel A) shows that the mapped up gene CCND2 is connected to the mapped down gene RRM2 through the KEGG P53 signaling pathway and to the mapped down gene CSNK1E through the KEGG WNT signaling pathways. This information can be useful to understand how the key gene CCND2 interacts with other genes that have not been identified as highly relevant for lung cancer. The interactive networks dashboard is available in Appendix A.2.

Computational Complexity
To provide an estimate of the runtime of the proposed methodology, we ran the Cosmonet() function using the breast cancer datasets presented above, using two separate matrices for the training and testing phases. The run times were approximately 30 s for the BMD screening, approximately 10 seconds for the DAD screening, and 40 s for the BMD + DAD screening. When running the same procedure using the LUAD dataset (using one single matrix), the run times were approximately 65 s for the BMD screening, approximately 6 s for the DAD screening, and 50 s for the BMD + DAD screening. As expected, the running time was longer when using the LUAD dataset, as this included the train/test splitting phase and a larger number of samples.
It is worth mentioning that the runtime is mostly affected by the size of the input matrices and by the number of genes selected during the screening process. The execution times were measured using the R library tictoc (https://github.com/collectivemedia/t ictoc, accessed on 13 November 2021) and the experiments were run on an Intel Core i5 processor with 8 GB of RAM.

Discussion and Conclusions
In this work, we presented COSMONET, a novel R package that merges advanced screening techniques with network-based Cox regression for survival prediction. This package implements a complete pipeline built upon an efficient computational algorithm based on two main features: (i) screening techniques to reduce the feature space p to a moderate scale d and (ii) a network-based Cox regression method to select the high-risk cancer genes among the screened covariates for the survival of patients [34,35]. The variable screening aims to find a subset of input and significant variables associated with cancer patient survival. The network-based Cox regression algorithm, based on augmented and penalized minimization-L 0 (APM-L 0 ) [45], aims to select the most functional genes related to cancer. Moreover, the package is completed by several additional functions to determine the optimal cut-off for the PIs, perform data pre-processing, and visualize and compare the results in several forms. We illustrated the capabilities of COSMONET on two case studies with gene expression data (microarrays and RNA-seq) from the literature. The vignette provides a step-by-step guide for beginner users, making COSMONET appealing for use within the biomedical community in order to plan, develop, and implement a personalized care plan.
The current version of COSMONET works with a single type of omic data. We have used gene expression, although it is possible to use other types of omics profiles. Together with several others (i.e., [35,57,58]), we have shown that integrating multiple omics data types can improve the performance of a method. The approach used in [35] can be easily extended to COSMONET. In the aforementioned proposal, we used MANCIE [59] to correct the gene expression data using the copy number aberrations (CNA), which showed better performance. The same approach can be used with COSMONET when two types of omics are available. However, the approach in MANCIE assumes that one of the omics (i.e., the gene expression) acts as the principal matrix and the second one as a support matrix (i.e., the CNA in [35]). Therefore, the approach does not easily extend to multiple omics and does not give equal importance to the experiments. To further improve COSMONET for data integration, we should use the multi-task regression learning approaches in a manner similar to [60,61]. Moreover, we have shown that COSMONET runs fast also on large datasets such as those in our case studies. However, due to the increasing complexity of a multi-omics approach, we expect a longer runtime when integrating multiple datasets or when the size of the problems increases. For this reason, future improvements of the package should integrate the option to run the process in parallel mode. Furthermore, the integration of different databases other than KEGG, such as MetaCyc [62], could also improve the user experience, allowing an analysis of the data from different perspectives. In fact, MetaCys contains significantly more reactions and pathways than KEGG. Hence, we aim to include additional pathway repositories in a future extension of COSMONET.
In conclusion, to the best of our knowledge, COSMONET is the only R package that combines both biologically driven and data-driven screening techniques within a networkpenalized Cox regression model. Hence, the proposed package provides the biomedical community with a valuable and straightforward tool for investigating new cancer biomarkers and predicting survival outcomes while using the most recent statistical techniques.  Figure S3: Heat-maps. (A) BMD + DAD genes signatures in the training set (T) and the testing set (D) selected using breast cancer data. (B) BMD + DAD gene signatures in the training set (T) and the testing set (D) identified using lung cancer data. The genes in the horizontal direction are clustered in the same order for both sets (T and D). Z-scores of the gene expression data are used. Z-scores larger than 3.5 were set to 3.5 and Z-scores smaller than −3.5 were set to −3.5. Red color indicates a high level of expression in breast cancer and green color indicates a low level of expression. The patients are divided into high-risk and low-risk groups based on the optimal cut-off PI T, * .  Institutional Review Board Statement: Not applicable since publicly available data were used.