Synergistic Effects of Different Levels of Genomic Data for the Staging of Lung Adenocarcinoma: An Illustrative Study

Lung adenocarcinoma (LUAD) is a common and very lethal cancer. Accurate staging is a prerequisite for its effective diagnosis and treatment. Therefore, improving the accuracy of the stage prediction of LUAD patients is of great clinical relevance. Previous works have mainly focused on single genomic data information or a small number of different omics data types concurrently for generating predictive models. A few of them have considered multi-omics data from genome to proteome. We used a publicly available dataset to illustrate the potential of multi-omics data for stage prediction in LUAD. In particular, we investigated the roles of the specific omics data types in the prediction process. We used a self-developed method, Omics-MKL, for stage prediction that combines an existing feature ranking technique Minimum Redundancy and Maximum Relevance (mRMR), which avoids redundancy among the selected features, and multiple kernel learning (MKL), applying different kernels for different omics data types. Each of the considered omics data types individually provided useful prediction results. Moreover, using multi-omics data delivered notably better results than using single-omics data. Gene expression and methylation information seem to play vital roles in the staging of LUAD. The Omics-MKL method retained 70 features after the selection process. Of these, 21 (30%) were methylation features and 34 (48.57%) were gene expression features. Moreover, 18 (25.71%) of the selected features are known to be related to LUAD, and 29 (41.43%) to lung cancer in general. Using multi-omics data from genome to proteome for predicting the stage of LUAD seems promising because each omics data type may improve the accuracy of the predictions. Here, methylation and gene expression data may play particularly important roles.


Introduction
Lung cancer is one of the most common types of cancer. Morbidity and mortality associated with lung cancer rank high among all cancers worldwide [1]. Lung cancer can be divided into small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). Lung adenocarcinoma (LUAD) is a histological subtype of NSCLC [2], accounting for approximately 70% of NSCLC cases. The five-year survival rate of LUAD remains very poor [3]. However, if diagnosed at an early stage, survival rates are greatly extended.
One-year survival rates for stage I NSCLC are 81-85%, while a stage IV diagnosis is associated with only a 15-19% survival rate [4]. Due to the existence of different treatment methods, the staging of LUAD is an initial and important step in clinical diagnosis and targeted treatment. This highlights the need to design computational methods for the staging prediction of LUAD to reduce the overall mortality associated with this disease and further improve the quality of life of patients.
Since cancer is related to alterations in genes that control normal cell growth and death, molecular aberrations play a critical role in cancer initiation and progression [5]. An understanding of the molecular basis of cancer helps to predict the clinical outcome of cancer patients and determine the best-fitting treatments. With the development of sequencing

Data Preparation
LUAD data were downloaded from the TCGA data portal (https://portal.gdc.cancer. gov/, accessed on 19 December 2019). TCGA [35] is a public database that contains thousands of cancer patient samples, different cancer types, and various omics data types. We selected multi-omics datasets, because we were interested in studying the impact of the fusion of different omics data types on LUAD cancer staging predictions. Among these types, CNV belongs to the genome level, methylation to the epigenetic level, gene expression and miRNA to the level of the transcriptome, and protein expression to the level of the proteome.
We obtained 351 multi-omics data samples for analysis by first excluding samples without clinical staging information and then excluding samples that did not feature all considered multi-omics data types. Figure 1 shows the numbers of patients available for each possible combination of omics data types. In accordance with a previous study [36], we defined the staging prediction of LUAD in our study as a binary classification problem, differentiating between early stages (T1-T2, n = 270) and late stages (T3-T4, n = 81). Detailed information on the patients' basic characteristics is given in Table 1.
Genes 2021, 12,1872 3 of 16 the fusion of different omics data types on LUAD cancer staging predictions. Among these types, CNV belongs to the genome level, methylation to the epigenetic level, gene expression and miRNA to the level of the transcriptome, and protein expression to the level of the proteome. We obtained 351 multi-omics data samples for analysis by first excluding samples without clinical staging information and then excluding samples that did not feature all considered multi-omics data types. Figure 1 shows the numbers of patients available for each possible combination of omics data types. In accordance with a previous study [36], we defined the staging prediction of LUAD in our study as a binary classification problem, differentiating between early stages (T1-T2, n = 270) and late stages (T3-T4, n = 81). Detailed information on the patients' basic characteristics is given in Table 1.  We obtained 56,170 feature variables. Table 2 gives an overview of these variables. After feature selection, 70 features were retained. The feature selection process will be described in the following section. We obtained 56,170 feature variables. Table 2 gives an overview of these variables. After feature selection, 70 features were retained. The feature selection process will be described in the following section. To obtain a predictive subset of features and reduce the computational burden, we performed automatic feature selection. In general, feature selection methods can be categorized into filter methods, wrapper methods, and embedded methods [37]. In our article, we used a filter-wrapper method to select and model the features. The filter method sorts the features with respect to their importance and redundancy and the wrapper method aims at selecting a number of features that leads to the greatest classification performance.
Minimum Redundancy and Maximum Relevance (mRMR) mRMR is a multivariate filter procedure that sorts features according to their predictive information, while taking into account their mutual information [38]. The aim of mRMR is to retain features that are maximally relevant for predicting the target class, but also minimally redundant among each other. It is a very popular method applied to select features in areas such as gene expression data analysis [39,40], protein sub-cellular localization prediction [41], and cancer survival prediction [42,43].
The mutual information between the jth feature x j and the target class c is defined in terms of the density functions p x j , p(c), and p x j , c as follows: I x j ; c is a measure of relation between the individual feature x j and the target class c. For categorical features, the integrands in (1) reduce to sums and estimates for the involved density functions are readily available [38]. We transformed all continuous features into categorical features (see further down for details), because the mRMR implementation used in this paper requires categorical features (as do other implementations).
Let S denote a subset of features and |S| the number of features in S. The Maximum-Relevance condition is: Although we can use the Maximum-Relevance algorithm to choose the top individual features in descending order of I x j ; c , it has been recognized that the selected features could have rich redundancy, namely, "the m best features are not the best m features" [44]. To reduce the redundancy among selected features, a Minimum-Redundancy condition can be added: The mRMR feature set is obtained by optimizing the conditions in Equations (2) and (3) simultaneously.
In practice, a sequential incremental method is used. Suppose we already have S m−1 , a feature set with m − 1 sorted features. Then the task is to select the mth feature, that is, the next sorted feature, from the set {X − S m−1 }, where set X represents the set of all features. This feature is selected by maximizing the single-variable relevance minus a redundancy function: The features are sorted using Formula (4), stopping at a maximum value of 500 sorted features.
As already noted above, mRMR requires the features to be categorical. Each continuous feature was, therefore, transformed into a categorical feature, which was performed as follows: the value −1 was assigned for feature values smaller than µ − ασ, the value 0 for feature values in [µ − ασ, µ + ασ], and the value 1 for feature values larger than µ + ασ. Here, µ is the mean of the values of the feature, σ is their standard deviation, and α is a parameter controlling the expression rate, which was set to 0.5.

Feature Selection Process Based on the Filter-Wrapper Method
The filtering step using mRMR does not yet deliver a compact set of selected features, but rather a list of 500 sorted features. These 500 features do, however, likely not deliver an optimal prediction rule with respect to the classification method we use. To tackle this issue, we used a filter-wrapper method to select features.
A wrapper [45] method can convolve with a classifier and has the direct goal of maximizing the prediction performance of a particular classifier. After obtaining a sorted list of 500 features using mRMR, we applied the following wrapper method: first, for m = 20, 30, 40, . . . , 500, apply the considered classification method (see next subsection) using only the first m features and calculate the cross-validated AUC value associated with the resulting prediction rule. Second, identify the optimal number N of first genes as that number of genes that was associated with the largest cross-validated AUC value in the first step.

Multiple Kernel Learning Classification The General MKL Model
In our study, we combined different data types into one model. Different omics data types have different feature representations, which is why directly combining these multiple sources of data as an input of one model would not be efficient [46]. Multiple Kernel Learning (MKL) can fuse heterogeneous omics data by using different kernels to represent input from different sources.
Equation (5) combines M kernels to one single kernel in a linear format: where x and x both represent a vector of all features, K m (x, x ) indicates the mth kernels, and d m is the weight of the mth kernel. Note that it is not only possible to use different kernels for different data types, but there can also be several kernels for the same data types. Bach et al. [47] have shown that the MKL formulation is actually a dual SVM problem. The approach simpleMKL is a supervised method based on an improvement of the linear MKL framework, the decision boundary of which is given by: where l is the number of patients and x i denotes the vector of all features for the ith patient. When applying the classifier to the feature vector of a new patient, the sign of f (x) is used to decide on which of the two classes the patient is classified into. To optimize the two parameters of the SVM and the kernel coefficients, simpleMKL uses an iterative gradient descent method. This approach has proven to be efficient when the number of kernels is high [48]. Importantly, the particular MKL implementation considered in this paper uses an L 2 − norm regularization leading to a sparse solution in the kernel coefficients. The optimization problem is of the form: where || f || H denotes a kernel in Hilbert space associated with a kernel K m and y i denotes the outcome. The overall kernel can be divided into the individual kernels, replacing || f || H by ∑ m || f m || Hm , which leads to: This equation shows several kernels in Hilbert space being combined in L 2 − norm formation. Detailed information can be found in [49].
The MKL Model for Multi-Omics Data MKL can fuse heterogeneous omics data by employing different kernels for the different omics data types and also several kernels per data type in an effort to make the decision function more powerful and improve the prediction performance. Therefore, in Omics-MKL, we use the simpleMKL method to construct different independent kernels for different omics data types, integrating them into a universal model. Specifically, we construct 10 different kernels for the five considered omics data types (CNV, gene methylation, gene expression, miRNA, and protein). For the kernels, we use two types of kernel functions for each omics data type, the Gaussian kernel and the polynomial kernel. As seen in the previous subsection, the simpleMKL method directly solves an integrated support vector machine optimization problem instead of learning kernel combinations from independent kernels, which greatly reduces the computational cost [43].

Experimental Design
To evaluate the performance of the methods, we used 10-fold nested cross-validation [50]. Nested cross-validation includes an outer loop and an inner loop. In the inner loop, we determine an optimal number of features N out of 20, 30, 40, . . . , 500, where we use mRMR to sort the features. In the outer loop, the best N from the inner loop is carried forward to build the final model and the performance is evaluated. The workflow is visualized in Supplementary Figure S1. Note that we applied the filter-wrapper approach using mRMR for all compared methods.
We used the receiver operating characteristic (ROC) curve and the area beneath it, the AUC, to evaluate the performance of the algorithms.

Comparison of the Achieved Prediction Performances When Using Multi-Omics Data and Single-Omics Data
To assess the added value of multi-omics data over single-omics data, we compared the classification performances of Omics-MKL when using single-omics data (methylation, CNV, miRNA, RNA-seq, protein) and multi-omics data. Specifically, six different Omics-MKL-based prediction rules were constructed, each data type using two kernel shapes.
As Figure 2 shows, among the single-omics prediction rules, the one using methylation data (Methyl-MKL) has the highest AUC value of 0.8233, and gene expression data (RNA-seq-MKL) showed comparable performance with an AUC of 0.8074. However, none of the AUC values obtained based on the single-omics data sources were larger than the AUC of 0.8614 obtained when considering all omics data types concurrently (Omics-MKL). This illustrates that integrating multi-omics genetic data can effectively improve the accuracy of LUAD staging compared to using only single-omics data.
(RNA-seq-MKL) showed comparable performance with an AUC of 0.8074. However, none of the AUC values obtained based on the single-omics data sources were larger than the AUC of 0.8614 obtained when considering all omics data types concurrently (Omics-MKL). This illustrates that integrating multi-omics genetic data can effectively improve the accuracy of LUAD staging compared to using only single-omics data.

Effectiveness of Integrating Multiple Omics Data Types
In this subsection, we illustrate that each considered omics data type can contribute to an improved prediction performance when using multi-omics data. In an iterative fashion, we individually removed methylation, CNV, miRNA, RNA-seq, or protein information and considered an Omics-MKL prediction rule without the removed data type. This also allows for understanding about which omics data types play important roles in prediction. The smaller the AUC becomes after removing an omics data type, the more important the respective data type tends to be. As seen in Figure 3, the prediction rule based on all available omics data types performed best, suggesting that each data type improves the prediction of LUAD staging. Gene expression and methylation seem to play more important roles than the other data types. After removing the methylation and RNAseq data from the multi-omics data, the AUC decreased by 0.0397 and 0.0400, respectively, whereas after removing the miRNA data, the AUC only decreased by 0.0138. The fact that the AUC decreased after removing each of the omics data types suggests that the integration of all available genomic data sources can be beneficial in terms of prediction performance in the staging of LUAD.

Effectiveness of Integrating Multiple Omics Data Types
In this subsection, we illustrate that each considered omics data type can contribute to an improved prediction performance when using multi-omics data. In an iterative fashion, we individually removed methylation, CNV, miRNA, RNA-seq, or protein information and considered an Omics-MKL prediction rule without the removed data type. This also allows for understanding about which omics data types play important roles in prediction. The smaller the AUC becomes after removing an omics data type, the more important the respective data type tends to be. As seen in Figure 3, the prediction rule based on all available omics data types performed best, suggesting that each data type improves the prediction of LUAD staging. Gene expression and methylation seem to play more important roles than the other data types. After removing the methylation and RNA-seq data from the multi-omics data, the AUC decreased by 0.0397 and 0.0400, respectively, whereas after removing the miRNA data, the AUC only decreased by 0.0138. The fact that the AUC decreased after removing each of the omics data types suggests that the integration of all available genomic data sources can be beneficial in terms of prediction performance in the staging of LUAD.

Comparison with Basic Machine Learning Methods Using Multi-Omics Data
As already discussed in the introduction, we do not make any claims on the effectiveness of Omics-MKL over that of other multi-omics prediction methods. However, to exclude that Omics-MKL does not deliver meaningful predictions, we compare it with basic machine learning algorithms, namely SVM, K-nearest neighbors (KNN), logistic regression (LR), and random forests (RF). The results are shown in Figure 4. The Omics-MKL algorithm delivered the largest AUC value. More precisely, Omics-MKL delivered an AUC value of 0.8614, which is 25.59%, 11.57%, 9.77%, and 7.12% higher than that obtained for KNN, RF, SVM, and LR, respectively. Genes 2021, 12, 1872 8 of 16 Figure 3. Comparison between the prediction performance obtained using all five omics data types and after removing one omics data type at a time (prediction method: Omics-MKL).

Comparison with Basic Machine Learning Methods Using Multi-Omics Data
As already discussed in the introduction, we do not make any claims on the effectiveness of Omics-MKL over that of other multi-omics prediction methods. However, to exclude that Omics-MKL does not deliver meaningful predictions, we compare it with basic machine learning algorithms, namely SVM, K-nearest neighbors (KNN), logistic regression (LR), and random forests (RF). The results are shown in Figure 4. The Omics-MKL algorithm delivered the largest AUC value. More precisely, Omics-MKL delivered an AUC value of 0.8614, which is 25.59%, 11.57%, 9.77%, and 7.12% higher than that obtained for KNN, RF, SVM, and LR, respectively.

Comparison with Basic Machine Learning Methods Using Multi-Omics Data
As already discussed in the introduction, we do not make any claims on the effectiveness of Omics-MKL over that of other multi-omics prediction methods. However, to exclude that Omics-MKL does not deliver meaningful predictions, we compare it with basic machine learning algorithms, namely SVM, K-nearest neighbors (KNN), logistic regression (LR), and random forests (RF). The results are shown in Figure 4. The Omics-MKL algorithm delivered the largest AUC value. More precisely, Omics-MKL delivered an AUC value of 0.8614, which is 25.59%, 11.57%, 9.77%, and 7.12% higher than that obtained for KNN, RF, SVM, and LR, respectively.

Analysis of the Selected Features
In the previous subsections, we performed nested cross-validation to evaluate the performance of the compared approaches. With this procedure, in each iteration of the outer cross-validation loop, a different subset of omics features is selected. However, it would be interesting to obtain a single set of selected features to investigate which features seem to be particularly important for stage prediction using multi-omics data in LUAD. To obtain such a single set of selected features, we first performed a non-nested 10-fold cross-validation for each considered N value (20, 30, . . . , 500), repeating mRMR in each iteration. Subsequently, we used the N value that was associated with the maximum cross-validated AUC value to perform the final feature selection using the whole dataset, that is, without cross-validation. Figure 5 illustrates that, when the value of N is varied, the performance of the model changes strongly. For N equal to 70, the cross-validated AUC value of the Omics-MKL classifier was best. We provide the cross-validated AUC values for N = 20, 30, . . . , 70 in Supplementary Table S1. The percentages of the different data types among the selected features are shown in Figure 6. The selected features included 10 (14.29%) CNV features, 21 (30%) methylation features, 34 (48.57%) gene expression features, 3 (4.29%) miRNA features, and 2 (2.86%) protein expression features (cf. also Table 2).
To obtain such a single set of selected features, we first performed a non-nested cross-validation for each considered N value (20, 30, …, 500), repeating mRMR iteration. Subsequently, we used the N value that was associated with the m cross-validated AUC value to perform the final feature selection using the whole that is, without cross-validation. Figure 5 illustrates that, when the value of N is varied, the performance of th changes strongly. For N equal to 70, the cross-validated AUC value of the Omi classifier was best. We provide the cross-validated AUC values for N = 20, 30, … Supplementary Table S1. The percentages of the different data types among the features are shown in Figure 6. The selected features included 10 (14.29%) CNV f 21 (30%) methylation features, 34 (48.57%) gene expression features, 3 (4.29%) features, and 2 (2.86%) protein expression features (cf. also Table 2).   seem to be particularly important for stage prediction using multi-omics data in LUAD. To obtain such a single set of selected features, we first performed a non-nested 10-fold cross-validation for each considered N value (20, 30, …, 500), repeating mRMR in each iteration. Subsequently, we used the N value that was associated with the maximum cross-validated AUC value to perform the final feature selection using the whole dataset, that is, without cross-validation. Figure 5 illustrates that, when the value of N is varied, the performance of the model changes strongly. For N equal to 70, the cross-validated AUC value of the Omics-MKL classifier was best. We provide the cross-validated AUC values for N = 20, 30, …, 70 in Supplementary Table S1. The percentages of the different data types among the selected features are shown in Figure 6. The selected features included 10 (14.29%) CNV features, 21 (30%) methylation features, 34 (48.57%) gene expression features, 3 (4.29%) miRNA features, and 2 (2.86%) protein expression features (cf. also Table 2).   Because we applied mRMR before feature selection, the selected features are sorted according to their association with the outcome and the mutual information between them. The ranks of the selected features are shown in Figure 7.
Because we applied mRMR before feature selection, the selected features are sorted according to their association with the outcome and the mutual information between them. The ranks of the selected features are shown in Figure 7. Figure 7. The selected features ranked according to their association with the outcome and the mutual information between them. Red represents protein features, yellow represents miRNA features, brown represents CNV features, purple represents methylation features, and blue represents RNAseq features.

Enrichment Analysis of the Selected Features
To further understand the roles of the selected features, we conducted an enrichment analysis of these features. Using Metascape [51], to understand the differences between LUAD stages, the whole set of human genes was employed as the background against the GO and the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway databases. The resulting molecular functions (MFs) are shown in Figure 8. We can see that the enriched functions were kinase activity and RNA expression-related functions. In addition, the most significantly enriched biological processes (BPs), cellular components (CCs), and KEGG pathways were negative regulation of catabolic processes, the centriolar satellite, and the Phospholipase D signaling pathway (see Supplementary Figures S2-S4).

Enrichment Analysis of the Selected Features
To further understand the roles of the selected features, we conducted an enrichment analysis of these features. Using Metascape [51], to understand the differences between LUAD stages, the whole set of human genes was employed as the background against the GO and the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway databases. The resulting molecular functions (MFs) are shown in Figure 8. We can see that the enriched functions were kinase activity and RNA expression-related functions. In addition, the most significantly enriched biological processes (BPs), cellular components (CCs), and KEGG pathways were negative regulation of catabolic processes, the centriolar satellite, and the Phospholipase D signaling pathway (see Supplementary Figures S2-S4).

Analysis of Those Selected Features That Are Known to Be Associated with LUAD
We searched all 70 selected features on the NCBI database and found that 18 of these features were reported to have functions in LUAD. Moreover, 11 features have been reported to be associated with lung cancer progress before. Table 3 lists the top ten features related to LUAD ranked by mRMR. In addition, we provide the reported information on the remaining selected features in Supplementary Table S2.

Analysis of Those Selected Features That Are Known to Be Associated with LUAD
We searched all 70 selected features on the NCBI database and found that 18 of these features were reported to have functions in LUAD. Moreover, 11 features have been reported to be associated with lung cancer progress before. Table 3 lists the top ten features related to LUAD ranked by mRMR. In addition, we provide the reported information on the remaining selected features in Supplementary Table S2. The MAPK, PI3K-Akt, Ras, and cGMP-PRKG1 signaling pathways were considered to be most probably correlated with platinum resistance.

SHC1
In NSCLC, the failure of pathways which involve factors such as DAPK1, GADD45A, SHC1, and TP53, in response to short telomeres, could promote tumor progression.

BCL7B
Compared with the combined human ACs, 39 genes with similar expression changes in murine lung tumors and human ACs/LCCs were identified, such as the oncogene related BCL7B, the cell cycle regulator CDK4, and the proapoptotic Endophilin B1.

Rank ID Genes
The Content of the Report PubMed ID

YTHDF2
The m6A-related genes METTL3, YTHDF1, and YTHDF2 could serve as novel biomarkers for the prognosis of LUAD. Compared with the combined human ACs, 39 genes with similar expression changes in murine lung tumors and human ACs/LCCs were identified, such as the oncogene related BCL7B, the cell cycle regulator CDK4, and the proapoptotic Endophilin B1.

PMID: 14647414
DSG2 gene overexpression has been found to correlate with poor prognosis in LUAD patients [52]. The XAF1 gene has been found to inhibit cell proliferation and induce apoptosis in human LUAD cell line A549 in vitro [53]. The CAPN1 gene has been proven to promote malignant behavior and erlotinib resistance mediated in LUAD [54].
To further understand the differences between LUAD stages, we performed a statistical analysis of the 18 selected genes known to be associated with LUAD. As seen in Figure 9, for RNA-seq data, the expression of CD109, MAP4, SHC1, DSG2, and CAPNS1 in the early stage of LUAD was lower than that in the late stage of LUAD, while the expression of DAP, ARFRP1, and XAF1 was lower in the late stage of LUAD. In the case of the methylation data, only BCAN had lower methylation at early stages, while PRKG1, CTDSPL, and HLA.E had lower methylation at late stages. Among the 18 LUAD-related features, only one protein feature was selected, PI3KP85, which was expressed more strongly in the early stages of LUAD. Interestingly, for the selected CNV feature YTHDF2, there was no change in most patients in the early stages and no homozygous deletion and high-level amplification for any of the patients in the late stages.
To further understand the differences between LUAD stages, we performed a statistical analysis of the 18 selected genes known to be associated with LUAD. As seen in Figure 9, for RNA-seq data, the expression of CD109, MAP4, SHC1, DSG2, and CAPNS1 in the early stage of LUAD was lower than that in the late stage of LUAD, while the expression of DAP, ARFRP1, and XAF1 was lower in the late stage of LUAD. In the case of the methylation data, only BCAN had lower methylation at early stages, while PRKG1, CTD-SPL, and HLA.E had lower methylation at late stages. Among the 18 LUAD-related features, only one protein feature was selected, PI3KP85, which was expressed more strongly in the early stages of LUAD. Interestingly, for the selected CNV feature YTHDF2, there was no change in most patients in the early stages and no homozygous deletion and highlevel amplification for any of the patients in the late stages.

Discussion
Using five omics data types jointly delivered better classification performance than when using only four omics data types or single-omics data. These results indicate that combining various omics data types into multi-omics data seems to be an efficient way of improving the classification of lung adenocarcinoma staging.
We used the self-developed method Omics-MKL in our experiments. Given that we did not compare Omics-MKL to other multi-omics prediction approaches and that we only analyzed one specific dataset, it is not possible to recommend Omics-MKL without limitation in clinical applications. Other multi-omics approaches may deliver better prediction results.

Discussion
Using five omics data types jointly delivered better classification performance than when using only four omics data types or single-omics data. These results indicate that combining various omics data types into multi-omics data seems to be an efficient way of improving the classification of lung adenocarcinoma staging.
We used the self-developed method Omics-MKL in our experiments. Given that we did not compare Omics-MKL to other multi-omics prediction approaches and that we only analyzed one specific dataset, it is not possible to recommend Omics-MKL without limitation in clinical applications. Other multi-omics approaches may deliver better prediction results.
The focus in this paper was not on Omics-MKL, but on illustrating the predictive value of multi-omics data in the staging of LUAD. An advantage of Omics-MKL in the context of the analyses performed in this paper was that the method functions in the same way when applied to single-omics data as when applied to multi-omics data. This makes the results obtained for multi-omics data and single-omics data comparable. In contrast, if we had used different methods for multi-omics data and single-omics data, this would have hampered the comparability between the results obtained for these two data types. Omics-MKL performed superior to the considered traditional machine learning classification methods for the investigated dataset. A possible reason for this is that, by using different kernels for different omics data types, Omics-MKL may better capture heterogeneous information from different types of data than the other compared methods, which do not explicitly consider that the features stem from different omics data types. An advantage of using mRMR for multi-omics data is that, by minimizing redundancy in the feature selection, we account for the known fact that the predictive information in different omics data types overlaps strongly.
Gene expression and methylation features were the two most important omics data types in our experiments. Methylation data played the most important role when building LUAD staging models using single-omics data. DNA methylation alteration is frequently observed in LUAD and plays an important role in carcinogenesis, diagnosis, and prediction [55,56]. The promoter regions of tumor suppressor genes are often hypermethylated, resulting in the activation of corresponding genes in tumors. It has been reported that BRCA2 [57], BCL2 [58], APC [59], and p16 [60] are hypermethylated in NSCLC, and P16 [60] gene promoter methylation is used as a biomarker for the diagnosis of NSCLC.
Our study has several limitations. First, the sample size for the multi-omics data is relatively small, which is why the performance estimates are likely quite variable. As shown in [61], it is not possible to quantify unbiasedly the variability of cross-validated performance estimates, which is why we are not able to investigate whether the observed performance differences between the methods are statistically significant. Second, our experiments integrated only omics data. Clinical information and pathological images were not considered in our study. Third, this work considered only internal validation via cross-validation. To obtain definitive conclusions on the ranking between the approaches, it would be necessary to analyze large numbers of multi-omics data samples, which are not available at this point. Moreover, it would also be interesting to compare the investigated methods using external validation. Including clinical information and imaging data would likely improve the performance further in comparison to using multi-omics data alone. In future work, we also intend to consider classifying cancer subtypes.

Conclusions
In this article, we used a self-developed method, Omics-MKL, for evaluating and illustrating the predictive value of multi-omics data in the staging of LUAD based on a publicly available dataset. Our results clearly indicate that using multi-omics data for the staging of LUAD has the potential to outperform using single-omics data and that each omics data type improves the predictions. At the same time, through the analysis of important genes and pathways, we tried to find some biological explanations for the differences between LUAD stages, and provide guidance for exploring the biological models of these stages.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/genes12121872/s1, Figure S1: 10-fold Nested Cross-Validation with Omics_MKL, Figure S2: Bar graph of enriched biological processes based on the 70 selected features, Figure S3: Bar graph of enriched cellular components based on the 70 selected features, Figure S4: Bar graph of enriched KEGG pathways based on the 70 selected features, Table S1: Cross-validated AUC values for N = 20 to 70, Table S2: The reported information on the remaining selected features.