Adaptive Sparse Multi-Block PLS Discriminant Analysis: An Integrative Method for Identifying Key Biomarkers from Multi-Omics Data

With the growing use of high-throughput technologies, multi-omics data containing various types of high-dimensional omics data is increasingly being generated to explore the association between the molecular mechanism of the host and diseases. In this study, we present an adaptive sparse multi-block partial least square discriminant analysis (asmbPLS-DA), an extension of our previous work, asmbPLS. This integrative approach identifies the most relevant features across different types of omics data while discriminating multiple disease outcome groups. We used simulation data with various scenarios and a real dataset from the TCGA project to demonstrate that asmbPLS-DA can identify key biomarkers from each type of omics data with better biological relevance than existing competitive methods. Moreover, asmbPLS-DA showed comparable performance in the classification of subjects in terms of disease status or phenotypes using integrated multi-omics molecular profiles, especially when combined with other classification algorithms, such as linear discriminant analysis and random forest. We have made the R package called asmbPLS that implements this method publicly available on GitHub. Overall, asmbPLS-DA achieved competitive performance in terms of feature selection and classification. We believe that asmbPLS-DA can be a valuable tool for multi-omics research.


Introduction
The improvements in high-throughput technology have enabled the collection of different types of omics data such as transcriptomics, epigenomics, proteomics, and metabolomics quickly and cost-effectively for the same subject. Each type of omics data, on its own, may contribute unique information. By identifying biomarkers within each type of omics data, we can gain insight as to which biological pathways or processes are different between different phenotypic groups. However, when analyzing only one type of omics data, the correlation between different omics layers is not considered, preventing us from understanding the complex relationship (molecular mechanism) between the multi-omics data and the host systematically and holistically. Therefore, in order to draw a more comprehensive view of systemic biological processes, omics data from the different platforms have to be integrated and analyzed [1].
Multi-omics data are often structured in blocks and characterized by hundreds or thousands of variables with only a small sample size. Meanwhile, they often imply a high degree of correlation within and between different types of omics data. With a large number of variables, the interpretation of their collective contribution to the process can be a hard task. Therefore, there is a need to perform variable selection to obtain the most relevant and predictive features for each block.
In the past decade, many approaches have been proposed for identifying predictive biomarkers that are relevant to the patient's phenotypes. For some methods, such as L1-regularized logistic regression, L1-regularized multinomial logistic regression [2], and 2.1. Introduction to asmbPLS 2.1. 1

. Model Building
In asmbPLS, the goal is to construct a set of orthogonal PLS components that maximize the covariance between the two latent variables t super j and u j , which represent information from X and Y, respectively. X and Y are always centered and scaled. For jth PLS component (j = 1, 2, . . . , J), the dimension reduction is conducted by taking a linear combination of the predictor features to obtain the block score t b j = (b = 1, . . . , B) for each block, where ω b j contains the block variable weights in X b , indicating the relevance of variables, and m X b is the number of variables in X b for block scaling. Within each block, the soft thresholding function sparse(x, λ) = sign(x)(|x| − λ) + is implemented on ω b j with the specific quantile of weights used as λ for feature selection. By doing this, the most relevant features will be retained given pre-defined quantiles for each block. After calculating t b j for each block, all block scores are combined into a new matrix T j . The dimension reduction is implemented again by taking a linear combination of the different block scores t super j = T j ω super j , where ω super j is the weight indicating the relevance of each block and t super j is the super score representing the information of predictor matrix X. Similarly, u j is a summary vector of Y, i.e., u j = Yq j (q j being the weight for each column of Y).
The problem can be formally expressed as follows: Once all the parameters are calculated for the first PLS component, X and Y are deflated, and then the deflated X and Y are used for the calculation of the second PLS component, and so on.

Prediction
Once the asmbPLS model is built, we can then predict the outcome for the original and the new samples. Specifically, given the new data X new , we can calculate t b j , T j , and t super j . Therefore, the outcome for the unknown sample can be predicted by: where the prediction can depend on the number of PLS components used.

asmbPLS Extended to asmbPLS-DA
For asmbPLS-DA, we focus on the categorical outcome instead of the continuous outcome. The extension of asmbPLS to asmbPLS-DA is straightforward. For the binary outcome, a matrix with one column dummy variable is used as the outcome matrix (n × 1); for the multiclass outcome, a matrix with G columns dummy variables is used (n × G).

Data Pre-Processing
In asmbPLS-DA, the pre-processing of data is different. X is always column-wise centered and scaled; Y is only centered. Usually, for the binary outcome, 0.5 (the average of two group indicators, i.e., 0 and 1) is used as the cutoff for the group decision when we have equal sample sizes in two groups. However, when the group sizes are unequal, the boundary is shifted toward the larger group, resulting in more misclassification. Therefore, to handle this, we implement the weighted center in asmbPLS-DA by subtracting the average of the means of the two groups, A and B, i.e., X A + X B /2, for each feature in X. We conduct the same weighted center for Y using the same group information. In addition, for the multiclass outcome, a similar procedure is applied. For example, if we have three groups (A, B, and C), then we use the average of the means of the three groups X A + X B + X C /3 to implement the weighted center.

Decision Rules
Over the last two decades, different decision rules have been proposed to translate Y-estimate into the meaningful group. We have included many of them in the proposed algorithm. The details are listed in Table 1. √ and × indicate if the decision rule can be applied to the corresponding object.

Decision Rule Object Binary Multiclass Definition
Fixed Cutoff Y √ × Y-estimates over 0.5 are assigned to 1, while Y-estimates under 0.5 are assigned to 0.
The sample is assigned to the group with the maximum Y-estimate.
Euclidean Distance (ED) t super √ √ The sample is assigned to the group with the shortest ED using t super or Y.
The sample is assigned to the group with the shortest MD using t super .
Principal Components The sample is assigned to the group with the shortest MD using the first G − 1 principal components of Y.
Among the included decisions, the Euclidean distance (ED) is one of the most used measures, which is defined as: where d ig is the ED of sample i to the centroid of group g (X g ). The Mahalanobis distance (MD) [10] is another popular measure, which seeks to measure the correlation between variables: S = (n 1 − 1)S 1 + · · · + n g − 1 S g + · · · n 1 + · · · n g + · · · + n G − G where S g is the variance-covariance matrix for group g.

Parameter Tuning and Model Selection
We let B = 2 and the number of PLS components = 3, for instance. K-fold CV is used to tune the quantile combination for different PLS components in different blocks to obtain the best quantile combination giving the highest balanced accuracy (BA). For the binary outcome, BA is defined as (sensitivity + specificity)/2, where sensitivity = true positive/(true positive + false negative) and specificity = true negative/(true negative + false positive); for the multiclass outcome, BA is defined as average recall or sensitivity across all the groups. Tuning these parameters is equivalent to choosing the "degree of sparsity", i.e., the number of non-zero weights for each PLS component in each block.
In the K-fold CV, the samples are randomly placed into K roughly equal groups with the ratios of samples from different classes being equal in all the groups. This step is repeated N CV times to generate N CV different sets of groups. For each set with K groups, the samples from the kth group are used as the validation set, while the remaining samples serve as the training set, with this process repeated for k = 1, . . . , K. For convenience, the same set of combinations of the degree of sparsity, i.e., quantile block 1 = q block 1 1 , . . . , q block 1 a and quantile block 2 = q block 2 1 , . . . , q block 2 b are used for different PLS components. The following steps are then taken: (1) asmbPLS-DA models using the first l (l starts from 1) PLS component(s) with different quantile combinations used for the lth PLS component are fitted based on the training set; (2) BA for each model is then calculated based on the validation set; (3) the average K-fold BA across N CV sets is calculated for each combination: where BA n CV ,k is the corresponding BA using samples from the kth group of n CV th set as the validation; (4) the combination with the lowest BA N CV ,K is chosen for lth PLS component; (5) the data deflation is then implemented for X and Y, and we let l = l + 1; and (6) steps (1)-(5) are repeated until we obtain the combinations for all the 3 PLS components. Note that the combination with the lowest BA N CV ,K obtained from the previous steps will be used for the first (l-1)th PLS component(s). In our study, we choose K and N cv to be 5 and 10, respectively, which is enough for parameter tuning. The selection of the quantiles for each block is usually based on prior knowledge of the corresponding omics data. The blocks with only a small number of relevant features should be given higher quantiles and vice versa. In addition, if there is no prior knowledge, a wider range of quantiles should be tried.
Using the results of the CV, we can determine the number of PLS components used for classification as well. To avoid the over-fitting issue, the strategy for selecting the optimal number of PLS components is: (1) let the initial number of components be comp = 1; (2) check whether including one more component decreases the BA by 0.005, i.e., BA comp + 0.005 ≤ BA comp+1 ; and (3) if so, let comp = comp + 1 and go back to step (2); otherwise, let comp be the selected number of components.

Vote Algorithm
In addition to the different decision rules, we have included several vote algorithms to help the discrimination. (1) The unweighted vote, where each decision rule used will be assigned equal weight, and the final classification is based on the results of the methods.
(2) The weighted vote, where each decision rule is assigned a weight based on the BA with the optimal quantile combination, i.e., weight i = log BA i The ranked vote, in addition to BA, the overall accuracy, recall, precision, and F1 score for the optimal quantile combination has also been collected. The rank of the decision rule is calculated for different measures and a combined rank is obtained. The decision rule with the highest combined rank is used for the final classification.

Another Classification Strategy
In addition to the above-mentioned original asmbPLS-DA classification algorithms, we have combined asmbPLS-DA with other methods for classification. Once the asmbPLS-DA model is built, the super score of different PLS components or the selected features can be used as predictors for other classification algorithms, such as RF and LDA. By doing this, we aim to test if this two-stage strategy will increase the classification capability of asmbPLS-DA.

Technicalities and Implementations
All the implementations were conducted using R 4.2.1 [11] in the "HiperGator 3.0" high-performance computing cluster, which includes 70,320 cores with 8 GB of RAM on average for each core, at the University of Florida.
The real data downloading and pre-processing were implemented using the R package TCGAbiolinks (2.25.3) [12]. The GDCquery function was used to query the required GDC data, where the project option was set to be "TCGA-BRCA". For gene data, we set data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", and workflow.type = "STAR-Counts"; for miRNA data, we set data.category = "Transcriptome Profiling", data.type = "miRNA Expression Quantification", and workflow.type = "BCGSC miRNA Profiling"; for protein data, we set data.category = "Proteome Profiling" and data.type = "Protein Expression Quantification".

Results and Discussion
asmbPLS-DA employs different decision rules and vote functions that can impact its performance in various scenarios. Thus, the identification of the best decision rules and vote functions is crucial for selecting the optimal model. Moreover, asmbPLS-DA can be combined with other classification methods, such as random forest (RF) [18] and linear discriminant analysis (LDA) [19], to improve the accuracy and reduce the risk of overfitting. In this case, the outputs of asmbPLS-DA, such as the super score (of the first two or three PLS components) or selected features, can be utilized as predictors for the other classification algorithms.

Simulation Strategies
The objective of the simulation studies is to investigate the performance of asmbPLS-DA and compare it with other classification methods. To accomplish this, we consider two correlated high-dimensional predictor blocks that vary in the total number of variables and the number of highly correlated variables in each block. In addition, we examine two phenotypic group scenarios, including binary and multiclass (three groups) outcomes. By conducting simulation studies with varying parameter settings, we aim to comprehensively evaluate the performance of asmbPLS-DA and other methods for feature selection and classification.
We simulate n samples, q variables in the first block X 1 , and p variables in the second block X 2 . Among the q + p variables, q h variables from X 1 and p h variables from X 2 are simulated to be highly correlated, and we set the correlation between all other variables to be from low to moderate. The highly correlated part and the non-highly correlated part are simulated from the multivariate normal distribution, respectively: where mean (variance) for variables in X 1 and X 2 are set to 0 (1) and 5 (3), respectively. To simulate the within-block and between-block correlation, for ∑ h and the variancecovariance matrix for the highly correlated part, we set all the non-diagonal elements to be c ij σ i σ j , where σ i and σ j are the standard deviations for ith and jth features, respectively, and c ij ∼ Uni f orm(0.6, 0.9). Similarly, we did the same for ∑ l with c ij ∼ Uni f orm(−0.5, 0.5). We used the posdefify function from the R package sfsmisc to make the variance-covariance matrix positive definite. n, q, and q h are always set to be 100, 1000, and 100 while we vary the number of p and p h in different scenarios, including (1) p = 50 and p h = 5, where there is only a small number of features in X 2 ; (2) p = 200 and p h = 20, where there are more features in X 2 but still much fewer than in X 1 ; and (3) p = 1000 and p h = 100, where X 1 and X 2 have an equal number of features. Once X 1 and X 2 are simulated, they are centered and scaled (X 1.scale and X 2.scale ) and then used to simulate the binary and multiclass outcomes Y following the idea of logistic regression and multinomial logistic regression. Only the first ten variables in X 1 and the first five variables in X 2 are set to be relevant to the outcome with non-zero coefficients, while the coefficients of all the other variables are equal to zero. For binary outcome: where I is the identity matrix and r is used to indicate different scale of noise. For multiclass outcome: where β 1.2 and β 1. 3 have the same settings as β 1 , and β 2.2 and β 2. 3 have the same settings as β 2 .
Since only the first ten variables in X 1 and the first five variables in X 2 are set to be relevant, we consider four different variable structures for each setting mentioned above by replacing the order of the variables in different blocks. (1) cor: among the relevant variables, all of them are highly correlated variables; (2) order: among the relevant variables, only the first six variables in X 1 and the first three variables in X 2 are highly correlated variables; (3) random: the order of all the variables is random; and (4) inverse: none of the relevant variables are highly correlated variables. In addition, the r is set to be different values, including 0, 1, 2, 3, and 5, to indicate different scales of noise. Therefore, we have 60 scenarios for each type of outcome. For each scenario, 100 datasets are generated, and additional n test = n samples are generated using the same design parameters to serve as the test set for each dataset.

Simulation Results
To evaluate the performance of asmbPLS-DA and compare it with other classification methods, we conducted a comprehensive analysis of feature selection, classification accuracy, and computation efficiency. We compared asmbPLS-DA with L1-regularized logistic regression and L1-regularized multinomial logistic regression for binary and multiclass outcomes, respectively. In addition, we examined sPLS-DA and DIABLO for both binary and multiclass outcomes and IPF-Lasso for binary outcomes. To perform the comparisons, we used 5-fold cross-validation (CV) with 10 repetitions for asmbPLS-DA, sPLS-DA, IPF-Lasso, and DIABLO. We used the corresponding R package name glmnet to represent L1-regularized logistic regression and L1-regularized multinomial logistic regression in the figures for binary and multiclass outcomes, respectively. For asmbPLS-DA, we conducted the CV with varying quantile combinations for each scenario. In scenarios with p = 50, we used quantile block 1  In our simulation setting, only the first ten variables in X 1 and the first five variables in X 2 are set to be relevant. The sensitivity and specificity for each block were calculated based on the feature selection results. Figure 1 displays the overall sensitivity and specificity of different methods for the binary outcome. When the relevant features are highly correlated (structure = cor), DIABLO exhibits nearly the highest sensitivity in scenarios with p = 50 and 200, while L1-regularized logistic regression and IPF-Lasso show the lowest sensitivity in a different dimension setting due to the Lasso-based method's limitation in selecting only one variable from a group of highly correlated variables. asmbPLS-DA performs similarly to DIABLO in scenarios with p = 50, 200, and low noise, and performs better than all the other methods in scenarios with p = 1000 and low noise. However, in scenarios with structures = order, random, and inverse, DIABLO's performance is nearly the worst. Among these scenarios with p = 50, asmbPLS-DA shows higher sensitivity than all the other methods. As p increases to 200 and 1000, the performance of asmbPLS-DA is similar to other methods, except for DIABLO. Correspondingly, higher sensitivity is accompanied by lower specificity. Among all methods, sPLS-DA has the lowest specificity. More details of the sensitivity and specificity of different blocks for the binary outcome are presented in Figure S1. In scenarios with p = 50, asmbPLS-DA and DIABLO display higher sensitivity than methods that do not consider group structure for the smaller block X 2 , as methods without group information may assign more weight to the block with more features. In our simulation setting, only the first ten variables in and the first five variables in are set to be relevant. The sensitivity and specificity for each block were calculated based on the feature selection results. Figure 1 displays the overall sensitivity and specificity of different methods for the binary outcome. When the relevant features are highly correlated (structure = cor), DIA-BLO exhibits nearly the highest sensitivity in scenarios with = 50 and 200, while L1regularized logistic regression and IPF-Lasso show the lowest sensitivity in a different dimension setting due to the Lasso-based method's limitation in selecting only one variable from a group of highly correlated variables. asmbPLS-DA performs similarly to DIA-BLO in scenarios with = 50, 200, and low noise, and performs better than all the other methods in scenarios with = 1000 and low noise. However, in scenarios with structures = order, random, and inverse, DIABLO's performance is nearly the worst. Among these scenarios with = 50, asmbPLS-DA shows higher sensitivity than all the other methods. As increases to 200 and 1000, the performance of asmbPLS-DA is similar to other methods, except for DIABLO. Correspondingly, higher sensitivity is accompanied by lower specificity. Among all methods, sPLS-DA has the lowest specificity. More details of the sensitivity and specificity of different blocks for the binary outcome are presented in Figure S1. In scenarios with = 50, asmbPLS-DA and DIABLO display higher sensitivity than methods that do not consider group structure for the smaller block , as methods without group information may assign more weight to the block with more features. For the multiclass outcome, the results of feature selection are presented in Figure 2. In nearly all the scenarios except structure = cor and = 200, asmbPLS-DA performs the best or close to the best regarding sensitivity. In addition, the specificity of asmbPLS is the second-best among all the methods. Figure S2 shows the corresponding sensitivity and specificity of different blocks, and a similar trend in Figure S1 can be found. For the multiclass outcome, the results of feature selection are presented in Figure 2. In nearly all the scenarios except structure = cor and p = 200, asmbPLS-DA performs the best or close to the best regarding sensitivity. In addition, the specificity of asmbPLS is the second-best among all the methods. Figure S2 shows the corresponding sensitivity and specificity of different blocks, and a similar trend in Figure S1 can be found.

Classification Performance
The classification performance of all the methods was measured in terms of overall accuracy with models fitted on the training set and then applied to the test set.
For the binary outcome, Figure S3 compares the different asmbPLS-DA methods. In scenarios with structure = cor and = 50, 200, RF using features selected by asmbPLS-DA performs the best, followed by the original asmbPLS-DA methods with different decision rules and vote functions. The performance of asmbPLS-DA with LDA is good in low noise but decreases rapidly with increased noise, as seen in scenarios with structure = random and inverse and = 50, 200. Furthermore, LDA and RF using the super score of asmbPLS-DA show relatively lower accuracy than other asmbPLS-DA methods in most scenarios. For scenarios with structure = order, the original asmbPLS-DA methods show the best performance overall. Additionally, for = 1000, the original asmbPLS-DA performs the best.
Based on the comparison among the asmbPLS-DA methods, we have selected a subset of asmbPLS-DA methods to compare with other methods for simplicity ( Figure 3). The selected methods are asmbPLS-DA using fixed cutoff, RF, and LDA using features selected by asmbPLS-DA. In scenarios with structure = cor, sPLS-DA and DIABLO perform the best, followed by L1-regularized logistic regression and IPF-Lasso. RF using features selected by asmbPLS-DA gradually outperforms L1-regularized logistic regression and IPF-Lasso with increased noise. In scenarios with structure = order, random, and inverse, asmbPLS-DA methods perform worse than sPLS-DA, IPF-Lasso, and L1-regularized logistic regression with low noise, but gradually improve with increased noise and show higher or similar accuracy compared to other methods. DIABLO performs the worst in these scenarios.

Classification Performance
The classification performance of all the methods was measured in terms of overall accuracy with models fitted on the training set and then applied to the test set.
For the binary outcome, Figure S3 compares the different asmbPLS-DA methods. In scenarios with structure = cor and p = 50, 200, RF using features selected by asmbPLS-DA performs the best, followed by the original asmbPLS-DA methods with different decision rules and vote functions. The performance of asmbPLS-DA with LDA is good in low noise but decreases rapidly with increased noise, as seen in scenarios with structure = random and inverse and p = 50, 200. Furthermore, LDA and RF using the super score of asmbPLS-DA show relatively lower accuracy than other asmbPLS-DA methods in most scenarios. For scenarios with structure = order, the original asmbPLS-DA methods show the best performance overall. Additionally, for p = 1000, the original asmbPLS-DA performs the best.
Based on the comparison among the asmbPLS-DA methods, we have selected a subset of asmbPLS-DA methods to compare with other methods for simplicity ( Figure 3). The selected methods are asmbPLS-DA using fixed cutoff, RF, and LDA using features selected by asmbPLS-DA. In scenarios with structure = cor, sPLS-DA and DIABLO perform the best, followed by L1-regularized logistic regression and IPF-Lasso. RF using features selected by asmbPLS-DA gradually outperforms L1-regularized logistic regression and IPF-Lasso with increased noise. In scenarios with structure = order, random, and inverse, asmbPLS-DA methods perform worse than sPLS-DA, IPF-Lasso, and L1-regularized logistic regression with low noise, but gradually improve with increased noise and show higher or similar accuracy compared to other methods. DIABLO performs the worst in these scenarios.
For the multiclass outcome, the comparison among all asmbPLS-DA methods shows that RF using features selected by asmbPLS-DA performs the best in scenarios with structure = cor ( Figure S4). For structure = order, random, and inverse, LDA using the super scores shows overall better performance than the other asmbPLS-DA methods. In addition, the performance of asmbPLS-DA with Max Y is good in scenarios with p = 1000. Similarly, we select RF using features selected by asmbPLS-DA, LDA using super scores of asmbPLS-DA and asmbPLS-DA with Max Y for comparison with other methods for simplicity.
According to Figure 4, asmbPLS-DA methods generally perform worse than L1regularized multinomial logistic regression and sPLS-DA, except for RF using asmbPLS-DA selected features, which show close-to-best performance in scenarios with structure = cor. For the multiclass outcome, the comparison among all asmbPLS-DA methods shows that RF using features selected by asmbPLS-DA performs the best in scenarios with structure = cor ( Figure S4). For structure = order, random, and inverse, LDA using the super scores shows overall better performance than the other asmbPLS-DA methods. In addition, the performance of asmbPLS-DA with Max Y is good in scenarios with = 1000. Similarly, we select RF using features selected by asmbPLS-DA, LDA using super scores of asmbPLS-DA and asmbPLS-DA with Max Y for comparison with other methods for simplicity.
According to Figure 4, asmbPLS-DA methods generally perform worse than L1-regularized multinomial logistic regression and sPLS-DA, except for RF using asmbPLS-DA selected features, which show close-to-best performance in scenarios with structure = cor.   Table 2 displays the computation time for each method in different settings. For the binary outcome, L1-regularized logistic regression is the fastest, followed by IPF-Lasso, asmbPLS-DA, and then sPLS-DA. For the multiclass outcome, L1-regularized multinomial logistic regression is the fastest, followed by sPLS-DA and asmbPLS-DA. DIABLO is  Table 2 displays the computation time for each method in different settings. For the binary outcome, L1-regularized logistic regression is the fastest, followed by IPF-Lasso, asmbPLS-DA, and then sPLS-DA. For the multiclass outcome, L1-regularized multinomial logistic regression is the fastest, followed by sPLS-DA and asmbPLS-DA. DIABLO is the slowest among all the methods, which is not implemented for real data applications.

Application to TCGA Breast Cancer Data
We evaluated the performance of asmbPLS-DA and the other competitive methods using the TCGA breast cancer dataset, which included two types of omics data: gene expression and miRNA expression. The dataset consisted of 949 samples, including 854 tumor samples and 95 normal samples. The data were downloaded from https: //portal.gdc.cancer.gov/repository. We used the tissue type (tumor vs. normal) as the binary outcome to assess the methods' performance. Additionally, we removed samples without pathologic stage information from the 854 tumor samples, resulting in 127 samples with stage I breast cancer, 491 samples with stage II breast cancer, and 221 samples with stage III or stage IV breast cancer. We used the pathologic stage as the multiclass outcome (stage I vs. stage II vs. stage III + stage IV) for evaluation. The pre-processed procedures, including outlier removal, data normalization, and filtering, were implemented for the two omics data, resulting in a final dataset of 45,371 genes and 1410 miRNAs, with a feature ratio of approximately 32:1. For the pathologic stage outcome, we have also included one more omics data, i.e., protein expression, to further test the performance of methods. After removing samples and features with missing data, we obtained 45,371 genes and 1410 miRNAs, as well as 457 proteins for 822 tumor samples. Of these samples, 125 were in the stage I group, 480 were in the stage II group, and 217 were in the stage III or stage IV group.
We included three PLS components for PLS-based methods and performed 5-fold crossvalidation with 10 repetitions for asmbPLS-DA, sPLS-DA, and IPF-Lasso for parameter tuning. DIABLO was not included due to its long computation time. First, we used tissue type as the outcome to test the performance of different methods on the binary outcome. For asmbPLS-DA methods, for simplicity, we only used asmbPLS-DA with the decision rule of fixed cutoff for both feature selection and classification. Additionally, we utilized the features selected by asmbPLS-DA in both LDA and RF for the purpose of classification.
Regarding feature selection, forty-six genes and eight miRNAs were selected by asmbPLS-DA; thirty genes and zero miRNAs were selected by sPLS-DA; thirty-six genes and zero miRNAs were selected by L1-regularized logistic regression; twenty-six genes and fifteen miRNAs were selected by IPF-Lasso. Since sPLS-DA and L1-regularized logistic regression do not consider group structure, no miRNA was collected by these methods, which is expected given the relatively small number of miRNAs compared to genes. In addition, due to the limitation of the Lasso method, the features selected by L1-regularized logistic regression and IPF-Lasso show relatively lower correlations compared to the features selected by asmbPLS-DA and sPLS-DA ( Figure 5), indicating the effectiveness of PLS-based methods for selecting features with strong correlation.  Table 3 lists the top five features with the largest absolute coefficients selected by each method for different blocks, along with corresponding references supporting the association between the features and breast cancer. Interestingly, asmbPLS-DA and sPLS-DA share the top list for the gene block, with VEGF-D, ADAMTS5, ABCA10, ADAM33, and PAMR1 being identified as the most important genes for breast cancer, which have been previously reported in the literature [20][21][22][23][24][25]. However, the top genes selected by L1-regularized logistic regression and IPF-Lasso differed from those identified by asmbPLS-DA and sPLS-DA, and not all of them were found to be related to breast cancer, with limited supporting literature. As for miRNAs, all of the miRNAs selected by asmbPLS-DA have been found to be associated with breast cancer in previous studies [26][27][28][29]. In contrast, the top miRNAs selected by IPF-Lasso, such as miR-6883 and miR-5191, were associated with colorectal cancer [30,31], and there was no literature to support the effects of miR-6717  Table 3 lists the top five features with the largest absolute coefficients selected by each method for different blocks, along with corresponding references supporting the association between the features and breast cancer. Interestingly, asmbPLS-DA and sPLS-DA share the top list for the gene block, with VEGF-D, ADAMTS5, ABCA10, ADAM33, and PAMR1 being identified as the most important genes for breast cancer, which have been previously reported in the literature [20][21][22][23][24][25]. However, the top genes selected by L1-regularized logistic regression and IPF-Lasso differed from those identified by asmbPLS-DA and sPLS-DA, and not all of them were found to be related to breast cancer, with limited supporting literature. As for miRNAs, all of the miRNAs selected by asmbPLS-DA have been found to be associated with breast cancer in previous studies [26][27][28][29]. In contrast, the top miRNAs selected by IPF-Lasso, such as miR-6883 and miR-5191, were associated with colorectal cancer [30,31], and there was no literature to support the effects of miR-6717 and miR-548o-2 on breast cancer. As we expected, neither sPLS-DA nor L1-regularized logistic regression selected any miRNAs for breast cancer. Table 3. The top five features selected by different methods for the gene and the miRNA blocks, respectively, using tissue type as the outcome. The symbol * indicates the evidence of the association between the feature and other types of cancer other than breast cancer.

Feature
Ranking Reference ID Gene asmbPLS-DA sPLS-DA L1-Regularized Logistic Regression IPF-Lasso Table 4 lists the results of the classification for different methods, with tissue type as the outcome. For asmbPLS-DA, the quantile combination selected from the CV was used for model fitting and classification. All the methods demonstrate satisfactory classification performance in terms of overall accuracy. However, sPLS-DA exhibited the worst performance, despite performing nearly the best on the simulation datasets. Notably, 28 samples were classified as NA by sPLS-DA, which may be attributed to the unbalanced group or the excess zero values in the real data that the corresponding R function cannot handle. Among all the methods, asmbPLS-DA with RF showed the highest overall accuracy, while asmbPLS-DA with LDA exhibited the best performance in terms of balanced accuracy. L1-regularized logistic regression and IPF-Lasso exhibited high recall for the group with a larger sample size, but their performance was weaker for the smaller group. To further evaluate the methods, we used the pathologic stage as our multiclass outcome with more balanced group sizes of 127, 491, and 227 samples in stage I, stage II, and stage III + stage IV, respectively. Similarly, we applied asmbPLS-DA with the decision rule of Max Y for both feature selection and classification and used the selected features in LDA and RF for classification.
The results showed that asmbPLS-DA selected forty-six genes and fifteen miRNAs, sPLS-DA selected one hundred genes but no miRNA, and L1-regularized logistic regression selected forty-seven genes and one miRNA. The top five features from both blocks selected by each method are listed in Table 5. In contrast to the binary outcome results, asmbPLS-DA and sPLS-DA only shared one gene in the top list, i.e., DIRAS1, which has decreased expression in most breast cancers [38]. Among the other four genes selected by asmbPLS-DA, STOM, MMGT1, and DIAPH3 are associated with breast cancer progression [39][40][41], while FAM72B is related to lung adenocarcinoma [42]. In contrast, sPLS-DA selected RHBDF2 and CCL14 related to breast cancer [43,44], while ERG and FASTKD1 are associated with other cancers [45,46]. The associations between the genes selected by L1-regularized multinomial logistic regression and breast cancer are not supported by the available literature. Regarding miRNAs selected by asmbPLS-DA, miR-5047 regulates the proliferation and migration of breast cancer cells [47], miR-6744 increases anoikis in breast cancer [48], and miR-4634 can be used to detect breast cancer in the early stages [49]. The only miRNA selected by L1-regularized multinomial logistic regression, i.e., miR-548ag-1, is not related to breast cancer. Similarly, no miRNA was selected by sPLS-DA.
The classification results using the pathologic stage as the outcome are listed in Table 6. We found that L1-regularized multinomial logistic regression performs the best, followed by asmbPLS-DA with RF, asmbPLS-DA with LDA, sPLS-DA, and asmbPLS-DA with Max Y. Furthermore, asmbPLS-DA with LDA was noted to show the highest balanced accuracy, followed by asmbPLS-DA with Max Y, asmbPLS-DA with RF, L1-regularized multinomial logistic regression, and sPLS-DA. However, all the classification methods were found to show poor performance in classifying groups of smaller size, with L1-regularized multinomial logistic regression and asmbPLS-DA with RF nearly classifying all the samples from the stage II group. This observation is consistent with the results obtained from the real data analysis using the binary outcome. To test the performance of the methods with more than two blocks, we added protein expression data to the analysis for the pathologic stage outcome. asmbPLS-DA identified twenty-eight genes, eight miRNAs, and fourteen proteins as the most important features. sPLS-DA identified 70 genes as the most important features but did not select any miRNAs or proteins. L1-regularized logistic regression selected sixty genes, two miRNAs, and four proteins. Notably, asmbPLS-DA selected nearly the same top genes and miRNAs as when using only two omics data (as shown in Table 7). However, sPLS-DA identified two other genes, CLEC14A and LDB2, which have been associated with breast cancer [50,51]. The genes selected by L1-regularized logistic regression did not have strong evidence supporting their relevance to breast cancer. The top miRNAs selected were similar to those from the previous results. For the protein block, the corresponding genes related to the top five proteins selected by asmbPLS-DA are all associated with breast cancer [52][53][54][55][56], while two out of the three proteins selected by L1-regularized logistic regression are related to breast cancer [57,58]. However, including the protein block did not significantly improve the classification performance of any of the methods (Table S1). CLEC14A

Conclusions
In this paper, we extended our previous work, asmbPLS, to a new method called asmbPLS-DA. This new method considers both binary and multiclass categorical outcomes. The aim of asmbPLS-DA is to identify the most predictive features in diverse types of omics data and then use the selected features for classification. This helps to improve our understanding of the molecular mechanisms behind diseases. To achieve this goal, we have included different decision rules and several vote functions. Additionally, we have combined asmbPLS-DA with other classification algorithms for further testing. The proposed method is implemented in the R package asmbPLS available on our GitHub (https://github.com/RunzhiZ/asmbPLS), and it will soon be uploaded to CRAN.
Simulation studies have shown that asmbPLS-DA has very satisfactory performances regarding feature selection, especially for multiclass outcomes and cases where different omics data vary greatly in the number of features. In methods that do not consider the group structure, such as sPLS-DA, L1-regularized logistic regression, and L1-regularized multinomial logistic regression, the feature selection can be dominated by the block with the greatest number of features, ignoring information from smaller blocks. However, due to the nature of asmbPLS-DA, it can select the predictive features from all the blocks included, even if these features are highly correlated. The feature selection performance of asmbPLS-DA is stable regardless of the scenario setting. In real data, the top features selected by asmbPLS-DA have been validated by the literature, further indicating the power of feature selection by asmbPLS-DA.
In the simulation studies, asmbPLS-DA performs worse overall for classification compared to the other methods, except for DIABLO. However, RF with the features selected by asmbPLS-DA shows similar performance to the best in the scenarios where all the relevant features are highly correlated, which is quite common for multi-omics data. This indicates the significance of the selected features. Moreover, as the noise increases, the performances of asmbPLS-DA methods get closer to the best. In the real data, we found that asmbPLS-DA with LDA shows the best classification when considering both the overall accuracy and the balanced accuracy, while asmbPLS-DA with RF shows good performance regarding the overall accuracy only.
The current version of asmbPLS-DA does not include the pre-processing of missing values. Additionally, if there are no relevant features in the predictor blocks, classification can be problematic for the method since asmbPLS-DA will always select some features for each block, resulting in non-relevant features being selected.
In summary, we have shown that asmbPLS-DA is a useful tool for multi-omics analysis. We believe that asmbPLS-DA provides a valuable alternative for both feature selection and classification in binary and multiclass problems.
Supplementary Materials: The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/genes14050961/s1, Figure S1. The sensitivity and specificity of different methods for different blocks for simulation data with the binary outcome. Figure S2. The sensitivity and specificity of different methods for different blocks for simulation data with the multiclass outcome. Figure S3. The classification results for simulation data with the binary outcome among asmbPLS-DA methods. Figure S4. The classification results for simulation data with the multiclass outcome among asmbPLS-DA methods. Table S1. Comparison of the classification performance for different methods using the real data with an additional protein block included and the multiclass outcome.
Author Contributions: S.D. and R.Z. designed the study, S.D. provided theoretical support when required, R.Z. implemented the simulation and the analyses, and R.Z. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This work was partially supported by NIH grant 1UL1TR000064 from the Center for Scientific Review.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The real data were obtained from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/repository) (accessed on 9 January 2023), where the cancer datasets have been gathered by the TCGA research network. We downloaded the gene, miRNA, and protein, as well as clinical datasets, from the TCGA-BRCA project.