Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data

Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order selection for response categories, and variable selection. We perform our procedure on high-dimensional gene expression data with 801 patients, 2426 genes, and five types of cancerous tumors. As a result, we recommend three finalized models: one with 74 genes achieves extremely low cross-entropy loss and zero predictive error rate based on a five-fold cross-validation; and two other models with 31 and 4 genes, respectively, are recommended for prognostic multi-gene signatures.


Introduction
High-dimensional data is a dataset in which the number of covariates (d) is much larger than the number of observations (n), often written as d n. It is raised in many scientific fields, especially in biological sciences [1][2][3], where the datasets are difficult to analyze using classical statistical tools such as linear, nonlinear, or generalized linear regression models. When the covariates have a high percentage of zeros, it is necessary to use zero-inflated models or hurdle models to address this sparsity [4][5][6]. The analysis becomes more challenging due to the skewness of the distributions [7].
High-dimensional sparse data is wide-ranging. It arises in many different disciplines such as genomics [8], mobile app usage logs [2], and energy technologies [9]. Omics data such as microbiome [10] and gene expression data [11] are typically high-dimensional and sparse.
As a motivating example, the RNA-seq gene expression data discussed in [1] consists of n = 801 tissue samples with 20,531 genes. Among them, 2426 genes carry a good proportion of zeros, from 5% to 50%. The goal of their analysis was to predict the response category which belongs to one of the five different types of cancerous tumors, namely BRCA, COAD, KIRC, LUAD, and PRAD, based on the 2426 sparse genes. A variable selection criterion was proposed by [1] to rank the 2426 sparse genes for predicting the type of tumors, and the top 50 genes were recommended with a 1-nearest neighbor classifier, which is highly robust for high-dimensional classifications [12].
Unlike statistical methods, the 1-nearest neighbor classifier, as well as many other machine learning techniques, is a deterministic classification approach, which assigns a single predictive label to each given subject or individual. Statistical methods, however, typically produce a distribution answer, called a stochastic classification [13], which assigns a probability to each possible response category or label. In general, a stochastic classification result provides more information than a deterministic one, especially for mixed cases or cases close to a boundary between classes.
In this paper, we consider statistical models and stochastic classifications for highdimensional sparse data with categorical responses. If the response is binary, generalized linear models [14,15] have been widely used in practice. When the response has three or more categories, such as the five tumor types in the motivating example of gene expression data, we consider the multinomial logistic models [16][17][18], which include four commonly used logit models, baseline-category, cumulative, adjacent-categories, and continuationratio (please see [18] for a good review and the references therein).

Materials and Methods
In this section, we introduce the multinomial logistic models and order selection for categorical responses, a variable screening technique for sparse covariates, and variable selection techniques.

Multinomial Logisitic Models
In this paper, we suppose that the original data takes the form of {(x i , y i ), i = 1, . . . , n} with covariates x i = (x i1 , . . . , x id ) T ∈ R d , d ≥ 1 and categorical response y i ∈ {1, . . . , J}, J ≥ 2. We further assume that there are only m distinct x i vectors, and denote them as x 1 , . . . , x m , m ≤ n for simplicity. For applications with discrete covariates only, it is often the case that m n. Following the notations in [18], we consider the data in its and y k = j}, the number of observations with covariates x i and response category j. We denote n i = ∑ J j=1 Y ij , the number of observations with covariate x i , i = 1, . . . , m. Following [18,19], a multinomial logistic model assumes that (i) Y i follows a multinomial distribution with number n i and category probabilities π i1 , . . . , π iJ independently for different i, where ∑ J j=1 π ij = 1; (ii) the category probabilities π ij are linked to functions of covariates in one of the four following ways: Here h T j (·) = (h j1 (·), . . . , h jp j (·)) and h T c (·) = (h 1 (·), . . . , h p c (·)) are known predictor functions; β j = (β j1 , . . . , β jp j ) T and ζ = (ζ 1 , . . . , ζ p c ) T are unknown regression parameters, i = 1, . . . , m, j = 1, . . . , J − 1. It should be noted that when J = 2, all four logit models, namely baseline-category (1), cumulative (2), adjacent-categories (3), and continuationratio (4), are equivalent to the logistic regression model for binary responses.
In this paper, we consider two special classes of multinomial logistic models, proportional odds (po) models assuming h T j (x i ) ≡ 1, i.e., the same parameters for different categories except the intercepts, and nonproportional odds (npo) models assuming h T c (x i ) ≡ 0, i.e., different parameters across categories. The four link models (1), (2), (3), and (4), com-bined with either the po or npo assumption, lead to eight different models. For example, one model is called an adjacent-categories po model if it is an adjacent-categories logit model (3) with the po assumption. In practice, people often adopt h c (x i ) = x i for po models and h j (x i ) = (1, x T i ) T for npo models, also known as main-effects models. More general models and examples can be found in [18].

The Most Appropriate Order for Categorical Responses
In the statistical literature, the baseline-category logit model (1) is also known as a multiclass logistic regression model (see, for example, [20]). It is commonly used for nominal responses, i.e., the response categories do not have a natural ordering [21]. In model (1), the Jth category is regarded as the baseline category. The choice of the baseline category is known to be irrelevant for prediction purposes for npo models. Nevertheless, as pointed out by [19], the baseline category needs to be carefully chosen for po models, since it may improve the prediction accuracy significantly. We adopt the Akaike information criterion (AIC, [20,22]) for choosing the most appropriate baseline category for baseline-category po models.
Models (2), (3), and (4) are typically used for ordinal or hierarchical responses, which assume either a natural ordering or a hierarchical structure among the response categories. Surprisingly, according to [19], even for responses whose categories do not have a natural or known order, one may still use AIC or Bayesian information criterion (BIC, see also [20]) to choose the most appropriate order, called a working order, for the response categories. Then models (2), (3), and (4) can also be used for nominal responses with the working order, which may significantly improve the prediction accuracy.
In this paper, we use AIC to choose the most appropriate (working) order for the five tumor types in the motivating example so that all the four logit models can be applied.

Sparse Variable Screening Using the AZIAD Package
To fit a multinomial logistic model and obtain the parameter estimates with the corresponding confidence intervals, we need the number m of distinct x i vectors to be large enough to keep the corresponding Fisher information matrix positive definite. According to [18], the smallest possible m is p c + 1 for po models or max{p 1 , . . . , p J−1 } for npo models. If we consider main-effects models only, then m ≥ d + 1 is required for both po and npo models. In other words, we need to reduce the number d of covariates below m − 1 before fitting a multinomial logistic model.
For the motivating example, following [1], we focus on the 2426 sparse genes. In this case, n = m = 801, since the gene expression levels are continuous. To reduce the number of genes below 800, we first calculate a rank of all these genes based on the AIC differences of the significance test proposed by [1], which involves model selections from a list of candidate distributions based on the p-values of KS-tests using the R package AZIAD [23]. More specifically, for each gene, we consider two probabilistic models. In Model I, we assume that all 801 gene expression levels follow the same probability distribution chosen from a candidate set; while in Model II, we divide the 801 numbers into five groups according to their response labels and assume that each group of numbers follows a distinct distribution. We fit both models and denote the corresponding AIC values as AIC(I) and AIC(II), respectively. According to [1], a bigger difference between AIC(I) and AIC(II) indicates that the gene is more informative for predicting the response labels.
Since the gene expression levels are continuous and non-negative, we consider a candidate set of 12 probability distributions, including normal (or Gaussian), zero-inflated normal, normal hurdle, half-normal, zero-inflated half-normal, half-normal hurdle, lognormal, zero-inflated log-normal, log-normal hurdle, exponential, zero-inflated exponential, and exponential hurdle distributions (please see [7] for a complete list of distributions available in AZIAD). As a result, we obtain a rank of the 2426 sparse genes from high to low, corresponding to their AIC differences from high to low. We then use a five-fold cross-validation to select the best number of genes for fitting multinomial logistic models (see Section 2.4).

Best Number of Covariates for Categorical Regression Analysis
In [1], a five-fold cross-validation with a 1-nearest neighbor classifier was used for choosing the best number of genes. In other words, a predictive error count or error rate was used as the criterion for variable selection.
When a stochastic classification is available, the cross-entropy loss, which measures the difference between the predictive probabilities based on the fitted model and the observed class labels, is more commonly used as a criterion and is more sensitive for variable selection or model selection [20].
In this paper, we choose the number of ranked covariates that minimizes the crossentropy loss based on a five-fold cross-validation. More specifically, given a specified regression model and the top t covariates under consideration, (1) we randomly divide the original n observations (not the summarized data) into five data blocks of roughly the same size, then for each i = 1, . . . , n, the ith observation belongs to one and only one of the five data blocks, denoted as the k(i)th block; (2) for each k = 1, . . . , 5, we fit the regression model with the top t covariates based on the four data blocks other than the kth one, denoted as the kth fitted model; (3) for each i = 1, . . . , n, we predict π ij , j = 1, . . . , J based on the k(i)th fitted model and denote the predictive probability asπ k(i) ij ; and (4) the (average) cross-entropy loss for the specified model and the given number t is defined as For the specified model, we choose the t that minimizes CE(t).

Backward Variable Selection Based on AIC
Backward variable selection aims to identify a good (if not the best) subset of variables by iteratively removing the least-significant predictor from a statistical model. Typically, AIC is used as the criterion for eliminating a variable (see, for example, Section 3.3 in [20]). In our case, (1) we fit a specified multinomial logistic model with the top t covariates obtained in Section 2.4 and record its AIC value as AIC(t); (2) we initialize r = t to be the number of covariates in the model and perform the following steps while r ≥ 1; (3) for each s = 1, . . . , r, we fit the regression model with the sth covariates removed and denote the corresponding AIC value as AIC(r − 1, s); (4) we let AIC(r − 1) = min s {AIC(r − 1, s)}, the smallest AIC value with r − 1 covariates; and (5) if AIC(r − 1) < AIC(r) and r ≥ 1, we remove the covariate that attains AIC(r − 1) and go back to step (3) with r replaced with r − 1, otherwise we stop and report the model with the current set of r covaraites.
It should be noted that there are other strategies for selecting a subset of covariates, such as forward selection or forward and backward selection. We refer [20] to the readers for more options.
Following [1], we consider only the 2426 genes whose expression levels have a proportion of zeros between 5% and 50%, which is still high-dimensional. Since the gene expression levels are continuous, all 801 x i vectors are distinct and thus n = m = 801 in our notations (see also Section 2.1). Applying the significance test described in [1] to each of the genes (see also Section 2.3), we obtain a rank of the 2426 genes from high to low in terms of relevance to the five response categories. Based on a five-fold cross-validation with the 1-nearest neighbor classifier, the top 50 genes were recommended by [1] for predicting the response categories. The 1-nearest neighbor classifier based on the selected 50 genes achieves zero predictive error rate based on the five-fold cross-validation.
In this paper, we use this data to illustrate how to use the proposed model selection procedure to build up the most appropriate regression model for analyzing a high-dimensional sparse data with categorical responses.

Data Analysis and Results
Through this comprehensive analysis on the RNA-seq gene expression data, we aim to identify the most appropriate regression model with selected genes that would provide valuable insights into the relationship between the gene expression levels and the type of cancers.

Model Selection and Variable Selection for Sparse Genes
We first use the R function vglm in package VGAM [25,26] to fit eight candidate multinomial logistic main-effects models (see Section 2.1) given different numbers of selected genes including the baseline-category po model (nompo), the baseline-category npo model (nomnpo, also known as the multiclass logistic regression model), the cumulative po model (cumpo), the cumulative npo model (cumnpo), the continuation-ratio po model (crpo), the continuation-ratio npo model (crnpo), the adjacent-categories po model (acpo), and the adjacent-categories npo model (acnpo). The R codes for the top 80 ranked sparse genes are provided below, where yj stands for the vector (Y 1j , . . . , Y nj ) T , j = 1, . . . , 5.

Order Selection for Response Categories
At this stage of the proposed analysis, we keep three candidate models under consideration: the adjacent-categories po model, the adjacent-categories npo model, and the baseline-category npo model. According to [19], the order of the response categories matters only for the adjacent-categories po model (see also Section 2.2).
To find the most appropriate order of response categories for the adjacent-categories po model in this case, we explore all the 5! = 120 different orders of y1, y2, y3, y4 and y5. For each order, we fit the adjacent-categories po model with the top 80 genes and record the corresponding AIC value. The best order, which achieves the smallest AIC value, is (y2, y3, y4, y1, y5), which corresponds to (COAD, KIRC, LUAD, BRCA, PRAD) with AIC value 175.81, while the AIC value with the original order is 309.52. The improvement is significant [27]. It should be noted that the reversed order of the best one achieves the same AIC as well [19].

Backward Variable Selected Models
In this section, we apply the backward variable selection procedure (see Section 2.5) to the three candidate models, starting with the top t = 80 genes.
We start with the adjacent-categories po model with order (y2, y3, y4, y1, y5) and AIC = 176. After backward variable selection, we end with 31 genes and AIC value 90. As for the adjacent-categories npo model, we start with 80 genes and AIC value 648 and end with 74 genes and AIC value 600. We also run the backward variable selection for the baseline-category npo model (i.e., the multiclass logistic regression model). Its 80 genes are reduced to only 4 genes, along with AIC values from 648 to 52. The backward variable selection improves all the three models significantly in terms of AIC values [27].
As a summary, after the backward variable selection, our three candidate models are reduced as (1) an adjacent-categories po model with 31 genes; (2) an adjacent-categories npo model with 74 genes; and (3) a baseline-category npo model with 4 genes. It should be noted that two genes, gene_7965 and gene_9176, appear in all the three final models.

Final Models
We further evaluate the performance of the three final models obtained in Section 4.3, whose cross-entropy losses and predictive error counts/rates based on a five-fold crossvalidation are provided in Table 2. In terms of prediction accuracy, the adjacent-categories npo model with 74 genes is clearly the winner since it achieves the lowest cross-entropy loss of 2.2 × 10 −9 and error count of 0. For readers' reference, we provide the adjacent-categories npo model with 74 genes fitted on all 801 samples in Appendix C. Nevertheless, if one aims to find a prognostic multi-gene signature (see Section 4.5 for more details) for predicting the risk of the five types of cancers, the remaining two models have their values as well, since they require information from a much smaller number of genes. As a comparison, all the three final models are much more accurate than the best model based on the top 50 genes recommended by [1], which is an adjacent-categories npo model with a five-fold cross-entropy loss of 0.14 and error rate of 0.0087 (see Table 1). The final adjacent-categories po model with 31 genes misclassifies the case sample_190 with predictive probabilities (1.04 × 10 −13 , 2.68 × 10 −10 , 0.576, 0.424, 2.95 × 10 −37 ). Although the observed label of sample_190 is type 4 (or LUAD), the predictive probability 0.424 is about 0.5, i.e., at the boundary between type 3 and type 4. We can even further reduce the multigene signature to a set of four genes, gene_7965, gene_9176, gene_12977, and gene_15898. With the final baseline-category npo model, we can still achieve high predictive accuracy (801 − 2)/801 = 99.75% based on a five-fold cross-validation.

Prognostic Multi-Gene Signatures
A prognostic multi-gene signature is a set of genes whose expression patterns can serve as a prognostic biomarker for the associated phenotype or condition, such as the cancer types in our example [28,29]. In this section, we compare four potential prognostic multi-gene signatures: (1) the 31 genes (see Appendix A) selected for the adjacent-categories po (acpo) model; (2) the 74 genes (see Appendix B) selected for the adjacent-categories npo (acnpo) model; (3) the 4 genes (see Section 4.4) selected for the baseline-category npo (nomnpo) model; and (4) the top 50 genes (top50) recommended by [1].
In Table 3, we list the number of genes shared by the row multi-gene signature and the column one. As mentioned in Section 4.3, two genes, gene_7965 and gene_9176, are shared by acpo, acnpo and nomnpo, while only one gene, gene_9176, is shared by all four signatures. It should be noted that the genes selected for acpo and nomnpo are neither subsets of acnpo nor top50. They overlap with each other, but without inclusion.

Discussion
In this paper, we propose a data analysis procedure for a high-dimensional sparse data with categorical response variables, including covariate screening, model selection, order selection, and variable selection. Different from typical machine learning techniques, the prediction given the covariates is a probability distribution on the collection of response categories. For example, the prediction for sample_190 in the RNA-seq gene expression data is (1.04 × 10 −13 , 2.68 × 10 −10 , 0.576, 0.424, 2.95 × 10 −37 ) based on the adjacent-categories po model with 31 genes. Although the observed label is type 4 (or LUAD), the prediction should be interpreted as high risks in both KIRC (type 3) and LUAD since both predictive probabilities are close to 0.5. It is important in practice, since the doctor may suggest the patient receive early screening examinations on both types of tumors. If, instead, the prediction is a deterministic one, in this case, KIRC, whose predictive probability 0.576 is the highest, then the risk of LUAD is ignored by mistake. In other words, a distribution prediction or stochastic classification is much more informative than a deterministic classification.
As mentioned in Sections 4.3-4.5, we recommend three predictive models for the RNA-seq gene expression data at different levels of cost and accuracy. For predicting the risks of five types of cancers, Model (3) based on four genes costs the least and achieves the lowest accuracy level. In practice, it may be used for general cancer screening purposes. Model (2), based on 74 genes, has the highest predictive accuracy and may be used for clinical diagnosis. Model (1), with 31 genes, stands in the middle and may be used for cancer screening practice for people with a higher risk due to a family history of cancers. In other words, each recommended model may have its own appropriate applications.
Another interesting result obtained in our analysis is the order (COAD, KIRC, LUAD, BRCA, PRAD) chosen for the adjacent-categories po model. According to the reduction in the AIC value from 309.52 to 175.81, the chosen order works significantly better than the original order for the adjacent-categories po model. It may be worth exploring whether the selected order has any implication since it is supported by the data.
It should be noted that, for illustration purposes only, the original gene expression levels x ij variables are used as predictors in Section 4, known as main-effects models. In practice, however, transformations of covariates such as x 2 ij and log(x ij + 1), or interactions of covariates such as x ij x ik may be considered as potential predictors as well. In that case, the selected final models are expected to be different.
Besides the RNA-seq gene expression data or other omics data in medical sciences, the proposed statistical approach can also be used for high-dimensional sparse data arising in many other scientific disciplines, including survey and demographic data in social sciences, bag-of-words representations of text in natural language processing and information retrieval, mobile app usage logs in application usage analytics, recommender systems in artificial intelligence (AI), etc. [2].  Institutional Review Board Statement: Ethical review and approval are not required for this study due to using publicly available data.

Informed Consent Statement: Not applicable.
Data Availability Statement: The gene expression data are publicly available from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq#, accessed on 4 November 2022) (see also Section 3).

Conflicts of Interest:
The authors declare no conflict of interest.