Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data

Sparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. We utilize two real scientific applications as illustrations, including a longitudinal vaginal microbiome data and a high dimensional gene expression data. We recommend zero-inflated model selections and significance tests to identify the time intervals when the pregnant and non-pregnant groups of women are significantly different in terms of Lactobacillus species. We apply the same techniques to select the best 50 genes out of 2426 sparse gene expression data. The classification based on our selected genes achieves 100% prediction accuracy. Furthermore, the first four principal components based on the selected genes can explain as high as 83% of the model variability.


Introduction
Sparse or zero-inflated data has a lot of applications in various disciplines such as microbiome [1], gene expression [2], epidemiology [3], health care [4], security [5], social networks [6], and more. Modeling sparse data is very challenging due to the high proportion of zeros and severe skewness of the distribution [7,8]. Modeling sparse data appropriately is also critical for successful scientific applications. For example, zero readings in the microbiome and RNA-seq data have two possible sources: First, some species or genes exist but are not detected as a result of insufficient sequence depth or inefficiencies of the technology processes (non-biological zeros); secondly, it is possible that some species or genes are truly never represented (biological zeros) [9].
To model sparse data, zero-inflated and hurdle models have been widely used. Both of them consist of two data-generating processes. The first process generates purely zeros, while the second one is governed by some distribution, for example, Poisson distribution, which may or may not generate zeros. The zero-inflated Poisson (ZIP) model was proposed by Lambert (1992) with an application to defects in manufacturing [10]. In 1994, Greene considered the zero-inflated negative binomial (ZINB) model with an application on consumer behavior and default on credit card loans [11].
In this paper, we consider analyzing zero-inflated data in a more general and complex context. One motivating example is the vaginal microbiome data, which is longitudinal, and the goal is to identify the time intervals when the two groups of individuals are significantly different [2].
The vaginal microbiome is a dynamic micro-ecosystem that inhabits the vaginal surfaces and its cavity. The vaginal microbiome has great significance in maintaining vaginal health and protecting the host from urogenital diseases such as sexually transmitted Smirnov tests (KS test) for fairly general distributions, likelihood ratio tests (LRT) for model selection, Fisher information matrix and confidence intervals for parameter estimates. We provide more details on utilizing AZIAD in Section 2.4.
In this paper, we use the two motivating examples as illustrations on analyzing longitudinal or high-dimensional sparse data. For the longitudinal vaginal microbiome data, we compare the pregnant and non-pregnant groups in terms of the Lactobacillus species to identify the time intervals when the two groups are significantly different. Although we use the Lactobacillus species as an example, our methods and analysis can be applied to any other vaginal microbiome species in the dataset as well. For the gene expression data, we select the most informative genes based on sparse data model selection and show how the selected genes help with predicting the class labels. We also apply the principal component analysis to the top 50 selected genes and show that the four principal components can explain 83% of the model variability.

Materials and Methods
To analyze sparse data, we use zero-altered models (or hurdle models) and zeroinflated models. In this section, we first briefly review zero-inflated and hurdle models. Then we illustrate by examples how to use AZIAD for selecting the most appropriate model for sparse data. One of our major contributions is the significance test that we develop based on sparse data model selection, which will be used for selecting significant time points for longitudinal sparse data and selecting the most informative covariates for high-dimensional sparse data.

Zero-Altered or Hurdle Models
Zero-altered models, also known as hurdle models, have been widely used for modeling sparse data (see, for example, [1] or [8], for a good review). Technically speaking, hurdle models can also be used for modeling data with a number of zeros less than expected. A general hurdle model consists of two components, one generating zeros and the other generating non-zeros. Given a baseline distribution function f θ (y), where the parameter(s) θ = (θ 1 , . . . , θ p ) T , p ≥ 1, the distribution function of the corresponding hurdle model can be written as follows where φ ∈ [0, 1] is the weight parameter of zeros, f θ (y) is either a probability mass function (pmf) for some discrete baseline distribution or a probability density function (pdf) for some continuous baseline distribution, and p 0 (θ) = f θ (0) for discrete distributions or simply 0 for continuous distributions.

Zero-Inflated Models
Unlike zero-altered models, a zero-inflated model always assumes an excess of zeros. Besides zeros coming from the baseline distribution f θ (y), there are additional zeros modeled by a weight parameter φ ∈ [0, 1]. The distribution function of the zero-inflated model can be defined as follows (2)

Zero-Altered and Zero-Inflated Models with Continuous Baseline Distributions
When the baseline distribution f θ (y) is continuous, p 0 (θ) = 0 and models (1) and (2) are the same, called a zero-altered and zero-inflated (ZAZI) model (see [41] for more details). Its distribution function can be written as follows Commonly used continuous baseline distributions include Gaussian (or normal), lognormal, half-normal, exponential, etc. The corresponding zero-inflated models are also known as zero-inflated Gaussian (ZIG), zero-inflated log-normal (ZILN), zero-inflated half-normal (ZIHN), and zero-inflated exponential (ZIE), respectively.

Model Selection Using AZIAD Package
To identify the most appropriate model for sparse data, we recommend the R package AZIAD. Compared with other existing R packages on analyzing zero-inflated data, (1) it takes 27 different distributions under consideration; (2) it covers both discrete and continuous baseline distributions; (3) it provides Fisher information matrix and confidence intervals for estimated parameters as well.
When the baseline distribution is continuous with the pdf f θ (y), AZIAD covers normal, log-normal, half-normal, exponential, and their corresponding zero-inflated and hurdle models. When the baseline distribution is discrete with the pmf f θ (y), the package covers Poisson, geometric, negative binomial, beta binomial, beta negative binomial, and their corresponding zero-inflated and hurdle models.
We apply KS-test for each model under our consideration to test if the model is appropriate for the given sparse data. If the p value of the KS-test is below 0.05, we usually discard the corresponding model. There are two functions built in the AZIAD package, kstest.A and kstest.B. According to [41], kstest.B is recommended for data with a sample size of about 50 or below, such as the vaginal microbiome data of the pregnant group on week three (see Section 3.1), while kstest.A is recommended for a larger sample size, such as the gene expression data (see Section 3.2). We provide below two toy examples.
We then generate a random sample Data2 from a zero-inflated beta negative binomial (ZIBNB) model using the R function sample.zi1. The model parameters are (φ, r, α, β) = (0.4, 10, 3, 2) and the sample size is N = 30. Since the sample size is below 50, we apply kstest.A to two models, ZIBNB and ZIP. Again only the true model ZIBNB has a p-value larger than 0.05.

Significance Test on Group Labels
In this section, we propose a significance test for selecting the most informative covariates associated with group labels of sparse data.
First of all, we briefly review two commonly used model selection criteria, Akaike information criterion (AIC) and Bayesian information criterion (BIC), defined as where loglik is the maximized likelihood (see Section 2.4), k stands for the number of parameters in the model, and N is the sample size (see, for example, [42] for more details). According to Hastie et al. (2009), BIC is asymptotically consistent in the sense that it will choose the correct model with a probability approaching to 1 as the sample size N → ∞, and AIC is more suitable for small or moderate sample sizes [42].
In this paper, we adopt AIC since the sample size is 52 for the vaginal microbiome data (see Section 3.1), or 801 for the gene expression data (see Section 3.2). Now we propose a significance test based on our sparse data model selection procedure. In general, we consider a data set with covariates {x ij | i = 1, . . . , N; j = 1, . . . , p}, that is, consisting of N individuals and p covariates, and class labels y i ∈ {1, . . . , m}, m ≥ 2. The goal is to select the most informative covariates for predicting the class labels. For each covariate, say the jth covariate, the readings or measures x ij , i = 1, . . . , N are grouped into m blocks according to their class labels y i , i = 1, . . . , N. We summarize the following procedure in three steps.
• Step 1: Choose the most appropriate model for all the N numbers, {x ij | i = 1, . . . , N} after ignoring their class labels. This task is accomplished by performing KS-tests using kstest.A on all models under consideration (see also [7] A larger difference indicates that the jth covariate is more informative for predicting the class labels.
We refer the readers to [43] for more discussion on using AIC or BIC for model selection.

Two Applications
In this section, we use two real examples to illustrate how our variable selection techniques could be used for sparse data analysis.

Vaginal Microbiome
The purpose of this study is to characterize the changes in the composition of the vaginal microbiome (including Lactobacilli) at some time points between two groups of women, pregnant or non-pregnant. The dataset is available in Romero et al. (2014)'s study [2]. The original study includes 32 non-pregnant women and 22 pregnant women who had a term delivery without complications. Non-pregnant women self-sampled with a frequency of twice a week for 16 weeks. Pregnant women had a speculum examination at each visit when a sample of vaginal fluid was collected as well. Samples were collected every 4 weeks till week 24 of pregnancy and every 2 weeks till delivery. The numbers of samples for the 22 pregnant women are not balanced and fluctuate between 3 and 8. In our analysis, we remove one sample of non-pregnant women since there was only one observation. That is, 31 non-pregnant women are kept for our analysis.
The goal of our study in this paper is to observe the AIC differences between the pregnant and non-pregnant women over the time of pregnancy (1 ≤ t ≤ 38) concentrated on the Lactobacillus microbiome and identify time intervals when the two groups of women are significantly different in terms of Lactobacillus. Eventually, we can extend this procedure and do parallel analysis for any other species in this dataset.

RNA-Seq Gene Expression Data
The RNA-seq gene expression dataset is a high-dimensional dataset consisting of 801 tissue samples (N = 801) and 20,531 genes (p = 20,531) [21]. The 801 samples are labeled by 5 different types of cancerous tumors, BRCA, KIRC, COAD, LUAD, and PRAD. The data can be accessed from the UCI Machine Learning Repository (https://archive.ics. uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq#, accessed on 4 November 2022) and were collected as a part of the Cancer Genome Atlas (TCGA) analysis project [44] (see also [21]).
The goal of our study is to select the most informative genes for identifying class labels.

Data Analysis and Results
In this section, we explain the data analysis procedures and report our results for the two real applications described in Section 3.

Vaginal Microbiome
There are two challenging aspects of this dataset, missing time points and unbalanced time intervals. Our first step is data imputation for the missing time points using k-nearest neighbor, where k = 1 is used in our study (see, for example, [45] for other possible missing value imputation strategies). Next, we apply linear interpolations on the readings of each individual using R function approx with option method="linear". We select 38 discrete, equally distanced time points on the curves. At each t = 1 . . . 38, we gather 53 values of Lactobacillus microbiome which belongs to the two groups of women. By screening all samples, we find out that all of the samples contain approximately 5% to 13% of zeroes. Therefore, zero-inflated models are more appropriate in model selection. Figures 1 and 2 summarize the result of 22 linear interpolated curves for pregnant women, and 31 linear interpolated curves for non-pregnant women over the time of pregnancy (38 weeks). As observed in the non-pregnant samples (Figure 2), there is an apparent outlier curve that acts very differently from the other samples. We remove this outlier since it is influential for our analysis (see Appendix A for more information). As a result, our analysis is based on 30 non-pregnant women and 22 pregnant women.  After the data preparation, we perform the significance test described in Section 2.5. Since the readings are real numbers, we consider 12 models including four continuous distributions, normal (N), half-normal (HN), log-normal (LN), exponential (E), and their zero-inflated and hurdle versions. Based on the KS-test p values (see Table 1), we obtain several candidate models with p values larger than 0.05. Among them, normal (N) and normal hurdle (NH) are the best ones (for readers' reference, an extended version of Table 1 can be found in the Supplementary Materials). We choose NH instead of N since the data contains a proportion of zeros. We then calculate Model I AIC for each of the 38 time points. As Step 2, we apply the same procedure to pregnant and non-pregnant groups separately. The proportions of zeros in the non-pregnant group are roughly between 10% and 20%, which implies that sparse model should be used. The proportions of zeros in the pregnant group are less than the non-pregnant group's. Some samples contain 4% of zeroes at the beginning of the pregnancy, while no sample has any zeroes after week 21. Therefore, we consider sparse models for weeks 1-21, and regular continuous models on weeks 22-38. Based on the KS-tests on the non-pregnant group (see Table A1 in Appendix B), we choose the normal hurdle (NH) model again. For the pregnant group, NH model is chosen for weeks 1-21, and normal distribution is chosen for weeks 22-38 (see Table A2 in Appendix B). In this case,

Model I I AIC = AIC non−pregnant(n=30) + AIC Pregnant(n=22)
As Step 3, we calculate Model I AIC − Model I I AIC for each t = 1, . . . , 38. Figure 3 demonstrates the difference of AIC values over the weeks of pregnancy 1 ≤ t ≤ 38 in terms of Lactobacillus microbiome. It indicates that the two groups tend to be significantly different after week 22 (AIC differences are larger than 2). It means that the pregnant women and non-pregnant women are significantly different during weeks 22-38 in terms of Lactobacillus species. To further investigate the differences between the two groups before and after week 22 of pregnancy observed in Figure 3, we conduct a more detailed analysis on the estimated parameters (i.e., (μ,σ,φ) ) over time. Figures 4 and 5 demonstrate the changes of individual parameters over time. Before week 22, we consider the changes of the estimated parameters for combined groups (N = 52), while after week 22 we consider the changes of estimated parameters for the two groups separately. Figure 4 shows that the estimated means of the pregnant group is not so different from the non-pregnant group until week 28. Nevertheless, it is clear that the mean differences are fairly large at the end of the pregnancy. On the other hand, the estimated variances are quite different between the two groups right after week 21. The estimated variance in the pregnant group seems to be much larger than the non-pregnant group's from weeks 22 to 32, and then becomes smaller than the non-pregnant group's especially at the end of the pregnancy. The down difference is also significant at the end of the pregnancy. This phenomenon indicates that Lactobacillus species is significantly less in the pregnant group compared to the non-pregnant group at the end of the pregnancy. These conclusions are free of the proportions of zeros in the data.   Figure 5 shows the changes of the weight parameter φ of zeros over time. The parameter estimates are different between the two groups after week 22, which might be due to the missing counts in the non-pregnant group. The estimated φ is 0 after week 22 in the pregnant group, which might be because the pregnant group does examinations regularly, especially when it is close to the end of the pregnancy.
In the microbiome literature, there are some other available methods for analyzing longitudinal microbiome data. For example, MetaLonDA is an R package capable of identifying significant time intervals of microbiome features [46]. It relies on two modeling components, the negative binomial distribution assumption for modeling the read counts, and the semi-parametric SSANOVA technique for modeling longitudinal profiles associated with different phenotypes. For comparison purposes, we apply the MetaLonDA package to the Lactobacillus species and compare the results with ours. The below MetaLonDA interval p value results (weeks 2-38) claim that there is no significant difference between the two groups during the first 11 weeks and during weeks 22-31, while the differences are significant in weeks 12-21 and after week 31.
For example, according to MetaLonDA, there is no significant difference between the two groups from week 22 to week 31, which is different from our result. From Figures 4 and 5, one can see that the standard deviations (σ) and the proportions of zeroes (φ) are quite different between the two groups starting week 22, although the group means (μ) are fairly close. MetaLonDA is not able to detect such differences. For readers' reference, we provide the estimated parameters of the non-pregnant and pregnant groups starting week 21 in Table S4 of the Supplementary Materials.

RNA-Seq Gene Expression Data
The RNA-seq gene expression data is a high-dimensional dataset consisting of 801 samples and 20,531 genes (covariates). At our data preparation and screening stage, we first delete 267 genes (columns) which contain only zeros. Secondly, we remove the columns (12,356 genes) that have no zeros at all so that we can focus on zero-inflated covariates (covariates without zeros are not sparse and can be dealt with classical data analysis). For the same reason, we choose only the genes (columns) that carry a good proportion (from 5% to 50%) of zeros. Therefore, the total number of genes used for our study is N = 2426, which is still high-dimensional.
The 801 gene expression samples are classified into five cancer categories: BRCA, KIRC, COAD, LUAD, and PRAD, which contain 300, 146, 78, 141, and 136 samples, respectively. We use R package AZIAD for the rest analysis. Since the data consists of non-negative real numbers, we consider the same set of 12 models as in Section 4.1. For each of the 2426 genes, we apply the three steps as described in Section 2.5. As Step 1, we apply R function kstest.A to 801 gene expression levels of the gene under consideration after ignoring the labels. Technically speaking, any model with a KS-test p value larger than 0.05 could be a candidate. In this study, we choose the model with the largest p value for simplicity. In case of ties, we always choose the last one in the tie list.
As a summary of the model selection results for all 801 samples, among the 2426 genes, normal hurdle (NH) model is chosen for 1, 194 genes, log-normal hurdle (LNH) model is chosen for 885 genes, log-normal (LN) model is chosen for 98 genes, and zero-inflated exponential (ZIE) model is chosen for another 98 genes. The rest genes are fitted with exponential (66 genes), normal (60 genes), zero-inflated half-normal (22 genes), and halfnormal (3 genes) models. We also perform model selections for each of the five categories. The results are slightly different. Log-normal hurdle model is in favor since it is either comparable or better than normal hurdle model in each category. For example, for COAD category, log-normal hurdle model passes 99% of KS-tests compared with 97% of normal hurdle model. After finishing the three steps for each gene (see Section 2.5), we are able to rank the 2426 genes based on their AIC differences from the largest one to the smallest one.
A critical question is how many genes should be chosen for predicting the class labels. To choose a good threshold, we utilize the 1-nearest neighbor classifier (see, for example, [42]) with a various number of selected genes to predict the class labels. Table 2 lists the predicted error rate (also known as the training error rate since all 801 class labels have been used by the classifier) by 1-nearest neighbor classifier with 20, 50, 100, or 2426 ranked genes. For comparison purpose, we also the prediction error rate (0.02) of 1-nearest neighbor classifier based on the top 7 principle components suggested by [21]. Note that the principle components here are based on 20,264 genes after removing the first column (sample indices) and all zero columns. According to Table 2, 50 seems to be a good option for the number of selected genes, whose prediction error (0.0037) is just below the top 7 PCA's. For readers' reference, we provide the list of the top 50 genes in Appendix C. It is known that the training error rate may underestimate the true prediction error rate due to over-fitting [42]. To obtain a fair estimate for the prediction error, we conduct a 5-fold cross-validation on this data. More specifically, we split the data (N = 801) into 5 roughly equal-sized blocks, with each containing about 160 samples. For each block (used as the testing data), we combine the rest 4 blocks as the training data. The genes are selected with AIC differences using the training data only, and the 1-nearest neighbor classifier utilizes the class labels of training data only for predicting class labels of the testing data. To simplify the procedure, we fix the model to be either log-normal hurdle (with zero-inflation) or log-normal (without zero-inflation). We repeat the procedure so that each block serves as the testing data once. The prediction error rate is calculated based on the predictions on the 5 testing data sets. Table 3 summarizes the result of the estimated perdition error rates based on the 5-fold cross-validation with 1-nearest neighbor classifier as described above. As the number of top selected genes increases, roughly speaking, the prediction error rate first decreases and then increases. The best prediction error rate 0 is attained at 50 genes. In other words, the best number of genes for this data is 50, which is chosen by the 5-fold cross-validation. For comparison purpose, the prediction error rate using the top 7 principal components is listed in Table 3 as well. Estimated by the 5-fold cross-validation, the prediction error rate of 7PCA is 0.0037, which is worse than the top 50 selected genes' and also higher than its training error rate 0.02 listed in Table 2. For comparison purpose, we further apply principal components analysis (PCA) to the top 50 selected genes. Figure 6 shows the cumulative variance explained by various numbers of principal components of the 50 genes. We recommend the top 4 principal components, which explain 83% of the variability of the data. For readers' reference, we need the top 10 principal components to explain 90% of the variability.
Many clustering methods have been used for gene expression data, including Ewens-Pitman Attraction (EPA), MCLUST, hierarchical clustering in conjunction with the gap statistic, k-means clustering in conjunction with the gap statistic, and Table Invitation Prior (TIP) [47]. Compared with other clustering methods, the TIP clustering algorithm does not require the analyst to specify the number of clusters and can still provide a good result [21]. For illustration purpose, we apply the TIP clustering method to this data with the top 4 principal components based on our top 50 selected genes. Its one-cluster plot is shown in Figure 7, where the original 5 classes seem to be well separable.

Conclusions and Discussion
In this work, we use two different real-world examples to illustrate how the sparse data model selection could be used for choosing the most significant covariates. The first example is the vaginal microbiome which is dominated by one or two species of Lactobacillus. The vaginal microbiome dataset is longitudinal and we aim to compare the Lactobacillus abundance between pregnant and non-pregnant groups over 38 time intervals (weeks). Based on our proposed variable selection method, the two groups are statistically different after week 22. In addition, we compare the estimated parameters of the two groups at each time point and show the differences in terms of model parameters. Our selected time intervals overlap with the MetaLonDA method to some extent. This work can be extended to other vaginal microbiome species as well. The second example is the gene expression data, which is high dimensional, and the subjects are categorized into 5 different cancer groups. The goal is to select the most important genes among a list of 2426 genes. Based on our variable selection procedure and PCA, we select the top 50 genes with 100% prediction accuracy. In addition, we pick up the first 4 principal components, which explain about 83% of the variability. Finally, we utilize the TIP clustering algorithm to the gene expression data with the top 4 principal components based on our top 50 selected genes, the 5 cancer classes are well separable.   [2]. The gene expression data are publicly available from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq#, accessed on 4 November 2022) (see also Section 3).

Acknowledgments:
The authors would like to thank Hsin-Hsiung Huang from the University of Central Florida for sharing their R package "TIP" (https://cran.r-project.org/web/packages/tip/ index.html, accessed on 7 December 2022).

Conflicts of Interest:
The authors declare no conflict of interest. Figure A1 shows the unstable AIC differences between the two groups over time when the outlier presents (see Section 4.1). We remove the outlier from the non-pregnant samples for our analysis.    For readers' reference, extended versions of Tables A1 and A2 can be found in the Supplementary Materials.