# Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Zero-Altered or Hurdle Models

#### 2.2. Zero-Inflated Models

#### 2.3. Zero-Altered and Zero-Inflated Models with Continuous Baseline Distributions

#### 2.4. Model Selection Using AZIAD Package

`kstest.A`and

`kstest.B`. According to [41],

`kstest.B`is recommended for data with a sample size of about 50 or below, such as the vaginal microbiome data of the pregnant group on week three (see Section 3.1), while

`kstest.A`is recommended for a larger sample size, such as the gene expression data (see Section 3.2). We provide below two toy examples.

`> set.seed(456)`

`> Data1=sample.h1(2000,phi=0.3,dist="normal",mean=10,sigma=2)`

`> kstest.A(Data1,nsim=100,bootstrap=TRUE,dist="normalh",`

`lowerbound=1e-10,upperbound=100000)$pvalue`

`> 1`

`> kstest.A(Data1,nsim=100,bootstrap=TRUE,dist="lognormal",`

`lowerbound=1e-1,upperbound=1000000)$pvalue`

`> 0`

`> kstest.A(Data1,nsim=100,bootstrap=TRUE,dist="zilognorm",`

`lowerbound=1e-1,upperbound=1000000)$pvalue`

`> 0`

`> Data2=sample.zi1(N=30,phi=0.4,r=10,alpha1=3,alpha2=5,dist="bnb")`

`> kstest.B(Data2,nsim=100,bootstrap=TRUE,dist="zibnb",`

`lowerbound=1e-10,upperbound=100000)$pvalue`

`> 0.76`

`> kstest.B(Data2,nsim=100,bootstrap=TRUE,dist="zip",`

`lowerbound=1e-10,upperbound=100000)$pvalue`

`> 0`

`sample.h1`can be used for generating random samples from hurdle models. We first generate a random sample (

`Data1`) from a normal hurdle distribution with parameters $(\varphi ,\mu ,\sigma )=(0.3,10,2)$ and sample size $N=2000$. For reproducibility purposes, we set a random seed 456. For this data, we apply

`kstest.A`to three different distributions, normal hurdle, log-normal, and zero-inflated log-normal. The results show that only the true distribution normal hurdle is appropriate with a p value larger than $0.05$.

`Data2`from a zero-inflated beta negative binomial (ZIBNB) model using the R function

`sample.zi1`. The model parameters are $(\varphi ,r,\alpha ,\beta )=(0.4,10,3,2)$ and the sample size is $N=30$. Since the sample size is below 50, we apply

`kstest.A`to two models, ZIBNB and ZIP. Again only the true model ZIBNB has a p-value larger than $0.05$.

`new.mle`for general baseline distributions, and

`zih.mle`for zero-inflated and hurdle models. To demonstrate in more detail, we consider the toy examples as follows.

`> library(AZIAD)`

`> set.seed(657)`

`> Data1=extraDistr::rbbinom(1000,size=4,alpha=2,beta=3)`

`> new.mle(Data1,n=10,alpha1=3,alpha2=4,dist="bb")`

`> n Alpha Beta loglik`

`> 3.99 1.975527 2.923279 -3060.583`

`> Data2=sample.zi1(2000,phi=0.3,dist=’bnb’,r=5,alpha=3,alpha2=3)`

`> zih.mle(Data2,r=10,alpha1=3,alpha2=4,dist="bnb.zihmle",type="zi")`

`> r alpha1 alpha2 phi loglik`

`> 5.095388 3.033706 2.902682 0.3025823 -5091.443`

`> Data3=sample.h1(2000,phi=0.3,dist="lognormal",mean=1,sigma=4)`

`> zih.mle(Data3,mean=4,sigma=2,dist="lognorm.zihmle",type="h")`

`> mean sigma phi loglik`

`> 1.049724 3.931015 0.3095 -6537.076`

`> Data4=sample.zi1(2000,phi=0.3,dist="exponential",lambda=20)`

`> zih.mle(Data4,lambda=10,dist="exp.zihmle",type="zi")`

`> lambda phi loglik`

`> 19.55911 0.305 1513`

#### 2.5. Significance Test on Group Labels

**Step 1**: Choose the most appropriate model for all the N numbers, $\{{x}_{ij}\mid i=1,\dots ,N\}$ after ignoring their class labels. This task is accomplished by performing KS-tests using`kstest.A`on all models under consideration (see also [7]). Then we compute the MLE of the parameters for the chosen model using R function`$zih.mle$`. The corresponding AIC value is denoted by $Model{I}_{AIC}$.**Step 2**: For each of the m classes, say the kth class, we choose the most appropriate model for the data $\{{x}_{ij}\mid {y}_{i}=k\}$ of the kth class, compute the MLE and denote the corresponding AIC value by $AIC\left(k\right)$. Then aggregated AIC value $ModelI{I}_{AIC}$ is essentially the summation of the AIC values from m classes, that is, $ModelI{I}_{AIC}={\sum}_{k=1}^{m}AIC\left(k\right)$.**Step 3**: Take the difference of two AIC values with or without class labels, $Model{I}_{AIC}-ModelI{I}_{AIC}=Model{I}_{AIC}-{\sum}_{k=1}^{m}AIC\left(k\right)$. A larger difference indicates that the jth covariate is more informative for predicting the class labels.

## 3. Two Applications

#### 3.1. Vaginal Microbiome

#### 3.2. RNA-Seq Gene Expression Data

## 4. Data Analysis and Results

#### 4.1. Vaginal Microbiome

`approx`with option

`method="linear"`. We select 38 discrete, equally distanced time points on the curves. At each $t=1\dots 38$, we gather 53 values of Lactobacillus microbiome which belongs to the two groups of women. By screening all samples, we find out that all of the samples contain approximately $5\%$ to $13\%$ of zeroes. Therefore, zero-inflated models are more appropriate in model selection.

`$intervals.pvalue`

`[1] 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000`

`[9] 1.00000 1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000`

`[17] 0.00000 0.00000 0.00000 0.00000 1.00000 1.00000 1.00000 1.00000`

`[25] 1.00000 1.00000 1.00000 1.00000 1.00000 0.69333 0.00000 0.00000`

`[33] 0.00000 0.00000 0.00000 0.00000 0.00000`

#### 4.2. RNA-Seq Gene Expression Data

`kstest.A`to 801 gene expression levels of the gene under consideration after ignoring the labels. Technically speaking, any model with a KS-test p value larger than $0.05$ could be a candidate. In this study, we choose the model with the largest p value for simplicity. In case of ties, we always choose the last one in the tie list.

## 5. Conclusions and Discussion

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Influence of Outlier in Vaginal Microbiome Analysis

## Appendix B. More KS-Test Results for Vaginal Microbiome Analysis

**Table A1.**KS-test p values for non-pregnant group ($N=30$) under 12 distributions at different weeks.

Time (Week) | N | ZIN | NH | HN | ZIHN | HNH | LN | ZILN | LNH | E | ZIE | EH |
---|---|---|---|---|---|---|---|---|---|---|---|---|

week 10 | 1.000 | 0.395 | 1.000 | 0.460 | 0.580 | 0.020 | 0.000 | 0.000 | 0.870 | 0.330 | 0.725 | 0.040 |

week 22 | 1.000 | 0.480 | 1.000 | 0.520 | 0.420 | 0.010 | 0.000 | 0.000 | 0.785 | 0.450 | 0.590 | 0.025 |

week 36 | 1.000 | 0.160 | 1.000 | 0.480 | 0.790 | 0.020 | 0.000 | 0.000 | 0.955 | 0.145 | 0.375 | 0.015 |

Time (Week) | N | ZIN | NH | HN | ZIHN | HNH | LN | ZILN | LNH | E | ZIE | EH |
---|---|---|---|---|---|---|---|---|---|---|---|---|

week 10 | 1.000 | 0.355 | 1.000 | 0.505 | 0.385 | 0.170 | 0.000 | 0.000 | 0.390 | 0.500 | 0.130 | 0.130 |

week 22 | 1.000 | 0.250 | 1.000 | 0.295 | 0.490 | 0.415 | 0.020 | 0.000 | 0.000 | 0.135 | 0.005 | 0.170 |

week 36 | 1.000 | 0.380 | 1.000 | 0.635 | 0.975 | 0.560 | 0.005 | 0.000 | 0.000 | 0.425 | 0.550 | 0.410 |

## Appendix C. List of 50 Selected Genes for Gene Expression Data

`"gene_14646" "gene_12695" "gene_17688" "gene_15945" "gene_5394"`

`"gene_12209" "gene_1054" "gene_3598" "gene_7235" "gene_11440"`

`"gene_4979" "gene_2288" "gene_6162" "gene_16817" "gene_15898"`

`"gene_4467" "gene_3946" "gene_16392" "gene_11566" "gene_1510"`

`"gene_9181" "gene_16246" "gene_16337" "gene_16169" "gene_10489"`

`"gene_9680" "gene_998" "gene_9176" "gene_4833" "gene_19661"`

`"gene_15447" "gene_12013" "gene_7964" "gene_13210" "gene_3461"`

`"gene_3737" "gene_15896" "gene_13497" "gene_17801" "gene_15633"`

`"gene_706" "gene_10460" "gene_3862" "gene_10950" "gene_10284"`

`"gene_9626" "gene_14866" "gene_3439" "gene_4618" "gene_3458"`

## References

- Metwally, A.A.; Aldirawi, H.; Yang, J. A review on probabilistic models used in microbiome studies. Commun. Inf. Syst.
**2018**, 18, 173–191. [Google Scholar] [CrossRef] - Romero, R.; Hassan, S.S.; Gajer, P.; Tarca, A.L.; Fadrosh, D.W.; Nikita, L.; Galuppi, M.; Lamont, R.F.; Chaemsaithong, P.; Miranda, J.; et al. The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. Microbiome
**2014**, 2, 4. [Google Scholar] [CrossRef] [PubMed] - Sarkar, A.; Pati, D.; Mallick, B.K.; Carroll, R.J. Bayesian copula density deconvolution for zero-inflated data in nutritional epidemiology. J. Am. Stat. Assoc.
**2021**, 116, 1075–1087. [Google Scholar] [CrossRef] [PubMed] - Aljabri, D.; Vaughn, A.; Austin, M.; White, L.; Li, Z.; Naessens, J.; Spaulding, A. An investigation of healthcare worker perception of their workplace safety and incidence of injury. Workplace Health Saf.
**2020**, 68, 214–225. [Google Scholar] [CrossRef] - Chen, P.; Liu, Q.; Sun, F. Bicycle parking security and built environments. Transp. Res. Part D Transp. Environ.
**2018**, 62, 169–178. [Google Scholar] [CrossRef] - Kim, A. Social exclusion of multicultural families in Korea. Soc. Sci.
**2018**, 7, 63. [Google Scholar] [CrossRef] - Aldirawi, H.; Yang, J.; Metwally, A.A. Identifying Appropriate Probabilistic Models for Sparse Discrete Omics Data. In Proceedings of the 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Chicago, IL, USA, 19–22 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
- Aldirawi, H.; Yang, J. Modeling Sparse Data Using MLE with Applications to Microbiome Data. J. Stat. Theory Pract.
**2022**, 16, 13. [Google Scholar] [CrossRef] - Jiang, R.; Sun, T.; Song, D.; Li, J.J. Statistics or biology: The zero-inflation controversy about scRNA-seq data. Genome Biol.
**2022**, 23, 1–24. [Google Scholar] [CrossRef] - Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics
**1992**, 34, 1–14. [Google Scholar] [CrossRef] - Greene, W.H. Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models. NYU 30 Working Paper No. EC-94-10. 1994. Available online: https://ssrn.com/abstract=1293115 (accessed on 5 November 2022). NYU Working Paper No. EC-94-10.
- Reid, G.; Bocking, A. The potential for probiotics to prevent bacterial vaginosis and preterm labor. Am. J. Obstet. Gynecol.
**2003**, 189, 1202–1208. [Google Scholar] [CrossRef] - Witkin, S.S.; Linhares, I.M. Why do lactobacilli dominate the human vaginal microbiota? BJOG Int. J. Obstet. Gynaecol.
**2017**, 124, 606–611. [Google Scholar] [CrossRef] - Eschenbach, D.A.; Davick, P.R.; Williams, B.L.; Klebanoff, S.J.; Young-Smith, K.; Critchlow, C.M.; Holmes, K.K. Prevalence of hydrogen peroxide-producing Lactobacillus species in normal women and women with bacterial vaginosis. J. Clin. Microbiol.
**1989**, 27, 251–256. [Google Scholar] [CrossRef] - Hawes, S.E.; Hillier, S.L.; Benedetti, J.; Stevens, C.E.; Koutsky, L.A.; Wølner-Hanssen, P.; Holmes, K.K. Hydrogen peroxide—Producing lactobacilli and acquisition of vaginal infections. J. Infect. Dis.
**1996**, 174, 1058–1063. [Google Scholar] [CrossRef] [PubMed] - Klaenhammer, T.R. Bacteriocins of lactic acid bacteria. Biochimie
**1988**, 70, 337–349. [Google Scholar] [CrossRef] - Ng, S.; Hart, A.; Kamm, M.; Stagg, A.; Knight, S.C. Mechanisms of action of probiotics: Recent advances. Inflamm. Bowel Dis.
**2009**, 15, 300–310. [Google Scholar] [CrossRef] [PubMed] - Koedooder, R.; Singer, M.; Schoenmakers, S.; Savelkoul, P.H.; Morré, S.A.; de Jonge, J.D.; Poort, L.; Cuypers, W.J.S.; Beckers, N.; Broekmans, F.; et al. The vaginal microbiome as a predictor for outcome of in vitro fertilization with or without intracytoplasmic sperm injection: A prospective study. Hum. Reprod.
**2019**, 34, 1042–1054. [Google Scholar] [CrossRef] - Chen, E.Z.; Li, H. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics
**2016**, 32, 2611–2617. [Google Scholar] [CrossRef] - Zhang, X.; Guo, B.; Yi, N. Zero-inflated Gaussian mixed models for analyzing longitudinal microbiome data. PLoS ONE
**2020**, 15, e0242073. [Google Scholar] [CrossRef] [PubMed] - Harrison, C.W.; He, Q.; Huang, H.H. Clustering Gene Expressions Using the Table Invitation Prior. Genes
**2022**, 13, 2036. [Google Scholar] [CrossRef] - Ahlmann-Eltze, C.; Huber, W. glmGamPoi: Fitting Gamma-Poisson generalized linear models on single cell count data. Bioinformatics
**2020**, 36, 5701–5702. [Google Scholar] [CrossRef] - Ji, F.; Sadreyev, R.I. RNA-seq: Basic bioinformatics analysis. Curr. Protoc. Mol. Biol.
**2018**, 124, e68. [Google Scholar] [CrossRef] [PubMed] - Zappia, L.; Phipson, B.; Oshlack, A. Splatter: Simulation of single-cell RNA sequencing data. Genome Biol.
**2017**, 18, 174. [Google Scholar] [CrossRef] [PubMed] - Kharchenko, P.V.; Silberstein, L.; Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nat. Methods
**2014**, 11, 740–742. [Google Scholar] [CrossRef] [PubMed] - McDavid, A.; Finak, G.; Chattopadyay, P.K.; Dominguez, M.; Lamoreaux, L.; Ma, S.S.; Roederer, M.; Gottardo, R. Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments. Bioinformatics
**2013**, 29, 461–467. [Google Scholar] [CrossRef] - Peng, X.; Li, G.; Liu, Z. Zero-inflated beta regression for differential abundance analysis with metagenomics data. J. Comput. Biol.
**2016**, 23, 102–110. [Google Scholar] [CrossRef] - Cho, H.; Liu, C.; Park, J.; Wu, D. bzinb: Bivariate Zero-Inflated Negative Binomial Model Estimator; R Package Version 1.0.4; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
- Balderama, E.; Trippe, T. hurdlr: Zero-Inflated and Hurdle Modelling Using Bayesian Inference; R Package Version 0.1; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
- Wang, L.; Aldirawi, H.; Yang, J. iZID: Identify Zero-Inflated Distributions; R Package Version 0.0.1; R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
- Stasinopoulos, M. gamlss: Generalised Additive Models for Location Scale and Shape; R Package Version 0.0.1; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
- Jackman, S. pscl: Political Science Computational Laboratory; R Package Version 0.0.1; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
- Croissant, Y.; Carlevaro, F.; Hoareau, S. mhurdle: Multiple Hurdle Tobit Models; R Package Version 1.3.0; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
- Waudby-Smith, I.; Li, P. rbtt: Alternative Bootstrap-Based t-Test Aiming to Reduce Type-I Error for Non-Negative, Zero-Inflated Data; R Package Version 0.1.0; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
- Peng, X.; Li, G.; Liu, Z.; Chen, H. ZIBseq: Differential Abundance Analysis for Metagenomic Data via Zero-Inflated Beta Regression; R Package Version 1.2; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
- Jochmann, M. zic: Bayesian Inference for Zero-Inflated Count Models; R Package Version 0.9.1; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
- Yang, M.; Zamba, G.; Cavanaugh, J. ZIM: Zero-Inflated Models (ZIM) for Count Time Series with Excess Zeros; R Package Version 1.1.0; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
- Xu, Z.J.; Liu, Y. ziphsmm: Zero-Inflated Poisson Hidden (Semi-)Markov Models; R Package Version 2.0.6; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
- Wang, L.; Aldirawi, H.; Yang, J. Identifying zero-inflated distributions with a new R package iZID. Commun. Inf. Syst.
**2020**, 20, 23–44. [Google Scholar] [CrossRef] - Dousti Mousavi, N.; Aldirawi, H.; Yang, J. AZIAD: Analyzing Zero-Inflated and Zero-Altered Data; R Package Version 0.0.2; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
- Dousti Mousavi, N.; Aldirawi, H.; Yang, J. An R Package AZIAD for Analyzing Zero-Inflated and Zero-Altered Data. arXiv
**2022**, arXiv:2205.01294. [Google Scholar] - Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Burnham, K.P.; Anderson, D.R. Multimodel inference: Understanding AIC and BIC in model selection. Sociol. Methods Res.
**2004**, 33, 261–304. [Google Scholar] [CrossRef] - Weinstein, J.N.; Collisson, E.A.; Mills, G.B.; Shaw, K.R.; Ozenberger, B.A.; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, J.M. The cancer genome atlas pan-cancer analysis project. Nat. Genet.
**2013**, 45, 1113–1120. [Google Scholar] [CrossRef] - Gelman, A.; Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models; Analytical Methods for Social Research; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar] [CrossRef]
- Metwally, A.A.; Yang, J.; Ascoli, C.; Dai, Y.; Finn, P.W.; Perkins, D.L. MetaLonDA: A flexible R package for identifying time intervals of differentially abundant features in metagenomic longitudinal studies. Microbiome
**2018**, 6, 32. [Google Scholar] [CrossRef] - Harrison, C.W.; He, Q.; Huang, H.H. tip: Bayesian Clustering Using the Table Invitation Prior (TIP); R Package Version 0.1.0; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]

**Figure 1.**Linear interpolated Lactobacillus readings of 22 pregnant women over 38 weeks of pregnancy.

**Figure 6.**Cumulative proportions of variance explained by various numbers of principal components based on the 50 selected genes.

**Table 1.**KS-test p values for combined data ($N=52$) under 12 distributions at three different weeks.

Time (Week) | N | ZIN | NH | HN | ZIHN | HNH | LN | ZILN | LNH | E | ZIE | EH |
---|---|---|---|---|---|---|---|---|---|---|---|---|

week 10 | 1.000 | 0.390 | 1.000 | 0.315 | 0.510 | 0.000 | 0.000 | 0.000 | 0.720 | 0.335 | 0.510 | 0.015 |

week 22 | 1.000 | 0.160 | 1.000 | 0.605 | 0.625 | 0.000 | 0.000 | 0.000 | 0.955 | 0.100 | 0.300 | 0.020 |

week 36 | 1.000 | 0.175 | 1.000 | 0.560 | 0.650 | 0.015 | 0.000 | 0.000 | 0.960 | 0.525 | 0.450 | 0.030 |

Number of Genes | Prediction Error |
---|---|

20 | $0.1500$ |

50 | $0.0037$ |

100 | $0.0037$ |

2426 | $0.0012$ |

$7PCA$ | $0.0200$ |

**Table 3.**Estimated prediction error rate by 5-fold cross-validation with 1-nearest neighbor classifier.

Number of Genes | Prediction Error |
---|---|

10 | $0.0480$ |

20 | $0.0012$ |

30 | $0.0012$ |

40 | $0.0024$ |

50 | 0 |

60 | $0.0012$ |

100 | $0.0012$ |

$7PCA$ | $0.0037$ |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dousti Mousavi, N.; Yang, J.; Aldirawi, H.
Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data. *Genes* **2023**, *14*, 403.
https://doi.org/10.3390/genes14020403

**AMA Style**

Dousti Mousavi N, Yang J, Aldirawi H.
Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data. *Genes*. 2023; 14(2):403.
https://doi.org/10.3390/genes14020403

**Chicago/Turabian Style**

Dousti Mousavi, Niloufar, Jie Yang, and Hani Aldirawi.
2023. "Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data" *Genes* 14, no. 2: 403.
https://doi.org/10.3390/genes14020403