# DExMA: An R Package for Performing Gene Expression Meta-Analysis with Missing Genes

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Meta-Analysis Methods

#### 2.1.1. Effect Size Combination Methods

- There is independence between the experimental and the control group.
- Both the experimental and control groups are distributed according to a normal distribution with means μ
_{E}and μ_{C}, respectively, and with the same σ^{2}variance.

_{i}) is described as:

- $c\left(m\right)=1-\frac{8}{4\left({n}_{E}+{n}_{C}\right)-9}$, is a factor that corrects the positive bias. n
_{E}and n_{C}are the sample sizes of the experimental and control groups, respectively. - ${\stackrel{=}{y}}_{E}$ and ${\stackrel{=}{y}}_{C}$ are the gene expression means of the experimental and control group, respectively.
- $S=\sqrt{\frac{\left({n}_{E}-1\right){S}_{E}^{2}+\left({n}_{C}-1\right){S}_{C}^{2}}{{n}_{E}+{n}_{C}-2}}$ is the standard deviation between studies. ${S}_{E}^{2}$ and ${S}_{C}^{2}$ are the variances in the experimental and control groups, respectively.

**FEM**assumes that all studies share a true common effect size, that is to say, studies with more information have greater weight in the combined effect size. Therefore, the combined effect size ($\overline{T})$ and its variance (V) for k studies are calculated [15]:

- T
_{i}is the effect size of the i-th study. - ω
_{i}is the weight assigned to the i-th study. In the case of a meta-analysis, the inverse of the variance is used as weights, ${\omega}_{i}=\frac{1}{{V}_{i}}$.

**REM**model considers that the true effect size varies from one study to another, that is, there is a distribution of the true effect sizes. The combined effect size (${\overline{T}}^{*}$) and its variance (V*) are calculated:

_{i}) and between-study variance (τ

^{2}) [15]. The between-study variance is obtained:

- $Q={\displaystyle \sum}_{i=1}^{k}{\omega}_{i}\left({T}_{i}-{T}_{.}\right)$ represents the total variance, where:
- ω
_{i}is the calculated weight for the Fixed Effects Model. - T. is the combined effect size for the Fixed Effects Model (Equation (2)).
- $C={\displaystyle \sum}_{i=1}^{k}{\omega}_{i}-\frac{{{\displaystyle \sum}}_{i=1}^{k}{\omega}_{i}^{2}}{{{\displaystyle \sum}}_{i=1}^{k}{\omega}_{i}}$ is a scaling-related factor related to the fact that Q is a weighted sum of squares.
- df = k − 1 are the degrees of freedom for the meta-analysis.

#### 2.1.2. p-Values Combination Methods

- p
_{1}, …, p_{k}are the p-values from the k independent studies. - The t
_{1}, …, t_{k}test statistics have absolute continuous probability distributions under their corresponding null hypotheses.

**Fisher’s method**calculates a statistic (S

_{F}) as the sum of the logarithm of the p-values, ${S}_{F}=-2\times {\displaystyle \sum}_{i=1}^{k}\mathrm{ln}\left({p}_{i}\right)$ [18]. Under the null hypothesis, S

_{F}is distributed as x

^{2}with 2 × k degrees of freedom [16].

**Stouffer’s method**assumes that ${Z}_{i}={\varphi}^{-1}\left(1-{p}_{i}\right)$ [16], where $\varphi $ is the standard normal cumulative distribution function. Then, for k independent studies, the statistic is calculated as the sum of the Z

_{i}values divided by the square root of the number of studies, ${S}_{S}=\frac{{{\displaystyle \sum}}_{i=1}^{k}{Z}_{i}}{\sqrt{k}}$. Under the null hypothesis, S

_{S}is distributed as a standard normal distribution [16]. Moreover, Stouffer’s method allows the inclusion of each of the datasets, ${S}_{S}=\frac{{{\displaystyle \sum}}_{i=1}^{k}{\omega}_{i}{Z}_{i}}{\sqrt{{{\displaystyle \sum}}_{i=1}^{k}{\omega}_{i}^{2}}}$. The DExMA package implements the square roots of sample sizes as weights [19].

**Tippett’s method**(also called the minimum of p-values method) and

**Wilkinson’s method**(also called the maximum of p-values method) use the minimum of p-values and the maximum of p-values, respectively, as statistics, that is to say, ${S}_{T}=min\left({p}_{1},{p}_{2},\dots ,{p}_{i},\dots ,{p}_{k}\right)$ and ${S}_{W}=max({p}_{1},{p}_{2},\dots ,{p}_{i},\dots ,{p}_{k}$). Under the null hypothesis, S

_{T}is distributed as a Beta(1,K), while S

_{W}is distributed as a Beta(K,1).

**ACAT method**uses a weighted sum of the Cauchy transformation of individual p-values, ${S}_{ACAT}={\displaystyle \sum}_{i=1}^{k}{\omega}_{i}\mathrm{tan}\left[\left(0.5-{p}_{i}\right)\pi \right]$, as a statistic, where the weights ω

_{i}are non-negative and $\sum}_{i=1}^{k}{\omega}_{i}=1$. Under the null hypothesis, ${S}_{ACAT}$ is distributed as a standard Cauchy distribution [20,21].

#### 2.2. Control of Missing Genes

## 3. Results

#### 3.1. The DExMA Package

- “listMatrixEX”: a list of four expression matrices.
- “listPhenodatas”: a list of the four phenodata dataframes corresponding to four expression matrices.
- “listExpressionSets”: a list of four ExpressionSet objects. It contains the same information as listMatrixEX and listPhenodatas.
- “ExpressionSetStudy5”: an ExpressionSet object similar to the ExpressionSets objects of listExpressionSets.
- “maObjectDif”: the meta-analysis object (objectMA) created from the listMatrixEx and listPhenodatas objects.
- “maObject”: the meta-analysis object (objectMA) after setting all the studies in Official Gene Symbol annotation.

#### 3.1.1. Meta-Analysis Object Creation

#### 3.1.2. Quality Control

^{2}statistic. The I

^{2}statistic measures the inconsistency, that is, the percentage of variation across studies due to heterogeneity [23]. When interpreting the I

^{2}results, it is considered that there is low heterogeneity when the I

^{2}value is less than 0.25 [23]. Therefore, to consider homogeneity, most of the I

^{2}values must be less than 0.25.

#### 3.1.3. Missing Gene Imputation

- imputValuesSample: the number of missing values imputed per sample.
- imputPercentageSample: the percentage of missing values imputed per sample.
- imputValuesGene: the number of missing values imputed per gene.
- imputPercentageGene: the percentage of missing values imputed per gene.

#### 3.1.4. Performing Gene Expression Meta-Analysis

#### 3.1.5. Other Useful Functions

#### 3.2. Applying DExMA to Real Data

^{2}greater than 0.71, so it was concluded that there was heterogeneity between the different datasets. Therefore, as all the studies belonged to the same tissue, and there was heterogeneity between them, we decided to apply a Random Effects Model (REM).

- Using only common genes (common genes approach).
- Considering the genes that are present in at least two of the studies (66%) (minimum proportion approach).
- Performing a previous imputation of missing genes before accomplishing the meta-analysis (called the imputing missing genes approach).

#### 3.3. Comparison to Other Available R Packages

## 4. Discussion

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

KNN K | nearest neighbours |

FEM | Fixed Effects Model |

REM | Random Effects Model |

ID | Identifier |

SLE | systemic lupus erythematosus |

## Appendix A

#### Loading and Preparing the Case Study Data Directly from the ADEX Database

- GSE24706.tsv: gene expression matrix of the study GSE24706.
- GSE50772.tsv: gene expression matrix of the study GSE50772.
- GSE82221_GPL10558.tsv: gene expression matrix of the study GSE82221.
- metadata.tsv: dataframe with the information from the different samples of the studies (phenodata).

## References

- Barrett, T.; Wilhite, S.E.; Ledoux, P.; Evangelista, C.; Kim, I.F.; Tomashevsky, M.; Marshall, K.A.; Phillippy, K.H.; Sherman, P.M.; Holko, M.; et al. NCBI GEO: Archive for Functional Genomics Data Sets—Update. Nucleic Acids Res.
**2013**, 41, D991–D995. [Google Scholar] [CrossRef] [PubMed] - Perez-Riverol, Y.; Zorin, A.; Dass, G.; Vu, M.-T.; Xu, P.; Glont, M.; Vizcaíno, J.A.; Jarnuczak, A.F.; Petryszak, R.; Ping, P.; et al. Quantifying the Impact of Public Omics Data. Nat. Commun.
**2019**, 10, 3512. [Google Scholar] [CrossRef] [PubMed] - Song, G.G.; Kim, J.-H.; Seo, Y.H.; Choi, S.J.; Ji, J.D.; Lee, Y.H. Meta-Analysis of Differentially Expressed Genes in Primary Sjogren’s Syndrome by Using Microarray. Hum. Immunol.
**2014**, 75, 98–104. [Google Scholar] [CrossRef] [PubMed] - Afroz, S.; Giddaluru, J.; Vishwakarma, S.; Naz, S.; Khan, A.A.; Khan, N. A Comprehensive Gene Expression Meta-Analysis Identifies Novel Immune Signatures in Rheumatoid Arthritis Patients. Front. Immunol.
**2017**, 8, 74. [Google Scholar] [CrossRef] - Badr, M.T.; Häcker, G. Gene Expression Profiling Meta-Analysis Reveals Novel Gene Signatures and Pathways Shared between Tuberculosis and Rheumatoid Arthritis. PLoS ONE
**2019**, 14, e0213470. [Google Scholar] [CrossRef] - Kelly, J.; Moyeed, R.; Carroll, C.; Albani, D.; Li, X. Gene Expression Meta-Analysis of Parkinson’s Disease and Its Relationship with Alzheimer’s Disease. Mol. Brain
**2019**, 12, 16. [Google Scholar] [CrossRef] - Ibáñez, K.; Boullosa, C.; Tabarés-Seisdedos, R.; Baudot, A.; Valencia, A. Molecular Evidence for the Inverse Comorbidity between Central Nervous System Disorders and Cancers Detected by Transcriptomic Meta-Analyses. PLoS Genet.
**2014**, 10, e1004173. [Google Scholar] [CrossRef] - Ioannidis, J.P.A. The Mass Production of Redundant, Misleading, and Conflicted Systematic Reviews and Meta-Analyses. Milbank Q.
**2016**, 94, 485–514. [Google Scholar] [CrossRef] - Park, J.H.; Eisenhut, M.; van der Vliet, H.J.; Shin, J.I. Statistical Controversies in Clinical Research: Overlap and Errors in the Meta-Analyses of MicroRNA Genetic Association Studies in Cancers. Ann. Oncol.
**2017**, 28, 1169–1182. [Google Scholar] [CrossRef] - Haynes, W.A.; Vallania, F.; Liu, C.; Bongen, E.; Tomczak, A.; Andres-Terrè, M.; Lofgren, S.; Tam, A.; Deisseroth, C.A.; Li, M.D.; et al. Empowering Multi-Cohort Gene Expression Analysis to Increase Reproducibility. Pac. Symp. Biocomput.
**2016**, 22, 144–153. [Google Scholar] - Prada, C.; Lima, D.; Nakaya, H. MetaVolcanoR: Gene Expression Meta-Analysis Visualization Tool. 2022. Available online: https://www.bioconductor.org/packages/release/bioc/html/MetaVolcanoR.html (accessed on 1 July 2022).
- Bobak, C.A.; McDonnell, L.; Nemesure, M.D.; Lin, J.; Hill, J.E. Assessment of Imputation Methods for Missing Gene Expression Data in Meta-Analysis of Distinct Cohorts of Tuberculosis Patients. Pac. Symp. Biocomput.
**2020**, 25, 307–318. [Google Scholar] [PubMed] - Mancuso, C.A.; Canfield, J.L.; Singla, D.; Krishnan, A. A Flexible, Interpretable, and Accurate Approach for Imputing the Expression of Unmeasured Genes. Nucleic Acids Res.
**2020**, 48, e125. [Google Scholar] [CrossRef] [PubMed] - Toro-Domínguez, D.; Villatoro-García, J.A.; Martorell-Marugán, J.; Román-Montoya, Y.; Alarcón-Riquelme, M.E.; Carmona-Sáez, P. A Survey of Gene Expression Meta-Analysis: Methods and Applications. Brief. Bioinform.
**2021**, 22, 1694–1705. [Google Scholar] [CrossRef] [PubMed] - Borenstein, M.; Hedges, L.V.; Higgins, J.P.T.; Rothstein, H.R. Introduction to Meta-Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2021; ISBN 978-1-119-55838-5. [Google Scholar]
- Heard, N.A.; Rubin-Delanchy, P. Choosing between Methods of Combining p-Values. Biometrika
**2018**, 105, 239–246. [Google Scholar] [CrossRef] - Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies. Nucleic Acids Res.
**2015**, 43, e47. [Google Scholar] [CrossRef] [PubMed] - Li, J.; Tseng, G.C. An Adaptively Weighted Statistic for Detecting Differential Gene Expression When Combining Multiple Transcriptomic Studies. Ann. Appl. Stat.
**2011**, 5, 994–1019. [Google Scholar] [CrossRef] - Zaykin, D.V. Optimally Weighted Z-Test Is a Powerful Method for Combining Probabilities in Meta-Analysis. J. Evol. Biol.
**2011**, 24, 1836–1841. [Google Scholar] [CrossRef] - Liu, Y.; Chen, S.; Li, Z.; Morrison, A.C.; Boerwinkle, E.; Lin, X. ACAT: A Fast and Powerful p Value Combination Method for Rare-Variant Analysis in Sequencing Studies. Am. J. Hum. Genet.
**2019**, 104, 410–421. [Google Scholar] [CrossRef] - Liu, Y.; Xie, J. Cauchy Combination Test: A Powerful Test with Analytic p-Value Calculation under Arbitrary Dependency Structures. J. Am. Stat. Assoc.
**2020**, 115, 393–402. [Google Scholar] [CrossRef] - Higgins, J.P.T.; Thompson, S.G. Quantifying Heterogeneity in a Meta-Analysis. Stat. Med.
**2002**, 21, 1539–1558. [Google Scholar] [CrossRef] - Higgins, J.P.T.; Thompson, S.G.; Deeks, J.J.; Altman, D.G. Measuring Inconsistency in Meta-Analyses. BMJ
**2003**, 327, 557–560. [Google Scholar] [CrossRef] [PubMed][Green Version] - Wickham, H.; Seidel, D. Scales: Scale Functions for Visualization. 2020. Available online: https://cran.r-project.org/web/packages/scales/index.html (accessed on 30 June 2022).
- Lazar, C.; Meganck, S.; Taminau, J.; Steenhoff, D.; Coletta, A.; Molter, C.; Weiss-Solís, D.Y.; Duque, R.; Bersini, H.; Nowé, A. Batch Effect Removal Methods for Microarray Gene Expression Data Integration: A Survey. Brief. Bioinform.
**2013**, 14, 469–490. [Google Scholar] [CrossRef] [PubMed] - Martorell-Marugán, J.; López-Domínguez, R.; García-Moreno, A.; Toro-Domínguez, D.; Villatoro-García, J.A.; Barturen, G.; Martín-Gómez, A.; Troule, K.; Gómez-López, G.; Al-Shahrour, F.; et al. A Comprehensive Database for Integrated Analysis of Omics Data in Autoimmune Diseases. BMC Bioinform.
**2021**, 22, 343. [Google Scholar] [CrossRef] [PubMed] - Li, Q.-Z.; Karp, D.R.; Quan, J.; Branch, V.K.; Zhou, J.; Lian, Y.; Chong, B.F.; Wakeland, E.K.; Olsen, N.J. Risk Factors for ANA Positivity in Healthy Persons. Arthritis Res. Ther.
**2011**, 13, R38. [Google Scholar] [CrossRef] - Kennedy, W.P.; Maciuca, R.; Wolslegel, K.; Tew, W.; Abbas, A.R.; Chaivorapol, C.; Morimoto, A.; McBride, J.M.; Brunetta, P.; Richardson, B.C.; et al. Association of the Interferon Signature Metric with Serological Disease Manifestations but Not Global Activity Scores in Multiple Cohorts of Patients with SLE. Lupus Sci. Med.
**2015**, 2, e000080. [Google Scholar] [CrossRef] [PubMed] - Zhu, H.; Mi, W.; Luo, H.; Chen, T.; Liu, S.; Raman, I.; Zuo, X.; Li, Q.-Z. Whole-Genome Transcription and DNA Methylation Analysis of Peripheral Blood Mononuclear Cells Identified Aberrant Gene Regulation Pathways in Systemic Lupus Erythematosus. Arthritis Res. Ther.
**2016**, 18, 162. [Google Scholar] [CrossRef] - Carmona-Saez, P.; Chagoyen, M.; Tirado, F.; Carazo, J.M.; Pascual-Montano, A. GENECODIS: A Web-Based Tool for Finding Significant Concurrent Annotations in Gene Lists. Genome Biol.
**2007**, 8, R3. [Google Scholar] [CrossRef] - Garcia-Moreno, A.; López-Domínguez, R.; Villatoro-García, J.A.; Ramirez-Mena, A.; Aparicio-Puerta, E.; Hackenberg, M.; Pascual-Montano, A.; Carmona-Saez, P. Functional Enrichment Analysis of Regulatory Elements. Biomedicines
**2022**, 10, 590. [Google Scholar] [CrossRef] - Huang, R.; Grishagin, I.; Wang, Y.; Zhao, T.; Greene, J.; Obenauer, J.C.; Ngan, D.; Nguyen, D.-T.; Guha, R.; Jadhav, A.; et al. The NCATS BioPlanet—An Integrated Platform for Exploring the Universe of Cellular Signaling Pathways for Toxicology, Systems Biology, and Chemical Genomics. Front. Pharmacol.
**2019**, 10, 445. [Google Scholar] [CrossRef][Green Version] - Stevens, J.R.; Nicholas, G. Metahdep: Hierarchical Dependence in Meta-Analysis. 2022. Available online: https://www.bioconductor.org/packages/release/bioc/html/metahdep.html (accessed on 25 June 2022).
- Lusa, L.; Gentleman, R.; Ruschhaupt, M. GeneMeta: MetaAnalysis for High Throughput Experiments 2021. Available online: https://www.bioconductor.org/packages/release/bioc/html/GeneMeta.html (accessed on 25 June 2022).
- Marot, G.; Rau, A.; Jaffrezic, F.; Blanck, S. MetaRNASeq: Meta-Analysis of RNA-Seq Data 2021. Available online: https://cran.r-project.org/web/packages/metaRNASeq/index.html (accessed on 27 June 2022).
- Tsuyuzaki, K.; Nikaido, I. MetaSeq: Meta-Analysis of RNA-Seq Count Data in Multiple Studies 2022. Available online: https://www.bioconductor.org/packages/release/bioc/html/metaSeq.html (accessed on 27 June 2022).
- Marot, G. MetaMA: Meta-Analysis for MicroArrays 2022. Available online: https://cran.r-project.org/web/packages/metaMA/index.html (accessed on 27 June 2022).
- Pickering, A. Crossmeta: Cross Platform Meta-Analysis of Microarray Data 2022. Available online: https://www.bioconductor.org/packages/release/bioc/html/crossmeta.html (accessed on 27 June 2022).

**Figure 1.**DExMA workflow. The figure shows the main steps of the DExMA package workflow: (1) data load and meta-analysis object creation, (2) gene annotation, (3) quality control, (4) missing gene imputation (optional), (5) gene expression meta-analysis, and (6) visualization.

**Figure 2.**Heterogeneity QQ-plot. QQ-plot of Cochran’s heterogeneity test values. The further the values are from the reference distribution (central line), the more heterogeneity there is.

**Figure 3.**Synthetic data heatmap. Heatmap of the meta-analysis results for the 40 most significant genes. The red colour indicates that a gene is overexpressed in that sample, blue that it is under-expressed, and grey that is not present.

**Figure 4.**Heterogeneity QQ-plot of SLE data. QQ-plot of Cochran’s heterogeneity test values for the SLE case study data.

**Figure 6.**Graphical representations of the most significant pathways for each of the meta-analysis approaches. (

**a**) Ten most significant pathways in common genes approach. (

**b**) Ten most significant pathways considering genes that are contained in at least two studies. (

**c**) Ten most significant pathways in the missing genes imputation approach. (

**d**) Number of significant genes in top pathways in each approach.

Function | Description |
---|---|

allsameID | Sets all datasets of objectMA in the same annotation (Official Gene Symbol, Entrez, or Ensembl) |

batchRemove | Reduces the effects of batch or bias through the use of covariates |

calculateES | Calculates the effects sizes and their variances for each gene and each dataset using Hedges’ g estimator |

createObjectMA | Creates the meta-analysis object (objectMA) |

dataLog | Checks if data are log transformed and transforms them if they are not |

downloadGEOData | Downloads ExpressionSets objects from GEO database |

elementObjectMA | Creates an object that can be added to a meta-analysis object (objectMA) |

heterogeneityTest | Shows a QQ-plot of Cochran’s test and the quantiles of I^{2} statistic values to measure heterogeneity |

makeHeatmap | Shows a heatmap with the expression of significant genes along samples |

metaAnalysisDE | Performs a meta-analysis using the selected method |

pvalueIndAnalysis | Performs a differential expression analysis in each of the studies to obtain the p-values |

missGenesImput | Imputes missing genes using the sampleKNN method |

**Table 2.**Comparison of the main features of gene expression meta-analysis packages. Input: “User data” means that the user can enter their own data, while “GEO data” means that the user can include GEO database codes. QC (quality control): “Yes” if the package has implemented functions for performing quality controls. ES: “Yes” if the package performs effect sizes combination methods. PV: “Yes” if the package performs p-value combination methods. Considers Missing Genes: “Yes” if the package somehow considers the unmeasured genes. Imputes Missing genes: “Yes” if the package somehow imputes the unmeasured genes. Visualization: “Yes” if the package has implemented a function to visualize the results.

Package | Input | QC | ES | PV | Considers Missing Genes | Imputes Missing Genes | Visualization |
---|---|---|---|---|---|---|---|

DExMA | GEO/User data | Yes | Yes | Yes | Yes | Yes | Yes |

MetaIntegrator [10] | User data | Yes | Yes | Yes | Yes | No | Yes |

GeneMeta [34] | User data | No | Yes | No | No | No | Yes |

Metahdep [33] | User data | No | Yes | No | No | No | No |

Crossmeta [38] | User data | No | Yes | No | Yes | No | No |

metaMA [37] | User data | No | Yes | Yes | No | No | No |

metaRNASeq [35] | User data | No | No | No | No | No | Yes |

metaSeq [36] | User data | No | No | No | No | No | No |

MetaVolcanoR [11] | User data | No | Yes | Yes | Yes | No | Yes |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Villatoro-García, J.A.; Martorell-Marugán, J.; Toro-Domínguez, D.; Román-Montoya, Y.; Femia, P.; Carmona-Sáez, P.
DExMA: An R Package for Performing Gene Expression Meta-Analysis with Missing Genes. *Mathematics* **2022**, *10*, 3376.
https://doi.org/10.3390/math10183376

**AMA Style**

Villatoro-García JA, Martorell-Marugán J, Toro-Domínguez D, Román-Montoya Y, Femia P, Carmona-Sáez P.
DExMA: An R Package for Performing Gene Expression Meta-Analysis with Missing Genes. *Mathematics*. 2022; 10(18):3376.
https://doi.org/10.3390/math10183376

**Chicago/Turabian Style**

Villatoro-García, Juan Antonio, Jordi Martorell-Marugán, Daniel Toro-Domínguez, Yolanda Román-Montoya, Pedro Femia, and Pedro Carmona-Sáez.
2022. "DExMA: An R Package for Performing Gene Expression Meta-Analysis with Missing Genes" *Mathematics* 10, no. 18: 3376.
https://doi.org/10.3390/math10183376