The Role of Microarray in Modern Sequencing: Statistical Approach Matters in a Comparison Between Microarray and RNA-Seq

Raplee, Isaac D.; Borkar, Samiksha A.; Yin, Li; Venturi, Guglielmo M.; Shen, Jerry; Chang, Kai-Fen; Nepal, Upasana; Sleasman, John W.; Goodenow, Maureen M.

doi:10.3390/biotech14030055

Open AccessArticle

The Role of Microarray in Modern Sequencing: Statistical Approach Matters in a Comparison Between Microarray and RNA-Seq

by

Isaac D. Raplee

^1,*,

Samiksha A. Borkar

¹,

Li Yin

¹

,

Guglielmo M. Venturi

²,

Jerry Shen

¹,

Kai-Fen Chang

¹

,

Upasana Nepal

¹,

John W. Sleasman

² and

Maureen M. Goodenow

¹

Molecular HIV and Host Interactions Section, National Institute of Allergy and Infectious Diseases, National Institutes of Health, 50 South Drive, Bethesda, MD 20894, USA

²

Division of Allergy and Immunology, Department of Pediatrics, Duke University School of Medicine, Durham, NC 27710, USA

^*

Author to whom correspondence should be addressed.

BioTech 2025, 14(3), 55; https://doi.org/10.3390/biotech14030055

Submission received: 1 May 2025 / Revised: 12 June 2025 / Accepted: 1 July 2025 / Published: 5 July 2025

(This article belongs to the Section Computational Biology)

Download

Browse Figures

Versions Notes

Abstract

Gene expression analysis is crucial in understanding cellular processes, development, health, and disease. With RNA-seq outpacing microarray as the chosen platform for gene expression, is there space for array data in future profiling? This study involved 35 participants from the Adolescent Medicine Trials Network for HIV/AIDS Intervention protocol. RNA was isolated from whole blood samples and analyzed using both microarray and RNA-seq technologies. Data processing included quality control, normalization, and statistical analysis using non-parametric Mann–Whitney U tests. Differential expression analysis and pathway analysis were conducted to compare the outputs of the two platforms. The study found a high correlation in gene expression profiles between microarray and RNA-seq, with a median Pearson correlation coefficient of 0.76. RNA-seq identified 2395 differentially expressed genes (DEGs), while microarray identified 427 DEGs, with 223 DEGs shared between the two platforms. Pathway analysis revealed 205 perturbed pathways by RNA-seq and 47 by microarray, with 30 pathways shared. Both microarray and RNA-seq technologies provide highly concordant results when analyzed with consistent non-parametric statistical methods. The findings emphasize that both methods are reliable for gene expression analysis and can be used complementarily to enhance the robustness of biological insights.

Keywords:

microarray; RNA-sequencing; non-parametric; HIV; transcriptomics; gene expression

Key Contribution: Microarray data remains a valid and relevant tool for gene expression analysis, even in an era of RNA-seq dominance.

1. Introduction

Gene expression analysis plays an important role in evaluating the molecular mechanisms underlying cellular processes, development, health and disease [1,2,3]. RNA sequencing (RNA-seq) and microarray technologies are two well-established methods for quantifying gene expression profiles. For decades, throughout the late 1990s and early 2000s, microarray technology was the cornerstone of transcriptome profiling and the source for the bulk of the Gene Expression Omnibus (GEO) repository’s datasets during that time. The landscape of transcriptomics in GEO has shifted with the advent of RNA-seq technology, which, as of 2023, comprises 85% of all submissions [4].

While both technologies start with mRNA and polymerase chain reaction (PCR) amplification to produce cDNA, the two platforms differ greatly in their gene expression quantification technologies. Microarrays detect fluorescently labeled cDNA through hybridization to complementary sequences on a solid surface. The output is measured as a continuous variable, represented by fluorescence intensity. In contrast, RNA-seq leverages next-generation sequencing (NGS) of cDNA molecules, providing a digital readout of transcript abundance and sequence information.

Comparisons between RNA-seq and microarray technologies for gene expression yield both similar and different findings regarding the comparability of profiling by each method [5,6,7,8,9]. The gene expression discordance found in these studies may be attributed to inherent differences introduced by sample variability, preparation, and analytical approach. One study using technical replicates and simulated data to assess the differentially expressed gene (DEG) profiles found strong correlations between microarray and RNA-seq, with most discrepancies related to the different analytic algorithms for each platform [10]. Zhang et al. found that despite DEG discrepancies, microarray and RNA-seq had similar clinical endpoint predictions [11]. A comparison performed starting with the same samples and an appropriate non-parametric statistical approach may reduce gene expression discrepancies and enhance downstream pathway analyses.

In the last few years, increased application of artificial intelligence (AI), machine learning (ML), and similar approaches to transcriptomics research has offered powerful tools for data integration, spatial omics, and pattern recognition. The easy access to publicly available repositories of transcriptomic datasets may contribute to the development of increasingly sophisticated algorithms. Harnessing legacy microarray and RNA-seq studies, which often include data from hard to acquire or rare cohorts, may provide valuable resources for training AI models when analyzed appropriately.

In this comparison study using RNA-seq and microarray data, the same statistical approaches were applied to analyze the transcriptome profile of the same peripheral blood cell (PBC) samples. The goal of this study was to minimize DEG discrepancies and assess the relatedness of the functional analyses’ outputs between microarray and RNA-seq.

2. Materials and Methods

2.1. Clinical Profile of the Study Participants

The study participants were enrolled through the Adolescent Medicine Trials Network (ATN) for HIV/AIDS Intervention protocol 071/101, across 22 urban sites across in the United States and Puerto Rico (ClinicalTrials.gov; https://clinicaltrials.gov, Clinical Identifier No NCT00683579). The enrollment procedure and primary outcome results for this 3-year study have been reported [12,13,14,15]. A sub-study of 35 participants aged 18–25 years included 22 youth without HIV (YWOH) and 13 youth with HIV (YWH) (Table S1) selected based on the availability of whole blood samples for both microarray and RNA-seq analyses. The study participants across the groups were, on average, predominantly male (79%) and African American (70%). YWH were on combination antiretroviral therapy (ART) with suppressed viral loads (<50 HIV-1 RNA copies/mL plasma) and reported use of marijuana and tobacco, while YWOH reported no substance use.

2.2. RNA Isolation, Hybridization and Sequencing

Total intracellular RNA was isolated from whole blood cell samples collected in PAXgene Blood RNA tubes (Becton, Dickinson and Company, Franklin Lakes, NJ, USA) using PAXgene Blood RNA Kit (PreAnalytiX, Hombrechtikon, Switzerland) as previously described [14,15]. Globin mRNA was depleted using GLOBINclear Kit (Ambion, Waltham, MA, USA), and RNA quality was assessed by Agilent Bioanalyzer for an RNA Integrity Number (RIN) above 7. For microarray analysis, 100 ng aliquots of globin-reduced RNA was poly(A) selected, amplified and labeled using GeneChip 3′ IVT Express Kit (Affymetrix, Waltham, MA, USA) and hybridized to Gene Chip Human Genome U133 Plus 2.0 Array with 54,675 probes representing 20,174 genes (Affymetrix, Waltham, MA, USA) in the Interdisciplinary Center for Biotechnology Research (ICBR) at the University of Florida. For RNA-seq, 100 ng aliquots of globin-reduced RNA was processed with poly(A) mRNA Magnetic Isolation Module and NEBNext Ultra II RNA Library Prep Kit for Illumina (New England Biolabs, Ipswich, MA, USA). Libraries were uniquely barcoded and sequenced on the Illumina Hiseq 3000 platform (2 × 100 cycles) (Illumina, San Diego, CA, USA) at the ICBR, generating 50 million paired-end reads per sample.

2.3. Data Processing

The study analytic workflow is outlined in Figure 1. For microarray data, the raw signal intensities in CEL format generated with Affymetrix GeneChip Operating Software were evaluated for quality by checking mean values, variance, and paired scatter plots. The raw probe signal values were background-corrected, quantile-normalized, and summarized using Robust Multi-Array Averaging (RMA) with the rma function from affy, R/Bioconductor package [16]. All expression values were converted to a log₂ scale for downstream analysis. Microarray data were filtered by removing the lower 25% of the interquartile range (IQR) using the R package genefilter version 1.84.0 and annotated using the hgu133plus2.dg, version 3.13.0, package in R.

For RNA-seq, raw reads were checked for quality control with FASTQC [17]. Low-quality reads and residual adaptor sequences were trimmed using trimmomatic [18] and aligned to the USCS reference transcriptome, which included 26,475 annotated genes [19]. Read counts for each gene were obtained, and transcripts per million (TPM) values were calculated. The aligned reads were assessed for batch variation and outliers using BatchQC version 2.0.0 [20]. RNA-seq data were filtered by removing all genes with a sum of 0 across all samples.

2.4. Downstream Analysis

To assess the normality of the log-transformed microarray data, the Kolmogorov–Smirnov (KS) and the Anderson–Darling (AD) tests were employed (Table S2). To visually inspect the distributions of the microarray samples, the R package ggplot2 was used to create density plots of the data (Figure S1A). The function fitdistr from the MASS package in R was used to evaluate the goodness of fit of RNA-seq data against three distributions: normal, Poisson and negative binomial (NB) (Table S2). To visualize the distributions of the RNA-seq samples, ggplot2 was used to create overlayed density plots (Figure S1B) [21,22]. Depending on the downstream analysis, expression data was either log₂-transformed or subjected to variance-stabilizing transformation (VST) using the DESeq2/Bioconductor R package (1.4.4.0) [23,24,25]. Log-transformed microarray and RNA-seq data were used to compute correlation coefficients using the Pearson method in R. PCA was performed on log-transformed microarray data and VST-transformed RNA-seq data using the prcomp function and visualized with the ggfortify and cluster packages in R [26,27]. Ellipses on the PCA plot represent a 95% confidence interval.

Differential expression analysis was conducted using the non-parametric Mann–Whitney U test with the wilcox.test function in R with the paired argument set to FALSE. Multiple comparisons were adjusted using the padjust function with the BH method (p_adj = 0.05). Dynamic fold change was calculated by dividing the means of each variable for YWOH by the means of the respective variable for YWH, followed by log₂ transformation (Table S3). The KS test was used to determine if the distribution of the dynamic fold change was different between the two platforms (p-value < 0.05). Next, filtered expression values for microarray and RNA-seq data were uploaded into Qiagen’s Ingenuity Pathway Analysis (IPA) for pathway analysis [28].

3. Results

3.1. High Correlation of Gene Expression and Concordance of DEGs

After normalizing and filtering, the microarray data included 15,828 genes, about 29% less than RNA-seq’s 22,323 genes. The two platforms shared 13,577 genes, representing ~86% of microarray’s gene expression dataset and ~61% of RNA-seq’s gene expression dataset (Table 1). To further assess the concordance between RNA-seq and microarray data, differential expression analysis was performed, and outputs were compared (Table 1). RNA-seq analysis identified 2395 DEGs, while microarray analysis identified 427 DEGs. The two platforms shared 223 DEGs, representing 52.2% of total microarray DEGs and 9.3% of total DEGs by RNA-seq, with significant concordance in the overlap (p-value = 2.2 × 10⁻¹⁶). A Venn diagram of the DEG outputs is provided as a visualization tool (Figure S2).

To determine if a linear relationship between the two datasets existed, the Pearson correlation coefficient was computed on the gene expression outputs, and the distribution of the coefficient of correlation (R) was plotted for all samples (Figure 2). The median correlation coefficient (R = 0.76 and p-value < 0.05) signifies that the expression datasets sequenced using the two platforms show a strong correlation.

Before any downstream analyses were performed, goodness-of-fit and fit-of-distribution tests were completed (Table S2). Microarray’s distribution failed the test of normality, while the RNA-Seq data best fit the negative binomial distribution between normal, Poisson, and negative binomial.

3.2. PCA of Microarray and RNA-Seq

To assess the variance and consistency of the two gene expression platforms, PCAs were generated using log₂- and VST-transformed normalized expression values of the 223 shared genes (Figure 3A,B). The first two components (PC1 and PC2) accounted for 28.2% variability for PCA of microarray and 37.3% variability for PCA of RNA-seq. The PCA for each platform showed two clusters based on their biological profile. The red cluster includes all but one YWOH (22/23), while the blue cluster includes all YWH (13/13).

3.3. RNA-Seq Demonstrates a Greater Dynamic Range of Fold Change

To evaluate the resolution of differential expression, comparisons of fold change dynamics between the two platforms were assessed. DEGs shared between platforms had their log₂ fold change assessed (Table S3) and plotted (Figure 4). RNA-seq had a range between −0.9 and 2.5, while the range for DEGs from the microarray platform was between −0.2 and 0.5. The second comparison consisted of the complete set of DEGs unique to the respective platform. RNA-seq’s log₂ fold change range was −4.4 to 6.0, while microarray had a more limited range of log₂ fold change between −0.2 and 0.5, when compared to RNA-seq. The distribution of the fold change dynamic range was considered different by KS test p-value < 0.05.

3.4. High Concordance of Canonical Pathways

To determine potential biological relevance, canonical pathways were investigated using IPA on filtered expression outputs. RNA-seq analysis identified 205 perturbed pathways, while microarray analysis identified 47 pathways perturbed. Among the perturbed pathways identified by RNA-seq, 30 pathways were also detected by microarray analysis (Figure 5). The 30 shared pathways represent 14.6% of the total RNA-seq pathways and 63.8% of the microarray pathways, with significant concordance in the amount of overlap observed (p-value < 0.0001). The median percent of shared expressed genes across the significant pathways identified by both platforms was 15%. All significant (p-value < 0.001) pathways are in supplemental figures and tables (Table S4). The associated p-values frequently differed by orders of magnitude. For example, IL-10 signaling was enriched in both datasets, but its statistical significance was far greater in the RNA-Seq analysis (p-value = 2.95 × 10⁻⁵) compared to the microarray analysis (p-value = 7.5 × 10⁻⁴).

4. Discussion

In this study, we compared the results obtained from microarray and RNA-seq data, both derived from the same biological samples. The normality and goodness-of-fit analysis revealed that the microarray data had significant deviations from normality, as evidenced by the KS and AD tests. Conversely, the RNA-seq data showed a better fit with the NB distribution than the Poisson and normal distributions, reflecting the frequent issue of overdispersion in RNA-seq data where low-expressed genes’ variance exceeds the means. Using the Mann–Whitney U test for analysis ensured consistency in our statistical approach, which provided a more robust framework for comparing the outputs across different data types. The findings indicate that applying similar statistical methods to these datasets provide comparable views of canonical pathways and potential biological significance. Furthermore, the use of consistent sample preparation protocols for both RNA-seq and microarray experiments is crucial to minimize biological variability and enable a more accurate assessment of platform performance. This approach reduces variance in datasets developed from different sources, allowing for meaningful comparisons between well-curated microarray data and RNA-seq data. By using the same non-parametric statistics, we ensure a more consistent comparison between the two technologies, reducing the technical variability associated with different statistical analyses. This facilitates a fair and comprehensive evaluation of the strengths and limitations of each platform and their feasibility.

Few studies have systematically compared RNA-seq and microarray data generated from identical samples by the same statistical approach, highlighting a significant gap in the current research literature. Addressing this gap is essential to provide robust benchmarks and guide best practices in transcriptomic research. Comparative analyses hold promise to advance understanding of the factors influencing gene expression measurements. They can also improve the reproducibility and reliability of transcriptomic studies across different experimental settings and biological systems.

The Mann–Whitney U test was applied in this study due to its robustness in handling a range of varying distributions and not assuming a normal distribution [29,30], making it suitable for skewed data often encountered in RNA-seq and microarray analysis. A limitation of the method is the lack of power that parametric tests provide when the data fit their distribution [29]. Despite this limitation, we provide striking results of overlap in the reported canonical pathways.

The key findings from this study are the strong correlation of gene expression profiles determined by the coefficient of correlation, the high concordance across DEGs and canonical pathways, and the relevance and consistency of canonical pathways previously identified between the groups studied. The median correlation coefficient of 0.76 indicates a strong concordance in the gene expression profiles obtained from the two platforms. Furthermore, the comparison of DEGs between the two platforms revealed a high degree of overlap, with Fisher’s exact test confirming significant concordance. Removing the low-variance genes from the microarray dataset increased the power to detect DEGs, supporting the idea that microarray analysis benefits from focused filtering to enhance the sensitivity of differential expression analysis [31,32,33].

Two separate fold change analyses were performed: one focusing on the shared genes between the two platforms, and another using all DEGs identified for each platform individually. Restricting the analysis to shared genes limits the ability to detect platform-specific ranges, especially in RNA-seq, where more low-expressed genes are detected. However, performing fold change analysis using all DEGs may skew results, particularly in cases where one platform detects larger numbers of DEGs. The difference in dynamic fold change between the two platforms may be due to the platform-specific approaches to gene expression detection. Microarray relies on intensity probes that can be saturated with highly expressed genes, limiting low-expressed gene detection and the dynamic range of fold change. RNA-Seq can scale linearly with transcript abundance until the sequencing depth is exhausted. The results of the PCA analyses may have showcased the differences in platform-specific variability. The RNA-seq PCA analysis had a larger PC1, possibly highlighting RNA-seq’s broad range in total variability. This may be due to RNA-seq’s much broader dynamic range, which we highlight in the fold change analysis. The difference in PC1’s contribution may also be due to RNA-seq’s greater representation of total biological diversity in gene expression since microarray is optimized for predefined target sequences on the probe set.

Nested from a previous microarray study [14,15] that focused on the inflammatory impact of recreational marijuana and tobacco use, many canonical pathways perturbed in the current sub-study, based on their group function or biological role, were enriched. Pathways previously reported and identified in this study include interferon signaling, PI3K signaling, and cell death pathways. The similarity in pathway profiles between the two studies suggests that despite differences in technology, both platforms capture genes from similar pathways, and the approach could be applied to other conditions.

RNA-seq is currently the standard for gene expression data curation, leaving to question microarray’s past, present and future viability in the research community. Microarray was the largest contributor to gene expression data for many years, filling large depositories with publicly available datasets, some of which are from challenging-to-produce studies, like those involving neonates, neurological conditions, or tissue with limited availability. Machine learning (ML) algorithms and artificial intelligence (AI) tools could be created to integrate RNA-seq and microarray data, learning to recognize platform-specific biases, adjust analysis to increase confidence, and improve the predictive accuracy of biological outcomes. The high concordance of biologically relevant functional pathways and DEGs between the two platforms suggests that both technologies could serve as input for predictive models.

5. Conclusions

This study demonstrates that both microarray and RNA-seq technologies provide highly concordant results when analyzing the same biological samples and when analyzed with the same non-parametric statistics, such as the Mann–Whitney U test. The decreased biological variance helped to establish the robustness of the statistical analysis when comparing two different gene expression platforms. Both platforms identify similar canonical pathways and DEGs. The two platforms remain complementary tools in gene expression analysis. The results we present emphasize the importance of considering both technologies in future research, particularly in the context of AI and machine learning techniques to improve the integration of transcriptomic data for clinical and translational research. Microarray data remains a valid and relevant tool for gene expression analysis, even in an era of RNA-seq dominance.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/biotech14030055/s1: Figure S1: Density plots; Figure S2: Venn Diagram comparing DEGs identified by RNA-Seq (Blue) and microarray (Red) analysis. The Venn diagram illustrates the overlap and unique sets of DEGs detected by RNA-Seq and microarray platforms. Table S1: Demographic and clinical characteristics of study groups; Table S2: Assessment for goodness of fit for microarray’s normality and fitting of distributions for RNA-seq. Table S3. List of the shared DEGs and their respective Log₂ fold change. Table S4A. List of the shared Canonical Pathways and their respective p-values. Shared pathways are highlighted in red cells. Sorted by Canonical Pathway alphabetical order. Table S4B. List of the microarray Canonical Pathways and their respective p-values. Sorted by p-value. Table S4C. List of RNA-Seq Canonical Pathways and their respective p-value. Sorted by p-value.

Author Contributions

Conceptualization, I.D.R., S.A.B., L.Y. and M.M.G.; methodology, I.D.R. and S.A.B.; software, I.D.R. and S.A.B.; validation, K.-F.C., L.Y., J.S. and U.N.; formal analysis, I.D.R. and S.A.B.; investigation, I.D.R. and S.A.B.; resources, M.M.G. and J.W.S.; data curation, I.D.R. and K.-F.C.; writing—original draft preparation, I.D.R.; writing—review and editing, S.A.B., K.-F.C., L.Y., U.N., G.M.V., J.S., J.W.S. and M.M.G.; visualization, I.D.R., K.-F.C. and U.N.; supervision, M.M.G. and J.W.S.; project administration, K.-F.C.; funding acquisition, M.M.G. and J.W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by NIH awards R01-DA031017 (M.M.G.), U01-DA044571 (J.W.S.), U01-HD040533 (J.W.S.), U01-HD040474 (J.W.S.), U01-A1068632 (J.W.S.), NIH Pathway Program, NIH Division of Intramural Research, NIAID (M.M.G.), NIH Office of AIDS Research, OD (M.M.G.), and 5P30-AI064518 (J.W.S., Duke University Center for AIDS Research).

Institutional Review Board Statement

Approval to analyze the biorepository samples from this study was obtained from the Duke University IRB, Consequences of Marijuana Use on Inflammatory Pathways in HIV-Infected Youth (Pro00100780), approved on 8 March 2019, and the National Institutes of Health IRB (18-NIAID-00677), approved on 10 August 2018.

Informed Consent Statement

All participants provided informed consent for enrollment in Adolescent Medicine Trials Network (ATN) for HIV/AIDS Interventions protocol 071/101 at their individual sites (ClinicalTrials.gov; https://clinicaltrials.gov/, Clinical Identifier No NCT00683579).

Data Availability Statement

The data were uploaded to dbGaP [34] and will be released upon the publication of this manuscript (Accession ID phs002090.v1.p1, Accession ID phs002218.v1.p1).

Acknowledgments

The authors would like to thank the study volunteers for their participation in the study. This work was supported in part by funding from the Adolescent Trials Network for HIV/AIDS Interventions (ATN) from the NIH (U01-HD040533 and U01-HD040474) through the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), with supplemental funding from the National Institute on Drug Abuse (NIDA) and the National Institute for Mental Health (NIMH). It was also supported by independent grants from the NIDA (R01-DA031017 and U01-DA044571). The study was co-endorsed by the International Maternal Pediatric Adolescent AIDS Clinical Trials Group (IMPAACT), which is supported by the NIAID, the Eunice Kennedy Shriver NICHD, and the NIMH (U01-A1068632). Further support was provided by the Duke University Center for AIDS Research (CFAR) (NIH 5P30-AI064518). The following ATN sites participated in this study: University of South Florida, Tampa (Emmanuel, Lujan-Zilberman, Julian), Children’s Hospital of Los Angeles (Belzer, Flores, Tucker), University of Southern California at Los Angeles (Kovacs, Homans, Lozano), Children’s National Medical Center (D’Angelo, Hagler, Trexler), Children’s Hospital of Philadelphia (Douglas, Tanney, DiBenedetto), John H. Stroger Jr. Hospital of Cook County and the Ruth M. Rothstein CORE Center (Martinez, Bojan, Jackson), University of Puerto Rico (Febo, Ayala-Flores, Fuentes-Gomez), Montefiore Medical Center (Futterman, Enriquez-Bruce, Campos), Mount Sinai Medical Center (Steever, Geiger), University of California-San Francisco (Moscicki, Auerswald, Irish), Tulane University Health Sciences Center (Abdalian, Kozina, Baker), University of Maryland (Peralta, Gorle), University of Miami School of Medicine (Friedman, Maturo, Major-Wilson), Children’s Diagnostic and Treatment Center (Puga, Leonard, Inman), St. Jude’s Children’s Research Hospital (Flynn, Dillard), and Children’s Memorial (Garofalo, Brennan, Flanagan). The following IMPAACT sites participated in the study: Children’s Hospital of Michigan—Wayne State (Moore, Rongkavilit, Hancock), Duke University Medical Center Pediatric CRS (Cunningham, Wilson), Johns Hopkins University (Ellen, Chang, Noletto), New Jersey Medical School CTU/CRS (Dieudonne, Bettica, Monti), St. Jude/Memphis CTU/CRS (Flynn, Dillard, McKinley), University of Colorado School of Medicine/The Children’s Hospital (Reirden, Kahn, Witte), University of Southern California Medical Center (Homans, Lozano), Howard University Hospital (Rana, Deressa). Four of the ATN and IMPAACT sites utilized their General Clinical Research Center (GCRC)/Pediatric Clinical Research Center (PCRC) for the study. The centers were supported by grants from the General Clinical Research Center Program of the National Center for Research Resources (NCRR), NIH, Department of Health and Human Services, as follows: Children’s National Medical Center, M01RR020359; Howard University Hospital, MO1-RR010284; University of California at San Francisco, UL1 RR024131; and University of Colorado School of Medicine/Children’s Hospital, UL1 RR025780. The University of Pennsylvania/Children’s Hospital of Philadelphia utilized its Institutional Clinical and Translational Science Award Research Center (CTRC), supported by grant UL1 RR024134 from NCRR. The Tulane University Health Sciences Center utilized its CTRC for the study, which was supported in whole or in part by funds provided through the Louisiana Board of Regents RC/EEP (RC/EEP-06). This research was also supported by Stephany W. Holloway University Endowed Chair for AIDS Research at the University of Florida and Intramural Research Program of the NIAID.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RNA	Ribonucleic acid
HIV	Human immunodeficiency virus
AIDS	Acquired immunodeficiency syndrome
DEG	Differentially expressed gene(s)
GEO	Gene expression omnibus
PCR	Polymerase chain reaction
DNA	Deoxyribonucleic acid
NGS	Next-generation sequencing
PBCs	Peripheral blood cells
ATN	Adolescent Medicine Trial Network
YWOH	Youth without HIV
YWH	Youth with HIV
ART	Antiretroviral therapy
RMA	Robust multi-array averaging
IQR	Interquartile range
PCA	Principal component analysis
TPM	Transcripts per million
VST	Variance-stabilizing transformation
NB	Negative binomial
IPA	Ingenuity Pathway Analysis
KS	Kolmogorov–Smirnov
AD	Anderson–Darling
ML	Machine learning
AI	Artificial intelligence

References

Baechler, E.C.; Batliwalla, F.M.; Reed, A.M.; Peterson, E.J.; Gaffney, P.M.; Moser, K.L.; Gregersen, P.K.; Behrens, T.W. Gene expression profiling in human autoimmunity. Immunol. Rev. 2006, 210, 120–137. [Google Scholar] [CrossRef] [PubMed]
Cooper-Knock, J.; Kirby, J.; Ferraiuolo, L.; Heath, P.R.; Rattray, M.; Shaw, P.J. Gene expression profiling in human neurodegenerative disease. Nat. Rev. Neurol. 2012, 8, 518–530. [Google Scholar] [CrossRef] [PubMed]
Sotiriou, C.; Piccart, M.J. Taking gene-expression profiling to the clinic: When will molecular signatures become relevant to patient care? Nat. Rev. Cancer 2007, 7, 545–553. [Google Scholar] [CrossRef] [PubMed]
Clough, E.; Barrett, T.; Wilhite, S.E.; Ledoux, P.; Evangelista, C.; Kim, I.F.; Tomashevsky, M.; Marshall, K.A.; Phillippy, K.H.; Sherman, P.M.; et al. NCBI GEO: Archive for gene expression and epigenomics data sets: 23-year update. Nucleic Acids Res. 2023, 52, D138–D144. [Google Scholar] [CrossRef]
Rao, M.S.; Van Vleet, T.R.; Ciurlionis, R.; Buck, W.R.; Mittelstadt, S.W.; Blomme, E.A.; Liguori, M.J. Comparison of RNA-Seq and microarray gene expression platforms for the toxicogenomic evaluation of liver from short-term rat toxicity studies. Front. Genet. 2019, 9, 636. [Google Scholar] [CrossRef]
van der Kloet, F.M.; Buurmans, J.; Jonker, M.J.; Smilde, A.K.; Westerhuis, J.A. Increased comparability between RNA-Seq and microarray data by utilization of gene sets. PLoS Comput. Biol. 2020, 16, e1008295. [Google Scholar] [CrossRef]
Wang, C.; Gong, B.; Bushel, P.R.; Thierry-Mieg, J.; Thierry-Mieg, D.; Xu, J.; Fang, H.; Hong, H.; Shen, J.; Su, Z. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 2014, 32, 926–932. [Google Scholar] [CrossRef]
Zhao, S.; Fung-Leung, W.-P.; Bittner, A.; Ngo, K.; Liu, X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS ONE 2014, 9, e78644. [Google Scholar] [CrossRef]
Zwemer, L.M.; Hui, L.; Wick, H.C.; Bianchi, D.W. RNA-Seq and expression microarray highlight different aspects of the fetal amniotic fluid transcriptome. Prenat. Diagn. 2014, 34, 1006–1014. [Google Scholar] [CrossRef]
Xu, X.; Zhang, Y.; Williams, J.; Antoniou, E.; McCombie, W.R.; Wu, S.; Zhu, W.; Davidson, N.O.; Denoya, P.; Li, E. Parallel comparison of Illumina RNA-Seq and Affymetrix microarray platforms on transcriptomic profiles generated from 5-aza-deoxy-cytidine treated HT-29 colon cancer cells and simulated datasets. BMC Bioinform. 2013, 14, S1. [Google Scholar] [CrossRef]
Zhang, W.; Yu, Y.; Hertwig, F.; Thierry-Mieg, J.; Zhang, W.; Thierry-Mieg, D.; Wang, J.; Furlanello, C.; Devanarayan, V.; Cheng, J.; et al. Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biol. 2015, 16, 133. [Google Scholar] [CrossRef] [PubMed]
Kim-Chang, J.J.; Donovan, K.; Loop, M.S.; Hong, S.; Fischer, B.; Venturi, G.; Garvie, P.A.; Kohn, J.; Rendina, H.J.; Woods, S.P.; et al. Higher soluble CD14 levels are associated with lower visuospatial memory performance in youth with HIV. Aids 2019, 33, 2363–2374. [Google Scholar] [CrossRef] [PubMed]
Nichols, S.L.; Bethel, J.; Garvie, P.A.; Patton, D.E.; Thornton, S.; Kapogiannis, B.G.; Ren, W.; Major-Wilson, H.; Puga, A.; Woods, S.P. Neurocognitive functioning in antiretroviral therapy-naïve youth with behaviorally acquired human immunodeficiency virus. J. Adolesc. Health 2013, 53, 763–771. [Google Scholar] [CrossRef]
Yin, L.; Dinasarapu, A.R.; Borkar, S.A.; Chang, K.F.; De Paris, K.; Kim-Chang, J.J.; Sleasman, J.W.; Goodenow, M.M. Anti-inflammatory effects of recreational marijuana in virally suppressed youth with HIV-1 are reversed by use of tobacco products in combination with marijuana. Retrovirology 2022, 19, 10. [Google Scholar] [CrossRef]
Borkar, S.A.; Yin, L.; Venturi, G.M.; Shen, J.; Chang, K.F.; Fischer, B.M.; Nepal, U.; Raplee, I.D.; Sleasman, J.W.; Goodenow, M.M. Youth Who Control HIV on Antiretroviral Therapy Display Unique Plasma Biomarkers and Cellular Transcriptome Profiles Including DNA Repair and RNA Processing. Cells 2025, 14, 285. [Google Scholar] [CrossRef]
Gautier, L.; Cope, L.; Bolstad, B.M.; Irizarry, R.A. affy—Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20, 307–315. [Google Scholar] [CrossRef]
FASTQC. FastQC. 2015. Available online: https://qubeshub.org/resources/fastqc (accessed on 3 June 2024).
Bolger, A.M.; Lohse, M.; Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 2014, 30, 2114–2120. [Google Scholar] [CrossRef]
Kent, W.J.; Sugnet, C.W.; Furey, T.S.; Roskin, K.M.; Pringle, T.H.; Zahler, A.M.; Haussler, D. The human genome browser at UCSC. Genome Res. 2002, 12, 996–1006. [Google Scholar] [CrossRef]
Manimaran, S.; Selby, H.M.; Okrah, K.; Ruberman, C.; Leek, J.T.; Quackenbush, J.; Haibe-Kains, B.; Bravo, H.C.; Johnson, W.E. BatchQC: Interactive software for evaluating sample and batch effects in genomic data. Bioinformatics 2016, 32, 3836–3838. [Google Scholar] [CrossRef]
Ripley, B.; Venables, B.; Bates, D.M.; Hornik, K.; Gebhardt, A.; Firth, D.; Ripley, M.B. Package ‘mass’. Cran R 2013, 538, 822. [Google Scholar]
Wickham, H. ggplot2. Wiley Interdiscip. Rev. Comput. Stat. 2011, 3, 180–185. [Google Scholar] [CrossRef]
Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef] [PubMed]
Ihaka, R.; Gentleman, R. R: A Language for Data Analysis and Graphics. J. Comput. Graph. Statistics. 1996, 5, 299–314. [Google Scholar] [CrossRef]
The R Core Team. R: A Language and Enviroment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2004. [Google Scholar]
Yuan, Y.; Horikoshi, M.; Li, W. ggfortify: Unified interface to visualize statistical results of popular R packages. R J. 2016, 8, 474–485. [Google Scholar]
Maechler, M.; Rousseeuw, P.; Struyf, A.; Hubert, M.; Hornik, K. Cluster: Cluster Analysis Basics and Extensions. R Package Version 2.1.8.1. 2025. Available online: https://CRAN.R-project.org/package=cluster (accessed on 30 April 2025).
Krämer, A.; Green, J.; Pollard, J., Jr.; Tugendreich, S. Causal analysis approaches in Ingenuity Pathway Analysis. Bioinformatics 2013, 30, 523–530. [Google Scholar] [CrossRef]
Gao, X.; Song, P.X. Nonparametric tests for differential gene expression and interaction effects in multi-factorial microarray experiments. BMC Bioinform. 2005, 6, 186. [Google Scholar] [CrossRef]
Saroj, R.K.; Murthy, K.N.; Kumar, M. Nonparametric statistical test approaches in genetics data. Int. J. Comput. Biol. 2016, 5, 77–87. [Google Scholar] [CrossRef]
Hackstadt, A.J.; Hess, A.M. Filtering for increased power for microarray data analysis. BMC Bioinform. 2009, 10, 11. [Google Scholar] [CrossRef]
Lu, J.; Kerns, R.T.; Peddada, S.D.; Bushel, P.R. Principal component analysis-based filtering improves detection for Affymetrix gene expression arrays. Nucleic Acids Res. 2011, 39, e86. [Google Scholar] [CrossRef]
Marczyk, M.; Jaksik, R.; Polanski, A.; Polanska, J. Adaptive filtering of microarray gene expression data based on Gaussian mixture decomposition. BMC Bioinform. 2013, 14, 101. [Google Scholar] [CrossRef]
Mailman, M.D.; Feolo, M.; Jin, Y.; Kimura, M.; Tryka, K.; Bagoutdinov, R.; Hao, L.; Kiang, A.; Paschall, J.; Phan, L.; et al. The Ncbi Dbgap Database of Genotypes and Phenotypes. Nat. Genet. 2007, 39, 1181–1186. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Study design. The comparison was performed by collecting aliquots from the same samples for the respective platform. Data processing included filtering low-variable probes in the microarray output and rows with no counts in the RNA-seq output. Data was normalized by log transformation for microarray and transcript per million (TPM) for RNA-seq. Principal component analysis (PCA) was performed on log-transformed data for microarray and variance-stabilizing transformation (VST) for RNA-seq. Analysis of differentially expressed genes (DEGs) was performed, and coefficient of correlation, concordance, dynamic range and unique/shared genes were derived from DEG output. Pathway analysis was performed on total gene expression data. Differential expression was determined with Mann–Whitney U (Wilcoxon rank sum) test and adjusted with Benjamini–Hochberg (BH) correction.

Figure 2. Distribution of coefficient of correlation (R). Correlation coefficient for youth with HIV (blue) and without HIV (red). The median correlation coefficient (R) is 0.76 for both groups, indicating a positive relationship between their variables.

Figure 3. Principal component analyses. (A) PCA of microarray data from youth with HIV (blue) and youth without HIV (red). The first principal component (PC) accounts for 15.3% of the variability, while the second PC explains 12.9% of the variability, together representing 28.2% of the total variability in the data. (B) PCA of RNA-seq data from youth with HIV (blue) and youth without HIV (red). The first PC accounts for 23.7% of the variability, with the second PC explaining 13.6% of the variability, together representing 37.3% of the total variability in the data. Ellipses represent a 95% confidence interval.

Figure 4. Violin plot of the range of fold change of shared and total DEGs. Log₂FC values of DEGs common in both microarray and RNA-seq (Shared DEGs) and all DEGs in their respective platform (Total DEGs), microarray (white) or RNA-seq (striped). Log₂FC is computed by dividing the mean of each variable in youth with HIV by the respective mean of each variable in youth without HIV and calculating the log₂ of the output. Microarray had a range of −0.2 to 0.5 Log₂FC and RNA-seq had a range of −0.9 to 2.5 Log₂FC for shared DEGs. Microarray had a range of −0.2 to 0.5 Log₂FC and RNA-seq had a range of −4.4 to 6.0 Log₂FC for total DEGs. KS test p-value of less than 0.05 determined that the distributions of fold changes were significantly different, indicated by an *.

Figure 5. Shared pathways of IPA. IPA was used to determine functional outputs of microarray and RNA-seq data. The total number of shared pathways between microarray and RNA-seq is represented (30 pathways). The percentage of genes unique to microarray in each pathway is annotated with yellow. There is a large percentage of shared genes in each pathway, which are annotated with grey. RNA-seq had the most contributions of unique genes to shared pathways, annotated with green. The number of genes contributing to percent values is annotated in each stacked percent bar in the graph.

Table 1. Comparison of expressed genes and DEG numbers between microarray and RNA-seq.

	Microarray ¹		RNA-Seq ²
Expressed Genes
Unique	2251		8656
Shared	(86%)	13,577	(61%)
Total	15,828		22,323
DEGs (FDR = 0.05)
Unique	204		2172
Shared	(52%)	223	(9%)
Total	427		2395

¹ Microarray data were filtered by removing the genes responsible for the bottom 25% variance. ² RNA-seq data were filtered by removing all genes that had sum zero across all samples.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Raplee, I.D.; Borkar, S.A.; Yin, L.; Venturi, G.M.; Shen, J.; Chang, K.-F.; Nepal, U.; Sleasman, J.W.; Goodenow, M.M. The Role of Microarray in Modern Sequencing: Statistical Approach Matters in a Comparison Between Microarray and RNA-Seq. BioTech 2025, 14, 55. https://doi.org/10.3390/biotech14030055

AMA Style

Raplee ID, Borkar SA, Yin L, Venturi GM, Shen J, Chang K-F, Nepal U, Sleasman JW, Goodenow MM. The Role of Microarray in Modern Sequencing: Statistical Approach Matters in a Comparison Between Microarray and RNA-Seq. BioTech. 2025; 14(3):55. https://doi.org/10.3390/biotech14030055

Chicago/Turabian Style

Raplee, Isaac D., Samiksha A. Borkar, Li Yin, Guglielmo M. Venturi, Jerry Shen, Kai-Fen Chang, Upasana Nepal, John W. Sleasman, and Maureen M. Goodenow. 2025. "The Role of Microarray in Modern Sequencing: Statistical Approach Matters in a Comparison Between Microarray and RNA-Seq" BioTech 14, no. 3: 55. https://doi.org/10.3390/biotech14030055

APA Style

Raplee, I. D., Borkar, S. A., Yin, L., Venturi, G. M., Shen, J., Chang, K.-F., Nepal, U., Sleasman, J. W., & Goodenow, M. M. (2025). The Role of Microarray in Modern Sequencing: Statistical Approach Matters in a Comparison Between Microarray and RNA-Seq. BioTech, 14(3), 55. https://doi.org/10.3390/biotech14030055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Role of Microarray in Modern Sequencing: Statistical Approach Matters in a Comparison Between Microarray and RNA-Seq

Abstract

1. Introduction

2. Materials and Methods

2.1. Clinical Profile of the Study Participants

2.2. RNA Isolation, Hybridization and Sequencing

2.3. Data Processing

2.4. Downstream Analysis

3. Results

3.1. High Correlation of Gene Expression and Concordance of DEGs

3.2. PCA of Microarray and RNA-Seq

3.3. RNA-Seq Demonstrates a Greater Dynamic Range of Fold Change

3.4. High Concordance of Canonical Pathways

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI