# t-Test at the Probe Level: An Alternative Method to Identify Statistically Significant Genes for Microarray Data

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methodology

## 3. Results

#### 3.1. Sensitivity

#### 3.2. Robustness

**Figure 1.**ROC curve comparing the performance (sensitivity) of different ranking methodologies in identifying differentially expressed genes in spike-in data. (

**A**) Comparison of our approach (median t-values) with different preprocessing methods used with t-test as ranking method to select the differentially expressed genes. (

**B**) Comparison of our approach (median t-values) with the best performance of other ranking methods: t-test, SAM and LIMMA. The best performance of t-test, SAM and LIMMA is obtained when using RMA as a preprocessing algorithm.

**Figure 2.**Average of the fraction of genes shared by two lists of differentially expressed genes (overlap) as a function of the sample size using the Leukemia dataset. Each list of differentially expressed genes is composed by the top 100 genes chosen according to different ranking methods, i.e., t-test, SAM and LIMMA (preprocessed by the RMA method), and our approach (median t-value) which does not require a preprocessing algorithm. The average value of the overlap between the lists is calculated over 100 lists chosen randomly.

**Figure 3.**Average of the fraction of genes shared by two lists of differentially expressed genes (overlap) as a function of the sample size using the Multiple Myeloma dataset divided according to (

**A**) Overall Survival Milestone Outcome (OS-MO) and (

**B**) Event Free Survival Milestone Outcome (EFS-MO). Each list of differentially expressed genes is composed by the top 100 genes chosen according to different ranking methods, i.e., t-test, SAM and LIMMA (preprocessed by the RMA method), and our approach (median t-value) which does not require a preprocessing algorithm. The average value of the overlap between the lists is calculated over 100 lists chosen randomly.

**Figure 4.**Average of the fraction of genes shared by two lists of differentially expressed genes (overlap) as a function of the sample size using the Breast Cancer dataset divided according to (

**A**) pre-operative treatment response (pCR, pathologic complete response) and (

**B**) estrogen receptor (ER) endpoint. Each list of differentially expressed genes is composed by the top 100 genes chosen according to different ranking methods, i.e., t-test, SAM and LIMMA (preprocessed by the RMA method), and our approach (median t-value) which does not require a preprocessing algorithm. The average value of the overlap between the lists is calculated over 100 lists chosen randomly.

## 4. Discussion

## Acknowledgments

## Author Contributions

## Appendices

**Table A1.**Latin Square design, which consists of 14 spike-in gene groups in 14 experimental groups with 3 repetitions.

gene/exp | 1–3 | 4–6 | 7–9 | 10–12 | 13–15 | 16–18 | 19–21 | 22–24 | 25–27 | 28–30 | 31–33 | 34–36 | 37–39 | 40–42 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1–3 | 0 | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 |

4–6 | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 0 |

7–9 | 0.25 | 0.5 | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 0 | 0.125 |

10–12 | 0.5 | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 0 | 0.125 | 0.25 |

13–15 | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 0 | 0.125 | 0.25 | 0.5 |

16–18 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 0 | 0.125 | 0.25 | 0.5 | 1 |

19–21 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 0 | 0.125 | 0.25 | 0.5 | 1 | 2 |

22–24 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 0 | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 |

25–27 | 16 | 32 | 64 | 128 | 256 | 512 | 0 | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 |

28–30 | 32 | 64 | 128 | 256 | 512 | 0 | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 | 16 |

31–33 | 64 | 128 | 256 | 512 | 0 | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 | 16 | 32 |

34–36 | 128 | 256 | 512 | 0 | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 | 16 | 32 | 64 |

37–39 | 256 | 512 | 0 | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 |

40–42 | 512 | 0 | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |

**Table A2.**Class distribution of the datasets. The Breast Cancer and Multiple Myeloma datasets were obtained from the MAQC consortium [25]. The training and validation datasets were generated by different experimental groups. In our analysis we considered all samples: both training and validation samples.

Dataset | Training Set | Validation Set | ||||
---|---|---|---|---|---|---|

Number of Samples | Positive | Negative | Number of Samples | Positive | Negative | |

Breast Cancer (pCR) | 130 | 33 | 97 | 100 | 15 | 85 |

Breast Cancer (ER) | 130 | 80 | 50 | 100 | 61 | 39 |

Multiple Myeloma (OS-MO) | 340 | 51 | 289 | 214 | 27 | 187 |

Multiple Myeloma (EFS-MO) | 340 | 84 | 256 | 214 | 34 | 180 |

## Conflicts of Interest

## References

- Ein-Dor, L.; Kela, I.; Getz, G.; Givol, D.; Domany, E. Outcome signature genes in breast cancer: Is there a unique set? Bioinformatics
**2005**, 21, 171–178. [Google Scholar] [CrossRef] [PubMed] - Ein-Dor, L.; Zuk, O.; Domany, E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA
**2006**, 103, 5923–5928. [Google Scholar] [CrossRef] [PubMed] - Li, C.; Wong, W.H. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. USA
**2001**, 98, 31–36. [Google Scholar] [CrossRef] [PubMed] - Irizarry, R.A.; Hobbs, B.; Collin, F.; Beazer-Barclay, Y.D.; Antonellis, K.J.; Scherf, U.; Speed, T.P. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics
**2003**, 4, 249–264. [Google Scholar] [CrossRef] [PubMed] - Wu, Z. A review of statistical methods for preprocessing oligonucleotide microarrays. Stat. Methods Med. Res.
**2009**, 18, 533–541. [Google Scholar] [PubMed] - Guide to Probe Logarithmic Intensity Error (Plier) Estimation. Available online: http://www.affy metrix.com/support/technical/technotes/plier_technote.pdf (accessed on 1 November 2012).
- Shi, L.; Tong, W.; Fang, H.; Scherf, U.; Han, J.; Puri, R.K.; Frueh, F.W.; Goodsaid, F.M.; Guo, L.; Su, Z.; et al. Cross-platform comparability of microarray technology: Intra-platform consistency and appropriate data analysis procedures are essential. BMC Bioinform.
**2005**, 6, eS12. [Google Scholar] [CrossRef] [PubMed] - Shi, L.; Reid, L.H.; Jones, W.D.; Shippy, R.; Warrington, J.A.; Baker, S.C.; Collins, P.J.; de Longueville, F.; Kawasaki, E.S.; Lee, K.Y.; et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol.
**2006**, 24, 1151–1161. [Google Scholar] [CrossRef] [PubMed] - Allison, D.B.; Cui, X.; Page, G.P.; Sabripour, M. Microarray data analysis: From disarray to consolidation and consensus. Nat. Rev. Genet.
**2006**, 7, 55–65. [Google Scholar] [CrossRef] [PubMed] - Jeanmougin, M.; de Reynies, A.; Marisa, L.; Paccard, C.; Nuel, G.; Guedj, M. Should we abandon the t-test in the analysis of gene expression microarray data: A comparison of variance modeling strategies. PLoS One
**2010**, 5, e0012336. [Google Scholar] [CrossRef] [PubMed] - Tusher, V.G.; Tibshirani, R.; Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA
**2001**, 98, 5116–5121. [Google Scholar] [CrossRef] [PubMed] - Cui, X.; Hwang, J.; Qiu, J.; Blades, N.; Churchill, G. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics
**2005**, 6, 59–75. [Google Scholar] [CrossRef] [PubMed] - Wright, G.W.; Simon, R.M. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics
**2003**, 19, 2448–2455. [Google Scholar] [CrossRef] [PubMed] - Smyth, G.K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol.
**2004**, 3, e3. [Google Scholar] [CrossRef] [PubMed] - Zeisel, A.; Amir, A.; Kostler, W.J.; Domany, E. Intensity dependent estimation of noise in microarrays improves detection of differentially expressed genes. BMC Bioinform.
**2010**, 11, e400. [Google Scholar] [CrossRef] [PubMed] - Baldi, P.; Long, A.D. A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics
**2001**, 17, 509–519. [Google Scholar] [CrossRef] [PubMed] - Stevens, J.R.; Bell, J.L.; Aston, K.I.; White, K.L. A comparison of probe-level and probeset models for small-sample gene expression data. BMC Bioinform.
**2010**, 11, e281. [Google Scholar] [CrossRef] [PubMed] - Lemieux, S. Probe-level linear model fitting and mixture modeling results in high accuracy detection of differential gene expression. BMC Bioinform.
**2006**, 7, e391. [Google Scholar] [CrossRef] [PubMed][Green Version] - Barrera, L.; Benner, C.; Tao, Y. Leveraging two-way probe-level block design for identifying differential gene expression with high-density oligonucleotide arrays. BMC Bioinform.
**2004**, 14, 1–14. [Google Scholar] - Astrand, M.; Mostad, P.; Rudemo, M. Empirical Bayes models for multiple probe type microarrays at the probe level. BMC Bioinform.
**2008**, 9, e156. [Google Scholar] [CrossRef] [PubMed] - Chu, J.T. On the distribution of the sample median. Ann. Math. Stat.
**1955**, 26, 112–116. [Google Scholar] [CrossRef] - Latin Square Data for Expression Algorithm Assessment. Available online: http://www.affymetrix.com/support/technical/sample_data/datasets.affx (accessed on 1 November 2012).
- Cope, L.M.; Irizarry, R.A.; Jaffee, H.A.; Wu, Z.; Speed, T.P. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics
**2004**, 20, 323–331. [Google Scholar] [CrossRef] [PubMed] - Bolstad, B.M.; Irizarry, R.A.; Astrand, M.; Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics
**2003**, 19, 185–193. [Google Scholar] [CrossRef] [PubMed] - Shi, L.; Campbell, G.; Jones, W.D.; Campagne, F.; Wen, Z.; Walker, S.J.; Su, Z.; Chu, T.M.; Goodsaid, F.M.; Pusztai, L.; et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol.
**2010**, 28, e827. [Google Scholar] [CrossRef] [PubMed] - Armstrong, S.A.; Staunton, J.E.; Silverman, L.B.; Pieters, R.; den Boer, M.L.; Minden, M.D.; Sallan, S.E.; Lander, E.S.; Golub, T.R.; Korsmeyer, S.J. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet.
**2002**, 30, 41–47. [Google Scholar] [CrossRef] [PubMed] - Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.; Downing, J.R.; Caligiuri, M.A.; et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science
**1999**, 286, 531–537. [Google Scholar] [CrossRef] [PubMed] - Gyorffy, B.; Molnar, B.; Lage, H.; Szallasi, Z.; Eklund, A.C. Evaluation of microarray preprocessing algorithms based on concordance with RT-PCR in clinical samples. PLoS One
**2009**, 4, e0005645. [Google Scholar] [CrossRef] [PubMed][Green Version] - Therneau, T.M.; Ballman, K.V. What does PLIER really do? Cancer Inform.
**2008**, 6, 423–431. [Google Scholar] [PubMed]

© 2014 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Boareto, M.; Caticha, N. t-Test at the Probe Level: An Alternative Method to Identify Statistically Significant Genes for Microarray Data. *Microarrays* **2014**, *3*, 340-351.
https://doi.org/10.3390/microarrays3040340

**AMA Style**

Boareto M, Caticha N. t-Test at the Probe Level: An Alternative Method to Identify Statistically Significant Genes for Microarray Data. *Microarrays*. 2014; 3(4):340-351.
https://doi.org/10.3390/microarrays3040340

**Chicago/Turabian Style**

Boareto, Marcelo, and Nestor Caticha. 2014. "t-Test at the Probe Level: An Alternative Method to Identify Statistically Significant Genes for Microarray Data" *Microarrays* 3, no. 4: 340-351.
https://doi.org/10.3390/microarrays3040340