1. Introduction
From both biological and statistical standpoints, reproducibility and accuracy of results are crucial to clinical utility of genome-wide Omics studies. A 2016 survey by
Nature [
1] indicated that 70% of researchers failed to replicate other scientists’ studies, with more than half failing to replicate their own. While the accuracy and reproducibility of an Omics signal in multi-subject studies can be assessed by comparing subjects in distinct datasets, evaluating the accuracy of a single-subject study (SSS) remains challenging. In principle, conventional statistics deriving dispersion parameters (e.g., variance) across samples can be applied to single-subject studies using multiple repeated measures in each compared condition (e.g.,
t-test) or many measures over time (e.g., time series) [
2,
3]. However, this strategy is often prohibitively expensive, wastes valuable clinical specimens, and is rate-limiting. The foundation for single-subject studies [
4,
5] highlights the challenges and issues associated with inferential statistics on cohorts of size N = 1. Beyond the multiple repeated measures paradigm of conventional statistics, we and others have proposed new analytical methods designed to identify an effect size and statistical significance for a subject from an Omics sample per condition without replicates [
6,
7,
8,
9] (in this study specifically with transcriptomics). A reference standard consisting of the other subjects’ genomes is sufficient to qualify the frequency of a genetic variant or mutation in static DNA. However, when this strategy is applied to proteins or transcripts, it does not inform on the differences observed between an individual’s gene products expression and that of a group. Are these differences attributable to a normal physiological adaption or to a pathological response to environmental factors unique to this individual (e.g., a combination of medication)? Furthermore, how can studies best capture these differences? This manuscript presents an alternative approach to constructing method-independent reference standards that address inherent challenges in transcriptome scale single-subject studies. A proposed framework improves upon the evaluation of software tools and algorithms for differential gene expression in one subject between two sampling conditions, in absence of replicate measures per condition. Such single-subject study Omics designs are more affordable and practical for clinical settings than repeated measures in one condition. Generally, they provide a more interpretable effect size and p-value at a single subject than comparing an individual against a cohort. We contrast and compare this new evaluation framework in transcriptomics to previous ones in terms of the accuracy of results beyond the previously proposed “naïve replication” and quantify biases stemming from previous evaluations’ anticonservative assumptions frameworks.
In large-scale biological data science studies, “gold standards” produced via biological validation is rate-limiting and generally unfeasible at the Omics scale. Data scientists address this limitation with computational “reference standards” as a proxy for conventional biological gold standards. The most rigorous reference standards employ (i) independent analytics and (ii) independent samples (datasets) from predictions. However, these two conditions are not always feasible in single-subject studies. Furthermore, most approaches generating reference standards from an Omics dataset rely heavily on
p-values, despite recommendations from statistician scholars for effect-size informed approaches to address the limitations of null-hypothesis significance testing [
10,
11]. We synthesize and incorporate these notions into a set of standard operating procedures for the development of reliable reference standards in transcriptomes, as a foundation for evaluating big data science reproducibility studies.
This manuscript focuses on improving the accuracy of single-subject studies evaluations, beyond “naïve reproducibility” of results and other biases described in
Table 1. In a prior study of 5 distinct RNA analysis methods in multiple isogenic datasets [
12], we described a new method that combines the inconsistent signal between analytical methods that the original study did not address [
13]. This inconsistency (arising from distributional differences) required methods, such as DESeq [
14], to impose a false discovery rate (FDR) cutoff of 0.001 to detect ~3000 DEGs, while DEGseq [
15] required a cutoff of FDR < 3.6 × 10
−12 for the same number of DEGs, with 2039 overlapping transcripts. Conversely, we also found that applying the same FDR cutoff (i.e., 0.001) resulted in methods producing various predictions (i.e., 3200 vs. ~9000 with approximately 3000 overlapping transcripts, leaving ~6000 transcripts with a conflicting, unaddressed signal). Anticonservative isomorphic evaluations (
Table 1) have been the conventional standard for evaluating DEG analytics in isogenic conditions (e.g., cell lines or inbred animal models), the closest datasets to single-subject studies [
13,
16]. Such evaluations propose a naïve replication of results using the anticonservative assumption that the same DEG analytics can be employed to create the reference standard and the predictions. In a prior study, we constructed an ensemble learner [
12] to develop reference standards, where the ensemble approach resolves conflicting biomarker prediction, uses no statistical assumptions, and removes anti-conservative isomorphic evaluations. Our prior study demonstrated that in situations comprising high technical noise, an ensemble learner maximizes the stability of a reference standard and the DEG predictions [
12]. However, they increase the “black-box” aspect of the data analysis and muddle its interpretability.
In principle, the reference standard should be independent from the predicted biological signal to evaluate an analytical method. This requires independent datasets for calculating and evaluating the prediction. We sought to improve the evaluation of single-subject studies of transcriptomics-scale gene products by generating unbiased reference standards. We focused on one framework of single-subject studies: Those with two transcriptomics-scale measures (one per condition) in one subject, designed to determined altered gene products using the subject as their own control. We hypothesized that these unbiased reference standards could be achieved by: (i) Using distinct analytical methods against than the one being evaluated to avoid analytical biases, and (ii) selecting the most concordant results between multiple analytical methods according to ranges of fold change expression between two conditions and expression count cutoffs. We propose a framework,
referenceNof1, to resolve the challenges highlighted in
Table 1, offering an alternate, yet related, evaluation framework for improving the data quality in the reference standard construction. The framework is presented in
Figure 1. We demonstrate the
referenceNof1 method accuracy with transcriptome simulations and historical transcriptome data (
Section 2).
Section 3 discusses the implications and limitations of the current approaches, while
Section 4 details the data and materials and formally introduces the
referenceNof1 algorithm.
Section 5 concludes the study. The
referenceNof1 software is released as an R package.
5. Conclusions
Reproducibility and accuracy are not only central to Omics studies but to precision medicine. Improving existing techniques and frameworks in single-subject studies empowers scientists to separate clinically relevant biomarkers from statistical artefacts. Transforming these initiatives into open-source software improves reproducibility and furthers the space of open precision medicine. Prior studies [
2] illustrate how the unique challenges of single-subject analyses of transcriptomes in the absence of replicates remain challenging. However, we posit that an improvement in evaluation methods, as proposed here, provides the rigorous framework for assessing objectively ulterior proposed improvements. In addition, pathway-level single-subject studies of transcriptomes have been shown more accurate than gene product level ones [
3], suggesting potential future pathway-level applications of the methods we proposed. This manuscript highlights four types of biases (
Table 1) that confound results in both conventional analyses and single-subject studies’ clinical translation. The proposed
referenceNof1, complementary to [
12], follows a suite of recent work [
9,
12,
28] in which we seek to address these challenges, resulting in a new framework for creating robust reference standards. We proposed, tested, and developed an open-source software using a single strategy that reduces two additional biases: (i) Statistical distribution bias and (ii) systematic bias from isomorphic evaluations (using the same analysis in the prediction and validation sets). Despite the specific challenges posed in single-subject studies, these advances create new opportunities to combine single-subject and conventional cohort studies. In this study, we demonstrated the utility of constructing more robust reference standards in single-subject transcriptomic studies. There are multiple directions to conduct future studies. One opportunity is a follow-up study that will extend these methods by incorporating ontologies to transform and aggregate gene-products into pathways and gene sets, which can construct pathway-based robust reference standards. An alternate avenue can be pursued by extending these tools into other ‘omics (i.e., proteome or metabolome). Finally, future studies will include a self-learning grid-search that identifies the optimal reference standard parameters. This manuscript expands upon recent work to addresses existing knowledge gaps and challenges in the single-subject domain to bring our tools, technology, and analyses closer to delivering the promise made by precision medicine: “the right treatment, for the right patient, at the right time.”