#### 5.1. The Reproducibility Crisis in Metabolomics Biomarker Discovery

The reproducibility crisis has incited considerable discussion in recent years [

51,

52], since the phrase first came to the fore when a major replication effort in psychology found that only 39% of studies assessed were reproducible. Since then, similar replication efforts have shown that the reproducibility crisis is a serious issue across a variety of disciplines, from cancer biology [

53] to economics [

54]. A 2016 survey by nature magazine showed that most scientific fields are facing a reproducibility crisis [

55].

The terms reproducibility, replicibality, reliability, robustness, and generalizability all have slightly different meanings and implications and the study in [

56] can provide clarification. In this review, results reproducibility and the related phenomenon of generalizability in metabolomics biomarker discovery are discussed.

With regards to metabolomics biomarker discovery, the issue of results irreproducibility can be described as the inability of results from preliminary analyses to reproduce in validation studies despite good performance on cross-validation/permutation on the initial dataset. In the field of metabolomics biomarker discovery, the majority of preliminary findings are not followed up by external validation [

21]. The reason for this lack of follow up could be because validation studies do not actually take place or because validation studies are carried out and produce negative results, which are not reported (or published). A recent systematic review examined biomarker discovery publications where metabolomics biomarkers had been validated on an external test set (usually by the same group, on the same publication) and showed that apparently equivalent studies, on the same disease, obtain different biomarker lists [

30].

The most often purported reasons in the literature for the reproducibility crisis in preclinical research are lack of standards and rigor in experimental procedures, publication bias, poor data analysis techniques, questionable research methods, selective reporting of results, studies with low statistical power, and a “faulty incentive” culture in scientific and clinical research [

52,

57]. Solutions emanating from this view of the crisis focus on open science practices, data sharing, and increased standardisation.

The above reasons for the reproducibility crisis notwithstanding, an alternative perspective on the reproducibility crisis has begun to appear in the literature, which suggests that expectations of reproducibility are misplaced. This view proposes that a large part of the reproducibility crisis in clinical and preclinical research can be ascribed to a failure to account for contextual sensitivity [

58], phenotypic plasticity [

59], and reaction norms [

60]. This philosophy urges us to adjust our expectations [

61] and, therefore, our analysis methods. Voekl [

62] shows that increasing variability, rather than reducing it, improves reproducibility in preclinical animal studies. An [

63] describes his perception of the reproducibility crisis as a failure of typical methods of scientific investigation to account for biological heterogeneity. Describing the denominator subspace as all possible states of a biosystem, An posits that typical experiments only base models on a sliver of possible outcomes and, as such, failure of these models to be generalizable (reproducible) is inevitable.

In metabolomics biomarker discovery, as has been mentioned previously, validation efforts of preliminary discovery findings are in the vast minority (and are almost always carried out by the same group), and validated studies by different groups on the same disease produce different lists of biomarkers [

15]. However, metabolomics profiles from the same samples analysed across different platforms, by different investigators, have produced comparable profiles even without prior standardisation [

64]. Together, these findings could point towards contextual sensitivity as a possible explanation for results irreproducibility (

Figure 1), as opposed to issues related to in experimental standards.

#### 5.2. Multivariate Analysis and Univariate Analysis

Univariate and multivariate analysis techniques are frequently applied to metabolomics datasets and are considered to provide complementary information. Therefore, it is advised that both analysis methods be employed. Univariate techniques examine only one variable at a time while multivariate techniques make use of co-variances or correlations, which reflect the extent of the relationships among the variables. Based on the intuitive notion of gene-sets or metabolic networks, there is a general belief that multivariate analysis is superior to univariate analysis for the discovery of biomarker candidates [

65].

However, Lai et al. show that univariate selection approaches yield generally better results than multivariate approaches across the majority of gene expression datasets that they analysed [

50]. The reason for this, they conclude, is that correlation structures, even if they are present, cannot be extracted reliably due to low sample numbers. Another study shows that for gene extraction, multivariate statistics do not lead to a substantial gain in power compared with univariate statistics, even when correlations are present and high [

66].

Multivariate search for biomarkers using supervised algorithms is based on the underlying assumption of homogeneity with multiple discriminating features together representing the fingerprint of the disease phenotype. However, in the situation where a CHD does not contain a hidden fingerprint for disease that is common to most or all cases, but instead exhibits perturbations in some features, in some cases, that reflect the different etiologies of the hidden subgroups, then the biomarker(s) from these different subsets of the disease are difficult to reveal using global methods due to their low overall prevalence among cases in the dataset [

28].

Multivariate techniques, particularly statistical learning techniques, are the

de facto data analysis methods employed for biomarker discovery in metabolomics. Instability and abstraction have been purported as the two fundamental issues contributing to the failure of biomarkers obtained from statistical learning approaches and are problems that are “here to stay” due to characteristics inherent in omics datasets. Instability is due to the curse of dimensionality, and the “small

n large

p” problem. Abstraction describes the data driven nature of the algorithms, which yield complex decision rules that generally lack meaning for biologists and clinicians to “generate testable hypotheses” [

38]. Partial Least Squares Discriminant Analysis (PLSDA) is the most popular learning algorithm used in metabolomics biomarker discovery. Its widespread availability in omics analysis software is credited (or blamed) for the algorithm’s ubiquity in metabolomics biomarker discovery studies [

67]. PLSDA, however, has been described as an “algorithm full of dangers” and is prone to overfitting and, therefore, producing false positive results [

67,

68,

69]. PLSDA is also susceptible to misuse in the hands of non-experts [

67]. For example, supervised algorithms such as PLSDA can result in biased estimations of prediction accuracy when classifiers are used that fail to select features from scratch in each iteration of a loop in cross-validation. This analysis flaw was found to be common in a review of microarray statistical analyses [

70]. It is reasonable to assume that such misuse of PLSDA and similar algorithms is also an issue in metabolomics data analysis. However, poor reporting makes assessing the extent of this problem difficult [

36]. Finally, as an especially important point for the analysis of complex and heaterogeneous datasets, Eriksson et al. note that “

a necessary condition for PLSDA to work is that each class is tight and occupies a small and separate volume in X-Space... Moreover, when some of the classes are not homogeneous and spread significantly in X-space, the discriminant analysis does not work” [

71]. Clearly, this has serious implications for the use of PLSDA to analyse CHD datasets.

PLSDA and other classification algorithms may have most success in biomarker discovery for CHD in metabolomics in certain scenarios: (1). If the dataset under study represents a subtype of the complex disease, so it is essentially homogeneous. (2). If, at the time of bio-specimen sampling, etiological pathways had converged sufficiently, so the samples under analysis were exhibiting the same pathway, so again, the dataset would essentially be homogeneous at that time point. However the first scenario is unlikely to exist the phenotypic classification does not, in fact, represent a biologically unique category but instead a lumping together of cases due to similar phenotypes, and the second scenario is unlikely to exist if the sample timing captures the metabolomic activity at different stages along the various etiological pathways of the disease or too early in the disease etiology at a point in time where the pathways have not yet converged.

While multivariate modelling methods, such as PLSDA, often identify discriminating variables that lead to good classification upon cross validation and permutation, these variables are unlikely to be generalizable beyond the original system due to the complexity of the system from which they are derived. The discriminating variables identified in this scenario are likely to be highly context specific and might, in fact, not even be a revelation of real world phenomena but may, instead, merely represent “a tautological relationship between a set of numbers” [

72] derived from a static snapshot analysis of a dynamic system.

An [

63] eloquently defines what he calls The Denominator Problem in biomedical research as the intrinsic inability of reductionist experimental biology to effectively reflect system denominator space, where the denominator is defined as the population distribution of the total possible behaviour/state space, as described by whatever metrics chosen, that biosystem. He explains how the requirements of experimental biology serve to constrain the sampling of denominator space, which leads to extreme sensitivity to conditions and irreproducibility. The author proposes that increased complexity and sophistication in the form of multi-scale mathematical (MSM) and dynamic computational models, to account for the denominator subspace, will overcome this problem. Geman [

38] also suggests that global mathematical models over network scale configurations of genomic states and molecular concentrations will overcome the failure of omics based predictors and signatures to translate to clinical use.

Somewhat conversely, Karpievitch et al. describe how a major issue affecting the success of biomarker discovery in metabolomics is that the long line from pre-processing to statistical inference in data analysis is costly in terms of loss of degrees of freedom and accumulation of bias and errors [

34]. At every step, there is a loss of degrees of freedom as we “use up” information in the dataset. Variability introduced at each step is not communicated to the next step, which can lead to an over-fitting of the data. Karpievitch et al. [

34] suggest that the most natural solution to this problem is that processing and inference are ideally carried out in the same step, as this will greatly increase the quality of results by limiting the amount of bias communicated from one step to another. If this approach is not achievable, then shortening the data analysis pipeline, where possible, will result in fewer decisions to be made and, consequently, less opportunity for the introduction of bias, thereby increasing the possibility of obtaining actual meaningful results.

Therefore, it may be the case that typical data analysis techniques employed in biomarker discovery are at once too complex and not complex enough to tackle the biological heterogeneity issue. Specifically, statistical learning techniques, such as PLSDA, may be inadequate to model the heterogeneity and complexity of the “denominator subspace”, and, at the same time, they may be overly complex, leading to loss of information from an already overtaxed source. The trade-off of loss of degrees of freedom may not be balanced adequately by a gain in information. Until such a time as the sophistication of the methods improves (i.e., until multi-scale-modelling can be achieved), recognition of the limitations of what we can access from a static representation of a dynamic system needs to be accepted. While not sophisticated, low constraint and low complexity analysis methods, to search for biomarkers, may demonstrate improvements in generalisability.

Figure 2 represents a schematic of this proposed notion.

As an example, in the investigation of clinically useful biomarkers from a CHD metabolomics dataset, features that exhibit stable expression across all controls are a rich source of clinically applicable information. In a CHD dataset, the case group has hidden substructure due to unknown hidden subgroups of disease, whereas in the controls, at least for the outcome of interest, that hidden substructure does not exist. A feature that exhibits low variance or stable expression across a group of healthy controls and perturbation, in at least some cases, is likely to be informative, and, more importantly, that information is likely to be generalizable beyond the study at hand, since the feature has shown stability across a group of patients (the controls) and, as such, has demonstrated that it is impervious to context sensitivity (in the actual study at least). This would be a mechanism would simultaneously leverage prior information into model building that would constrain model complexity and overcome the issue of contextual sensitivity.