Integrating NMR and MS for Improved Metabolomic Analysis: From Methodologies to Applications

Homobono Brito de Moura, Patricia; Leleu, Guillaume; Da Costa, Grégory; Marti, Guillaume; Pétriacq, Pierre; Valls Fonayet, Josep; Richard, Tristan

doi:10.3390/molecules30122624

Open AccessReview

Integrating NMR and MS for Improved Metabolomic Analysis: From Methodologies to Applications

by

Patricia Homobono Brito de Moura

^1,2,3

,

Guillaume Leleu

¹

,

Grégory Da Costa

¹,

Guillaume Marti

^4,5

,

Pierre Pétriacq

^2,3

,

Josep Valls Fonayet

^1,2

and

Tristan Richard

^1,*

¹

Bordeaux INP, INRAE, Bordeaux Sciences Agro, OENO, UMR 1366, ISVV, University of Bordeaux, 33140 Villenave d’Ornon, France

²

Bordeaux Metabolome, MetaboHUB, PHENOME-EMPHASIS, 33140 Villenave d’Ornon, France

³

UMR 1332 Fruit Biology and Pathology, INRAE, University of Bordeaux, 33882 Villenave d’Ornon, France

⁴

Metatoul-AgromiX Platform, Laboratoire de Recherche en Sciences Végétales (UMR 5546), Université deToulouse, CNRS, INP, 31320 Auzeville-Tolosane, France

⁵

MetaboHUB-MetaToul, National Infrastructure of Metabolomics and Fluxomics, 31077 Toulouse, France

^*

Author to whom correspondence should be addressed.

Molecules 2025, 30(12), 2624; https://doi.org/10.3390/molecules30122624

Submission received: 26 May 2025 / Revised: 10 June 2025 / Accepted: 11 June 2025 / Published: 17 June 2025

(This article belongs to the Special Issue Feature Papers in Food Chemistry—3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Metabolomics, the comprehensive analysis of low-molecular-weight metabolites (typically below 1500 DA) in biological systems, relies heavily on mass spectrometry (MS) and nuclear magnetic resonance (NMR) spectroscopy. Each technique has inherent strengths and weaknesses. MS offers high sensitivity and is commonly coupled with chromatography to analyze complex matrices, yet it is destructive, has limited reproducibility, and provides limited structural information. NMR, while less sensitive, is non-destructive and enables structural elucidation and precise quantification. Recent studies increasingly employ data fusion (DF) strategies to combine the complementary information from NMR and MS, aiming to enhance metabolomic analyses. This review summarizes DF methodologies using NMR and MS data in metabolomics studies over the past decade. A comprehensive search of SciFinder, Scopus, and Clarivate Web of Science databases was conducted to analyze fusion techniques, methods, and statistical models. The review emphasizes the growing importance of DF in metabolomics, showing its capacity to provide a more comprehensive view of biochemical processes across diverse biological systems, including clinical, plant, and food matrices.

Keywords:

metabolomics; data fusion; nuclear magnetic resonance (NMR); mass spectrometry (MS); multi-omics

1. Introduction

Metabolomics, a branch of the omics sciences, focuses on the comprehensive characterization of the metabolome, defined as the complete set of low-molecular-weight metabolites within a given biological system [1]. Since its initial description by Nicholson et al. [2], metabolomics has been widely employed across various scientific disciplines in the post-genomic era. While its applications are most frequently associated with human health [3,4,5], the approach has also proven valuable in studies involving animals, microorganisms, algae and plants, environment [6], and food systems [7], thereby offering tangible benefits to medicine, agriculture, and the food industry [8].

Among the analytical platforms most commonly used in metabolomics are mass spectrometry (MS) and nuclear magnetic resonance (NMR), both of which offer distinct advantages that enhance the robustness of the results [9,10,11,12]. Each technique, however, presents inherent limitations. MS is a highly sensitive method, particularly popular in food science applications [13]. It is typically coupled with liquid or gas chromatography to enable the analysis of complex matrices, facilitating the detection and quantification of trace metabolites in biological and environmental samples, food products, and biofluids [14]. Nonetheless, MS is a destructive technique which depends on the ionization properties of compounds and does not permit complete structural elucidation. In contrast, NMR exhibits lower sensitivity than MS [15] and is generally limited to quantifying the most abundant metabolites in a sample. However, it is non-destructive and provides valuable information for structural elucidation and accurate metabolite quantification [16].

Recent advances in metabolomics have increasingly combined NMR- and MS-based approaches to harness their complementary strengths. For instance, Ghafar et al. applied UHPLC-MS and ¹H-NMR to profile Phyllantus acidus leaf extracts, identifying key metabolites associated with antioxidant, α-glucosidase, and nitric oxide inhibitory activities [17]. Similarly, Razali et al. used these same techniques to explore the entomological origin of stingless bee honeys, successfully identifying species-specific markers and achieving high levels of classification accuracy [18]. Although both studies demonstrated the analytical power of each platform, the datasets were processed and evaluated separately, thereby maintaining the predictive value of each platform on its own [10]. More recently, research efforts have focused on employing data fusion (DF) strategies to construct more robust and informative models by integrating results from independently acquired metabolomics datasets [19]. DF has emerged as a growing trend, particularly as modern laboratories increasingly operate as multi-platform environments capable of investigating complex samples through a variety of analytical techniques.

In the light of this shift toward integratives approaches, increasing attention has been devoted to data fusion strategies that combine NMR and MS in metabolomics. These methodologies offer a more holistic view of biochemical profiles and have contributed to enhaced interpretation of complex biological systems. To capture the current state of this envolving field, this review aims to summarize publications from the past ten years that utilize DF strategies integrating these analytical techiniques in metabolomics-based studies. For this purpose, several databases, namely SciFinder, Scopus, and Clarivate Web of Science, were consulted to collect detailed information on the fusion strategies, analytical methodologies, and statistical models applied across these studies.

2. Data Fusion in Metabolomics

DF is a multidisciplinary field that allows the integration of different datasets obtained using various independent techniques to provide better insights than each approach alone [20]. In analytical chemistry, and more specifically within the field of metabolomics, the most widely used classification for data fusion is based on levels of abstraction, termed low-, mid-, and high-level data fusion, as illustrated in Figure 1 and supported by recent publications [10,21,22,23,24]. These levels represent a progression in data handling complexity: from the direct concatenation of raw or pre-processed matrices (low-level), to the integration of extracted features (mid-level), and finally to the combination of model outputs (high-level) [25]. This hierarchical representation reflects the balance between level of detail, interpretability, and computational effort. The following sections provide a detailed overview of each strategy, including their strengths, limitations, and suitability for metabolomics applications. The specific characteristics, advantages, and limitations of these approaches are detailed in the following sections.

2.1. Low-Level DF

Low-level data fusion (LLDF), also referred to as block concatenation [26], represents the most straightforward strategy for integrating data from different sensors. This approach enables the concatenation of two or more data matrices originating from different sources [21,23]. LLDF may be applied to raw data or to data that have undergone initial pre-processing steps [23]. In this regard, Campos & Reis detailed that this pre-processing can be divided into three stages [27]: (1) pre-processing the data by correcting the artefacts due to the signal acquisition for each sensor; (2) equalizing the contributions of each set of data collected from the different analytical sources (block) using methods such as mean centring or unit variance scaling; (3) correcting the weights of each block from the different analytical sources. Concatenation analysis tends to relate more to the block with the most significant (co)variance if not correctly handled.

In this context, the first two steps are related to intra-block scaling, while the third focuses on inter-block equalization. Forshed et al. also discussed many different intra- and inter-block scaling strategies in a concatenated approach of ¹H-NMR and LC-MS datasets [26]. According to the cross-validation results, the authors reported that Pareto scaling (

\frac{1}{\sqrt σ^{2}}

) was the most suitable for intra-block normalization, while inter-block normalization was performed by adjusting weights to provide equal sums of standard deviation (

\frac{1}{{(\sum σ)}_{b l o c k}}

). However, this is not a general point, as the suitability of a scaling or normalization method depends on the sctructures and variability of each dataset. Tools like score plots, explained variance, and distance to the model can serve as objective criteria to evaluate whether the applied pre-processing improves model performance and preserves the intrinsic structure of the data.

LLDF can be explored using both supervised and unsupervised methods. From this perspective, unsupervised approaches, such as Principal Component Analysis (PCA), aim to identify common and unique patterns across datasets without relying on prior outcome information. Conversely, supervised techniques, such as Partial Least Squares regression (PLS), seek to maximize covariance within the fused matrix while integrating sample class information. Nevertheless, single-block chemometric methods like PLS or PCA are often insufficient to address the complexities of concatenation analysis. As a result, advanced methods have been developed to better integrate heterogeneous data sources. Mishra et al. provided an extensive review of sequential multiblock strategies, in the context of LLDF, applied in both exploratory and predictive contexts [28].

2.2. Mid-Level DF

Smolinska et al. highlight that a drawback of low-level data fusion is related to the number of observations [23], which is often much smaller than the number of variables. A highly effective way to overcome this adversity is through dimensionality reduction of the matrices separately, and this is precisely the aim of so-called mid-level data fusion (MLDF). It can be described as a two-step methodology, which (1) aims to extract the most important characteristics from the considered matrices and then (2) concatenates the outputs to build a single matrix to be processed [21,22].

Among the possible techniques for reducing matrix dimensions, PCA seems to be the most popular, with a view to subsequent concatenation to obtain a merged model. However, there are certain specificities for this strategy to be feasible: it is usually applied to first-order data [22]. For second-order data, other methods need to be employed. When they can be arranged in a data cube, the most well-known factorization method is parallel factor analysis (PARAFAC), which is able to decompose matrices into trilinear components to extract scores [29,30]. Additionally, other factorization methods have been developed to address some limitations of PARAFAC, such as the presence of artefacts in the matrices or missing data estimation problems such as PARAFAC2 [31], multivariate curve resolution–alternating least squares (MCR-ALS) [32,33], or more recently, multimodal multitask matrix factorization (MMMF) [34].

2.3. High-Level DF

According to Azcarate et al. [21], this is the least employed data fusion approach in analytical chemistry studies, justified by its level of complexity. Briefly, while low-level data fusion can concatenate matrices and mid-level fusion merge features, high-level fusion, also known as decision-level fusion, is able to combine previously calculated models in order to improve prediction performance and reduce the uncertainty of the final combined result [35]. These values can be qualitative, as in classification models, or quantitative, as in regression models. Typical approaches include Heuristic rules, Bayesian consensus methods, and fuzzy aggregation strategies [20,35].

HLDF is particularly advantageous when integrating heterogeneous analytical platforms such as NMR and MS, which differ in dimensionality, scale, and pre-processing requirements. Rather than fusing variables directly, this approach aggregates model-level outputs using strategies like majority voting, probabilistic averaging, or supervised meta-modeling. A relevant application is the multiblock DD-SIMCA method described by Rodionova and Pomerantsev [36], in which full distances from individual models are combined into a single cumulative metric known as the Cumulative Analytical Signal (CAS). This strategy allows the preservation of interpretability and enables the contribution of each data block to be traced in the final classification. Although HLDF may introduce interpretive complexity and may not fully exploit interactions between variables from different sources, it remains a robust and flexible framework in multiblock chemometric modeling, especially in applications involving authentication, quality assurance, and the fusion of spectroscopic and sensor-based data.

2.4. Potential and Limitations of DF Levels in Data Integration

Building upon the conceptual distinctions described above, it becomes clear that selecting the most appropriate data fusion strategy is a critical step for maximizing the analytical value of combined datasets, such as those generated by NMR and MS platforms. Each level of DF offers distinct advantages and presents specific challenges as already discussed, particularly when applied to metabolomics studies seeking to extract complementary biochemical information from structurally diverse data sources. LLDF, while easy to implement, may lead to unbalanced contributions if block variance is not properly normalized. MLDF addresses this by reducing dimensionality prior to merging, which facilitates interpretation but depends on the suitability of the feature extraction method. HLDF, which combines outputs from separately trained models, offers flexibility for integrating heterogeneous datasets and is particularly useful in classification and authentication tasks.

Despite its potential, the use of HLDF in metabolomics studies remains limited compared to other strategies, to the best of our knowledge. This lower adoption may reflect the need for independent model validation [22], the demand for advanced statistical expertise, and the challenge of interpreting results from aggregated outputs [23]. Such inconsistencies are particularly problematic in metabolomics, where model outputs must often be interpreted in terms of underlying biochemical mechanisms or pathways. Discrepant predictions may therefore hinder biological interpretation and compromise mechanistic insight. Moreover, the use of HLDF may require the application of decision rules, such as voting or consensus strategies, particularly when the combined models provide conflicting or contradictory information [37]. Identifying and resolving such divergences is not always straightforward and demands both methodological rigor and a solid understanding of the biological context.

Table 1 summarizes the main strengths and limitations of each data fusion level, supporting the selection of the most appropriate approach for data integration.

3. NMR-MS Data Fusion Applied in Metabolomics Studies

Numerous studies in the literature have combined NMR and MS data, as this integrated approach offers broader coverage of the metabolome and enhances the identification of biomarkers that are specific to each technique. However, in multiplexed analytical contexts, these studies often evaluate the datasets independently, without fully integrating them into a unified model [38,39,40]. The primary aim of this review is to highlight research that has successfully merged NMR and MS data, with a particular focus on the methodological strategies employed at different levels of data fusion.

In recent years, the number of studies employing data fusion strategies has grown significantly. To illustrate this trend, a search was conducted in the Scopus database using three sets of keywords: (1) (multiblock OR “data fusion”) AND MS AND NMR; (2) (multiblock OR “data fusion”) AND MS; and (3) (multiblock OR “data fusion”) AND NMR. These searches retrieved studies not only integrating MS and NMR data, but also combining these techniques with other spectroscopic outputs such as near-infrared (NIR) and ultraviolet (UV) data. The results of this literature search are presented in Figure 2, which shows that studies based solely on MS continue to rise and represent the largest share of publications. In contrast, NMR-based studies show a more stable pattern, with no significant increase over time. Meanwhile, the combined use of MS and NMR has shown an upward trend in recent years, although still with a lower absolute number of studies. This reflects a growing interest in integrative analytical strategies and highlights the perceived value of combining complementary platforms to enhance metabolome coverage and enrich biological interpretation.

Although fewer in number, studies employing multiblock analysis with NMR and MS are also showing a clear upward trend. Understanding the latest developments in this area is therefore essential for guiding future research that integrates data from these two key analytical platforms in metabolomics.

Table 2 provides an overview of NMR-MS data fusion studies published over the past ten years. The table highlights the fusion strategy adopted, the analytical techniques applied, and specific details on data processing for integration. In general, these publications exhibit a predominant focus on untargeted studies involving food matrices, plants, and clinical samples. Most of the studies utilized outputs from one-dimensional ¹H-NMR and two-dimensional (¹H-¹H) J-resolved spectroscopy, in combination with mass spectrometry coupled to chromatographic systems such as gas chromatography (GC) and liquid chromatography (LC); one study employed direct injection (DI) mode [41].

Among the reviewed studies, LLDF was the most commonly reported strategy (18 occurrences), followed by MLDF, mentioned 6 times. As anticipated based on current trends in analytical chemistry, and to the best of our knowledge, no studies have yet applied HLDF approaches combining NMR and MS.

Given the widespread use of low-level data fusion (LLDF), it is important to examine how studies have addressed the statistical challenges associated with high dimensionality (p >> n). Table S1 summarizes the number of samples used, whether dimensionality issues were acknowledged, and the strategies applied to mitigate rank deficiency. In many cases, LLDF was implemented in settings with relatively small sample sizes (often below 100 observations), which increases the relevance of robust statistical approaches. A variety of practices were observed across the reviewed studies: while some relied primarily on general pre-processing procedures (e.g., autoscaling, UV-scaling, or block scaling), others integrated more targeted strategies to manage dimensionality. Notable examples include studies [43,50], which employed sparse or kernel-based multiblock models; [45,48,49,55], which used PCA as a dimensionality reduction step prior to supervised modeling; [56], which combined MCIA with penalized sGCCA; and [52,58], which implemented ComDim to extract global components and reduce inter-block redundancy. This range of approaches highlights the value of incorporating dimension-aware strategies to maximize model robustness in high-dimensional fusion contexts.

3.1. Importance of Evaluating Predictive Performance in NMR-MS Data Fusion

Evaluating the predictive performance of classification models is a fundamental step in assessing the effectiveness of NMR-MS integrative strategies. This analysis becomes particularly relevant when considering the growing complexity of multiblock datasets and the need to justify the integration of distinct analytical platforms. While many studies illustrate the benefits of combining NMR and MS, not all of them systematically report performance metrics that allow for clear comparisons between models built from individual datasets and those developed from fused data.

In this review, predictive accuracy was primarily assessed through Q² values or overall classification rates. However, reporting practices varied across publications. Some studies [41,50,51] chose to focus exclusively on exploring innovative data fusion approaches, without including performance comparisons with single-block models. Other studies do provide comparisons between models built on individual datasets and those based on fusion. In some cases, however, this comparison is addressed more descriptively, for example by noting separation in coordinate score plots (such as PC1 × PC2) or by referencing permutation tests, rather than through detailed numerical metrics. While these approaches offer useful insights into data structure, they may not be sufficient for evaluating the relative robustness of the predictive models.

The box plot in Figure 3 summarizes classification results extracted from studies [41,42,49,50,51,54,56,57,58,59]. These results reflect generally strong performance, but also highlight the diversity of modeling strategies, fusion levels and validation methods used across the literature. Some papers included a direct comparison of classification performance before and after data fusion, while others chose to emphasize specific fusion strategies or biological interpretations without contrasting them with the original data blocks. Both directions are scientifically valid and contribute important perspectives. Nonetheless, when the research involves two complementary datasets, explicit comparison of their combined and individual predictive capacities provides a clearer understanding of the real benefits and potential trade-offs of data fusion.

3.2. NMR and MS DF in Body Fluids—Clinical Studies

Only one study on body fluids was performed using a targeted approach [42], as shown in Table 2. This study aimed to investigate metabolic differences in the urine of children diagnosed with autism spectrum disorder (ASD) compared to a healthy control group. The selection of metabolites was guided by profiles previously reported in similar populations. LLDF was applied to pre-processed datasets obtained from one-dimensional D ¹H-NMR, 2D ¹H-¹³C Heteronuclear Single Quantum Coherence Spectroscopy (HSQC) NMR, and LC-MS analyses, in both positive and negative ionization modes. The data matrices were concatenated and explored using PCA and OPLS-DA to uncover systematic variations among the integrated datasets. Model quality was assessed based on explained variance for predictors and responses (R²X and R²Y), as well as predictive performance (Q²). Statistical significance was evaluated using cross-validation analysis of variance (CV-ANOVA). This targeted strategy enabled the detection of 377 metabolites (80 from NMR and 297 from MS), among which 27 biomarkers were highlighted using variable importance in projection (VIP) scores obtained from the OPLS-DA models. These biomarkers contributed to characterizing metabolic profile alterations in Lebanese children with ASD, achieving sensitivity and specificity values above 80%.

Marshall et al. [40] conducted another noteworthy clinical study that integrated ¹H-NMR and ESI-MS data acquired in direct injection mode (DI-ESI-MS) to investigate a neurotoxin associated with dopaminergic cell death. In this study, the authors employed the MVAPACK toolbox [62] and combined the data matrices using a multiblock approach through both supervised and unsupervised models, namely MB-PCA and MB-PLS. To ensure balanced contributions from each dataset, the blocks were scaled according to the square root of the number of variables in each matrix. For comparison, PCA and PLS were also applied separately to the individual NMR and MS datasets. All models underwent cross-validation, while only the supervised models were subjected to significance testing.

The results demonstrated that the multiblock models achieved superior separation between control and neurotoxin-treated groups compared to models based on individual datasets. These approaches also showed higher sensitivity and specificity. A particularly innovative aspect of the study was the use of backscaled loadings from MB-PLS-DA for biomarker characterization—an advanced and effective method for identifying relevant compounds. Additionally, the authors provided a detailed protocol for the combined use of ¹H-NMR and DI-MS, including sample preparation procedures and optimization of ion source conditions to enhance metabolomic analysis.

3.3. NMR and MS DF in Natural Product Matrices

Regarding studies involving plant matrices, Zanatta et al. investigated the seasonal variation in the chemical composition of Terminalia catappa leaves [46]. Ethanolic extracts from the leaves were analyzed using two-dimensional ¹H-¹H J-resolved (J-res) NMR and UHPLC-ESI-HRMS, applying an LLDF strategy to combine the outputs. Prior to fusion, each dataset underwent intra-block scaling according to established protocols [47,48]. In this study, the fused matrix, along with the individual data matrices, was analyzed using block-wise modeling in SIMCA^® 17. Both supervised multivariate techniques—PLS-DA and OPLS-DA—were employed to evaluate group separations. Model validation and assessment of potential overfitting were performed via permutation tests (n = 100), ensuring the reliability of the classification models.

The results demonstrated that DF, when integrated with multivariate analysis, supported and complemented the insights obtained from individual datasets. Moreover, the fusion approach facilitated the identification of key metabolites associated with the observed chemical variations in T. catappa leaves across different environmental conditions and seasons. Notably, the study also highlighted the contribution of NMR-MS data fusion in guiding biomarker annotation. However, it is important to mention that, in the case of NMR data, the discriminant features were annotated only at the compound class level.

3.4. NMR and MS DF in Food Matrices

Across studies involving body fluids and natural product matrices, there is a clear predominance of supervised modeling approaches, likely due to the higher degree of similarity among the variables detected in these sample types. In contrast, publications on food matrices show a more balanced distribution between supervised [54,59,63] and unsupervised models [58,60,61]. This balanced use of supervised and unsupervised approaches in food studies may reflect a greater potential for natural clustering of samples, which is less evident in body fluids and plant matrices, where supervised models are more frequently required to achieve group discrimination.

An illustrative example is the investigation of two cold plasma processes applied to camu-camu (Myrciaria dubia) juice, focusing on the beverage’s flavor and aroma profiles through GC-MS and ¹H-NMR analyses [55]. The authors justified the use of both techniques based on their complementary ability to evaluate volatile and non-volatile constituents of the juice. Following ASCII conversion, numerical matrices were generated for each dataset and exported to PLS Toolbox™ for subsequent pre-processing steps [60,64]. Then, the resulting matrices were modeled using PCA and PLS-DA, with the PLS-DA results corroborating the patterns observed in the PCA. To assess the model’s classification performance, a confusion matrix was applied. As part of the exploratory chemometric evaluation, PLS-DA was used to classify three sample groups: two corresponding to the distinct cold plasma treatments and one control group. This model explained 63% of the total variance using just two latent variables. The study confirmed that the integrated analysis yielded superior discrimination between processing methods compared to the individual analyses. Notably, the same substance-level variabilities detected in the separate datasets were also captured—and further enhanced—through the fused analysis.

A particularly relevant contribution was made by Lanza et al. [54], evaluating the discriminative capacity of fatty acid profiles (GC-MS) and polar metabolites (¹H-NMR) in milk samples from three groups of cows fed different forage-based diets: corn silage (MS), grass–legume silage and corn (GMS), and grass hay with alfalfa (HAY). The analysis was performed using canonical Discriminant Analysis (CDA) as a supervised model, implemented in SAS software. Initially, the CDA model was trained using features from the GC-MS and NMR datasets, with the data split into training (80%) and test (20%) sets based on features with statistical significance (p < 0.05), using the diet groups (MS, GMS, and HAY) as prediction factors. The trained model was then applied to the LLDF matrix, and the degree of dissimilarity among forage groups was assessed through D²-Mahalanobis distances, using leave-one-out cross-validation. The results showed that when applied to the individual GC-MS and NMR datasets, CDA achieved only moderate discriminant capacity overall, with D² values ranging from 2.6 to 31.7 (p < 0.001). Nevertheless, both datasets correctly separated the HAY group in the validation step. These outcomes were significantly improved through LLDF, which enhanced the model’s predictive performance, yielding higher D²-Mahalanobis values (7.2 to 51.2). In addition, the cross-validation step confirmed that the CDA model applied to the fused dataset was more robust and reliable, achieving greater specificity and sensitivity in authenticating milk samples from cows fed the HAY-based diet. Among the studies reviewed, this later was the only one that employed CDA as the primary discrimination strategy. Its use was associated with the advantage of generating interpretable canonical functions and clear group separation [65,66].

However, while LLDF increased robustness and discriminatory power, particularly in authenticating HAY-based milk samples, it also introduced challenges inherent to this fusion level, such as increased dimensionality and potential redundancy. These factors can complicate interpretability and increase the risk of overfitting if not properly managed. Nonetheless, the results confirmed that combining GC-MS and NMR data through LLDF provided a more reliable classification framework than either dataset alone, highlighting the promise and complexity of data fusion strategies in compositional food analysis.

Still, within the scope of food matrix studies, Jin et al. [58] explored the geographical origin of Taiping Houkui green tea using ¹H-NMR (from polar and apolar extracts) and headspace solid-phase microextraction coupled with gas chromatography–mass spectrometry (HS-SPME-GC-MS). The authors employed both LLDF and MLDF approaches and compared them with the independent datasets using MATLAB 9.5. For LLDF, two strategies were used: (1) simple concatenation and (2) Common Dimensions (ComDim) [67]. For MLDF, two feature extraction strategies were explored: (1) variables extracted from 3D PCA loading plots and (2) the top 10 variables based on PCA scores.

In this exploratory analysis of the individual datasets, HS-SPME-GC-MS produced better clustering performance, while ¹H-NMR (polar extracts) data yielded the least promising results. However, none of the independent techniques were able to distinguish green tea samples by origin. In contrast, ComDim-PLS loading plots from the data fusion approaches successfully identified the key metabolites responsible for differentiating the tea samples and highlighted the specific contribution of ¹H-NMR (polar extracts) data to the overall model. These findings reinforce the idea that complementary information across datasets can yield synergistic effects. Another important aspect of the study was the application of support vector machines (SVMs) to model both LLDF and MLDF data, using linear and radial basis function (RBF) kernels. The dataset consisted of 60 samples, split into 75% training and 25% test sets. Results showed that all fusion-based models outperformed the individual datasets, with classification accuracies reaching up to 93.3%. This work presented an innovative and elegant data processing pipeline, integrating data fusion with machine learning to improve the authentication of green tea origin.

Continuing within this framework of integrating DF and machine learning, a recent study classified different vintages of the Chinese beverage Baijiu through low- and mid-level fusion of GC-MS and ¹H-NMR datasets [57] compared with non-fused data, using Python 3.11 in the PyCharm 2020.3.2 environment and R 4.1.1 in the RStudio environment. Data processing and modeling were conducted in Python using pandas for data handling, scikit-learn for implementing machine learning algorithms and cross-validation, and matplotlib for data visualization. Meanwhile, mixOmics in R was used for multivariate analysis and for assessing variable importance via VIP scores.

In this study, for low-level data fusion (LLDF), the datasets were directly concatenated. Subsequently, several machine learning models were applied, including support vector machines (SVMs), neural networks (NNs), k-nearest neighbors (k-NN), Decision Trees (DTs), and Random Forests (RFs). In the mid-level data fusion (MLDF) approach, dimensionality reduction was first performed using PCA, PLS-DA, DT, and RF, followed by modeling with the same set of algorithms as in the LLDF approach. In addition to fusion-based models, individual datasets (GC-MS or ¹H NMR alone) were also evaluated using the same classification algorithms. Classification performance was assessed using four key metrics: accuracy, precision, recall, and F1-score (Figure 4).

The comparative results clearly highlight the advantage of MLDF approaches over the use of individual analytical platforms. Notably, the RF-RF configuration, in which both dimensionality reduction and final modeling were performed using Random Forests, yielded the best performance among all evaluated approaches. This model achieved perfect scores (1.000) for accuracy, precision, recall, and F1-score in both training and test sets, and recorded an out-of-bag (OOB) error rate of 0.957. Other MLDF strategies, such as PCA-RF and PLSDA-RF, also yielded high classification performance, reinforcing the chemometric advantage of combining latent variable modeling with robust classification algorithms. These elegantly constructed analyses reflect a significant trend in data fusion research: the combination of advanced machine learning tools and multi-platform analytical data to enhance the authentication and quality evaluation of complex beverages.

Among the available publications, different approaches are evident in accessing LLDF and MLDF models. Studies based on matrix concatenation typically involve different normalization strategies to equalize weights. It is also noted that there is no homogeneity in the use of software; environments such as SIMCA^®, MATLAB, R, Python, and PLS Toolbox^TM are highlighted for model calculations. Despite the different approaches used, the application of data fusion in unsupervised approaches showed greater power in discriminating groups. In contrast, supervised analyses showed higher predictive efficiency than independent matrix analyses. This is possibly due to the greater metabolite coverage achieved through different analytical techniques.

Looking ahead, advances in deep learning have enabled the development of powerful in silico data integration methods for handling heterogeneous, high-dimensional omics datasets. Among these, variational autoencoders (VAEs) have been applied to integrated multi-omics analysis with promising results. For example, Hira et al. [68] proposed a deep learning framework using standard VAE and a modified version, MMD-VAE, to integrate genomic, transcriptomic, and epigenomic data from ovarian cancer patients. Their study demonstrated that these models could effectively learn compressed representations from multi-omics datasets, improving the classification of molecular subtypes and survival prediction compared to traditional dimensionality reduction techniques. Importantly, these approaches circumvent manual data scaling by learning latent representations directly from raw or minimally normalized inputs, which is particularly beneficial when integrating omics datasets with highly unbalanced feature dimensions. In a different context, Li et al. [69] introduced an adaptive data fusion strategy based on convolutional neural networks with atrous convolution to integrate multi-sensor signals for industrial fault diagnosis. Their method performed the fusion directly on multi-source inputs without requiring explicit pre-processing or manual scaling, and outperformed conventional feature-engineering approaches regarding stability and classification accuracy. Similarly, Elamin and El-Rabbany [70] applied deep convolutional neural networks to fuse UAV-based LiDAR and multispectral imagery for urban land cover classification. The joint modeling of spatial and spectral information using a U-Net with ResNet101 backbone improved classification performance relative to traditional methods, again avoiding the need for prior scaling or alignment of data channels. These studies highlight a key advantage of deep learning-based fusion models: the capacity to harmonize structurally diverse datasets, such as NMR and MS, through automated representation learning, offering a robust alternative to conventional scaling procedures often inadequate for complex metabolomics workflows.

4. Metabolites Covered by NMR-MS Data Fusion

In general, the annotation of compounds remains a bottleneck in metabolomics studies. The articles that reported compound annotation primarily relied on comparisons with the literature and the use of spectral repositories such as the MetLin database, Human Metabolites Database (HMDB), Biological Magnetic Resonance Databank (BMRB), Tea Metabolome Database (TMDB), Food Database, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Dictionary of Natural Products (DNP). Specifically, for GC-MS data, the National Institute of Standards and Technology (NIST) library was widely employed, along with the calculation of Kovats indices.

Among the analytical tools cited, ¹H-NMR required more resources for compound characterization. In some studies, in addition to comparison with repository data, 2D experiments (¹H-¹³C) such as (¹H-¹H)-COSY, (¹H-¹³C)-HSQC, and (¹H-¹³C)-HMBC were described as supplementary steps in biomarker characterization.

Figure 5 and Table S2 (Supplementary Material) illustrate the compound set coverage across the reviewed publications. The data highlight the relevance of all three analytical strategies in metabolomics studies, as each contributed to the identification of specific metabolites within clinical, plant, and food matrices. A total of 571 substances were reported. Among these, 260 (46%) were exclusive to GC-MS, 192 (34%) to LC-MS, and 87 (15%) to NMR, underscoring their distinct contributions. Only 32 (6%) compounds were detected by more than one technique, and only proline was shared across all three. This finding aligns with the previous literature, which consistently reports that only glycine and proline are commonly detected across all platforms [71]. The high number of technique-specific compounds underscores the value of data fusion approaches, which improve metabolome coverage while minimizing redundancy in data integration.

Although the analytical potential of NMR is well recognized, particularly for its ability to detect a wide range of structurally diverse metabolites without the need for derivatization or separation steps, it exhibited the lowest number of annotated compounds among the reviewed studies. This observation reflects ongoing challenges in metabolite annotation by NMR, especially when dealing with complex biological matrices. In such samples, peak overlap, signal suppression, and lower intrinsic sensitivity can hinder the resolution and assignment of individual constituents, limiting the overall identification yield.

Among the platforms most frequently employed in NMR- and MS-based DF studies, ¹H-NMR and UHPLC-MS remain the primary analytical techniques associated with metabolomics approaches, as shown in Table 1. Both methods enable the simultaneous detection of a wide range of metabolites, particularly nitrogen-containing compounds such as alkaloids and amino acids, as well as various organic acids. This capability is crucial, as LC-MS and NMR are often complementary in the identification of known and unknown metabolites within complex mixtures.

To enhance this integrative analysis, specialized tools have been developed, including SUMMIT NMR/MS [72] and Data Fusion-based Discovery (DAFdiscovery) [73], which were developed to merge MS and NMR data. SUMMIT NMR/MS focuses on identifying chemical formulas through MS data, generating all plausible structures for each formula, and then comparing them to experimental NMR spectra to determine the most consistent molecular candidates. This workflow facilitates the structural elucidation of metabolites by integrating the strengths of both analytical platforms. DAFdiscovery, in turn, employs statistical total correlation spectroscopy (STOCSY) and statistical heterospectroscopy (SHY) to assist in metabolite annotation. These advanced tools underscore the importance of combining multiple analytical techniques to improve metabolome coverage and biomarker identification across diverse biological matrices.

5. Concluding Remarks

In this review, various innovative fusion strategies employing 1D and 2D NMR (¹H, ¹H-¹H J-Res, ¹H-¹³C HSQC) and MS (LC, GC, DI) over the past decade were considered. These strategies include classical multivariate exploratory or classificatory models, and some of their modifications, such as MB OPLS-DA, Consensus-PCA, and Sparse MBPLSR, as well as the application of machine learning tools on fused data.

In all these studies comparing results from independent matrices with merged data, data fusion techniques demonstrated a greater capacity for discrimination and classification of the considered variables. However, it is essential to validate these models, as not all accessed publications included this crucial step in establishing chemometric approaches.

The NMR-MS data fusion proves to be a powerful tool for studying complex matrices such as plants, clinical samples, food, fungi, and marine organisms. The promising data presented in publications highlight successful results in authentication and quality assessment of foods by origin, types of applied processes, and geographic region; plants regarding seasonal variations, origin, and extraction processes; and clinical studies emphasizing metabolic profile changes associated with the presence of disorders or dietary changes. These results reveal the unique metabolic coverage of each analytical technique used and their synergistic potential in biomarker discovery. Given all this, further research that involves the fusion of different analytical techniques, especially NMR and MS, is strongly encouraged.

The annotation of the observed biomarkers remains the main challenge. Nonetheless, mass spectrometry appears to offer more resources for compound characterization compared to NMR, especially in complex mixtures, even though that can be optimized through correlated NMR and MS analyses such as SUMMIT NMR/MS or Data Fusion-based Discovery.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/molecules30122624/s1, Table S1: Overview of sample size and rank deficiency mitigation approaches in low-level data fusion studies combining NMR and MS; Table S2: Coverage of compounds in NMR-MS data fusion studies by analytical technique.

Author Contributions

Conceptualization, P.H.B.d.M. and T.R.; writing—original draft preparation, P.H.B.d.M. and T.R.; writing—review and editing, G.L., G.D.C., J.V.F., G.M. and P.P.; supervision, T.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by ANR, MetaboHUB (ANR-11-INBS-0010) and WAPNMR (ANR-21-CE21-0014) projects.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vignoli, A.; Ghini, V.; Meoni, G.; Licari, C.; Takis, P.G.; Tenori, L.; Turano, P.; Luchinat, C. Hochdurchsatz-Metabolomik Mit 1D-NMR. Angew. Chem. 2019, 131, 980–1007. [Google Scholar] [CrossRef]
Nicholson, J.K.; Lindon, J.C.; Holmes, E. “Metabonomics”: Understanding the Metabolic Responses of Living Systems to Pathophysiological Stimuli via Multivariate Statistical Analysis of Biological NMR Spectroscopic Data. Xenobiotica 1999, 29, 1181–1189. [Google Scholar] [CrossRef] [PubMed]
Lin, S.; Zhang, H.; Wang, C.; Su, X.L.; Song, Y.; Wu, P.; Yang, Z.; Wong, M.H.; Cai, Z.; Zheng, C. Metabolomics Reveal Nanoplastic-Induced Mitochondrial Damage in Human Liver and Lung Cells. Env. Sci. Technol. 2022, 56, 12483–12493. [Google Scholar] [CrossRef] [PubMed]
Evans, E.D.; Duvallet, C.; Chu, N.D.; Oberst, M.K.; Murphy, M.A.; Rockafellow, I.; Sontag, D.; Alm, E.J. Predicting Human Health from Biofluid-Based Metabolomics Using Machine Learning. Sci. Rep. 2020, 10, 17635. [Google Scholar] [CrossRef]
Zhang, X.W.; Li, Q.H.; Xu, Z.D.; Dou, J.J. Mass Spectrometry-Based Metabolomics in Health and Medical Science: A Systematic Review. RSC Adv. 2020, 10, 3092–3104. [Google Scholar] [CrossRef]
Bedia, C. Metabolomics in Environmental Toxicology: Applications and Challenges. Trends Environ. Anal. Chem. 2022, 34, e00161. [Google Scholar] [CrossRef]
Fraga-Corral, M.; Carpena, M.; Garcia-Oliveira, P.; Pereira, A.G.; Prieto, M.A.; Simal-Gandara, J. Analytical Metabolomics and Applications in Health, Environmental and Food Science. Crit. Rev. Anal. Chem. 2022, 52, 712–734. [Google Scholar] [CrossRef]
Marchev, A.S.; Vasileva, L.V.; Amirova, K.M.; Savova, M.S.; Balcheva-Sivenova, Z.P.; Georgiev, M.I. Metabolomics and Health: From Nutritional Crops and Plant-Based Pharmaceuticals to Profiling of Human Biofluids. Cell. Mol. Life Sci. 2021, 78, 6487–6503. [Google Scholar] [CrossRef]
Labine, L.M.; Simpson, M.J. The Use of Nuclear Magnetic Resonance (NMR) and Mass Spectrometry (MS)–Based Metabolomics in Environmental Exposure Assessment. Curr. Opin. Env. Sci. Health 2020, 15, 7–15. [Google Scholar] [CrossRef]
Letertre, M.P.M.; Dervilly, G.; Giraudeau, P. Combined Nuclear Magnetic Resonance Spectroscopy and Mass Spectrometry Approaches for Metabolomics. Anal. Chem. 2021, 93, 500–518. [Google Scholar] [CrossRef]
Marshall, D.D.; Powers, R. Beyond the Paradigm: Combining Mass Spectrometry and Nuclear Magnetic Resonance for Metabolomics. Prog. Nucl. Magn. Reson Spectrosc. 2017, 100, 1–16. [Google Scholar] [CrossRef] [PubMed]
Wishart, D.S.; Cheng, L.L.; Copié, V.; Edison, A.S.; Eghbalnia, H.R.; Hoch, J.C.; Gouveia, G.J.; Pathmasiri, W.; Powers, R.; Schock, T.B.; et al. NMR and Metabolomics—A Roadmap for the Future. Metabolites 2022, 12, 678. [Google Scholar] [CrossRef]
Shi, J.; Liu, Y.; Xu, Y.J. MS Based Foodomics: An Edge Tool Integrated Metabolomics and Proteomics for Food Science. Food Chem. 2024, 446, 138852. [Google Scholar] [CrossRef] [PubMed]
Wolfender, J.L.; Nuzillard, J.M.; Van Der Hooft, J.J.J.; Renault, J.H.; Bertrand, S. Accelerating Metabolite Identification in Natural Product Research: Toward an Ideal Combination of Liquid Chromatography-High-Resolution Tandem Mass Spectrometry and NMR Profiling, in Silico Databases, and Chemometrics. Anal. Chem. 2019, 91, 704–742. [Google Scholar] [CrossRef] [PubMed]
Tsedilin, A.M.; Fakhrutdinov, A.N.; Eremin, D.B.; Zalesskiy, S.S.; Chizhov, A.O.; Kolotyrkina, N.G.; Ananikov, V.P. How Sensitive and Accurate Are Routine NMR and MS Measurements? Mendeleev Commun. 2015, 25, 454–456. [Google Scholar] [CrossRef]
Wishart, D.S. NMR Metabolomics: A Look Ahead. J. Magn. Reson. 2019, 306, 155–161. [Google Scholar] [CrossRef] [PubMed]
Ghafar, S.Z.A.; Mediani, A.; Maulidiani, M.; Rudiyanto, R.; Mohd Ghazali, H.; Ramli, N.S.; Abas, F. Complementary NMR- and MS-Based Metabolomics Approaches Reveal the Correlations of Phytochemicals and Biological Activities in Phyllanthus Acidus Leaf Extracts. Food Res. Int. 2020, 136, 109312. [Google Scholar] [CrossRef]
Razali, M.T.A.; Zainal, Z.A.; Maulidiani, M.; Shaari, K.; Zamri, Z.; Idrus, M.Z.M.; Khatib, A.; Abas, F.; Ling, Y.S.; Rui, L.L.; et al. Classification of Raw Stingless Bee Honeys by Bee Species Origins Using the NMR- and LC-MS-Based Metabolomics Approach. Molecules 2018, 23, 2160. [Google Scholar] [CrossRef]
Ríos-Reina, R.; Azcarate, S.M.; Camiña, J.M.; Goicoechea, H.C. Multi-Level Data Fusion Strategies for Modeling Three-Way Electrophoresis Capillary and Fluorescence Arrays Enhancing Geographical and Grape Variety Classification of Wines. Anal. Chim. Acta 2020, 1126, 52–62. [Google Scholar] [CrossRef]
Castanedo, F. A Review of Data Fusion Techniques. Sci. World J. 2013, 1, 704504. [Google Scholar] [CrossRef]
Azcarate, S.M.; Ríos-Reina, R.; Amigo, J.M.; Goicoechea, H.C. Data Handling in Data Fusion: Methodologies and Applications. TrAC–Trends Anal. Chem. 2021, 143, 116355. [Google Scholar] [CrossRef]
Cocchi, M. Introduction: Ways and Means to Deal With Data From Multiple Sources. Data Handl. Sci. Technol. 2019, 31, 1–26. [Google Scholar]
Smolinska, A.; Engel, J.; Szymanska, E.; Buydens, L.; Blanchet, L. General Framing of Low-, Mid-, and High-Level Data Fusion With Examples in the Life Sciences. Data Handl. Sci. Technol. 2019, 31, 51–79. [Google Scholar]
Chen, J.; Li, J.-Q.; Li, T.; Liu, H.-G.; Wang, Y. Application of UV-Vis and Infrared Spectroscopy on Wild Edible Bolete Mushrooms Discrimination and Evaluation: A Review. Crit. Rev. Anal. Chem. 2021, 53, 1–17. [Google Scholar] [CrossRef]
Doeswijk, T.G.; Smilde, A.K.; Hageman, J.A.; Westerhuis, J.A.; van Eeuwijk, F.A. On the Increase of Predictive Performance with High-Level Data Fusion. Anal. Chim. Acta. 2011, 705, 41–47. [Google Scholar] [CrossRef] [PubMed]
Forshed, J.; Idborg, H.; Jacobsson, S.P. Evaluation of Different Techniques for Data Fusion of LC/MS and 1H-NMR. Chemom. Intell. Lab. Syst. 2007, 85, 102–109. [Google Scholar] [CrossRef]
Campos, M.P.; Reis, M.S. Data Preprocessing for Multiblock Modelling–A Systematization with New Methods. Chemom. Intell. Lab. Syst. 2020, 199, 103959. [Google Scholar] [CrossRef]
Mishra, P.; Roger, J.M.; Jouan-Rimbaud-Bouveresse, D.; Biancolillo, A.; Marini, F.; Nordon, A.; Rutledge, D.N. Recent Trends in Multi-Block Data Analysis in Chemometrics for Multi-Source Data Integration. TrAC–Trends Anal. Chem. 2021, 137, 116206. [Google Scholar] [CrossRef]
Bro, R. Chemometrics and Intelligent Laboratory Systems Tutorial PARAFAC. Tutor. Appl. 1997, 38, 149–171. [Google Scholar] [CrossRef]
Martínez Bilesio, A.R.; Batistelli, M.; García-Reiriz, A.G. Fusing Data of Different Orders for Environmental Monitoring. Anal. Chim. Acta 2019, 1085, 48–60. [Google Scholar] [CrossRef]
Kiers, H.A.L.; ten Berge, J.M.F.; Bro, R. PARAFAC2—Part I. A Direct Fitting Algorithm for the PARAFAC2 Model. J. Chemom. 1999, 13, 275–294. [Google Scholar] [CrossRef]
Jalalvand, A.R.; Goicoechea, H.C. Applications of Electrochemical Data Analysis by Multivariate Curve Resolution-Alternating Least Squares. TrAC–Trends Anal. Chem. 2017, 88, 134–166. [Google Scholar] [CrossRef]
Jalili, F.; Jalalvand, A.R. Developing a Novel and Intelligent Chemometrics-Assisted Molecularly Imprinted Electrochemical Sensor: Application to the Improvement of the Efficiency of the Treatment of Parkinson’s Disease. Chemom. Intell. Lab. Syst. 2025, 259, 105351. [Google Scholar] [CrossRef]
Hwang, J.; Lee, H. MMMF: Multimodal Multitask Matrix Factorization for Classification and Feature Selection. IEEE Access 2022, 10, 120155–120167. [Google Scholar] [CrossRef]
Ballabio, D.; Todeschini, R.; Consonni, V. Recent Advances in High-Level Fusion Methods to Classify Multiple Analytical Chemical Data. Data Handl. Sci. Technol. 2019, 31, 129–155. [Google Scholar]
Rodionova, O.; Pomerantsev, A. Multi-Block DD-SIMCA as a High-Level Data Fusion Tool. Anal. Chim. Acta 2023, 1265, 341328. [Google Scholar] [CrossRef]
Lahat, D.; Adalý, T.; Jutten, C. Challenges in Multimodal Data Fusion. In Proceedings of the 2014 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, 1–5 September 2014; pp. 101–105. [Google Scholar]
Baky, M.H.; Badawy, M.T.; Bakr, A.F.; Hegazi, N.M.; Abdellatif, A.; Farag, M.A. Metabolome-Based Profiling of African Baobab Fruit (Adansonia digitata L.) Using a Multiplex Approach of MS and NMR Techniques in Relation to Its Biological Activity. RSC Adv. 2021, 11, 39680–39695. [Google Scholar] [CrossRef] [PubMed]
Farag, M.A.; Otify, A.; Porzel, A.; Michel, C.G.; Elsayed, A.; Wessjohann, L.A. Comparative Metabolite Profiling and Fingerprinting of Genus Passiflora Leaves Using a Multiplex Approach of UPLC-MS and NMR Analyzed by Chemometric Tools. Anal. Bioanal. Chem. 2016, 408, 3125–3143. [Google Scholar] [CrossRef]
Yoon, D.; Choi, B.R.; Ma, S.; Lee, J.W.; Jo, I.H.; Lee, Y.S.; Kim, G.S.; Kim, S.; Lee, D.Y. Metabolomics for Age Discrimination of Ginseng Using a Multiplex Approach to HR-MAS NMR Spectroscopy, UPLC-QTOF/MS, and GC × GC-TOF/MS. Molecules 2019, 24, 2381. [Google Scholar] [CrossRef]
Marshall, D.D.; Lei, S.; Worley, B.; Huang, Y.; Garcia-Garcia, A.; Franco, R.; Dodds, E.D.; Powers, R. Combining DI-ESI–MS and NMR Datasets for Metabolic Profiling. Metabolomics 2015, 11, 391–402. [Google Scholar] [CrossRef]
Bitar, T.; Mavel, S.; Emond, P.; Nadal-Desbarats, L.; Lefèvre, A.; Mattar, H.; Soufia, M.; Blasco, H.; Vourc’h, P.; Hleihel, W.; et al. Identification of Metabolic Pathway Disturbances Using Multimodal Metabolomics in Autistic Disorders in a Middle Eastern Population. J. Pharm. Biomed. Anal. 2018, 152, 57–65. [Google Scholar] [CrossRef]
Karaman, İ.; Nørskov, N.P.; Yde, C.C.; Hedemann, M.S.; Bach Knudsen, K.E.; Kohler, A. Sparse Multi-Block PLSR for Biomarker Discovery When Integrating Data from LC–MS and NMR Metabolomics. Metabolomics 2015, 11, 367–379. [Google Scholar] [CrossRef]
Maan, K.; Baghel, R.; Bakhshi, R.; Dhariwal, S.; Tyagi, R.; Rana, P. An Integrative Chemometric Approach and Correlative Metabolite Networking of LC-MS and 1H NMR Based Urine Metabolomics for Radiation Signatures. Mol. Omics 2022, 18, 214–225. [Google Scholar] [CrossRef]
Silva, L.M.A.; Filho, E.G.A.; Rodrigues, T.H.S.; Louredo, F.J.C.; Zocolo, G.J.; Canuto, K.M.; Mikich, S.B.; Liebsch, D.; De Almeida, A.; De Brito, E.S. Metabolomic Profiling of Phloem Sap from Different Pine Species and Implications on Black Capuchin. J. Chem. Ecol. 2022, 48, 660–669. [Google Scholar] [CrossRef] [PubMed]
Zanatta, A.C.; Vieira, N.C.; Dantas-Medeiros, R.; Vilegas, W.; Edrada-Ebel, R.A. Understanding the Seasonal Effect of Metabolite Production in Terminalia Catappa L. Leaves through a Concatenated MS- and NMR-Based Metabolomics Approach. Metabolites 2023, 13, 349. [Google Scholar] [CrossRef] [PubMed]
Zanatta, A.C.; Vilegas, W.; Edrada-Ebel, R.A. UHPLC-(ESI)-HRMS and NMR-Based Metabolomics Approach to Access the Seasonality of Byrsonima Intermedia and Serjania Marginata From Brazilian Cerrado Flora Diversity. Front. Chem. 2021, 9, 710025. [Google Scholar] [CrossRef]
Sampaio, B.L.; Edrada-Ebel, R.; Da Costa, F.B. Effect of the Environment on the Secondary Metabolic Profile of Tithonia Diversifolia: A Model for Environmental Metabolomics of Plants. Sci. Rep. 2016, 6, 29265. [Google Scholar] [CrossRef]
Beniddir, M.A.; Le Moyec, L.; Triba, M.N.; Longeon, A.; Deville, A.; Blond, A.; Pham, V.C.; de Voogd, N.J.; Bourguet-Kondracki, M.L. Metabolomics with Multi-Block Modelling of Mass Spectrometry and Nuclear Magnetic Resonance in Order to Discriminate Haplosclerida Marine Sponges. Anal. Bioanal. Chem. 2022, 414, 5929–5942. [Google Scholar] [CrossRef]
Mehl, F.; Marti, G.; Merle, P.; Delort, E.; Baroux, L.; Sommer, H.; Wolfender, J.L.; Rudaz, S.; Boccard, J. Integrating Metabolomic Data from Multiple Analytical Platforms for a Comprehensive Characterisation of Lemon Essential Oils. Flavour Fragr. J. 2015, 30, 131–138. [Google Scholar] [CrossRef]
Palaric, C.; Pilard, S.; Fontaine, J.X.; Boccard, J.; Mathiron, D.; Rigaud, S.; Cailleu, D.; Mesnard, F.; Gut, Y.; Renaud, T.; et al. Processing of NMR and MS Metabolomics Data Using Chemometrics Methods: A Global Tool for Fungi Biotransformation Reactions Monitoring. Metabolomics 2019, 15, 107. [Google Scholar] [CrossRef]
Parchemin, C.; Raviglione, D.; Ghosson, H.; Salvia, M.V.; Goossens, C.; Sasal, P.; Faliex, E.; Tapissier-Bontemps, N. Development of a Multiblock Metabolomics Approach to Explore Metabolite Variations of Two Algae of the Genus Asparagopsis Linked to Interspecies and Temporal Factors. Algal Res. 2023, 72, 103138. [Google Scholar] [CrossRef]
Quiroz-Delfi, G.O.; Rider, C.V.; Ferguson, S.S.; Jarmusch, A.K.; Mueller, G.A. Non-Targeted Chemical Analysis of Consumer Botanical Products Labeled as Blue Cohosh (Caulophyllum thalictroides), Goldenseal (Hydrastis canadensis), or Yohimbe Bark (Pausinystalia yohimbe) by NMR and MS. Anal. Bioanal. Chem. 2024, 416, 175–189. [Google Scholar] [CrossRef] [PubMed]
Lanza, I.; Lolli, V.; Segato, S.; Caligiani, A.; Contiero, B.; Lotto, A.; Galaverna, G.; Magrin, L.; Cozzi, G. Use of GC–MS and 1H NMR Low-Level Data Fusion as an Advanced and Comprehensive Metabolomic Approach to Discriminate Milk from Dairy Chains Based on Different Types of Forage. Int. Dairy J. 2021, 123, 105174. [Google Scholar] [CrossRef]
Alves Filho, E.G.; Mara, L.; Silva, A.; De Brito, E.S.; Castro, D.R.G.; Bezerra, J.A.; Sanches, E.A.; Rodrigues, S.; Fernandes, F.A.N.; Campelo, P.H. Effect of Glow and Dielectric Barrier Discharges Plasma on Volatile and Non-Volatile Chemical Profiling of Camu-Camu Juice. Food Bioproc. Tech. 2021, 14, 1275–1286. [Google Scholar] [CrossRef]
Becchi, P.P.; Lolli, V.; Zhang, L.; Pavanello, F.; Caligiani, A.; Lucini, L. Integration of LC-HRMS and 1H NMR Metabolomics Data Fusion Approaches for Classification of Amarone Wine Based on Withering Time and Yeast Strain. Food Chem. X 2024, 23, 101607. [Google Scholar] [CrossRef]
Chen, B.; Wang, L.; Wang, L.; Han, Y.; Yan, G.; Chen, L.; Li, C.; Zhu, Y.; Lu, J.; Han, L. A Novel Data Fusion Strategy of GC-MS and 1H NMR Spectra for the Identification of Different Vintages of Maotai-Flavor Baijiu. J. Agric. Food Chem. 2024, 72, 14865–14873. [Google Scholar] [CrossRef] [PubMed]
Jin, G.; Zhu, Y.; Cui, C.; Yang, C.; Hu, S.; Cai, H.; Ning, J.; Wei, C.; Li, A.; Hou, R. Tracing the Origin of Taiping Houkui Green Tea Using 1H NMR and HS-SPME-GC–MS Chemical Fingerprints, Data Fusion and Chemometrics. Food Chem. 2023, 425, 136538. [Google Scholar] [CrossRef]
Belmonte-Sánchez, J.R.; Romero-González, R.; Martínez Vidal, J.L.; Arrebola, F.J.; Garrido Frenich, A. 1H NMR and Multi-Technique Data Fusion as Metabolomic Tool for the Classification of Golden Rums by Multivariate Statistical Analysis. Food Chem. 2020, 317, 126363. [Google Scholar] [CrossRef]
Alves Filho, E.G.; Silva, L.M.A.; Wurlitzer, N.J.; Fernandes, F.A.N.; Fonteles, T.V.; Rodrigues, S.; de Brito, E.S. An Integrated Analytical Approach Based on NMR, LC–MS and GC–MS to Evaluate Thermal and Non-Thermal Processing of Cashew Apple Juice. Food Chem. 2020, 309, 125761. [Google Scholar] [CrossRef]
Spiteri, M.; Dubin, E.; Cotton, J.; Poirel, M.; Corman, B.; Jamin, E.; Lees, M.; Rutledge, D. Data Fusion between High Resolution 1H-NMR and Mass Spectrometry: A Synergetic Approach to Honey Botanical Origin Characterization. Anal. Bioanal. Chem. 2016, 408, 4389–4401. [Google Scholar] [CrossRef]
Worley, B.; Powers, R. MVAPACK: A Complete Data Handling Package for NMR Metabolomics. ACS Chem. Biol. 2014, 9, 1138–1144. [Google Scholar] [CrossRef] [PubMed]
Rivera-Pérez, A.; Romero-González, R.; Garrido Frenich, A. Application of an Innovative Metabolomics Approach to Discriminate Geographical Origin and Processing of Black Pepper by Untargeted UHPLC-Q-Orbitrap-HRMS Analysis and Mid-Level Data Fusion. Food Res. Int. 2021, 150, 110722. [Google Scholar] [CrossRef] [PubMed]
Biancolillo, A.; Liland, K.H.; Måge, I.; Næs, T.; Bro, R. Variable Selection in Multi-Block Regression. Chemom. Intell. Lab. Syst. 2016, 156, 89–101. [Google Scholar] [CrossRef]
Oh, S.W.; Imran, M.; Kim, E.H.; Park, S.Y.; Lee, S.G.; Park, H.M.; Jung, J.W.; Ryu, T.H. Approach Strategies and Application of Metabolomics to Biotechnology in Plants. Front. Plant Sci. 2023, 14, 1192235. [Google Scholar] [CrossRef]
Ariza, A.G.; Arbulu, A.A.; González, F.J.N.; Bermejo, J.V.D.; Vallejo, M.E.C. Discriminant Canonical Analysis as a Validation Tool for Multivariety Native Breed Egg Commercial Quality Classification. Foods 2021, 10, 632. [Google Scholar] [CrossRef]
Cariou, V.; Jouan-Rimbaud Bouveresse, D.; Qannari, E.M.; Rutledge, D.N. ComDim Methods for the Analysis of Multiblock Data in a Data Fusion Perspective. Data Handl. Sci. Technol. 2019, 31, 179–204. [Google Scholar] [CrossRef]
Hira, M.T.; Razzaque, M.A.; Angione, C.; Scrivens, J.; Sawan, S.; Sarkar, M. Integrated Multi-Omics Analysis of Ovarian Cancer Using Variational Autoencoders. Sci. Rep. 2021, 11, 6265. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Wang, H.; Song, L.; Wang, P.; Cui, L.; Lin, T. An Adaptive Data Fusion Strategy for Fault Diagnosis Based on the Convolutional Neural Network. Measurement 2020, 165, 108122. [Google Scholar] [CrossRef]
Elamin, A.; El-Rabbany, A. UAV-Based Multi-Sensor Data Fusion for Urban Land Cover Mapping Using a Deep Convolutional Neural Network. Remote Sens. 2022, 14, 4298. [Google Scholar] [CrossRef]
Bruno, C.; Patin, F.; Bocca, C.; Nadal-Desbarats, L.; Bonnier, F.; Reynier, P.; Emond, P.; Vourc’h, P.; Joseph-Delafont, K.; Corcia, P.; et al. The Combination of Four Analytical Methods to Explore Skeletal Muscle Metabolomics: Better Coverage of Metabolic Pathways or a Marketing Argument? J Pharm Biomed Anal 2018, 148, 273–279. [Google Scholar] [CrossRef]
Bingol, K.; Bruschweiler-Li, L.; Yu, C.; Somogyi, A.; Zhang, F.; Brüschweiler, R. Metabolomics beyond Spectroscopic Databases: A Combined MS/NMR Strategy for the Rapid Identification of New Metabolites in Complex Mixtures. Anal. Chem. 2015, 87, 3864–3870. [Google Scholar] [CrossRef] [PubMed]
Borges, R.M.; das Neves Costa, F.; Chagas, F.O.; Teixeira, A.M.; Yoon, J.; Weiss, M.B.; Crnkovic, C.M.; Pilon, A.C.; Garrido, B.C.; Betancur, L.A.; et al. Data Fusion-Based Discovery (DAFdiscovery) Pipeline to Aid Compound Annotation and Bioactive Compound Discovery across Diverse Spectral Data. Phytochem. Anal. 2023, 34, 48–55. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Flow chart of multi-sensor data fusion.

Figure 2. Survey of studies carried out in recent years involving NMR and MS data fusion.

Figure 3. Boxplot of classification accuracy values reported in studies using MS, NMR, LLDF, and MLDF.

Figure 4. Heatmap of classification performance metrics (accuracy, precision, recall, and F1-score) for all evaluated models, based on the real results reported in the study by Chen et al. [51]. Models are labeled by analytical platform and machine learning algorithm, with mid-level fusion configurations including the dimensionality reduction method used prior to classification.

Figure 5. Venn diagram of the compounds covered by the NMR and MS analytical techniques, highlighting the intersections between these tools.

Table 1. Summary of strengths, limitations, and applications of data fusion levels.

Criterion	Low-Level	Mid-Level	High-Level
Implementation	Simple, direct	Feature reduction needed	Separate models required
Data Richness	Full information	Compressed, relevant	Outputs only
Overfitting Risk	High (many variables)	Moderate	Low (if models are robust)
Interpretability	Moderate, variable-level	Component-based	Low, indirect
Data Compatibility	Low (scaling required)	Moderate (pre-processing helps)	High (platform-independent)
Application Focus	Exploration, trend mapping	Clustering, pattern discovery	Classification, decision support
Limitations	Sensitive to scale; risk of bias	Loss of nuance; method-dependent	Requires expertise; low traceability

Table 2. Overview of published studies employing data fusion approaches combining NMR and MS analyses, categorized by biological matrix (body fluids, natural products, and food), analytical strategy, modeling techniques, and validation criteria.

Approach	Aim	Datasets	DF Level	Fusion Strategy	Stat. Modelling	Model Validation	Ref.
Body Fluid Matrices—Clinical Studies
Targeted	Compare the metabolic profiles of patients affected by autistic disorders	LC-MS ¹H-NMR ¹H-¹³C (HSQC)	LLDF	Concatenation of pre-processed datasets	OPLS-DA	CV-ANOVA	[42]
Untargeted	Evaluate the effects of dietary variations on plasma responses in pigs	LC-MS ¹H-NMR	LLDF	Concatenation of pre-processed blocks	Sparse MB-PLS	RMSECV	[43]
Untargeted	Differentiate the urinary metabolic profiles of irradiated and non-irradiated mice	LC-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	MB-OPLSDA	CV-ANOVA	[44]
Untargeted	Evaluate optimization methods for NMR and MS data fusion using multiblock bilinear factorization in neurotoxin analysis	DI-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	MB-PCA, MB-PLS	LOOCV	[41]
Natural Products Matrices (plant, marine, and fungal sources)
Untargeted	Evaluate the variability in primary metabolites and aroma compounds of pines	GC-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	PLS-DA	Permutation test	[45]
Untargeted	Evaluate the environmental factors driving seasonal variations of a medicinal plant	LC-MS ¹H-¹H (J-resolved)	LLDF	Concatenation of pre-processed datasets	PLS-DA, OPLS-DA	Permutation test	[46]
Untargeted	Assess the influence of seasonal factors on the chemical composition of two medicinal plant species	LC-MS ¹H-¹H (J-resolved)	LLDF	Concatenation of pre-processed datasets	PLS-DA	Permutation test	[47]
Untargeted	Examine the relationships between environmental factors and the growth of an invasive weed	LC-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	PCA, HCA, OPLS-DA	-	[48]
Untargeted	Discriminate marine sponges	LC-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	Consensus PCA, MB-PLS	LOOCV	[49]
Untargeted	Classify lemon essential oils based on their extraction methods	LC-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	MB-PLS, consensus K-OPLS-DA	LOOCV	[50]
Untargeted	Monitor the biotransformation medium of a potential histamine H3 antagonist	LC-MS ¹H-NMR ¹H-¹H (J-resolved)	LLDF	Concatenation of pre-processed datasets	Consensus OPLS-DA	K-fold	[51]
Untargeted	Evaluate metabolic differences of algae	GC-MS LC-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	MB ComDim	Permutation test	[52]
Untargeted	Determine the botanical origin and authenticity of blue cohosh	LC-MS ¹H-NMR	MLDF	This approach analyzed the top five principal components (i.e., latent variables) of the NMR and MS datasets	PCA	-	[53]
Food Matrices
Untargeted	Discriminate milk from dairy chains based on different dietary types	GC-MS ¹H-NMR	LLDF	Concatenation of datasets	CDA	LOOCV	[54]
Untargeted	Investigate the effect of operating conditions on both non-volatile and volatile compounds in juice	GC-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	PCA, PLS-DA	RMSEC, RMSECV	[55]
Untargeted	Classify wines according to grape withering time and yeast strain	LC-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	MCIA, SPLS-DA	3-fold cross validation	[56]
Untargeted	Investigate the different vintages of Baijiu beverage	GC-MS ¹H-NMR	LLDF	Concatenation of datasets	PCA, PLS-DA machine learning: SVM, DT, RF, k-NN	10-fold cross validation	[57]
			MLDF	Extraction of features by: PCA, PLS-DA, Decision Tree (DT), Random Forest (RF)
Untargeted	Discriminate geographical origin of green tea	GC-MS ¹H-NMR	LLDF	Concatenation of pre-processed datasets	PCA, ComDim-PLS	-	[58]
			MLDF	Concatenation of 3D PCA loading plots (approach 1)	PCA
				Concatenation VIP scores (top 10-approach 2)	SVM
Untargeted	Classify rums based on fermentation barrel, raw material, distillation method, and aging	GC-MS LC-MS ¹H-NMR	LLDF	Concatenation of pre-processed raw data	PLS-DA	RMSECV, CV ANOVA, permutation test	[59]
			MLDF	Concatenation of features based on VIP scores
Untargeted	Evaluate the influence of two processing methods on juice composition	LC-MS ¹H-NMR	MLDF	Pre-processed matrices of NMR and MS were converted to ASCII	PCA	-	[60]
Untargeted	Honey origin discrimination and comparative analysis of mid-level fusion strategies	LC-MS ¹H-NMR	MLDF	Concatenation scores of PCAs of each analytical method (no variable selection) Selection of relevant variables of each analytical method using PLS	PCA, PLS-DA	-	[61]

Nuclear magnetic resonance (NMR); Proton NMR (¹H-NMR); Carbon-13 NMR (¹³C-NMR); ¹H-¹³C Correlation (¹H-¹³C); ¹H-¹H Correlation (¹H-¹H); Heteronuclear Single Quantum Coherence (HSQC); J-resolved spectroscopy (J-RES); gas chromatography (GC); liquid chromatography (LC); mass spectrometry (MS); direct injection (DI); low-level data fusion (LLDF); mid-level data fusion (MLDF); multiblock (MB); Text Export Format (ASCII). Principal Component Analysis (PCA); Partial Least Squares (PLS); Sparse Partial Least Squares (SPLS); Canonical Discriminant Analysis (CDA); Discriminant Analysis (DA); Orthogonal Partial Least Squares (OPLS); support vector machine (SVM); Random Forest (RF); Decision Tree (DT); neural network (NN); Multiple Co-Inertia Analysis (MCIA); cross-validation (CV); leave-one-out CV (LOOCV); analysis of variance (ANOVA); Root Mean Square Error of Cross-Validation (RMSECV); RMSE of Calibration (RMSEC); variable importance in projection (VIP).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Homobono Brito de Moura, P.; Leleu, G.; Da Costa, G.; Marti, G.; Pétriacq, P.; Valls Fonayet, J.; Richard, T. Integrating NMR and MS for Improved Metabolomic Analysis: From Methodologies to Applications. Molecules 2025, 30, 2624. https://doi.org/10.3390/molecules30122624

AMA Style

Homobono Brito de Moura P, Leleu G, Da Costa G, Marti G, Pétriacq P, Valls Fonayet J, Richard T. Integrating NMR and MS for Improved Metabolomic Analysis: From Methodologies to Applications. Molecules. 2025; 30(12):2624. https://doi.org/10.3390/molecules30122624

Chicago/Turabian Style

Homobono Brito de Moura, Patricia, Guillaume Leleu, Grégory Da Costa, Guillaume Marti, Pierre Pétriacq, Josep Valls Fonayet, and Tristan Richard. 2025. "Integrating NMR and MS for Improved Metabolomic Analysis: From Methodologies to Applications" Molecules 30, no. 12: 2624. https://doi.org/10.3390/molecules30122624

APA Style

Homobono Brito de Moura, P., Leleu, G., Da Costa, G., Marti, G., Pétriacq, P., Valls Fonayet, J., & Richard, T. (2025). Integrating NMR and MS for Improved Metabolomic Analysis: From Methodologies to Applications. Molecules, 30(12), 2624. https://doi.org/10.3390/molecules30122624

Article Menu

Integrating NMR and MS for Improved Metabolomic Analysis: From Methodologies to Applications

Abstract

1. Introduction

2. Data Fusion in Metabolomics

2.1. Low-Level DF

2.2. Mid-Level DF

2.3. High-Level DF

2.4. Potential and Limitations of DF Levels in Data Integration

3. NMR-MS Data Fusion Applied in Metabolomics Studies

3.1. Importance of Evaluating Predictive Performance in NMR-MS Data Fusion

3.2. NMR and MS DF in Body Fluids—Clinical Studies

3.3. NMR and MS DF in Natural Product Matrices

3.4. NMR and MS DF in Food Matrices

4. Metabolites Covered by NMR-MS Data Fusion

5. Concluding Remarks

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI