Critical Review of Selected Analytical Platforms for GC-MS Metabolomics Profiling—Case Study: HS-SPME/GC-MS Analysis of Blackberry’s Aroma

Data processing and data extraction are the first, and most often crucial, steps in metabolomics and multivariate data analysis in general. There are several software solutions for these purposes in GC-MS metabolomics. It becomes unclear which platform offers what kind of data and how that information influences the analysis’s conclusions. In this study, selected analytical platforms for GC-MS metabolomics profiling, SpectConnect and XCMS as well as MestReNova software, were used to process the results of the HS-SPME/GC-MS aroma analyses of several blackberry varieties. In addition, a detailed analysis of the identification of the individual components of the blackberry aroma club varieties was performed. In total, 72 components were detected in the XCMS platform, 119 in SpectConnect, and 87 and 167 in MestReNova, with automatic integral and manual correction, respectively, as well as 219 aroma components after manual analysis of GC-MS chromatograms. The obtained datasets were fed, for multivariate data analysis, to SIMCA software, and underwent the creation of PCA, OPLS, and OPLS-DA models. The results of the validation tests and VIP-pred. scores were analyzed in detail.


Introduction
Metabolomics has been employed in a variety of applications, including the discovery of biomarkers and enzymes in food and nutrition, plant biotechnology, and health.Metabolomics is described as the comprehensive, simultaneous examination of many metabolites in biological systems, and it has emerged in a wide range of research areas.Huge progress in this area has been aided by the development of high-resolution analytical techniques such as nuclear magnetic resonance (NMR) and mass spectrometry (MS), which allow for the examination of a wide range of metabolites at various concentration levels.Multivariate data analysis is an essential tool for the analysis of large and complex data sets [1,2].With this approach, the analysis of large and complex data sets is feasible [3].It was developed in parallel with the development of computation and computers.In order to adequately process a large data set, such as the metabolome, the application of multivariate analysis is necessary.Additionally, the large number of MS databases that make it easier to analyze and identify compounds greatly facilitates this process.The metabolome is very complex, and therefore there is a justified need to develop methods that will facilitate its interpretation and processing in order to discover metabolites of importance [4][5][6].One of the most important steps is the initial processing of raw data in order to obtain a reliable model using a multivariate analysis [7,8].In addition, depending on the analytical technique used, it is necessary to establish procedures and workflows for research [9].There are a number of platforms in use today that lead to the fast and efficient processing of large amounts of data, as well as their preparation for further metabolome analysis [10].High resolving power, robustness, good reproducibility, selectivity, and high sensitivity are characteristics that make GC-MS an excellent analytical platform.Electron ionization (EI) is most often used because there are a large number of mass spectra libraries that facilitate the analysis.The results of GC-MS analysis consist of m/z values, retention times, and intensities of different peaks [11].The fluctuations in the chromatograms' retention times, which are particularly noticeable when a high number of samples is used, could be an issue [12].Therefore, it is necessary to properly process the obtained raw data in order to obtain valid datasets [13].Some of the basic steps are baseline corrections to define the area for peaks of importance, as well as alignment to avoid shifting retention times and to ensure the uniformity of the entire data set.There are a number of ways to process raw GC-MS results, using different platforms and software to prepare them for multivariate analysis [14,15].
In order to characterize cultivars, it is necessary to take into consideration the entire profile during the analysis in order to draw appropriate conclusions and understand the correlation of all metabolites in the metabolome [16].The whole chromatogram (for all detected m/z values) is significant for nontargeted metabolomic investigations, which motivates attempts to choose experimental conditions that enhance metabolite peak accessibility [17].Accurate analysis of all of these data is usually accompanied by some difficulties.It could be challenging to discern certain "real" chromatographic peaks from noise.Sometimes the separate MS scans contain peaks from coeluting mixtures of metabolites that are not chromatographically separated.Hence, peak enumeration which separates "true" peaks from noise in a chromatogram, and spectral deconvolution which, according to the recent literature, is becoming more and more common, are the first processing steps that follow the storage of raw GC-MS data [18].These steps yield putative pure spectra from two overlapping peaks.These procedures can be carried out either using commercial software designed for a particular manufacturer's equipment or using freely accessible software such as AMDIS [19].In previous works, the main focus of the researchers was on the dominant components of aroma.They observed them as markers for the recognition of certain types of fruit [20] or geographical origin [21].In plant metabolomics of fruit varieties, differences in genetic variability can be ruled out due to the dominant vegetative method of propagation (cloning) [22].However, differences in replicants of the same type of sample can often have bigger effects on the main compounds than differences between samples.This is mostly because of different levels of maturity or stress-causing environmental factors like sunlight, humidity, pathogen exposure, and so on [23].For this reason, the scaling and preprocessing of data are equally as important as the degree of sensitivity and selectivity of the applied instrumental technique and the applied statistical platform [24,25].
Blackberries are widely consumed fruits that are employed in many different processed goods.Both consumers and food suppliers place a great deal of importance on fruit quality, with premium fruits typically having greater market potential.As a result, berry growers worldwide seek berries that are large, firm, flavorful, and nutrient-rich [26].With their bioactive components, which include phenolic acids and flavonoids [27], they have strong antioxidant activity [28] and help prevent a variety of ailments.Up to now, blackberries have been the subject of extensive studies using liquid chromatography with mass spectrometry and multivariate data analysis [27][28][29][30].To the best of our knowledge, this is the first time that a metabolomic approach has been used in the analysis of blackberry aroma, as well as the first time that the results of different platforms for preprocessing GC-MS results have been compared.This paper presents the influence of different online analytical platforms for data "extraction" and MestReNova software integration solutions on the results of a multivariate analysis for determining the aroma profile of different blackberry cultivars that were grown in the Zeleni Hit Company experimental field, near Belgrade.The headspace solid-phase microextraction coupled to the gas chromatography-mass spectrometry (HS-SPME/GC-MS) method was used to track the changes.Non-target quantitative component profiling, to investigate the aroma profiles of different blackberry cultivars, was used.Six blackberry cultivars, Columbia Star [31], Loch Ness [32], Nachez [33], Ouachita [34], Prime-Ark 45 [35], and Von [36], were analyzed on three different platforms: MestReNova 12.0 with automatic and manual corrections of detected peaks, XCMS online [37], and SpectConnect [38], in order to compare their results and identify an optimal solution for GC-MS data preprocessing.The criteria for comparison were the number of identified peaks, the quality of the peaks, and the results of statistical analyses.The peak numbers and identification were validated through the manual check of one representative chromatogram of each cultivar.In total, 269 compounds were detected and 216 were identified.The methylundecanoate was used as an internal standard for the normalization and quantification of each compound.

Sample Collection and Preparation
In July 2021, in the experimental field of the Zeleni Hit DOO (Batajnički put ZH, Belgrade 11080) company fruits were collected from the six blackberry varieties Loch Ness, Ouachita, Nachez, Von, Prime-Ark 45, and Columbia Star.The samples were stored in plastic sterile bottles at a temperature of −18 • C until analysis.
In headspace vials was placed 2 g from each sample, 100 mg of NaCl (Sigma-Aldrich, Saint Louis, MO, USA), and 1 µL of methyl undecanoate solution in dichloromethane (Poly-Science Corp. Niles, IL, USA) with a concentration of 2 ppm.The vials were tightly closed and incubated in a water bath at 60 • C for 30 min.During incubation in the empty space of the vial, fiber emerged for the solid-phase microextraction with polydimethylsiloxane (PDMS) as an adsorbent.A manual SPME arrow injection kit was used for incubation and the injection of concentrated blackberries into the GC-MS inlet.After injection, the fiber was kept in the heated inlet for 20 s before starting the analysis for desorption, and for another four minutes after starting the analysis, to condition it for the next sample.Blank samples were measured every day before measuring berry samples.

GC-MS Analysis
The Agilent 7890B GC system (Agilent Technologies, Santa Clara, CA, USA) equipped with a 5977 mass selective detector (MSD) was used for aroma compound GC-MS analysis.For separation, a non-polar HP-5MSI capillary column (30 m × 0.25 mm, 0.25 µm film thickness) was used.The oven temperature was programmed to increase linearly from 60 • C to 240 • C at a rate of 3 • C/min.Helium was used as a carrier gas, inlet pressure was constant at 16.7 psi (flow 1.0 mL/min at 210 • C), and splitless mode was used.The MS range was 40-550 amu, the electron ionization energy was 70 eV at 230 • C, and the quadrupole temperature was 150 • C. The transfer line temperature was kept at 315 • C.

Data Processing
Library search and mass spectral deconvolution and extraction of the derivatized compounds were performed using the MSD ChemStation software, version E02.02 (Agilent Technologies, Santa Clara, CA, USA), the NIST AMDIS (Automated Mass Spectral Deconvolution and Identification System) software version 2.70, and the commercially available Adams04, NIST17, and Wiley07 libraries containing approximately 500,000 spectra.
For the eXtensible Computational Mass Spectrometry (XCMS) online platform (Version 3.7.1),using the MSD ChemStation software, all the MS chromatograms were converted to the AIA format.Based on the R software, the peak picking, nonlinear peak alignment, and matching of the retention times were then carried out utilizing this platform [39,40].Using the CentWave feature detection algorithm, the maximum allowed m/z deviation in consecutive scans was set at 100 ppm.The minimum and maximum chromatographic peak widths were set at 5 and 10 s, respectively.The minimum difference in m/z for peaks with overlapping retention times was set at 0.01 and the signal/noise threshold was set at 6.In order to create peak density chromatograms and group peaks across samples, 0.5 is the minimum fraction of samples required in at least one sample group in order for it to be a valid group.Ten seconds is the allowable retention time deviation for peak alignment.After being standardized to the content of the internal standard (methylundecanoate), the data in the table from the XCMS online platform were put through multivariate data analysis.
The SpectConnect (Version 1.0) online platform was used according to the instructions given in the paper by Styczynski et al. and in online instructions [15,38].The AMDIS software (Version 2.73) was used for spectral deconvolution and data set extraction.
The MesReNova 12.0 software was used for the automatic detection and integration of chromatographic peaks.Peaks were detected in the range from 2.9 to 45 min with the highest sensitivity (200), automatic smoothing, and no area threshold (0%).The obtained tables were merged into one dataset.The siloxane signals that are also present in the blanks were manually removed.DRS was used for the automatic identification of individual components in representative chromatograms of each cultivar.
Finally, using the MesReNova 12.0 software, each chromatogram was manually checked, and the peaks that it did not recognize were additionally integrated.Every pick from one representative chromatogram from each cultivar was manually checked for MS fragmentation pattern matching as well as for retention indexes with NIST17 and Wiley07 library data.
Multivariate data analysis was performed using SIMCA software (version 15, Umet-rics, Umeå, Sweden).The GC-MS data were mean-centered and scaled using the square root of the standard deviation as the scaling factor (Pareto).For the MesReNova data sets, Excel 16 was used for the normalization of the content of the internal standard (methyl-undecanoate).

Results and Discussion
Ten replicates (one berry in the stage of technological maturity) of each cultivar were analyzed using the headspace SPME GC-MS instrumental technique.The obtained data were processed using online platforms (SpectConnect and XCMS), with the fact that the SpectConnect platform provides the possibility of using different data sets to obtain models (relative abundance-RA, integrated signal-IS, and base peak-AM) and compare them with semimanual and manual processed data from the utilization of the MestReNova software.The PCA, PLS-DA, and OPLS-DA models were generated and the main statistical parameters of each of them are presented in Table 1.According to the initial PCA model cultivars, Loch Ness had the most unique data set; it has been separated most in relation to the others in the score plot.For this reason, it was chosen for comparison with the others by making individual pairs in OPLS-DA models.Each OPLS-DA model was validated with CV-ANOVA (see Table 1) and permutation test.In order to compare the results of those models, VIP-predictive plots were analyzed.
In comparison to SpectConnect (119) and XCMS (72), the highest number of volatile components (peaks) was observed, after manual evaluation of the chromatograms (167), as twice as much (87) as in automatic peak detection with the highest sensitivity.In the further detailed investigation of the Total Ion Chromatogram (TIC) using the AMDIS deconvolution algorithm, 269 compounds were detected.The 219 volatile components were identified, but, for 50 minor (less than 5%) compounds, it was not possible to accomplish an identification due to a low ion abundance (concentration) and/or the lack of reference spectra in the libraries and retention indexes (see Supporting Information Table S1).
By comparing the models obtained using different datasets and their major statistical parameters, it can be noticed that the highest coefficient of determination (R 2 ) has a model with the XCMS data set and SpectConnect-base peak data set models with an R 2 value of about 0.9.In the remaining applied models, the R² value was from 0.516 to 0.766.A similar trend was observed with the predictive ability of the models (Q 2 ) (see Table 1).The major reason for that could be the number of variables as well as the data type that was used to generate the data set.Specifically, in the XCMS platform the abundance of the base ion was extracted instead of the total ion current.In that instance, the noise detection threshold is rather high, particularly for GC-overlapped analytes.This reduces the number of detected metabolites that would be subject to multivariate analysis but provides models with high statistical significance.Although it is quite intuitive and easy to use, this platform is not the most suitable for the metabolomic analysis of GC-MS results, where the electron impact ionization technique is used and plant extracts are the subject of analysis.On the contrary, the application of this platform to the analysis of compounds of similar polarity (e.g., aromas) gives reliable results [41].To overcome the problems of peak overlapping, the SpectConnect platform used NIST software, which extracts individual component spectra from gas chromatography/mass spectrometry (GC-MS) data files by deconvolution and ion-counting noise procedures (Figure 1).As a result of the SpectConnect platform, three similar datasets based on relative abundance (RA), integrated signals (ISs), and base peak (AM), as well as retention time (RT), were obtained, showing changes in retention times for individual compounds in different chromatograms.Comparing the major statistical parameters generated from datasets of the SpectConnect platform, there are no significant differences.On the other hand, automatic integration in the MestReNova software provides average sensitivity compared to the previously mentioned platforms, while manual correction was able to detect even the smallest peak.When the SIMCA model statistical parameters were compared, there were no big differences between the obtained models when the PCA model for MNOVA-automatic was excluded.For the Ouachita cultivar sample, an example of the expanded region of the SPME GC-MS chromatogram displays the output data for every tested software solution.Automatic integration without manual correction and TIC checks resulted in the absence of For the Ouachita cultivar sample, an example of the expanded region of the SPME GC-MS chromatogram displays the output data for every tested software solution.Automatic integration without manual correction and TIC checks resulted in the absence of component C and its contribution to the peak of component B. In relation to that, the XCMS platform recognized all three components and expressed their abundance through the base ion peak of each.It can be seen from Figure 1 that, although there is a slight weighting of each of the observed components, the XCMS platform does not recognize this and very directly sets limits for each of them.At the end, the SpectConect platform, using AMDIS deconvolution, separates the superimposed ion currents of the different compounds and gives the total areas for each of them.
Although the parameters of the model and the number of variables as input elements that are analyzed are important when deciding whether to opt for one, it is also necessary to consider the variables that the model recognizes as significant for the separation of certain data sets.Thus, in the mentioned models, a value of 1.3 or more was taken as an elimination parameter for VIP-predictive, and those variables are given in Table 2.A criterion for choosing the limit value for VIP-predictive was a number of variables that were higher than that value in each of the models.A compromise was made between too few and too many variables that were important for separation.Just by simple counting, it is clear that the most significant variables for separation are present in the model with the most input data.The three models based on the SpectConnect platform data, on the other hand, are different in both the number and types of aroma components that separate the two varieties (see Table 2).Nevertheless, the observed differences are not so big.In the OPLS-DA models of the Loch Ness/Columbia star, only E-2hexenal appeared in the RA dataset as relevant for separation.Except hexanal, hexenal, 2-heptanone, octanal, nonanal, theaspirane B, and ethyl dodecanoate, in the analysis of the Loch Ness/Von OPLS-DA model in VIP-pred, 1-hexanol and 2-heptanol were unique in RA, and 2-heptanol, hexyl acetate, decanal, and 2,6,10-trimethylpentadecane were unique IS datasets.The ethyl tiglate, myrcene, and heptanone were non-characteristic in all the Loch Ness/Nachez models; ethyl-benzaote, α-muurolene, (E)-2-hexenal, (Z)-2hexen-1-ol, 1,3,8-p-mentatriene, girjunen-β, and (Z)-β-lonone were non-characteristic in Loch Ness/Prime-Ark 45; and decanal, (Z)-calamene, and ethyl dodecanoate were non-characteristic in the Loch Ness/Ouachita OPLS-DA models.As we mentioned, the number of detected and processed variables has decreased, as has the number of potential biomarkers, or, in this case, chemical constituents of aroma that can be defined as characteristics of certain varieties.Even so, this had an effect on the total number of variables with a VIP-pred.score greater than 1.3; however, less than half of them were also called on other platforms.
Aligning the peaks in MestReNova only by retention times without the possibility of checking the coincidence of their mass spectra leads to the absence of some of the components recognized as deserving of stretching by online platforms while detecting even the smallest peaks increases the probability of finding additional distinguishing markers.All this resulted in the appearance of new variables that were not recognized on any other platform/model.Regardless of all the challenges presented, common factors can be found in the presented results, which can be unequivocally reliable data that have weight for conclusions.The only question is for what purpose the research is being carried out in order to choose the appropriate approach when designing while keeping in mind all the mentioned facts.

Table 1 .
The main statistical parameters of the obtained model with different platforms.