The Sample, the Spectra and the Maths—The Critical Pillars in the Development of Robust and Sound Applications of Vibrational Spectroscopy

The last two decades have witnessed an increasing interest in the use of the so-called rapid analytical methods or high throughput techniques. Most of these applications reported the use of vibrational spectroscopy methods (near infrared (NIR), mid infrared (MIR), and Raman) in a wide range of samples (e.g., food ingredients and natural products). In these applications, the analytical method is integrated with a wide range of multivariate data analysis (MVA) techniques (e.g., pattern recognition, modelling techniques, calibration, etc.) to develop the target application. The availability of modern and inexpensive instrumentation together with the access to easy to use software is determining a steady growth in the number of uses of these technologies. This paper underlines and briefly discusses the three critical pillars—the sample (e.g., sampling, variability, etc.), the spectra and the mathematics (e.g., algorithms, pre-processing, data interpretation, etc.)—that support the development and implementation of vibrational spectroscopy applications.

In recent years, vibrational spectroscopy has been also considered for its potential as a high throughput phenotyping tool in both animals and plants, where novel applications related with plant breeding and selection, plant nutrition and physiology have been reported in the last 20 years [9][10][11][12][13][14][15]. More recently, vibrational spectroscopy (e.g., NIR, MIR, Raman and hyperspectral imaging systems) techniques have shown their ability to qualitatively (e.g., classifying, identifying, and monitoring) analyse several types of samples (e.g., wine, meat, coffee, condiments, etc.), targeting issues related with origin, traceability, and provenance of foods and food ingredients [9][10][11][12][13][14][15][16][17]. Concomitantly, recent developments in hardware (e.g., image techniques, optical sensors, handheld instrumentation, etc.) are adding new analytical possibilities to the potential users of these technologies, making them very single sample using vibrational spectroscopy has been reduced, the time dedicated to interpreting and mining the data has exponentially increased, depending on the dataset [32][33][34].
Classical statistics are not able to handle the current increase in the volume of data generated with this approaches. In this context, the scope of MVA is wide where its applications are found in many fields and where the number of the so-called toolboxes or methods is diverse [11,[25][26][27][28][29][30][31]. The integration of MVA into vibrational spectroscopy provides the means to move the analysis beyond the one-dimensional (univariate) space, revealing constituents or properties that are important through the various interferences and interactions in the matrix [11,[25][26][27][28][29][30][31]. Today, many modern instrumental measurement techniques are multivariate and based on indirect measurements of the chemical and physical properties of the sample [11,[25][26][27][28][29][30][31]. Figure 1 shows the theoretical and practical links between the sample, the method or technique and the mathematics during the development of an application.
Molecules 2020, 25 Classical statistics are not able to handle the current increase in the volume of data generated with this approaches. In this context, the scope of MVA is wide where its applications are found in many fields and where the number of the so-called toolboxes or methods is diverse [11,[25][26][27][28][29][30][31]. The integration of MVA into vibrational spectroscopy provides the means to move the analysis beyond the one-dimensional (univariate) space, revealing constituents or properties that are important through the various interferences and interactions in the matrix [11,[25][26][27][28][29][30][31]. Today, many modern instrumental measurement techniques are multivariate and based on indirect measurements of the chemical and physical properties of the sample [11,[25][26][27][28][29][30][31]. Figure 1 shows the theoretical and practical links between the sample, the method or technique and the mathematics during the development of an application. Beyond the many advantages that the integration of vibrational spectroscopy with MVA offer, the ability of providing a holistic view of the system or sample analysed (e.g., fingerprint analysis) determines that these approaches are advantageous when compared with other analytical methods. In addition, the availability of modern and inexpensive instrumentation together with access to easyto-use software is determining a steady growth in the number of applications of these technologies. Please note that this paper does not intend to be "another" review of multivariate data analysis and/or vibrational spectroscopy. The reader can find several excellent dedicated reviews already published in the scientific literature. Instead, the intention is to discuss and provide a guide of the main issues that can affect the successful implementation of these approaches.
Therefore, this paper underline and briefly discussed the three critical pillars-the sample (e.g., sampling, variability, etc.), the spectra and the mathematics (e.g., algorithms, pre-processing, data interpretation, etc.)-that support developments and implementations of vibrational spectroscopy applications. Beyond the many advantages that the integration of vibrational spectroscopy with MVA offer, the ability of providing a holistic view of the system or sample analysed (e.g., fingerprint analysis) determines that these approaches are advantageous when compared with other analytical methods. In addition, the availability of modern and inexpensive instrumentation together with access to easy-to-use software is determining a steady growth in the number of applications of these technologies. Please note that this paper does not intend to be "another" review of multivariate data analysis and/or vibrational spectroscopy. The reader can find several excellent dedicated reviews already published in the scientific literature. Instead, the intention is to discuss and provide a guide of the main issues that can affect the successful implementation of these approaches.

The Theory of Sampling and Uncertainty
Therefore, this paper underline and briefly discussed the three critical pillars-the sample (e.g., sampling, variability, etc.), the spectra and the mathematics (e.g., algorithms, pre-processing, data interpretation, etc.)-that support developments and implementations of vibrational spectroscopy applications.

The Theory of Sampling and Uncertainty
Regardless of all the care taken during sampling, the sample always differs in composition from the target intended [35][36][37][38]. Even the use of randomly replicated samples from the same target will differ among themselves, determining the so-called sampling uncertainty [35][36][37][38]. Understanding the uncertainty derived from both the sampling and the analysis will allow making rational decisions about a given process, classification or calibration results [35][36][37][38]. It is worth noting that the final application will be connected to making decisions about the target instead of about the sample [35][36][37][38].
Different authors have highlighted that one of the most important issues to be considered during sampling is related to how good the uncertainty depending on the purpose is [35][36][37][38]. One important issue to consider (and remember) is that the uncertainty of the measurement that arose from sampling is non-negligible [35][36][37][38][39]. This is even more significant when raw materials (e.g., food ingredients) and environmental samples (e.g., soil and water) are collected, where the uncertainty of the sampling exceeds the analytical contribution [35][36][37][38][39]. Therefore, the theory of sampling becomes highly relevant during the development of a given applications.
The theory of sampling (TOS) documents and details in a comprehensive means all aspects of the mechanical structure and chemical variation within a target in relation to the procedure for obtaining a primary sample from it [35][36][37][38][39]. Some of the main issues considered in the TOS are associated with the characteristics and/or properties of the target, including the size range of the particles comprising the target, the shapes of the particles, the compositional variation of the particles and the degree and style of the heterogeneity of the target, among others [35][36][37][38][39]. The method of collecting or extracting the sample and the degree of comminution/homogenisation/grinding at the different steps during the sampling process are important aspects included in the TOS [35][36][37][38][39]. All of these previously summarised issues and properties contributed to identifying the types of "error" of a given analysis or process [35][36][37][38][39].
The different sources and types of "errors" should be eliminated, and attention to detail will define the procedure or sampling protocol that will deliver the "correct" sample [35][36][37][38][39]. Researchers and practitioners in the field state that the interpretation of "correct" refers to "unbiased", where sampling bias is avoided in the definition [35][36][37][38][39].
During the application of the TOS, it has been reported that sampling uncertainty is ignored and only the analytical uncertainty is considered [35][36][37][38][39]. The scientific literature in the field also suggested that the heterogeneity in the population and the ways of counteracting its adverse influence due to sampling/signal acquisition, sub-sampling and sample preparation/presentation processes, must be considered and evaluated before analysis [35][36][37][38][39].
In summary, the TOS is the main framework that must be used as a guide during meta-analysis of any application using vibrational spectroscopy [35][36][37][38][39]. It has been highlighted that the TOS emphasises the fundamental sampling principle (FSP), which states that all potential units from an original material must have an equal probability of being sampled in practice, and that samples are not altered in any way after sampling [35][36][37][38][39]. In the context of model development (e.g., calibration/validation and prediction), the main interactions between the sampling and the analysis (e.g., physical sampling), or the sampling and the on-line application, must be evaluated and understood in order to avoid inaccuracies and mistakes [35][36][37][38][39].

Samples
In any given application of vibrational spectroscopy, the sample itself plays an important role in defining the success of such application. However, the importance of both the sampling and the sample are usually overlooked. Two of the main characteristics or properties that define the success of a given application using vibrational spectroscopy are associated with both the perturbation and the observation of the sample [39][40][41][42][43][44][45][46]. The perturbation is usually associated with the experimental conditions used to develop the application (e.g., dry vs. wet sample, temperature, whole vs. powder, etc.) while the observations/samples are associated with the sampling protocol and the property to be measured (e.g., limit of detection, range in concentration, standard error of the laboratory, number of samples etc.) [39][40][41][42][43][44][45][46].

Sample Variability
Probably one of the main questions asked during the development of the application is associated with the selection of the most suitable sample to be used during calibration development [47]. It has been agreed by several researchers that samples used to build a given calibration model have to be selected from samples similar to those that will be analysed in the future [39][40][41][42][43][44][45][46][47][48]. In addition, the samples have to be exposed to the same pre-processing and handling steps adopted, and this should be maintained when future samples are incorporated into the calibration. Samples used in calibration must be sourced from a wide-range composition, or at least considering the expected range of the composition [39][40][41][42][43][44][45][46][47]. All sources of possible variation to be encountered in the future must be considered and/or incorporated into the sample set [39][40][41][42][43][44][45][46][47][48]. If samples are used to represent a process all potential variations in the system, factors such as temperature, changes in particle size, physical changes in the sample, and equipment should be incorporated [39][40][41][42][43][44][45][46][47]. When dealing with biological materials (e.g., plants, animal muscle or tissues), other variations must be evaluated, such as harvest time and type of tissue (e.g., type of muscle), among others [39][40][41][42][43][44][45][46][47][48].
However, the selection of samples is not an arbitrary task and demands care. For example, during calibration development, the aim is to obtain homogenous and representative samples well distributed along the dataset. If there are too many samples available, it is recommendable to choose samples in order to develop a well balance dataset. Although randomisation is the preferred method to select samples to be included into the calibration, a better approach will be the utilization of robust techniques based in either Mahalanobis and Euclidean distances or the Kennard-Stone algorithm [49,50]. Recently, the use kernel distances have been reported as a robust method to objectively select samples [49,50].

Collecting the Information-The Spectra
A wide range of analytical methods and techniques based on vibrational spectroscopy are available in the market nowadays (e.g., NIR, MIR, Raman, lab bench and handheld instrumentation, hyperspectral imaging etc.) [51][52][53]. All of these techniques have in common the fact that they generate large amounts of data. Munck and collaborators stated that most instruments based on vibrational spectroscopy are extensively used a black box devices for the estimation of chemical compositions based on calibrations [51][52][53]. Very few scientist are aware that black box technology can be expanded for the physical-chemical characterisation of spectra [51][52][53]. Please note that it is not the objective of this paper to provide a comprehensive and detailed description of the different vibrational methods used as rapid or high throughput methods [54][55][56][57][58][59][60][61][62][63][64]. More detailed information about the different methods and techniques available as well the different technical characteristics or properties of the commercial instrumentation available in the market can be found elsewhere [54][55][56][57][58][59][60][61][62][63][64].
It has been stated (and sometimes is the believe by some of the users of MVA) that if the data already contain information, then any MVA method will succeed [35][36][37][38][39]. Unfortunately, the data are not as clean as expected when sampling and instrument noise and typing mistakes, among others have a greater impact where the use or pre-processing or any other correction does not improve the accuracy of the analytical results (e.g., inaccuracies can never be modelled) [35][36][37][38][39]. Therefore, a word of caution: MVA is not a "black box" or "push button" approach where the modelling will automatically do the rest [35][36][37][38][39].

Data Pre-Processing
Before starting with the analysis, interpretation and model developing, data pre-processing is a critical stage, as it affects the performance of the algorithms used and therefore the results (e.g., calibration and classification) [79][80][81][82][83]. Different methods and/or techniques for data pre-processing have been applied and developed specifically to different types of data and experimental designs [79][80][81][82][83]. For example, pre-processing of the spectra using the first and second derivatives, smoothing, multiple scatter correction (MSC), standard normal variate (SNV) and other normalization techniques were reported in most of the applications using vibrational spectroscopy [79][80][81][82][83]. Details about these pre-processing methods and techniques can be found in reviews by other authors [79][80][81][82][83].
Validation of classification models (e.g., discrimination) derived from the application of hyperspectral imaging have their own challenges [105,117]. A recent tutorial revised the different validation methods used in hyperspectral imaging analysis [105,117]. One of the main issues encountered is related with the samples used to develop the models. If too many samples are used (e.g., oversampling), unconstrained bootstrap and k-fold cross-validation might yield inaccurate results, failing to provide a realistic estimate of the predictive performance of the model [105,117]. Factors that can have a large influence during the analysis might be related to the range of data points (e.g., wavenumbers) used, the size of the image, the distribution of pixels from the different classes in the image and the number of pixels included in the training set [117]. The authors of the tutorial indicated that better results were obtained when randomised samples were used to develop the calibration and validation datasets [117].
The development of discriminant models utilising image data acquired from a single sample is highly risky, as the models might not take into consideration the effect of several sources inducing variation in the IR signal (e.g., age, body mass index, collection dates, sample storage or instrument performance) [105,114]. Therefore, validation using an external validation set is necessary in order to avoid overoptimistic results [105,116,117]. Other validation methods have been proposed during the integration of discriminant approaches to hyperspectral image analysis [116,[122][123][124]. A summary of these applications can be found in a review by Guaita and collaborators [116,[122][123][124].

Data Interpretation
One of the main issues is the comparison of results from the literature is usually complicated by variations in the population size and structure with respect to the attribute of interest. It is therefore critical to report the standard deviation (SD) of the population for the attribute of interest [28,40,41,46,48,78,109,110,113]. In general, a range of statistics is required to be reported in order to compare different calibrations, including the coefficient of correlation (R), root mean square for the standard error in cross-validation (RMSECV), standard error of prediction (SEP), SD, the number of samples used, the number of outliers removed, and the number of principal components [28,40,41,46,48,78,109,110,113,116]. The report of marginal gains in the standard of cross-validation or prediction after the use of several pre-processing methods should be avoided. The same can be applied when different algorithms are used with no real improvements in the predictive ability of the models. A summary of the main statistics to be considered during calibration interpretation and reporting can be found in the report by Williams and collaborators [112].
Calibration models are often evaluated and/or reported using a combination of some of the statistics presented above. However, the sole interpretation and evaluation of statistics is not enough, and the loadings or coefficients of regression must be interpreted in the context of the property or the measured chemical analyte [28,40,41,46,50,78,109,112,113]. For example, if a calibration was developed to measure or predict protein, it is expected that wavelengths or frequencies that contain information about the N-H bonds will be prevalent. In real-life applications of vibrational spectroscopy, the calibration or model must be judged or considered in relation to their fit-for-purpose criterion [28,40,41,46,50,78,109,112,113].
During the application of any of the MVA techniques presented above, it is important to select the appropriate number of components or latent variables (optimization) [117][118][119][120]. In this process, there is a delicate balance: if too many are used, there is too much redundancy in the independent variables used during the development of the model, causing the model to become overfitted [117][118][119][120]. In this case, the calibration model will be very dependent on the dataset and might provide poor prediction results [117][118][119][120][121][122][123][124]. On the other hand, using too few components will cause underfitting and the model will not be large enough to capture the variability in the data [117][118][119][120]. This "fitting" effect is strongly dependent on the number of samples used to develop the model and, in general, more samples give rise to more accurate predictions [117][118][119][120][121][122][123][124].
Overall, the use of MVA has the risk of overfitting (over-parameterization) determining a potential increase in the risk of false discovery [121]. Overfitting can be reduced during exploratory applications of vibrational spectroscopy by the use of rank optimization (e.g., based on pragmatic cross-validation), or by the use of double cross-validation (cross-model validation) [121]. These approaches, although not ideal, can be used until large, representative and independent test sets are obtained [121].
The steps needed to develop an application combining the sample, the spectra and the reference data are summarised in Figure 2.
Molecules 2020, 25, x FOR PEER REVIEW 9 of 17 The steps needed to develop an application combining the sample, the spectra and the reference data are summarised in Figure 2.

Figure 2.
Steps needed to develop an application combining the sample, the spectra and the reference data.

Concluding Remarks
The integration of vibrational spectroscopy with MVA to develop analytical applications (e.g., calibration and classification) can be considered by the non-expert purely as a mathematical or statistical exercise. This, however, could not be further from the truth-calibration development is a complex process that implies the understanding of a system created by the sample and its inherent characteristics (e.g., physical and chemical properties, variability, origin, pre-processing, etc.), the origin of the spectra (e.g., instrument characteristics, sample collection mode, etc.) and all the aspects of the multivariate data analysis (e.g., pre-processing, selection of samples for calibration and validation, linear and non-linear algorithms, outliers, etc).
These developments require a basic understanding of the different variables that contribute to the system and they include the sample, fundamentals of spectroscopy, data processing and analysis, sampling protocols, and limit of detection (see Figure 3). The adaptation of vibrational spectroscopy to efficiently and reliably contribute to the expansion in the number of applications related to analytical chemistry, process analytical technologies, traceability of food ingredients, and natural products, makes them an ideal set of methodologies towards sustainability along the food value chain. An increasing number of research groups have investigated the use of vibrational spectroscopy, as shown in several applications reported in the literature. However, commercial implementation of these techniques is still under development in some industries.

Figure 2.
Steps needed to develop an application combining the sample, the spectra and the reference data.

Concluding Remarks
The integration of vibrational spectroscopy with MVA to develop analytical applications (e.g., calibration and classification) can be considered by the non-expert purely as a mathematical or statistical exercise. This, however, could not be further from the truth-calibration development is a complex process that implies the understanding of a system created by the sample and its inherent characteristics (e.g., physical and chemical properties, variability, origin, pre-processing, etc.), the origin of the spectra (e.g., instrument characteristics, sample collection mode, etc.) and all the aspects of the multivariate data analysis (e.g., pre-processing, selection of samples for calibration and validation, linear and non-linear algorithms, outliers, etc).
These developments require a basic understanding of the different variables that contribute to the system and they include the sample, fundamentals of spectroscopy, data processing and analysis, sampling protocols, and limit of detection (see Figure 3). The adaptation of vibrational spectroscopy to efficiently and reliably contribute to the expansion in the number of applications related to analytical chemistry, process analytical technologies, traceability of food ingredients, and natural products, makes them an ideal set of methodologies towards sustainability along the food value chain. An increasing number of research groups have investigated the use of vibrational spectroscopy, as shown in several applications reported in the literature. However, commercial implementation of these techniques is still under development in some industries.
Even though several articles have been published in the scientific literature, most of them describe feasibility or potential applications of vibrational spectroscopy, where small datasets containing few samples are analysed and cross-validation, rather than an independent dataset, is used to validate the developed models (e.g., calibration). Adding to this is the little in-depth understanding of the reference lab (e.g., standard error of the laboratory method). Most of the application of vibrational spectroscopy are considered correlative methods, and their accuracy depends on the error of the reference method. Therefore, knowledge of the extent to which results are repeatable using wet chemistry or biochemical procedures is of paramount importance in judging the reliability calibration.
It is important to remember that the wet chemistry or reference data with all their known inadequacies are used to assess the performance of the calibrations; thus, before assessing the accuracy of a calibration or model, the error associated with the reference method should be known, and this is a fact that is often ignored. The lack of interpretation of loadings, significance of coefficients of regression, and inter-correlations among measured variables and chemical compounds are usually missing from the interpretation. Even though several articles have been published in the scientific literature, most of them describe feasibility or potential applications of vibrational spectroscopy, where small datasets containing few samples are analysed and cross-validation, rather than an independent dataset, is used to validate the developed models (e.g., calibration). Adding to this is the little in-depth understanding of the reference lab (e.g., standard error of the laboratory method). Most of the application of vibrational spectroscopy are considered correlative methods, and their accuracy depends on the error of the reference method. Therefore, knowledge of the extent to which results are repeatable using wet chemistry or biochemical procedures is of paramount importance in judging the reliability calibration. It is important to remember that the wet chemistry or reference data with all their known inadequacies are used to assess the performance of the calibrations; thus, before assessing the accuracy of a calibration or model, the error associated with the reference method should be known, and this is a fact that is often ignored. The lack of interpretation of loadings, significance of coefficients of regression, and inter-correlations among measured variables and chemical compounds are usually missing from the interpretation.
The use of MVA reveals interesting information about the system but important bits might remain undiscovered. The extent or the use of good MVA (e.g., new algorithms, new software, or mathematical pre-processing) is meaningless if we fail in evaluating the best sample presentation, processing or interactions of the sample collection and analysis.
One of the interesting aspects of the modern integration of these technologies is that it requires and sources information and knowledge from many fields (e.g., spectroscopy, analytical chemistry, data analysis, biology, physics, etc.). This determines the unique multidisciplinary characteristic of this approach. A close collaboration between several researchers is therefore critical for the application and development of the technology. It is also important that everyone involved in the process understands and agrees upon the goals and requirements of the study beforehand to reduce the risk of weak links in the study. The definition of protocols for reporting the outcomes and results of any given study is also important.
Knowing and understanding the reference laboratory method (such as the standard error of the lab method), the limitations of the method, the physics and chemical basis of the spectra, as well as The use of MVA reveals interesting information about the system but important bits might remain undiscovered. The extent or the use of good MVA (e.g., new algorithms, new software, or mathematical pre-processing) is meaningless if we fail in evaluating the best sample presentation, processing or interactions of the sample collection and analysis.
One of the interesting aspects of the modern integration of these technologies is that it requires and sources information and knowledge from many fields (e.g., spectroscopy, analytical chemistry, data analysis, biology, physics, etc.). This determines the unique multidisciplinary characteristic of this approach. A close collaboration between several researchers is therefore critical for the application and development of the technology. It is also important that everyone involved in the process understands and agrees upon the goals and requirements of the study beforehand to reduce the risk of weak links in the study. The definition of protocols for reporting the outcomes and results of any given study is also important.
Knowing and understanding the reference laboratory method (such as the standard error of the lab method), the limitations of the method, the physics and chemical basis of the spectra, as well as knowing and interpreting the interactions that exist between the sample and the instrument, will allow the user to better interpret the calibration or obtained mathematical relationships. It is therefore important that the individual that developed such calibrations has this knowledge in order to produce a method that can be reliable.
Martens [121] has highlighted that the scientific process of boring into the solid "mountain of the unknown" never stops, and that it is continuous. The author suggested that statistically valid claims must be replicated independently, intuitive hunches should be chased and solid manmade theories should be assessed critically.
The advantages and ability of vibrational spectroscopy to predict multiple parameters and speed of analysis mean that we have a powerful tool that can revolutionise the way we produce foods. The future development of such applications will provide the industry with a very fast and non-destructive method to monitor composition or changes and to detect unwanted problems, providing a rapid means of qualitative rather than quantitative analysis. Moreover, the choice of measuring device(s) may benefit from the experience in, e.g., multichannel diffuse near infrared (NIR) spectroscopy measuring many properties-preferably more than necessary, (it usually does not cost much extra).
However, various hurdles still hinder the growth and development of vibrational spectroscopy applications. Among them is the reluctance to accept the incorporation of vibrational spectroscopy with new statistical tools, such as multivariate data analysis techniques, as routine analytical or quality control methods. Besides, most of the current courses and training programmes in food still focus on the so-called classical approach where several aspects related to the incorporation of new technologies, sensors and programming are not yet incorporated in the curricula. The same can be said regarding research and other aspects of informal training and extension. Together with the silo mentality that still exist in the food industry, this hinders the possibility of exploiting the full potential of these systems by the industry.
Finally, one of the most important and critical aspects of the development of vibrational spectroscopy is the need for an appropriate level of training. For example, although knowledge of the chemistry of a sample material is useful, routine analyses can be performed by analysts with a high-school education. On the other hand, calibration development (interpretation, application and monitoring) is by far the most critical aspect and thus requires a high level of expertise, particularly in multivariate data analysis, in order to make an application successful. Where methods based on vibrational spectroscopy have been applied in industry situations, the potential savings, reduction in time and cost of analysis have been demonstrated. These methods show promising potential for in-field and process analysis.