Comparison of Multivariate Regression Models Based on Water- and Carbohydrate-Related Spectral Regions in the Near-Infrared for Aqueous Solutions of Glucose

The predictive power of the two major water bands centered at 6900 cm-1 and 5200 cm-1 in the near-infrared (NIR) region was compared to carbohydrate-related spectral areas located in the first overtone (around 6000 cm-1) and combination (around 4500 cm-1) region using glucose in aqueous solutions as a model substance. For the purpose of optimal coverage of stronger as well as weaker absorbing NIR regions, cells with three different declared optical pathlengths were employed. The sample set consisted of multiple separately prepared batches in the range of 50–200 mmol/L. Moreover, the samples were divided into a calibration set for the construction of the partial least squares regression (PLS-R) models and a test set for the validation process with independent samples. The first overtone and combination region showed relative prediction errors between 0.4–1.6% with only one PLS-R factor required. On the other hand, the errors for the water bands were found between 1.6–8.3% and up to three PLS-R factors required. The best PLS-R models resulted from the cell with 1 mm optical pathlength. In general, the results suggested that the carbohydrate-related regions in the first overtone and combination region should be preferred over the regions of the two dominant water bands.


Introduction
Glucose is of great importance in physiological systems, medicine, and health care, as well as in the food and beverage industry. Besides other analytical methods, glucose and other carbohydrates are often quantitatively determined enzymatically or via chromatographic methods such as high performance liquid chromatography (HPLC) or gas chromatography (GC) [1][2][3]. However, these methods are rather time-consuming and expensive as they usually require sample preparation, long measurement time (incubation of enzymes, separation on column, etc.), and qualified personnel. Considering these drawbacks of conventional analytical techniques, near-infrared spectroscopy (NIRS) is of increasing interest as an alternative method for the quantification of glucose and other carbohydrates in aqueous solutions. NIRS mostly does not require any sample preparation, offers fast and non-invasive analyses, and multiple sample characteristics are accessible with one single measurement. Moreover, NIR spectrometers are cheap to run and can be operated by relatively untrained personnel. Alongside the mentioned advantages, NIRS comes with a few downsides. One is that the information contained in NIR spectra often needs to be extracted using multivariate data analysis tools such as principal component analysis (PCA) or partial least squares regression in order to access the whole NIR region from 10,000-4000 cm −1 (thinner cell pathlength) and to account for the lower absorption in the first overtone region (thicker cell pathlength).

Samples
D-(+)-glucose (≥99.5%) was purchased from Carl Roth (Karlsruhe, Germany) and Milli-Q water with a resistivity of 18.2 MΩ cm was used for the preparation of the glucose solutions. The calibration set was composed of pure Milli-Q water and glucose concentrations ranging from 50-200 mmol/L in steps of 30 mmol/L. For the test set, samples with glucose concentrations of 60.2, 130.5 and 186.0 mmol/L were prepared.
In order to avoid any effects of measurement time, multiple independent batches for both calibration and test samples were prepared. The calibration and test set consisted of three and two batches per sample, respectively. The preparation and measurement of the samples were randomized. Furthermore, all samples were measured on the day they were prepared.

FT-NIR Measurements
The Büchi NIRFlex N-500 FT-NIR spectrometer (Büchi, Flawil, Switzerland) equipped with the liquids measurement cell was used to acquire NIR spectra of the glucose solutions. The spectra were recorded in transmission mode in the range of 10,000-4000 cm −1 with a spectral resolution of 8 cm −1 , while each sample was scanned 64 times. In order to account for the lower absorption towards increasing wavenumbers, the measurements were performed using three cell types with different declared optical pathlengths. The cells were purchased from Hellma GmbH & Co. KG (Müllheim, Germany) and specified as follows: one 106-QS quartz SUPRASIL ® cell with 0.1 mm optical pathlength and demountable cell windows, and multiple 100-QX quartz SUPRASIL ® cells with 1 mm and 2 mm optical pathlength, respectively. Spectrometer reference measurements were performed in the beginning, as well as in the middle of each measurement day. Data acquisition was accomplished using the NIRWare 1.4.3010 software package (Büchi, Flawil, Switzerland).
Samples were always freshly prepared and measured randomly over a period of four weeks. In contrast to the 1 mm and 2 mm cells-which were simply filled with a certain amount of sample solution-the 0.1 mm cell is demountable and thus had to be filled differently. For the 0.1 mm cell, approximately 40 µL of glucose solution was applied onto the sample recess of one optical cell window, followed by the careful attachment of the second cell window. Excessive sample solution was displaced and collected with a tissue. Sticky glucose residues on the outside surface of the cell were removed.
The NIR measurements were performed at 35°C due to the fact that the NIRFlex N-500 liquids measurement cell was subjected to significant fluctuations at lower temperatures. However, at 35°C, the temperature fluctuation stabilized at ±0.1°C [13]. In order to avoid the introduction of temperature-driven shifts in the NIR spectrum, each cell filled with sample solution was tempered to 35°C before the measurements were started. The 0.1 mm cell was thermally equilibrated for 30 s, while the 1 mm and the 2 mm cells were thermally equilibrated for 1 min and for 2 min, respectively. To avoid water evaporation, the 1 mm and 2 mm cells were covered with a lid, whereas the 0.1 mm cell did not offer any cover possibility since the two cell windows were kept together by adhesion. The only possibility to prevent the evaporation of water out of the 0.1 mm cell was to minimize the time between the filling of the cell with sample solution and the actual sample measurement.
All samples were measured nine times, while the cells were refilled with fresh solution for each of the nine repeat measurements. The cells were cleaned thoroughly after every single measurement using Milli-Q water and ethanol. Lint-free tissues, as well as a conventional compressed air system, were used in order to dry the cells and remove potential dust particles.

Band Assignment and Division of Spectral Regions
In this study, the two water bands at around 6900 cm −1 and 5200 cm −1 , as well as two regions of glucose-related vibrations located at around 5900 cm −1 and 4400 cm −1 , were used for the comparison of the predictive power of each region separately at different cell pathlengths. Since water is known to be a strong absorber in the near-infrared [8,30], complete absorption of the NIR light can occur in certain spectral regions-depending on the cell pathlength, infrared source, and detector [30]. As a consequence, the water band centered at 5200 cm −1 could not be utilized for the purpose of any quantitative analysis using both the 1 mm and 2 mm cells. This region was only accessible using the 0.1 mm cell.
The NIR regions related to water were selected in such a way that they ranged from the beginning to the end of the corresponding NIR band while the regions related to glucose were chosen according to the spectral pattern after the application of the second derivative discussed later (Section 3). For the water band at around 6900 cm −1 a spectral range of 7692-6248 cm −1 was selected in order to match the region frequently used in aquaphotomics [27], and was labeled as W1 in this study. This band is commonly referred to as the first overtone of water [27,31], although it is actually a combination of symmetric and antisymmetric stretching vibration modes of water [8,9]. The spectral region for the second water band-labeled as W2-was set to 5400-4600 cm −1 and is assigned to the combination of bending and antisymmetric water stretching modes [7,8].
The two regions at around 5900 cm −1 and 4400 cm −1 related to glucose vibrations in water were labeled as G1 and G2, respectively. For G1, the spectral region was set to 6100-5800 cm −1 and is assigned to first overtone vibrations of C−H compounds [8,16,32]. The spectral region for G2 was set to 4520-4300 cm −1 and is assigned to combinations of C−H stretching and CH 2 deformation vibrations, as well as combinations of stretching vibrations of glucose-related O−H and C−O compounds [8,16]. Figure 1 shows an exemplary NIR spectrum of water containing glucose with the described division of spectral regions. Note that the small band around 4500 cm −1 (marked with an asterisk in Figure 1) is caused by O−H residues in the quartz windows of the 0.1 mm cell (probably due to water impurities [33]) and is assigned to a combination of an O−H stretching vibration and one of the SiO 2 fundamental vibrations [8,34].

Multivariate Data Analysis
The Unscrambler X Ver. 10.5 (Camo Software AS, Oslo, Norway) was used for the pre-treatment of the NIR spectra as well as the construction and validation of the multivariate regression models. Due to the occurrence of interference fringes using the 0.1 mm cell, the frequency filtering technique fast Fourier transform filter (FFT-filter) [13] was applied to these NIR spectra using OriginPro Ver. 9.1G (OriginLab Corporation, Northampton, MA, USA). Thereby, the NIR spectra were first Fourier transformed, followed by the application of a filter function and finally retransformed by inverse Fourier transformation. In order to only eliminate the disturbing interferences and leave the regular spectra containing the targeted information untouched, the parabolic low-pass filter was chosen as an FFT filter function. This filter blocks all frequencies above a certain threshold value (cutoff frequency), while lower frequency elements are allowed to pass [13]. The cutoff frequency was set to 0.02625 Hz-all frequencies above were eliminated before the spectra were inverse Fourier transformed. The suitability of this approach has been validated before [13]. Afterward, the spectra of all three cell pathlengths were transformed from transmittance to absorbance.
The NIR spectra were reduced batchwise from nine spectra to one representative spectrum for each batch. All previously defined spectral regions were subjected to an individual optimization of pre-treatments (see Table 1). However, in case of the regions W1 and W2, the pre-treatments were chosen as proposed in aquaphotomics literature [27] with an additional application of a standard normal variate (SNV) transformation [35]. For the regions G1 and G2, it was found that a second order Savitzky-Golay derivative [36] with a second order polynomial and a varying number of smoothing points was optimal. Second order derivative spectra were also calculated for the two water-related regions W1 and W2. The results were inferior compared to the pre-treatments mentioned above and will therefore not be discussed any further. For each spectral region, regression models were calculated using partial least squares regression (PLS-R) along with the NIPALS algorithm. The calibration process incorporated 21 calibration samples from three batches, whereas the performance of the calibration models was evaluated with six completely independent samples from two batches (test set validation). Note that these samples were never employed in any calibration [37,38]. The performances of the PLS-R models were assessed using the root mean square error (RMSE), which was calculated according to Equation (1), where y i andŷ i represent the reference and predicted values, respectively. Furthermore, in order to enable a more straightforward interpretation of the RMSE's scale, a percentage error called normalized RMSE (NRMSE) was introduced, which refers to the calibration range of 0-200 mmol/L (see Equation (2)).
The errors of the calibration (CAL) and test set validation (TSV) were referred to as root mean square error of calibration (RMSEC) and root mean square error of prediction (RMSEP), respectively: The RMSE's magnitude is closely associated with the number of PLS-R factors (or latent variables), which is a crucial parameter for a satisfactory performing PLS-R model [37,38]. Since the glucose-water system used in this study is rather simple, the number of PLS-R factors employed in the PLS-R models should be kept quite low in order to avoid modeling of noise and thus non-relevant spectral information (overfitting). However, using too few PLS-R factors can lead to poor model performance due to the lack of explained variance in the NIR spectra (underfitting). The optimal number of PLS-R factors was determined by the examination of the regression coefficients, the loadings and correlation loadings of each PLS-R factor as well as the explained variances.

Results and Discussion
The full-range raw NIR spectra of the calibration and test set for all three utilized cell pathlengths are depicted in Figure 2. The artifacts occuring in the region around 5200 cm −1 in the raw NIR spectra of the 1 mm and 2 mm cells in Figure 2 are caused by the complete absorption of NIR light in this spectral region [11].  The pre-treated NIR calibration set spectra of the regions W1, W2, G1 and G2 for all three cell pathlengths are presented in Figure 3. Since each of the three batches per concentration was averaged from nine spectra to one representative spectrum, three spectra per concentration are shown in Figure 3. The glucose-related regions G1 and G2 showed an evident pattern towards increasing glucose concentrations (Figure 3e,f,j-l), while such an obvious pattern was missing in the NIR spectra of the water-related regions W1 and W2 at first glance (Figure 3a-c,g). However, a closer look revealed that there actually was a certain concentration dependent pattern, although it was not as pronounced as in the regions associated with glucose vibrations. An example of this is shown in Figure 4.  In the course of data analysis, the number of smoothing points for the second derivative in the first overtone and combination region was individually optimized prior to the PLS-R. As a consequence, the exact spectral range used for the PLS-R varied for each cell. Nevertheless, the spectral regions subjected to the calculation of the derivative spectra did not vary in between the three cell pathlengths.

Measurements with 0.1 mm Cell Pathlength
The results of the PLS-R calibration and test set validation procedure is presented in Table 2. The 0.1 mm cell allowed the evaluation of the performance of all four investigated regions. Comparing the results for the 0.1 mm cell in Table 2, probably the most noticeable value is the relatively high prediction error of RMSEP = 22.6 mmol/L of the first overtone region G1. This error's magnitude of more than 11% employing three PLS-R factors was hardly surprising, considering the lack of a distinct concentration dependent pattern in Figure 3d. The reason for this was that a small amount of interference fringes was still present in this spectral region and that there was insufficient spectral information content due to the short pathlength [30]. These remaining fringes were hardly noticeable in the regular absorption spectrum but became evident after the application of the second derivative-despite previous smoothing of the NIR spectra. Considering the PLS-R scores of the calibration set of region G1 in Figure 5d-f, the first PLS-R factor mostly accounted for the changes in glucose concentration. Despite that, the PLS-R calibration model was not able to predict the test set adequately.
The models for the two water-related regions W1 and W2 both showed similar percentage errors of around NRMSEP = 4% for the prediction of unknown samples from the test set using two PLS-R factors, respectively. A consideration of more than two PLS-R factors for each model would have further reduced the prediction error; however, a closer look at the model statistics gave no justification for the use of a third PLS-R factor. In case of region W1, the validation model's explained Y-variance (variance in glucose concentration) comparably increased from PLS-R factor 1 to 2 and from PLS-R factor 2 to 3 (see Table 3), and therefore suggested a model based on three PLS-R factors. In contrast, the correlation loadings of PLS-R factor 3 showed very low values with a maximum of 0.2 (see Figure 6a), which led to the exclusion of PLS-R factor 3 from the PLS-R model due to the risk of modeling glucose-unrelated spectral information. For the combination band of water (region W2), two PLS-R factors were considered as optimal as the explained Y-variance in the PLS-R validation model increased by 2.4% from PLS-R factor 1 to 2 (see Table 3) and the correlation loadings indicated many X-variables with strong contributions to the second PLS-R factor (see Figure 6g).  Table 2.
Among all prediction errors of the measurements conducted with the 0.1 mm cell, the glucose-related combination region G2 yielded by far the lowest prediction error. The model required only one PLS-R factor to yield an NRMSEP as low as 0.7% along with an R 2 TSV of 0.9993. The explained Y-variance of the test set already reached 99.9% in the first PLS-R factor (see Table 3), and thus made the use of more PLS-R factors invalid. This remarkable prediction performance of region G2 can be attributed to the very distinct concentration pattern in the second derivative NIR spectra (see Figure 3j), which was not observed in the regular (untreated) spectra. In addition to that, the concentrations in the PLS-R score plot of region G2 in Figure 5c were perfectly separated along PLS-R factor 1. This demonstrated that PLS-R factor 1 exclusively accounted for changes in glucose concentration and thus allowed the exclusion of further PLS-R factors.

Measurements with 1 mm Cell Pathlength
The performance of the three exploitable regions of the 1 mm cell was remarkable. The PLS-R calibration model of the so-called first overtone of water (region W1) predicted the independent test set samples with an error of RMSEP = 3.2 mmol/L and a relative error of NRMSEP = 1.6% (see Table 2). These errors were achieved using the first two PLS-R factors and together accounted for 99.7% of the Y-variance (see Table 3), which, as a consequence, did not allow the consideration of further PLS-R factors in the model for region W1.
The pre-treated NIR spectra of the glucose-related regions G1 and G2 in Figure 3e,k, respectively, showed the same concentration dependent pattern from pure water towards increasing glucose content. This clearly evident pattern indicated that glucose in aqueous solution produces own NIR bands. This finding is in contrast to the frequently found view in the literature [26,39], according to which carbohydrates do not exhibit own NIR bands in aqueous solutions, but rather characteristically disturbs the water structure. Actually, at low concentrations, these bands are more like tiny changes in the untreated spectra's path line, which cannot be recognized by the eye, but are rather revealed and highlighted by calculating derivative spectra. The high predictive power of the two glucose-related regions is best represented by the low errors in the prediction of the independent test set: the PLS-R model for the first overtone region G1 yielded an NRMSEP value of 0.9%, whereas the NRMSEP for the combination region G2 was as low as 0.4% (see Table 2). The fact that for each PLS-R model only one PLS-R factor was necessary to achieve the mentioned prediction errors using the two regions associated with glucose vibrations showed the distinct glucose-related nature of these regions. The use of only one PLS-R factor for the two glucose-related regions was further confirmed by the fact that the concentrations in the PLS-R score plots in Figure 7a,b were clearly separated along the first PLS-R factor.  Table 2.
However, these findings allow for reconsidering the statement of Chen et al. [17], according to which a 1 mm cell pathlength is too thin for satisfactory glucose quantification from NIR spectra in the first overtone region in an aqueous matrix. The authors of the aforementioned study did not use derivative spectra. In contrast to Chen et al. [17], the results presented herein rather suggest that a cell pathlength of 1 mm is perfectly suitable. By applying a second derivative function to the first overtone region, a clear concentration dependent pattern becomes evident (see Figure 3e) and thus allows the construction of highly accurate PLS-R models for glucose quantifications. Our study did not investigate the exact cell pathlength at which it becomes too thin for high-quality NIR spectra. Nevertheless, considering the relatively high prediction error of the 0.1 mm cell in region G1, it can be concluded that this limit is below a cell pathlength of 1 mm.

Measurements with 2 mm Cell Pathlength
For the 2 mm cell, the test set validation for the so-called first overtone of water (region W1) yielded an NRMSEP of around 5% utilizing three PLS-R factors (see Table 2). An additional consideration of PLS-R factor 4 would have reduced the relative error by nearly half, but, from the interpretation of the model statistics, it was concluded that this would have led to the modeling of noise or glucose-unrelated spectral information. Although the explained Y-variance of the validation model increased by 2.4% from PLS-R factor 3 to PLS-R factor 4 (see Table 3), the correlation loadings showed negligibly small values for PLS-R factor 4 (see Figure 6c). This suggested that the X-variables modeled in PLS-R factor 4 were not of importance for the regression model and therefore might have contained non-relevant spectral information for the quantification of the target solute.
In direct comparison to the two glucose-related regions, the predictive power of region W1 was inferior. Using a pathlength of 2 mm, the PLS-R calibration models for the regions G1 and G2 predicted the independent test set samples with prediction errors of NRMSEP = 0.8% and NRMSEP = 1.6%, respectively, whereas both models required only one PLS-R factor (see Table 2). The second derivative spectra in Figure 3f,l showed a clear glucose concentration dependent pattern towards increasing glucose content, which was also reflected in the PLS-R score plots in Figure 8d,e. However, compared to the derivative spectra of the 1 mm cell in the combination region G2 (Figure 3k), the spectra in Figure 3l appeared noisy to some extent. This noisy pattern could be removed with a higher number of smoothing points in the second derivative, but the test set validation yielded poorer RMSEP values and required more PLS-R factors. It is conceivable that the high absorption of the adjacent water combination band and the associated spectral artifacts (see Figure 2) had an impact on region G2 and consequently led to the somewhat higher prediction error in this region.  Table 2.

Comparison between Cell Pathlengths
An overall comparison between the predictive power of water-and glucose-based PLS-R models indicated that the glucose-related regions G1 and G2 considerably outperformed the two water bands W1 and W2. The glucose regions yielded far lower prediction errors with NRMSEP values as low as 0.4% along with the utilization of only one PLS-R factor. This emphasizes the dominant presence of glucose-related spectral information in these two NIR regions at around 5900 cm −1 and 4400 cm −1 . The only exception with poor predictive power was the 0.1 mm cell in region G1 due to the reasons described earlier. With regard to the pathlength, the 1 mm cell turned out to give the most accurate PLS-R models for both water and glucose related regions. This optical pathlength seemed to have a favorable ratio between transmitted and absorbed NIR light for quantitative analyses of aqueous glucose solutions and most probably for carbohydrate solutions in general. Table 3. Explained Y-variances for each spectral region and utilized cell. The type of validation (calibration or test set validation) and the number of PLS-R factors are specified. The values for the explained variances are given in %.

Conclusions
The good predictive performance of the PLS-R models with the water-related regions W1 and W2 confirmed the well documented fact, in which sugars (in this case glucose) in aqueous solutions affect the water bands in the NIR region by disturbing the structure of the hydrogen bond network of liquid water [26,39]. On the other hand, considering the significantly higher predictive power of the PLS-R models based on the regions G1 and G2, it must be concluded that the regions associated with carbohydrate vibrations (i.e., C−H, O−H, C−O) are even better suited for highly accurate quantifications. These vibrations cause rather small bands in the NIR spectrum, thus derivative functions need to be applied in order to reveal the concentration dependent patterns.
The validity of the results obtained herein was further confirmed by the fact that multiple batches were used for the construction of the PLS-R calibration models. Moreover, independent test set samples were utilized in the process of PLS-R model validation. Therefore, a certain robustness of the constructed PLS-R models can be assumed.
This study demonstrated the superiority of characteristic glucose bands over the dominant and intense water bands in the NIR spectrum in terms of quantitative predictive power. Relative prediction errors lower than 1% were obtained while only one PLS-R factor was required. Further investigations need to be carried out in order to determine reliable values for the limit of detection (LOD) and the limit of quantification (LOQ) of both the water-and glucose-related regions. However, despite the promising predictive power of the glucose bands, it has to be noted that the sample matrix employed in the present study was rather simple. The establishment of reliable PLS-R models based on NIR data obtained from more complex matrices like body fluids such as blood or urine, or food products like beverages, is undoubtedly much more challenging. Nevertheless, the findings reported herein can support the selection of the most informative NIR regions for investigations of aqueous carbohydrate systems. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: