Next Article in Journal
Two-Dimensional GC–ToFMS Analysis of Volatile Organic Compounds in Fermented Camel Milk (Shubat)
Previous Article in Journal
A Beta Regression Approach to Modelling Country-Level Food Insecurity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automated Spectral Preprocessing via Bayesian Optimization for Chemometric Analysis of Milk Constituents

by
Habeeb Abolaji Babatunde
1,
Owen M. McDougal
2 and
Timothy Andersen
1,*
1
Computer Science, Boise State University, Boise, ID 83725, USA
2
Department of Chemistry and Biochemistry, Boise State University, Boise, ID 83725, USA
*
Author to whom correspondence should be addressed.
Foods 2025, 14(17), 2996; https://doi.org/10.3390/foods14172996
Submission received: 14 July 2025 / Revised: 16 August 2025 / Accepted: 19 August 2025 / Published: 27 August 2025
(This article belongs to the Section Food Quality and Safety)

Abstract

The preprocessing of infrared spectra can significantly improve predictive accuracy for protein, carbohydrate, lipid, or other nutrition components, yet optimal preprocessing selection is typically empirical, tedious, and dataset specific. This study introduces a Bayesian optimization-based framework designed for the automated selection of optimal spectral preprocessing pipelines within a chemometric modeling context. The framework was applied to mid-infrared spectra of milk to predict compositional parameters for fat, protein, lactose, and total solids. A total of 385 averaged spectra corresponding to 198 unique samples was split into a 70/30 ratio (training/test) using a group-aware Kennard-Stone algorithm, resulting in 269 averaged spectra (135 unique samples) for training and 116 spectra (58 unique samples) for testing. Six regression models: Elastic Net, Gradient Boosting Machines (GBM), Partial Least Squares (PLS), RidgeCV Regression, LassoLarsCV, and Support Vector Regression (SVR) were evaluated across three preprocessing conditions: (1) no preprocessing, (2) literature-derived custom preprocessing (e.g., MSC, SNV, and first and second derivatives), and (3) optimized preprocessing via the proposed Bayesian framework. Optimized preprocessing consistently outperformed other methods, with RidgeCV achieving the best performance for all components except lactose, where PLS slightly outperformed it. Improvements in predictive accuracy, particularly in terms of RMSEP were observed across all milk components. The best RMSEP results were achieved for protein (RMSEP = 0.054, R 2 = 0.981 ) and lactose (RMSEP = 0.026, R 2 = 0.917 ), followed by fat (RMSEP = 0.139, R 2 = 0.926 ) and total solids (RMSEP = 0.154, R 2 = 0.960 ). Literature-based pipelines demonstrated inconsistent effectiveness, highlighting the limitations of transferring preprocessing methods between datasets. The Bayesian optimization approach identified relatively simple yet highly effective preprocessing pipelines, typically involving few steps. By eliminating manual trial and error, this data-driven strategy offers a robust and generalizable solution that streamlines spectral modeling in dairy analysis and can be readily applied to other types of spectroscopic data across various domains.

1. Introduction

Milk is a nutrient-dense biological fluid that contains essential macronutrients, such as fat, protein, and lactose, along with vitamins, minerals, and bioactive compounds that support human growth, health, and disease prevention [1]. Fats contribute to energy provision, flavor, and fat-soluble vitamin transport; proteins such as casein and whey not only supply essential amino acids with high biological value but also exert a range of bioactive functions, including antibacterial, immunomodulatory, antioxidant, antihypertensive, and opioid-like activities, in addition to providing functional properties important in dairy processing [2]. Lactose serves as an energy source and facilitates calcium absorption, while minerals such as calcium, phosphorus, and magnesium are critical for bone development and metabolic functions [2]. The balance of these constituents determines not only the nutritional value of milk but also its technological functionality in the production of a wide range of dairy products [2]. The accurate determination of these components is therefore fundamental for quality control, economic valuation, and optimization of dairy production.
Conventional analysis of milk composition relies on standardized wet-chemical and instrumental reference methods to ensure accuracy and compliance. Protein content is determined using Kjeldahl nitrogen analysis or Dumas combustion; fat is measured through Gerber/Babcock acid digestion or Rose–Gottlieb solvent extraction; lactose is quantified by polarimetry; and minerals are analyzed through ashing followed by atomic absorption spectroscopy (AAS) or inductively coupled plasma (ICP) spectroscopy [3,4,5,6,7]. High-performance liquid chromatography (HPLC) is a versatile technique that can be applied for the quantification of lactose, proteins, fats, vitamins, and other bioactive compounds, offering high sensitivity and specificity across multiple milk components. While these conventional methods deliver high precision, they are often labor intensive, time consuming, and require skilled operators, making them less suitable for real-time process control in modern dairy operations [8].
In recent years, advanced food processing technologies, such as high-pressure processing (HPP), microfiltration, pulsed electric fields (PEFs), UV-C (ultraviolet C) treatment, and high-pressure homogenization, have been developed to enhance microbial safety, extend shelf life, and preserve the nutritional and sensory qualities of milk [9]. HPP inactivates pathogenic and spoilage microorganisms using a hydrostatic pressure of 400–600 MPa without significant heat, maintaining vitamins, flavor, and protein functionality [10]. Microfiltration employs membrane separation to remove bacteria, spores, and somatic cells while retaining desirable components such as proteins and minerals [11]. PEF uses short bursts of high-voltage electric pulses to disrupt microbial cell membranes, achieving pasteurization-like safety with minimal thermal damage [12]. UV-C treatment, operating in the 200–280 nm wavelength range, inactivates microbes by damaging their DNA [13]. It is particularly effective for surface decontamination and thin-film liquid applications, helping preserve the nutritional and sensory qualities of foods while extending shelf life [14]. High-pressure homogenization applies intense shear forces at elevated pressures to reduce fat globule size, improve emulsion stability, and enhance microbial inactivation [15]. Complementing these advances, rapid analytical tools, including vibrational spectroscopies (mid-infrared, near-infrared, and Raman), fluorescence sensors, dielectric/impedance detectors, and biosensors, enable non-destructive, on-site measurement of key components and contaminants within seconds. These analytical systems can function independently or in tandem with processing methods to verify composition, confirm microbial inactivation, detect adulteration, and optimize parameters in real time. Their integration, particularly through inline sensors and IoT-enabled monitoring, closes the loop between processing and quality assurance, ensuring milk safety, consistency, and consumer acceptability.
Mid-infrared (MIR) spectroscopy, when coupled with chemometric modeling, offers a rapid, non-destructive, and cost-effective alternative to improve predictive accuracy and processing efficiency [8,16]. MIR spectroscopy measures the absorption of infrared radiation within the mid-infrared region of the electromagnetic spectrum from 4000–400 cm−1 spectral range, where molecular vibrations associated with specific functional groups occur [17]. These vibrations correspond to the stretching and bending motions of chemical bonds, making MIR highly effective for identifying and quantifying key milk components. For instance, in the MIR spectral range of 4000 to 400 cm−1, various absorption peaks correspond to the vibrational modes of molecular bonds in milk components interacting with infrared radiation [18]. Fats are characterized by absorption bands associated with the stretching vibrations of C-H bonds in fatty acid chains. In particular, peaks at approximately 2922 cm−1 and 2852 cm−1 correspond to the asymmetric and symmetric stretching vibrations of the methylene (CH2) groups, respectively. Additionally, an absorption peak around 1743 cm−1 is linked to the C=O stretching vibrations of ester carbonyl groups in triglycerides, providing a distinctive marker for lipids in milk [18]. The spectral range between 1700 cm−1 and 1500 cm−1 is characterized by prominent peaks associated with peptide bonds in proteins. Two major bands are the amide I band around 1635 cm−1, attributed to C=O stretching and N-H bending vibrations, and the amide II band near 1548 cm−1, corresponding to N-H bending coupled with C-N stretching. These bands are directly related to the peptide bonds in milk proteins such as casein and whey proteins [19]. Additionally, the region between 1200 and 900 cm−1 contains absorption peaks linked to carbohydrates, particularly lactose. For instance, a peak at approximately 1077 cm−1 is associated with C-O stretching vibrations in lactose [20].
Despite these advantages, MIR spectra are inherently convoluted due to overlapping absorption bands from various constituents. Unlike HPLC, which yields distinct peaks for individual analytes, MIR does not allow the direct deconvolution of each component without additional statistical modeling. As a result, chemometrics plays a crucial role in the deconvolution of MIR spectra and linking them to quantifiable milk components such as fat, protein, lactose, and total solids [8]. A typical chemometric workflow includes three essential steps: spectral preprocessing, wavenumber selection, and predictive model development. Of these, preprocessing is foundational because the raw spectra often contain instrumental noise, baseline drift, scattering effects, and sample inconsistencies that obscure meaningful chemical information [21]. Spectral preprocessing comprises mathematical transformations designed to minimize unwanted variation and enhance relevant features of the spectra. Common methods include baseline correction, scatter correction (e.g., standard normal variate (SNV), multiplicative scatter correction (MSC)), smoothing, normalization, and derivatives (e.g., Savitzky–Golay (SavGol)) [22]. These techniques improve the signal-to-noise ratio and promote consistency across samples, thus enhancing the accuracy and robustness of the resulting chemometric models [21,23]. However, choosing the appropriate preprocessing pipeline remains a significant challenge. Most studies rely on manual selection or predefined methods from previous work, often without evaluating their suitability for the current dataset or target analyte [22,24]. This trial-and-error approach introduces subjectivity and can lead to suboptimal model performance. Several studies have highlighted the limitations of such practices. For example, Zhu et al. [25] demonstrated that transferring preprocessing techniques such as SNV, SavGol, and first- and second-order derivatives developed for fruit ripening [26], portable near-infrared (NIR) devices for milk assessment [27], and meat quality classification [28] to dielectric spectroscopy for milk fat analysis yielded poor calibration performance, with only SNV in combination with least squares support vector machines (LSSVMs) providing marginal gains. Pinto et al. [29] similarly showed that preprocessing effectiveness in MIR-based lactose prediction depended heavily on the selected spectral region and transformation method. Amsaraj et al. [30] applied preprocessing pipelines derived from tea sample analysis to milk adulterant detection with limited success, underscoring the risks of direct method transfer. Inon et al. [31] observed that the application of MSC, originally developed for NIR spectra, failed to improve the prediction accuracy when adapted to FTIR spectra. Collectively, these works reveal the necessity of dataset-specific preprocessing optimization.
Bayesian optimization (BO) offers a principled framework for addressing this issue. Unlike grid or random search methods, which either exhaustively or blindly explore the hyperparameter space, BO employs probabilistic models (e.g., Gaussian processes or Tree-structured Parzen Estimators) to guide the search toward promising regions of the solution space [32,33]. This enables efficient and scalable optimization, particularly in high-dimensional or complex domains like spectral preprocessing. In chemometrics, BO has been shown to outperform greedy or uninformed strategies in tasks such as PLS calibration [33], spectral feature selection [34], and MIR-based protein quantification [8].
Despite recent advances in wavenumber selection and model tuning, preprocessing optimization remains underexplored. Notable efforts such as the nippy package by Torniainen et al. [22] introduced the automated comparison of preprocessing strategies but relied on greedy search, which becomes computationally expensive and lacks global exploration capabilities. Moreover, the need for more adaptive and scalable preprocessing optimization has been emphasized in the recent chemometric literature [21,22,24].
Motivated by this gap, we propose a novel framework for automated preprocessing optimization in spectroscopic data analysis. The approach integrates spectroscopy-specific and general machine learning preprocessing techniques and leverages Gaussian Process-based Bayesian Optimization to dynamically identify the most effective pipeline for each predictive task. We apply this method to a mid-infrared (MIR) milk spectroscopy dataset to optimize the prediction of fat, protein, lactose, and total solids. Our results demonstrate that data-driven preprocessing selection within a chemometric modeling framework improves model accuracy and robustness while reducing the reliance on intuition and manual tuning. Although developed for milk analysis, this framework is a generalizable solution applicable to a wide range of infrared spectroscopic datasets in food, pharmaceutical, environmental, and agricultural domains.

2. Materials and Methods

2.1. Spectra Acquisition

MIR spectral data of milk were obtained from Agropur Jerome Cheese, Jerome, ID, USA; the dataset consisted of MIR spectra obtained during routine milk analysis during processing. All spectra were collected using a MilkoScan FT1 (Foss North America, Eden Prairie, MN, USA). The MIR dataset included samples from multiple production sources, encompassing various vats, raw tanks (RT), and other operational milk streams that are sampled throughout everyday operations. All spectra were provided in their raw form (absorbance units), without any additional preprocessing applied prior to analysis.

2.2. Dataset Description

The dataset used in this study comprises a total of 1772 spectral and 1193 reference data. After aligning and matching the two sources, 6362 spectral-reference matched samples were obtained. Each spectrum consists of 1060 variables, corresponding to wavenumbers ranging from 4999.99 cm−1 to 925.07 cm−1, with an approximate step size of 4 cm−1. These spectra were paired with the 4 target variables used in this study: fat (%), true protein (%), lactose (%), and total solids (TS, %). Due to multiple replicates within the spectral data (e.g., samples labeled as VAT15_1, VAT15_2, with repeated measurements such as VAT15_1, VAT15_1, etc.), a data reduction step was necessary. To resolve redundancy and improve consistency, mean spectra were computed for replicates of each unique sample group, resulting in a final dataset containing 385 averaged spectra and 193 unique samples.

2.3. Data Splitting

To ensure robust model validation, a modified Kennard-Stone algorithm with replicate handling (Kennard-StoneR) was implemented for data partitioning as presented in Algorithm 1. This approach addresses a critical limitation in the vanilla Kennard-Stone method [35], which can inadvertently place replicates from the same sample in both training and test sets, potentially leading to data leakage and overly optimistic model performance estimates. The Kennard-StoneR algorithm maintains the core principle of the original Kennard-Stone method that maximizes the Euclidean distance between selected samples to ensure representative coverage of the feature space while incorporating group-aware selection to prevent replicate splitting. The algorithm proceeds as follows.
Algorithm 1 Modified Kennard-Stone with replicate handling (Kennard-StoneR).
Require: Data matrix X R n × d , labels y R n , group identifiers g R n , test proportion p
Ensure: Training and test indices
1:
Aggregate replicates by computing the centroid of each sample group:
2:
for each unique group identifier i do
3:
       X ¯ i mean ( X [ g = i ] )
4:
end for
5:
Determine the number of training groups:
6:
m round ( ( 1 p ) × number of unique groups )
7:
Compute the distance matrix D between all group centroids:
8:
D i j X ¯ i X ¯ j 2
9:
Initialize selected groups S with the pair of groups having maximum distance:
10:
S { arg max ( i , j ) D i j }
11:
while  | S | < m   do
12:
    For each unselected group u:
13:
       Compute minimum distance to any selected group:
14:
        d min ( u ) min { D s u : s S }
15:
    Add the unselected group with maximum minimum distance:
16:
     S S { arg max u d min ( u ) }
17:
end while
18:
Map selected groups to original sample indices:
19:
train_indices { j : g [ j ] S }
20:
test_indices { j : g [ j ] S }
21:
return X[train_indices], X[test_indices], y[train_indices], y[test_indices]
The algorithm generates a training set that optimally spans the feature space while reserving a representative portion (p) of samples for independent testing. For our 193 unique milk composition MIR spectra, this technique resulted in a training set comprising 70% or 135 unique samples and a test set with the remaining 30% or 58 samples, with the complete separation of replicates between sets. Group-aware cross-validation was implemented by grouping samples that share the same base identifier, regardless of replicate index. For example, samples labeled VAT15_1 and VAT15_2 were treated as belonging to the same group (VAT15). This approach ensured that all replicates of a given sample were kept together during both training and testing, thereby preventing data leakage. In total, 193 unique groups were identified based on this naming convention. We verified that spectra within each group had similar reference (target) values, confirming their validity as true replicates.

2.4. Automated Pipeline Optimization for Spectral Preprocessing and Modeling

To enhance the robustness, reproducibility, and efficiency of spectral data analysis, we developed a Python-based framework for automated preprocessing pipeline optimization. At its core is the PipelineOptimizer class, which leverages Bayesian optimization [36] to systematically explore and fine-tune combinations of preprocessing techniques and model hyperparameters. This process is designed to yield the most predictive and scientifically valid pipeline tailored to the user’s dataset.
The framework supports a diverse set of preprocessing methods, including both spectroscopy-specific transformations and general purpose machine learning preprocessing from scikit-learn. During optimization, the framework intelligently excludes incompatible combinations based on predefined rules as described subsequently, ensuring that only valid configurations are evaluated. The high-level workflow, encompassing preprocessing configuration, validation, and pipeline optimization, is summarized in Algorithm 2 and Figure 1. A complete version with detailed steps and procedures is provided in Appendix A (Algorithm A1). This algorithm outlines the core logic behind candidate generation, evaluation using cross-validation or test data, and the Bayesian search strategy employed for optimization. This structured and reproducible approach provides a powerful tool for advancing chemometric analysis in both research and applied settings.
Algorithm 2 Automated spectroscopic data pipeline optimization.
1:
Input:  X t r a i n , y t r a i n , preprocessing steps S , incompatibilities I , allowed lengths L , bounds Θ , n i n i t , n i t e r
2:
Optional:  X t e s t , y t e s t
3:
procedure GeneratePipelines ( S , I , L )
4:
    Generate all valid preprocessing pipelines P subject to incompatibilities
5:
end procedure
6:
procedure Evaluate( θ )
7:
    Decode θ to build pipeline p θ
8:
    if  X t e s t is available then
9:
        Fit and evaluate p θ on test set
10:
    else
11:
        Cross-validate p θ on training set
12:
    end if
13:
    return negative RMSE as score
14:
end procedure
15:
procedure Optimize( Θ , n i n i t , n i t e r )
16:
    Use Bayesian optimization to find θ * maximizing Evaluate
17:
    Build best pipeline p * from θ *
18:
    Fit p * on X t r a i n ; evaluate on X t e s t if available
19:
    return  p * , θ *
20:
end procedure
21:
Output: Optimized pipeline p * and parameters θ *

2.4.1. Overview of Pipeline Optimization Strategy

The PipelineOptimizer class supports spectral datasets formatted as NumPy arrays, allowing users to specify training and testing sets, cross-validation strategies, and optional grouping variables. The framework incorporates two group-aware validation strategies: GroupShuffleSplit and LeavePGroupsOut, ensuring the robust evaluation of pipelines in the presence of samples with repeated measurements [37,38].

2.4.2. Preprocessing Configuration Space

Users can provide a custom list of candidate preprocessing steps which are then filtered for compatibility using a set of predefined rules. The framework supports the following spectroscopy-specific preprocessing methods: SNV, SavGol, MSC, Extended Multiplicative Signal Correction (EMSC), Mean Centering (MeanCN), Detrending, AsymmetricLeastSquareBaselineCorrection, Localized SNV (LSNV), and Robust Normal Variate (RNV) [39,40,41]. Additionally, the framework seamlessly integrates general purpose preprocessing methods from scikit-learn: Standard Scaling, Robust Scaling, Global Scaling, MinMaxScaler, Normalization, QuantileTransformer, Principal Component Analysis (PCA), Locally Linear Embedding (LLE), fast Independent Component Analysis (fast-ICA), kernel-PCA, and PowerTransformer. All valid preprocessing pipelines comprising up to a user-defined maximum number of steps are enumerated in advance, allowing optimization to occur over this discrete configuration space. Users can further constrain the search by specifying the allowed pipeline lengths. For example, setting the maximum pipeline length to 2 allows for either single preprocessing steps or combinations of two compatible methods.

2.4.3. Bayesian Optimization

Pipeline optimization is performed using Bayesian optimization with an Expected Improvement (EI) acquisition function via the Bayesian optimization Python library to autonomously identify optimal preprocessing pipelines, thereby eliminating the traditionally labor-intensive process of manual tuning in spectroscopic analysis [36]. Bayesian optimization is a probabilistic model-based approach that efficiently locates the extrema of objective functions with minimal evaluations, making it particularly well-suited for complex optimization tasks [36]. The optimization process begins with n i n i t random initial configurations, followed by n i t e r intelligently selected configurations guided by the Bayesian model’s posterior distribution (with both n i n i t and n i t e r defined by the user). The objective function dynamically constructs and evaluates pipelines based on a sampled index into the list of possible preprocessing configurations. Each pipeline is appended with a Ridge regression estimator, where the regularization strength ( θ ) is also optimized.
During each evaluation, the selected pipeline is fitted and validated using the configured cross-validation strategy. The objective function returns the negative Root Mean Squared Error (RMSE), penalizing unstable or ill-conditioned configurations (e.g., those leading to LinAlgError). Logging is integrated throughout the optimization process to track evaluated configurations, metric values (RMSE, R 2 ), and potential numerical issues.

2.4.4. Cross-Validation Methods

To ensure robust and realistic evaluation of preprocessing pipelines, we implemented group-aware cross-validation strategies suitable for spectroscopic data and chemometric modeling. These methods are particularly suited for spectral datasets, where measurements may be recorded as replicates. The framework allows users to provide an optional group parameter, specifying the group to which each sample belongs. If no group information is supplied, each sample is treated as independent, and traditional sample-level validation is performed. Two primary group-based cross-validation techniques are supported:
  • Group-Shuffle-Split: This method randomly divides groups of samples into training and validation sets while ensuring that all samples within a group are assigned to the same split. This technique helps mitigate data leakage and preserves the natural structure of the data, which is important for spectral datasets prone to replicate effects.
  • Leave-P-Groups-Out: This exhaustive method iteratively leaves out P groups as a validation set, training on the remaining groups. It offers a more stringent assessment of generalization to unseen groups, though at a higher computational cost.
In addition to these cross-validation strategies, we enhanced the evaluation function with conditional logic to leverage external test data when available. Specifically, we added a mechanism to check whether both X_test and y_test are present. If test data is provided, the evaluation proceeds as follows:
  • The pipeline is fit on the training data.
  • Predictions are made on the external test data.
  • Performance metrics: Root Mean Squared Error (RMSE) and coefficient of determination ( R 2 ) are computed on the test set.
  • The negative RMSE is returned as the optimization score for compatibility with minimization-based search frameworks.
If external test data is not available, or if an error occurs during test-based evaluation, the function defaults to the original group-based cross-validation strategy using either Group-Shuffle-Split or Leave-P-Groups-Out.

2.4.5. Compatibility Rules for Preprocessing Pipelines

The framework is designed to support a wide range of preprocessing techniques, drawing from both domain-specific spectroscopic methods and general purpose machine learning transformations available through scikit-learn. While this flexibility enables the construction of diverse and powerful pipelines, it also introduces the risk of combining methods that are theoretically redundant, semantically incompatible, or computationally conflicting.
To address this, we implemented a set of incompatibility rules that automatically prevent mutually exclusive or conceptually redundant methods from being used together. These rules are defined based on both functional similarity and insights from prior spectroscopic and chemometric literature [22].
For example, the following groups of preprocessing steps are treated as mutually incompatible:
  • Scatter Correction Methods: SNV, MSC, EMSC, LSNV, RNV are all methods that correct for scatter effects in spectral data. Applying more than one of these techniques can lead to overcorrection or unintended distortions.
  • Scaling and Normalization Methods: Methods such as scaler, autoscale, globalscale, normalization, robust-scaler, minmax-scaler, power-transformer, quantile-transformer, and row-standardizer all perform some form of scaling or normalization. Using multiple scaling approaches in the same pipeline may introduce redundancy and instability.
  • Method-Specific Incompatibilities: Specific combinations such as SNV with row-standardizer, or autoscale with scaler, are excluded due to their overlapping functionalities.
  • Dimensionality Reduction Methods: Techniques such as PCA, fast-ICA, kernel-PCA, and LLE aim to reduce data dimensionality and are typically not applied together, as they each represent distinct reduction philosophies.
These constraints are enforced internally through a predefined list of incompatibility sets. When a user supplies a list of candidate preprocessing techniques, some of which may be mutually incompatible, the framework ensures that such combinations are automatically excluded from consideration during pipeline optimization. Instead of raising an error, the system filters out any configurations that violate the defined compatibility rules, thereby streamlining the search space and maintaining both computational efficiency and methodological validity. This ensures that only scientifically coherent and practically feasible pipelines are explored during the optimization process, in line with established best practices in chemometric data preprocessing [22].
By integrating these methodological advances, the proposed framework represents a significant improvement over traditional approaches to spectroscopic data preprocessing, enabling more systematic, objective, and reproducible preprocessing pipeline optimization for chemometric applications.

2.5. Regression Analysis

To model the relationship between spectral features and the target variable(s), we employed six regression algorithms provided by the scikit-learn library [42]. These included Elastic Net, Partial Least Squares (PLS), Support Vector Regression (SVR), LassoLarsCV, RidgeCV, and Gradient Boosting Machines (GBMs).
The regression models are briefly described below:
  • Elastic Net regression combines both L1 (Lasso) and L2 (Ridge) regularization penalties. It is particularly effective for datasets with multicollinearity and for performing variable selection.
  • Partial Least Squares (PLS) regression projects both predictors and response variables to a latent space, maximizing their covariance. It is especially suitable for spectral data due to its ability to handle high-dimensional and collinear variables.
  • Support Vector Regression (SVR) models non-linear relationships by transforming data into a higher-dimensional space using kernel functions. It aims to fit the best hyperplane within a tolerance margin.
  • LassoLarsCV uses the lasso and the Least Angle Regression (LARS) algorithms with built-in cross-validation to select the optimal amount of L1 regularization. It encourages sparsity and aids in automatic feature selection.
  • RidgeCV applies L2 regularization and selects the best regularization parameter using cross-validation. It is robust against multicollinearity and can stabilize coefficient estimates.
  • Gradient Boosting Machines (GBMs) is a powerful ensemble method that builds a sequence of weak learners, typically decision trees, to minimize prediction error. It incrementally fits residuals from previous models to improve overall performance.
Each model was trained on the optimally preprocessed data and evaluated on test data. The hyperparameters of each model were tuned using Bayesian optimization to maximize predictive performance. To prevent data leakage, all steps including hyperparameter optimization were strictly confined to the training data.

2.6. Hyperparameter Tuning Strategy

We adopted a two-stage optimization framework that decouples preprocessing pipeline optimization from final model hyperparameter tuning to balance computational efficiency and modeling flexibility.
During the first stage, preprocessing pipelines were optimized using Bayesian optimization with cross-validation, where each pipeline configuration was evaluated using a RidgeCV regression model. RidgeCV was selected as the estimator during this stage due to its single hyperparameter and efficient internal cross-validation. This allowed the framework to explore a wide variety of preprocessing configurations without the added computational burden of simultaneously tuning complex model architectures. The  θ values were searched over a log-spaced range from 10 6 to 10 6 . The values of n i n i t and n i t e r were set to 50 and 200, respectively.
Table 1 summarizes the hyperparameter search space for a few of the preprocessing components explored during optimization.
In the second stage, the best performing preprocessing pipelines were fixed and used to evaluate multiple regression models. These included Partial Least Squares (PLS), Elastic Net, RidgeCV, LassoLarsCV, Support Vector Regression (SVR), and Gradient Boosting Machines (GBMs). Each model except RidgeCV and LassoLarsCV underwent hyperparameter tuning using Bayesian optimization on the training data, with the search spaces listed in Table 2. The values of n i n i t and n i t e r were set at 5 and 100, respectively, for all models, except PLS, where they were set to 5 and 10.
This two-stage procedure provides a clear separation between preprocessing pipeline discovery and model learning, enabling flexible experimentation while keeping overall search complexity tractable. Moreover, models like Elastic Net and LassoLarsCV inherently perform feature selection by assigning zero weights to less informative variables, offering indirect insights into variable importance.

2.7. Statistical Analysis

To evaluate whether the optimized preprocessing pipeline statistically outperformed baseline methods, we conducted hypothesis testing on fold-level RMSE values from 5-fold GroupShuffleSplit cross-validation across models, and milk components was conducted. We restricted the analysis to the PLS and RidgeCV models, as they yielded the best overall results. This led to a total of 24 pairwise comparisons (2 models × 4 components × 3 comparisons).
For each comparison, we tested the following hypotheses:
H 0 : μ optimized μ baseline = 0 ( no difference in mean RMSE ) H 1 : μ optimized μ baseline 0 ( significant difference in mean RMSE )
The choice of statistical test was based on the normality of the paired RMSE differences, assessed using the Shapiro–Wilk test (p > 0.05). If normality held, we used a paired t-test; otherwise, the Wilcoxon signed-rank test was applied [43,44].
To control the family-wise error rate from multiple comparisons, we applied the Bonferroni correction to the resulting p-values [45]. Cohen’s d was computed to quantify the effect size and direction of each comparison, with negative values indicating better performance (lower RMSE) by the optimized pipeline. The absolute magnitude of d follows conventional benchmarks: values greater than 0.8 denote a large effect size, and values exceeding 1.3 are considered very large [46]. We also reported 95% confidence intervals for the mean RMSE differences. All tests were two-sided, with a corrected significance threshold of α = 0.05 .
Boxplots showing RMSE distributions for each preprocessing method and milk component were generated to visually support the statistical findings, with asterisks indicating significance levels (* p < 0.05 , ** p < 0.01 after Bonferroni correction).

3. Results and Discussion

3.1. Dataset Statistics and Distribution

Table 3 provides summary statistics for the four milk components analyzed: fat, protein, lactose, and total solids. Among these, total solids show the highest mean concentration (16.15%), followed by fat (5.57%), protein (4.76%), and lactose (4.56%). Protein and lactose display narrow standard deviations (0.51 and 0.12, respectively), indicating relatively consistent composition across samples, whereas fat and total solids show more variability. The lower bounds for fat and total solids (3.07% and 12.11%, respectively) also suggest possible sample dilution or formulation effects.
Figure 2 provides a visual overview of the relationships and distribution patterns among the analyzed milk components (fat, protein, lactose, and TS). The correlation matrix (top left) reveals a strong positive correlation between fat and total solids ( r = 0.88 ), indicating that higher fat content tends to be associated with higher total solids, an expected trend in milk composition. A moderate negative correlation is observed between protein and lactose ( r = 0.58 ), suggesting that as protein levels increase, lactose concentrations may slightly decrease. Additional correlations include a moderate positive relationship between protein and total solids ( r = 0.76 ), and a weak positive correlation between fat and protein ( r = 0.38 ).
The boxplots (top right) confirm that protein and lactose concentrations are relatively uniform across samples, whereas fat and total solids show broader variability which may reflect processing practices or targeted composition adjustment in the sample set.
Lastly, the density plots (bottom) show that most components are negatively skewed, particularly protein and total solids, suggesting the majority of samples cluster near upper concentration ranges. Elevated kurtosis values (e.g., 3.86 for protein and 4.37 for lactose) indicate peaked distributions with a few low-value outliers. These trends point toward controlled or processed milk samples rather than fresh raw milk, which is often subject to greater component variation.

3.2. Spectral Preprocessing

To mitigate the risk of overfitting and enhance model robustness, we explored spectral preprocessing pipelines with a restricted number of steps, using our automated optimization framework based on Bayesian optimization. Two configurations were considered by setting the allowed preprocessing pipeline length to either [ 1 , 2 ] (allows for either single preprocessing steps or combinations of two compatible methods) or [ 1 , 2 , 3 ] (allows for either single preprocessing steps or combinations of two or three compatible methods).
The optimization was performed solely on the training set, leveraging group-aware cross-validation using the Group-Shuffle-Split method to respect sample dependencies. We employed Bayesian optimization with n init = 50 and n iter = 200 . All preprocessing steps were drawn from the following set: MSC, SavGol, detrend, scaler, SNV, robust_scaler, EMSC, PCA, normalization, autoscale, globalscaler, and meancn.
The optimized preprocessing pipelines for each milk component are summarized in Table 4 for the [ 1 , 2 ] and [ 1 , 2 , 3 ] configurations, respectively. Interestingly, the optimal pipelines were identical across both configurations, suggesting that a simpler preprocessing structure was sufficient for our data.
The consistency of results across both experimental configurations supports the robustness of the identified preprocessing schemes.
Figure 3 and  Figure 4 provide a visual comparison of the raw spectral data and the effects of various preprocessing strategies.
Figure 3 presents the raw MIR spectra of 135 milk calibration samples, which exhibit high overall alignment and minimal baseline or scatter artifacts. The spectra are smooth and consistent across the 3000–1000 cm−1 range, with major absorbance bands clearly preserved. A localized region of high-frequency variation is visible between approximately 1750 cm−1 and 1600 cm−1, likely reflecting chemical variability or instrument-related noise in that spectral window.
In Figure 4, the top row displays preprocessing pipelines optimized via Bayesian optimization and tailored for individual milk components, while the bottom row includes commonly reported literature methods such as SNV, MSC, and their combinations with derivatives. Literature-based techniques (e.g., SNV and MSC) effectively smooth the spectra and suppress global variation, resulting in visually cleaner profiles. However, this visual uniformity can come at the cost of reducing predictive information, particularly if relevant spectral variability is filtered out.
Conversely, the optimized pipelines introduced sharper variations especially in regions like 1450–1250 cm−1 and 2250–1750 cm−1 due to the application of scalers and derivatives. While these transformations may appear noisier, they are selected based on their ability to enhance model-relevant features rather than aesthetic smoothness. This contrast reinforces the core philosophy behind the Bayesian optimization approach: preprocessing should be optimized for predictive performance, not visual clarity.

3.3. Regression Analysis and the Importance of Optimized Preprocessing

To assess the impact of preprocessing techniques on predicting milk component concentrations on the test set, we conducted regression analyses under three distinct scenarios: without preprocessing, with optimized preprocessing obtained via Bayesian optimization, and using the custom preprocessing techniques previously reported in the literature (MSC, SNV, first derivative, and second derivative).
From Table 5, without preprocessing (baseline scenario), predictive models yielded reasonably accurate results on the test set. For example, fat prediction achieved an RMSEP of 0.159 (PLS regression, R 2 = 0.903 ), protein showed high prediction accuracy with an RMSEP of 0.063 (LassoLarsCV, R 2 = 0.974 ), lactose presented a moderate predictive accuracy (RMSEP = 0.027, PLS regression, R 2 = 0.912 ), and total solids predictions demonstrated robust accuracy (RMSEP = 0.158, PLS regression, R 2 = 0.958 ).
Applying optimized preprocessing improved model performance for protein and lactose predictions. Specifically, protein prediction RMSEP decreased to 0.054 (RidgeCV regression, R 2 = 0.981 ), enhancing predictive accuracy compared to the baseline scenario. Similarly, lactose prediction benefited from preprocessing optimization, achieving a lower RMSEP of 0.026 (PLS regression, R 2 = 0.917 ). Total solids and fat predictions also showed moderate improvements, with the best total solids prediction yielding an RMSEP of 0.154 (RidgeCV regression, R 2 = 0.960 ) and fat prediction reaching an RMSEP of 0.139 (RidgeCV regression, R 2 = 0.926 ).
Notably, across all three scenarios, support vector regression (SVR) consistently underperformed on the test set, despite often achieving strong performance on the training data. This discrepancy highlights the risk of overfitting when using highly flexible models on relatively limited datasets. The use of group-aware cross-validation during model development proved effective in providing a more realistic assessment of model generalization ability, particularly where reserving a separate internal validation set was not practical.
These findings are visually confirmed in Figure 5, which displays predicted versus true plots for three representative models, PLS, RidgeCV, and LassoLarsCV on the test set using optimized preprocessing obtained through Bayesian optimization. The best-performing models (highlighted in red) align closely with the identity line, especially for protein and total solids. Results for the remaining models i.e SVR, GBM, and ElasticNet, are included in Appendix C (Figure A1). The full regression performance metrics on both the training and test sets for each model and preprocessing strategy are provided in Appendix B (Table A1, Table A2 and Table A3).
From Table 6, using custom preprocessing methods commonly reported in the literature, we identified several studies that applied spectroscopy techniques to milk datasets. Zhu et al. [25] and Wu et al. [47] both reported SNV as the optimal preprocessing technique for their respective datasets. Wu et al. specifically employed short-wave NIR spectroscopy in the 800–1050 nm range to analyze the primary compounds in milk powder. Similarly, Amsaraj et al. [30] and Bonfatti et al. [48] identified a combination of SNV and first-derivative Savitzky–Golay (SavGol) filtering as their optimal preprocessing approach. Bonfatti et al. [48] specified SavGol parameters as a window length of 15, derivative order of 1, and polynomial order of 4. As Amsaraj et al. [30] did not report their SavGol parameters, we adopted the same values for consistency.
Although the literature has suggested that MSC and SNV are generally effective in improving model performance, our results contradict this assumption On our dataset, these methods produced inferior outcomes compared to both unprocessed data and the results achieved through our optimized preprocessing pipeline. Similarly, the use of first and second derivatives, often recommended for enhancing predictive power, offered only marginal benefits over no preprocessing in certain cases The best fat prediction using custom preprocessing (SNV + 1st Der SavGol, PLS, R 2 = 0.896 ), as used by Amsaraj et al. [30] and Bonfatti et al. [48], was lower than both no preprocessing (PLS, R 2 = 0.903 ) and our method (RidgeCV, R 2 = 0.926 ). For protein, the top custom result (MSC, RidgeCV, R 2 = 0.972 ) from Inon et al. [31] also underperformed compared to no preprocessing (LassoLarsCV, R 2 = 0.974 ) and our method (RidgeCV, R 2 = 0.981 ). Lactose prediction using SNV + 1st Der SavGol (LassoLarsCV, R 2 = 0.846 ) similarly lagged behind no preprocessing (PLS, R 2 = 0.912 ) and our approach (PLS, R 2 = 0.917 ). For total solids, MSC with LassoLarsCV ( R 2 = 0.950 ) was outperformed by both no preprocessing (PLS, R 2 = 0.958 ) and our method (RidgeCV, R 2 = 0.960 ).
These comparative findings underscore the critical importance of dataset-specific preprocessing optimization. Adopting preprocessing methods from unrelated or even closely related prior studies without validation can negatively affect prediction accuracy. Thus, optimized preprocessing tailored explicitly to individual datasets and prediction targets remain an essential step for achieving maximum accuracy and reliability in milk component prediction models.

3.4. Statistical Comparison of Preprocessing Methods

Statistical testing confirmed that the optimized pipeline significantly outperformed baseline preprocessing methods in most milk components as presented in Table 7. Before Bonferroni’s correction, almost all pairwise comparisons were statistically significant at the level α = 0.05 ; this included all comparisons under RidgeCV and 8 of 12 under PLS. The normality of RMSE differences was assessed using the Shapiro–Wilk test for all pairwise comparisons. A paired t-test was used in all cases except one under PLS, where normality was violated; in that instance, the nonparametric Wilcoxon signed-rank test was applied.
After applying the Bonferroni correction, eight of nine RidgeCV comparisons remained significant, with the exception of Total Solids. In particular, three comparisons exhibited strong significance at the p < 0.01 level: fat (optimized vs. SNV+SG) ( p = 0.0047 ), and true protein (optimized vs. MSC) ( p = 0.0023 ) and (optimized vs. SNV) ( p = 0.0062 ). In contrast, only one PLS comparison remained significant after correction, despite strong trends observed prior.
All RidgeCV comparisons produced large to extremely large effect sizes (Cohen’s | d | ranging from 2.3 to 7.4), supporting the practical relevance of the optimized pipeline. PLS comparisons also consistently showed large effect sizes despite losing corrected significance.
Boxplots (Figure 6, Figure 7 and Figure 8 and Figure A2) further illustrate these results, showing consistent reductions in RMSE and variability for the optimized pipeline across components. Even for total solids, where Bonferroni corrected significance was not observed, the optimized pipeline exhibited a visibly lower RMSE distribution compared to those for all other methods.
This study highlights the pivotal role of spectral preprocessing in improving the accuracy of milk component predictions. Using a Bayesian optimization-based framework, we identified preprocessing pipelines that consistently outperformed both no preprocessing and alternative algorithms reported in the literature, especially for protein and lactose. Fat and total solids have stronger IR spectral signatures, and we observed more modest gains, suggesting that these analytes may require simpler corrections.
A major insight is the data- and component-specific nature of preprocessing. Optimal pipelines vary between components, confirming that a universal approach is inadequate. This aligns with previous work in spectroscopy, such as Vestergaard et al. [49], which found that no single preprocessing strategy was the best across analytes. Our findings further show that commonly used methods (e.g., MSC and SNV) underperform when applied without dataset-specific tuning, reinforcing the need for empirical evaluation.
The Bayesian optimization approach offers a significant advantage by automating preprocessing selection, reducing reliance on trial and error. This method efficiently explores the pipeline space, often identifying simple but highly effective two-step combinations that enhance both interpretability and generalizability. Moreover, the transparent and reproducible nature of this approach makes it suitable for broader spectroscopic applications beyond milk, including food quality control and authenticity testing.

4. Conclusions

We present a Bayesian optimization-based framework for selecting spectral preprocessing pipelines, demonstrating its utility in accurately determining the amounts of fat, protein, lactose, and total solids from MIR spectra of milk. By comparing optimized preprocessing to both unprocessed data and methods commonly reported in the literature, the Bayesian optimization approach consistently delivered improved regression accuracy, especially for protein and lactose. These findings underscore the benefit of data-driven preprocessing over heuristic or borrowed approaches.
A key strength of this framework lies in its ability to identify simple yet effective pipelines typically involving just two to three steps, eliminating the need for exhaustive manual tuning. This approach makes it especially valuable when seeking high performance without sacrificing interpretability or implementation simplicity. The immediate applicability of the method to MIR milk spectra suggests that even small accuracy gains can impact quality control and formulation relevant in industry settings.
Looking ahead, while the current analysis focused primarily on spectral preprocessing optimization, future work may involve incorporating wavenumber selection techniques as an additional step to further enhance predictive performance. Expanding the framework to support broader spectral domains and integrating it with model hyperparameter tuning could also promote the development of fully automated modeling pipelines. In summary, this approach provides a scalable framework for spectral data preparation, supporting robust and generalizable modeling across diverse chemometric applications.

Author Contributions

Conceptualization, H.A.B., O.M.M. and T.A.; methodology, H.A.B., O.M.M. and T.A.; software, H.A.B.; validation, H.A.B.; formal analysis, H.A.B.; investigation, H.A.B.; resources, H.A.B., O.M.M. and T.A.; data curation, H.A.B., O.M.M. and T.A.; writing—original draft preparation, H.A.B.; writing—review and editing, H.A.B., O.M.M. and T.A.; visualization, H.A.B.; supervision, O.M.M. and T.A.; project administration, O.M.M. and T.A.; funding acquisition, O.M.M. and T.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by BUILD Dairy and Daisy Brand, a Product Research grant from the National Dairy Council/Dairy Management Inc., and the National Science Foundation Convergence Accelerator Track J Award #2345069.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The milk spectral dataset used in this study was obtained from Agropur and is available upon reasonable request from the authors. The Bayesian optimization framework is open-source and can be installed via pip or conda. The source code and documentation are available at https://github.com/habeeb3579/Spectoprep (accessed on 4 May 2025).

Acknowledgments

We acknowledge the contribution of Agropur to provide MIR spectra that were used to develop, test, and validate SpectroPrep. We recognize the contribution made by Rose Saxton to provide material support for this project.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Algorithm A1 Automated spectroscopic data pipeline optimization.
1:
Input: Training data X t r a i n R n × p , y t r a i n R n
2:
Input: Optional test data X t e s t R m × p , y t e s t R m
3:
Input: Preprocessing steps S = { s 1 , , s k } , incompatibility sets I = { I 1 , , I l }
4:
Input: Maximum pipeline length L m a x , allowed lengths L { 1 , , L m a x }
5:
Input: Parameter bounds Θ = { θ 1 , , θ d }
6:
Input: Bayesian optimization parameters: initial points n i n i t , iterations n i t e r
7:
procedure GeneratePipelineConfigurations( S , I , L )
8:
     P
9:
    for  l L  do
10:
         C l all combinations of l elements from S
11:
        for pipeline p C l  do
12:
           valid ← True
13:
           for  I j I  do
14:
               if  | p I j | > 1  then
15:
                   valid ← False
16:
                   break
17:
               end if
18:
           end for
19:
           if valid then
20:
                P P { p }
21:
           end if
22:
        end for
23:
    end for
24:
    return  P
25:
end procedure
26:
P GeneratePipelineConfigurations ( S , I , L )
27:
N p | P |                    ▹ Number of valid pipelines
28:
procedure BuildPipeline( i , θ )
29:
     p P [ i ]
30:
    pipeline ← empty Pipeline
31:
    for  s j p  do
32:
        preprocessorj← BuildPreprocessor( s j , θ )
33:
        Add ( s j , preprocessor j ) to pipeline
34:
    end for
35:
    Add (“ridge”, Ridge( α = θ [ ridge_alpha ] )) to pipeline
36:
    return pipeline
37:
end procedure
38:
procedure EvaluateObjective( θ )
39:
     i θ [ pipeline_config ] + 0.5
40:
     i max ( 0 , min ( i , N p 1 ) )
41:
    pipeline BuildPipeline ( i , θ )
42:
    if  X t e s t None  and  y t e s t None  then
43:
        Fit pipeline on X t r a i n , y t r a i n
44:
         y ^ t e s t pipeline . predict ( X t e s t )
45:
         RMSE t e s t 1 m j = 1 m ( y t e s t , j y ^ t e s t , j ) 2
46:
         R t e s t 2 1 j = 1 m ( y t e s t , j y ^ t e s t , j ) 2 j = 1 m ( y t e s t , j y ¯ t e s t ) 2
47:
        score RMSE t e s t
48:
    else
49:
        Initialize cross-validation splitter CV
50:
        predictions [ ] , actuals [ ]
51:
        for (train_idx, val_idx) in CV.split( X t r a i n , y t r a i n , groups) do
52:
            X t r a i n f o l d X t r a i n [train_idx]
53:
            y t r a i n f o l d y t r a i n [train_idx]
54:
            X v a l f o l d X t r a i n [val_idx]
55:
            y v a l f o l d y t r a i n [val_idx]
56:
           Fit pipeline on X t r a i n f o l d , y t r a i n f o l d
57:
            y ^ v a l f o l d pipeline . predict ( X v a l f o l d )
58:
           Append y ^ v a l f o l d to predictions
59:
           Append y v a l f o l d to actuals
60:
        end for
61:
        if predictions is empty then
62:
           return  10 6                ▹ Penalty for failure
63:
        end if
64:
         RMSE v a l 1 n i = 1 n ( y i y ^ i ) 2
65:
        score RMSE v a l
66:
    end if
67:
    return score
68:
end procedure
69:
procedure BayesianOptimize ( EvaluateObjective , Θ , n i n i t , n i t e r )
70:
    Initialize Bayesian optimizer with bounds Θ
71:
    Evaluate n i n i t random points
72:
    for  i = 1 to n i t e r  do
73:
        Select next point θ ( i )
74:
        Evaluate EvaluateObjective ( θ ( i ) )
75:
        Update surrogate model
76:
    end for
77:
     θ * best parameters found
78:
     i * θ * [ pipeline_config ] + 0.5
79:
     p * P [ i * ]
80:
    best_pipeline BuildPipeline ( i * , θ * )
81:
    Fit best_pipeline on full X t r a i n , y t r a i n
82:
    return best_pipeline, θ *
83:
end procedure
84:
best_pipeline, best_params ←BayesianOptimize(EvaluateObjective, Θ , n i n i t , n i t e r )
85:
if  X t e s t None then
86:
     y ^ t e s t best_pipeline.predict ( X t e s t )
87:
     RMSE t e s t 1 m i = 1 m ( y t e s t , i y ^ t e s t , i ) 2
88:
     R t e s t 2 1 i = 1 m ( y t e s t , i y ^ t e s t , i ) 2 i = 1 m ( y t e s t , i y ¯ t e s t ) 2
89:
end if
90:
Output: Best pipeline, best parameters, performance metrics

Appendix B

Table A1. Full regression performance on both training and test sets for each milk component after Bayesian optimization-based preprocessing. Metrics include R 2 and RMSE for each model evaluated.
Table A1. Full regression performance on both training and test sets for each milk component after Bayesian optimization-based preprocessing. Metrics include R 2 and RMSE for each model evaluated.
ComponentModel R train 2 RMSEtrain R 2 P RMSEP
Fatlassolarscv0.98830.08260.92000.1445
ridgecv0.99370.07330.92550.1394
gbm1.0000 1.48 × 10 8 0.89500.1655
pls0.99390.07120.92090.1437
svr0.99990.0011−0.17420.5554
elastic_net0.99990.00240.88230.1753
Proteinlassolarscv0.99660.03140.98070.0545
ridgecv0.99740.02770.98120.0539
gbm1.0000 1.45 × 10 8 0.94830.0894
pls0.98930.05740.97280.0650
svr0.99940.01370.43040.2967
elastic_net0.99930.00150.97420.0631
Lactoselassolarscv0.95140.02730.90970.0275
ridgecv0.94920.02790.90180.0286
gbm1.0000 1.48 × 10 8 0.77170.0437
pls0.95670.02570.91730.0263
svr0.99260.00110.70260.0499
elastic_net0.99720.00650.81730.0391
Total Solids (TS)lassolarscv0.98990.12740.95600.1615
ridgecv0.99060.12330.96020.1536
gbm1.0000 1.47 × 10 8 0.93440.1973
pls0.99520.08790.95360.1659
svr0.99990.0119−0.00410.7772
elastic_net0.99990.00210.90770.2340
Table A2. Full regression performance on both training and test sets for each milk component using no preprocessing.
Table A2. Full regression performance on both training and test sets for each milk component using no preprocessing.
ComponentModel R train 2 RMSEtrain R 2 P RMSEP
Fatlassolarscv0.96870.01610.86220.1897
ridgecv0.98170.01230.87450.1810
gbm1.00000.00000.87920.1776
pls0.98460.01130.90330.1589
svr0.99990.01070.42620.3869
elastic_net0.98970.00930.89840.1628
Proteinlassolarscv0.99780.02550.97430.0600
ridgecv0.99810.02360.97200.0657
gbm1.00000.00000.94830.0894
pls0.98930.05570.97270.0650
svr0.99940.01350.83540.1595
elastic_net0.99130.05030.96010.0786
Lactoselassolarscv0.94940.02780.79850.0410
ridgecv0.98940.01270.82330.0384
gbm1.00000.00000.80280.0451
pls0.94980.02770.91210.0271
svr0.99150.01140.31850.0755
elastic_net0.95620.02590.76200.0446
Total Solids (TS)lassolarscv0.98690.01450.95550.1625
ridgecv0.98850.01360.95370.1667
gbm1.00000.00000.93190.2010
pls0.99100.01200.95780.1582
svr0.99990.01180.88750.2584
elastic_net0.99340.01030.94930.1733
Table A3. Full regression performance on both training and test sets for each milk component using SNV preprocessing.
Table A3. Full regression performance on both training and test sets for each milk component using SNV preprocessing.
ComponentModel R train 2 RMSEtrain R 2 P RMSEP
Fatlassolarscv0.97660.13930.85890.1919
ridgecv0.97840.13390.84720.1997
gbm1.0000 1.46 × 10 8 0.72470.2680
pls0.97940.13050.88250.1751
svr0.99990.0011−0.17420.5536
elastic_net0.99300.07600.86050.1908
Proteinlassolarscv0.95970.10830.95650.0820
ridgecv0.99600.03410.95980.0789
gbm1.0000 1.48 × 10 8 0.89150.1295
pls0.97570.08420.94900.0888
svr0.99940.0137−0.20330.4312
elastic_net0.99290.04520.95970.0789
Lactoselassolarscv0.65100.73090.42300.0694
ridgecv0.98860.01320.80390.0404
gbm1.0000 1.47 × 10 8 0.34840.0738
pls0.87410.43890.64220.0547
svr0.99160.01140.21720.0809
elastic_net0.97640.01900.83430.0372
Total Solids (TS)lassolarscv0.88360.43320.84910.2992
ridgecv0.98750.14180.91670.2222
gbm1.0000 1.48 × 10 8 0.88310.2634
pls0.97960.18200.91670.2230
svr0.99990.0119−0.00400.7772
elastic_net0.99940.09760.89880.2451

Appendix C

Figure A1. Predicted vs. true value scatter plots for each milk component using optimized preprocessing obtained via Bayesian optimization. This figure presents results for additional models: SVR, GBM, and ElasticNet.
Figure A1. Predicted vs. true value scatter plots for each milk component using optimized preprocessing obtained via Bayesian optimization. This figure presents results for additional models: SVR, GBM, and ElasticNet.
Foods 14 02996 g0a1
Table A4. Complete statistical test results comparing the optimized pipeline with baseline preprocessing methods across all components and models. Includes effect sizes, confidence intervals, and both uncorrected and Bonferroni-corrected significance.
Table A4. Complete statistical test results comparing the optimized pipeline with baseline preprocessing methods across all components and models. Includes effect sizes, confidence intervals, and both uncorrected and Bonferroni-corrected significance.
ModelComponentComparisonTestStatisticPMean DiffCohen’s DCI LowerCI UpperSig. (P)P (Corr.)Sig. (Corr.)
plsFatMSC Onlypaired t-test−4.300.013−0.031−2.61−0.051−0.011Yes0.304No
plsFatSNV + SGpaired t-test−6.600.003−0.047−3.58−0.066−0.027Yes0.065No
plsFatSNVpaired t-test−3.740.020−0.032−2.27−0.056−0.008Yes0.483No
plsLactoseMSC Onlywilcoxon0.000.063−0.016−2.06−0.024−0.008No1.000No
plsLactoseSNV + SGpaired t-test−2.130.100−0.005−0.70−0.0120.002No1.000No
plsLactoseSNVpaired t-test−7.160.002−0.028−3.01−0.039−0.017Yes0.048Yes
plsTSMSC Onlypaired t-test−1.670.171−0.017−0.91−0.0450.011No1.000No
plsTSSNV + SGpaired t-test−4.110.015−0.055−3.22−0.092−0.018Yes0.353No
plsTSSNVpaired t-test−4.340.012−0.091−2.85−0.150−0.033Yes0.293No
plsTrue ProteinMSC Onlypaired t-test−6.150.004−0.021−1.23−0.031−0.012Yes0.085No
plsTrue ProteinSNV + SGpaired t-test−2.760.051−0.037−2.02−0.0730.000No1.000No
plsTrue ProteinSNVpaired t-test−4.900.008−0.023−1.18−0.035−0.010Yes0.193No
ridgecvFatMSC Onlypaired t-test−9.380.001−0.077−3.43−0.099−0.054Yes0.017Yes
ridgecvFatSNV + SGpaired t-test−13.090.000−0.091−7.43−0.111−0.072Yes0.005Yes
ridgecvFatSNVpaired t-test−8.610.001−0.079−3.52−0.105−0.054Yes0.024Yes
ridgecvLactoseMSC Onlypaired t-test−7.330.002−0.036−3.38−0.049−0.022Yes0.044Yes
ridgecvLactoseSNV + SGpaired t-test−8.490.001−0.029−4.47−0.038−0.019Yes0.025Yes
ridgecvLactoseSNVpaired t-test−8.380.001−0.036−4.99−0.048−0.024Yes0.027Yes
ridgecvTSMSC Onlypaired t-test−4.470.011−0.149−3.37−0.241−0.056Yes0.266No
ridgecvTSSNV + SGpaired t-test−4.260.013−0.149−2.91−0.247−0.052Yes0.314No
ridgecvTSSNVpaired t-test−5.600.005−0.195−3.57−0.291−0.098Yes0.120No
ridgecvTrue ProteinMSC Onlypaired t-test−15.710.000−0.062−2.33−0.073−0.051Yes0.002Yes
ridgecvTrue ProteinSNV + SGpaired t-test−6.480.003−0.076−4.04−0.109−0.043Yes0.070No
ridgecvTrue ProteinSNVpaired t-test−12.200.000−0.064−2.41−0.079−0.050Yes0.006Yes
Figure A2. RMSE distribution across preprocessing methods for the total solids (TS %) component. The optimized pipeline shows the most consistent and lowest RMSE values. (Bonferroni-corrected).
Figure A2. RMSE distribution across preprocessing methods for the total solids (TS %) component. The optimized pipeline shows the most consistent and lowest RMSE values. (Bonferroni-corrected).
Foods 14 02996 g0a2

References

  1. Cimmino, F.; Catapano, A.; Petrella, L.; Villano, I.; Tudisco, R.; Cavaliere, G. Role of milk micronutrients in human health. Front. Biosci.-Landmark 2023, 28, 41. [Google Scholar] [CrossRef] [PubMed]
  2. Pereira, P.C. Milk nutritional composition and its role in human health. Nutrition 2014, 30, 619–627. [Google Scholar] [CrossRef] [PubMed]
  3. Etheridge, R.; Pesti, G.; Foster, E. A comparison of nitrogen values obtained utilizing the Kjeldahl nitrogen and Dumas combustion methodologies (Leco CNS 2000) on samples typical of an animal nutrition analytical laboratory. Anim. Feed Sci. Technol. 1998, 73, 21–28. [Google Scholar] [CrossRef]
  4. Kleyn, D.H.; Lynch, J.M.; Barbano, D.M.; Bloom, M.J.; Mitchell, M.W. Determination of fat in raw and processed milks by the Gerber method: Collaborative study. J. AOAC Int. 2001, 84, 1499–1508. [Google Scholar] [CrossRef]
  5. Stefanov, I.; Vlaeminck, B.; Fievez, V. A novel procedure for routine milk fat extraction based on dichloromethane. J. Food Compos. Anal. 2010, 23, 852–855. [Google Scholar] [CrossRef]
  6. Caprita, R.; Caprita, A.; Cretescu, I. Determination of lactose concentration in milk serum by refractometry and polarimetry. Sci. Pap. Anim. Sci. Biotechnol. 2014, 47, 158. [Google Scholar]
  7. Poitevin, E. Official methods for the determination of minerals and trace elements in infant formula and milk products: A Review. J. AOAC Int. 2016, 99, 42–52. [Google Scholar] [CrossRef]
  8. Babatunde, H.A.; Collins, J.; Lukman, R.; Saxton, R.; Andersen, T.; McDougal, O.M. SVR chemometrics to quantify β-lactoglobulin and α-lactalbumin in milk using MIR. Foods 2024, 13, 166. [Google Scholar] [CrossRef]
  9. Pegu, K.; Arya, S.S. Non-thermal processing of milk: Principles, mechanisms and effect on milk components. J. Agric. Food Res. 2023, 14, 100730. [Google Scholar] [CrossRef]
  10. Stratakos, A.C.; Inguglia, E.S.; Linton, M.; Tollerton, J.; Murphy, L.; Corcionivoschi, N.; Koidis, A.; Tiwari, B.K. Effect of high pressure processing on the safety, shelf life and quality of raw milk. Innov. Food Sci. Emerg. Technol. 2019, 52, 325–333. [Google Scholar] [CrossRef]
  11. Verruck, S.; Sartor, S.; Marenda, F.B.; da Silva Barros, E.L.; Camelo-Silva, C.; Canella, M.M.; Prudencio, E.S. Influence of heat treatment and microfiltration on the milk proteins properties. Adv. Food Technol. Nutr. Sci. 2019, 5, 54–66. [Google Scholar] [CrossRef]
  12. Cavalcanti, R.N.; Balthazar, C.F.; Margalho, L.P.; Freitas, M.Q.; Sant’Ana, A.S.; Cruz, A.G. Pulsed electric field-based technology for microbial inactivation in milk and dairy products. Curr. Opin. Food Sci. 2023, 54, 101087. [Google Scholar] [CrossRef]
  13. Delorme, M.M.; Guimarães, J.T.; Coutinho, N.M.; Balthazar, C.F.; Rocha, R.S.; Silva, R.; Margalho, L.P.; Pimentel, T.C.; Silva, M.C.; Freitas, M.Q.; et al. Ultraviolet radiation: An interesting technology to preserve quality and safety of milk and dairy foods. Trends Food Sci. Technol. 2020, 102, 146–154. [Google Scholar] [CrossRef]
  14. Atik, A.; Gumus, T. The effect of different doses of UV-C treatment on microbiological quality of bovine milk. LWT 2021, 136, 110322. [Google Scholar] [CrossRef]
  15. Qi, P.X.; Ren, D.; Xiao, Y.; Tomasula, P.M. Effect of homogenization and pasteurization on the structure and stability of whey protein in milk. J. Dairy Sci. 2015, 98, 2884–2897. [Google Scholar] [CrossRef]
  16. Zappi, A.; Marassi, V.; Giordani, S.; Kassouf, N.; Roda, B.; Zattoni, A.; Reschiglian, P.; Melucci, D. Extracting information and enhancing the quality of separation data: A review on chemometrics-assisted analysis of volatile, soluble and colloidal samples. Chemosensors 2023, 11, 45. [Google Scholar] [CrossRef]
  17. Saxton, R.; McDougal, O.M. Whey protein powder analysis by mid-infrared spectroscopy. Foods 2021, 10, 1033. [Google Scholar] [CrossRef]
  18. Mohamed, H.; Nagy, P.; Agbaba, J.; Kamal-Eldin, A. Use of near and mid infra-red spectroscopy for analysis of protein, fat, lactose and total solids in raw cow and camel milk. Food Chem. 2021, 334, 127436. [Google Scholar] [CrossRef]
  19. Etzion, Y.; Linker, R.; Cogan, U.; Shmulevich, I. Determination of protein concentration in raw milk by mid-infrared Fourier transform infrared/attenuated total reflectance spectroscopy. J. Dairy Sci. 2004, 87, 2779–2788. [Google Scholar] [CrossRef] [PubMed]
  20. De Luca, M.; Ioele, G.; Spatari, C.; Caruso, L.; Galasso, M.P.; Ragno, G. Evaluation of human breastmilk adulteration by combining Fourier transform infrared spectroscopy and partial least square modeling. Food Sci. Nutr. 2019, 7, 2194–2201. [Google Scholar] [CrossRef] [PubMed]
  21. Rinnan, Å.; Van Den Berg, F.; Engelsen, S.B. Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends Anal. Chem. 2009, 28, 1201–1222. [Google Scholar] [CrossRef]
  22. Torniainen, J.; Afara, I.O.; Prakash, M.; Sarin, J.K.; Stenroth, L.; Töyräs, J. Open-source python module for automated preprocessing of near infrared spectroscopic data. Anal. Chim. Acta 2020, 1108, 1–9. [Google Scholar] [CrossRef] [PubMed]
  23. Schoot, M.; Kapper, C.; van Kollenburg, G.H.; Postma, G.J.; van Kessel, G.; Buydens, L.M.; Jansen, J.J. Investigating the need for preprocessing of near-infrared spectroscopic data as a function of sample size. Chemom. Intell. Lab. Syst. 2020, 204, 104105. [Google Scholar] [CrossRef]
  24. Engel, J.; Gerretzen, J.; Szymańska, E.; Jansen, J.J.; Downey, G.; Blanchet, L.; Buydens, L.M. Breaking with trends in pre-processing? TrAC Trends Anal. Chem. 2013, 50, 96–106. [Google Scholar] [CrossRef]
  25. Zhu, X.; Guo, W.; Liu, D.; Kang, F. Determining the fat concentration of fresh raw cow milk using dielectric spectroscopy combined with chemometrics. Food Anal. Methods 2018, 11, 1528–1537. [Google Scholar] [CrossRef]
  26. Guo, W.; Fang, L.; Liu, D.; Wang, Z. Determination of soluble solids content and firmness of pears during ripening by using dielectric spectroscopy. Comput. Electron. Agric. 2015, 117, 226–233. [Google Scholar] [CrossRef]
  27. Feng, X.D.; Su, R.; Xu, N.; Wang, X.H.; Yu, A.M.; Zhang, H.Q.; Cao, Y.B. Portable analyzer for rapid analysis of total protein, fat and lactose contents in raw milk measured by non-dispersive short-wave near-infrared spectrometry. Chem. Res. Chin. Univ. 2013, 29, 15–19. [Google Scholar] [CrossRef]
  28. Li, X.; Feng, F.; Gao, R.; Wang, L.; Qian, Y.; Li, C.; Zhou, G. Application of near infrared reflectance (NIR) spectroscopy to identify potential PSE meat. J. Sci. Food Agric. 2016, 96, 3148–3156. [Google Scholar] [CrossRef]
  29. Pinto, P.; Anconi, A.; de Abreu, L.; Magalhães, E.; Nunes, C. Strategies to determine lactose in cow milk by mid infrared spectroscopy. J. Food Compos. Anal. 2021, 104, 104176. [Google Scholar] [CrossRef]
  30. Amsaraj, R.; Ambade, N.; Mutturi, S. Variable selection coupled to PLS2, ANN and SVM for simultaneous detection of multiple adulterants in milk using spectral data. Int. Dairy J. 2021, 123, 105172. [Google Scholar] [CrossRef]
  31. Inón, F.; Garrigues, S.; de la Guardia, M. Nutritional parameters of commercially available milk samples by FTIR and chemometric techniques. Anal. Chim. Acta 2004, 513, 401–412. [Google Scholar] [CrossRef]
  32. Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Neural Information Processing Systems; Neural Information Processing Systems Foundation: San Diego, CA, USA, 2011; Volume 24. [Google Scholar]
  33. Khater, O.; Khater, A.; Al-Nasr, A.S.; Abozyd, S.; Mortada, B.; Sabry, Y.M. Advancing near-infrared spectroscopy: A synergistic approach through Bayesian optimization and model stacking. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2024, 318, 124492. [Google Scholar] [CrossRef] [PubMed]
  34. Mehdizadeh, S.A.; Noshad, M.; Hojjati, M. A modified sequential wavenumber selection-discriminant analysis with Bayesian optimization strategy for detection and identification of chia seed oil adulteration using Raman spectroscopy. Talanta 2024, 277, 126439. [Google Scholar] [CrossRef] [PubMed]
  35. Kennard, R.W.; Stone, L.A. Computer Aided Design of Experiments. Technometrics 1969, 11, 137–148. [Google Scholar] [CrossRef]
  36. Nogueira, F. Bayesian Optimization: Open Source Constrained Global Optimization Tool for Python. 2014. Available online: https://github.com/bayesian-optimization/BayesianOptimization (accessed on 4 May 2025).
  37. Scikit-Learn Developers. GroupShuffleSplit: Scikit-Learn Documentation. Scikit-Learn. 2023. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html (accessed on 11 April 2025).
  38. Scikit-Learn Developers. LeavePGroupsOut: Scikit-Learn Documentation. Scikit-Learn. 2007. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePGroupsOut.html (accessed on 11 April 2025).
  39. Barnes, R.; Dhanoa, M.S.; Lister, S.J. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Appl. Spectrosc. 1989, 43, 772–777. [Google Scholar] [CrossRef]
  40. Martens, H.; Jensen, S.; Geladi, P. Multivariate linearity transformation for near-infrared reflectance spectrometry. In Proceedings of the Nordic Symposium on Applied Statistics, Stavanger, Norway, 12–14 June 1983; Stokkand Forlag Publishers: Stavanger, Norway, 1983; pp. 205–234. [Google Scholar]
  41. Guo, Q.; Wu, W.; Massart, D. The robust normal variate transform for pattern recognition with near-infrared data. Anal. Chim. Acta 1999, 382, 87–103. [Google Scholar] [CrossRef]
  42. Scikit-Learn Developers. Supervised Learning: Scikit-Learn Documentation. Scikit-Learn. 2007. Available online: https://scikit-learn.org/stable/supervised_learning.html (accessed on 11 April 2025).
  43. SciPy Developers. SciPy: Scientific Library for Python. SciPy. 2023. Available online: https://docs.scipy.org/doc/scipy/ (accessed on 11 April 2025).
  44. Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
  45. Armstrong, R.A. When to use the B onferroni correction. Ophthalmic Physiol. Opt. 2014, 34, 502–508. [Google Scholar] [CrossRef]
  46. Sullivan, G.M.; Feinn, R. Using effect size—or why the P value is not enough. J. Grad. Med Educ. 2012, 4, 279–282. [Google Scholar] [CrossRef]
  47. Wu, D.; He, Y.; Feng, S. Short-wave near-infrared spectroscopy analysis of major compounds in milk powder and wavelength assignment. Anal. Chim. Acta 2008, 610, 232–242. [Google Scholar] [CrossRef]
  48. Bonfatti, V.; Di Martino, G.; Carnier, P. Effectiveness of mid-infrared spectroscopy for the prediction of detailed protein composition and contents of protein genetic variants of individual milk of Simmental cows. J. Dairy Sci. 2011, 94, 5776–5785. [Google Scholar] [CrossRef]
  49. Vestergaard, R.J.; Vasava, H.B.; Aspinall, D.; Chen, S.; Gillespie, A.; Adamchuk, V.; Biswas, A. Evaluation of optimized preprocessing and modeling algorithms for prediction of soil properties using vis-nir spectroscopy. Sensors 2021, 21, 6745. [Google Scholar] [CrossRef]
Figure 1. Automated spectroscopic data pipeline optimization framework. The workflow integrates preprocessing, evaluation, and Bayesian optimization to discover optimal preprocessing sequences. Phase 1 generates valid pipeline configurations based on compatibility constraints. Phase 2 evaluates pipelines through cross-validation or direct test set assessment. Phase 3 applies Bayesian optimization to fine-tune hyperparameters, ultimately yielding preprocessing pipelines that maximize model performance for spectroscopic data analysis.
Figure 1. Automated spectroscopic data pipeline optimization framework. The workflow integrates preprocessing, evaluation, and Bayesian optimization to discover optimal preprocessing sequences. Phase 1 generates valid pipeline configurations based on compatibility constraints. Phase 2 evaluates pipelines through cross-validation or direct test set assessment. Phase 3 applies Bayesian optimization to fine-tune hyperparameters, ultimately yielding preprocessing pipelines that maximize model performance for spectroscopic data analysis.
Foods 14 02996 g001
Figure 2. Distribution of milk components with their statistical properties.
Figure 2. Distribution of milk components with their statistical properties.
Foods 14 02996 g002
Figure 3. Raw MIR spectra of 135 calibration milk samples in the 3000–1000 cm−1 region.
Figure 3. Raw MIR spectra of 135 calibration milk samples in the 3000–1000 cm−1 region.
Foods 14 02996 g003
Figure 4. Comparison of spectral preprocessing pipelines. Top: automated, data-driven pipelines identified via Bayesian optimization. Bottom: commonly used literature-based methods (e.g., SNV, MSC, and derivatives) [25,30,31,47,48].
Figure 4. Comparison of spectral preprocessing pipelines. Top: automated, data-driven pipelines identified via Bayesian optimization. Bottom: commonly used literature-based methods (e.g., SNV, MSC, and derivatives) [25,30,31,47,48].
Foods 14 02996 g004
Figure 5. Predicted vs. true value scatter plots for each milk component using optimized preprocessing obtained via Bayesian optimization. This figure includes the results for the top-performing models (PLS, RidgeCV, and LassoLarsCV). Models with the highest R 2 for each component are outlined in red.
Figure 5. Predicted vs. true value scatter plots for each milk component using optimized preprocessing obtained via Bayesian optimization. This figure includes the results for the top-performing models (PLS, RidgeCV, and LassoLarsCV). Models with the highest R 2 for each component are outlined in red.
Foods 14 02996 g005
Figure 6. RMSE distribution across preprocessing methods for the fat (%) component. The optimized pipeline achieved the lowest median RMSE. Statistical significance is annotated using * p < 0.05 and ** p < 0.01 (Bonferroni-corrected).
Figure 6. RMSE distribution across preprocessing methods for the fat (%) component. The optimized pipeline achieved the lowest median RMSE. Statistical significance is annotated using * p < 0.05 and ** p < 0.01 (Bonferroni-corrected).
Foods 14 02996 g006
Figure 7. RMSE distribution across preprocessing methods for the true protein (%) component. Significant differences favor the optimized pipeline. Significance levels are marked as and ** p < 0.01 (Bonferroni-corrected).
Figure 7. RMSE distribution across preprocessing methods for the true protein (%) component. Significant differences favor the optimized pipeline. Significance levels are marked as and ** p < 0.01 (Bonferroni-corrected).
Foods 14 02996 g007
Figure 8. RMSE distribution across preprocessing methods for the Lactose (%) component. Optimized Pipeline outperforms baselines with statistically significant differences. * p < 0.05, (Bonferroni-corrected).
Figure 8. RMSE distribution across preprocessing methods for the Lactose (%) component. Optimized Pipeline outperforms baselines with statistically significant differences. * p < 0.05, (Bonferroni-corrected).
Foods 14 02996 g008
Table 1. Hyperparameter search space for preprocessing components used during optimization.
Table 1. Hyperparameter search space for preprocessing components used during optimization.
PreprocessingParameterSearch SpaceDescription
RobustScalerquantile_range[5, 80], [10, 90], [25, 75]Interquantile scaling range
with_centeringTrue, FalseWhether to center data before scaling
with_scalingTrue, FalseWhether to scale data
unit_varianceTrue, FalseWhether to scale to unit variance
StandardScalerwith_meanTrue, FalseCentering option
with_stdTrue, FalseScaling option
SavitzkyGolaywindow_length5, 7, 11, 15Smoothing window size
polyorder2, 3, 4Polynomial order
MultiplicativeScatterCorrectionreferencemean, medianReference spectrum type
Detrendmethodsimple, polynomial, splineDetrending method
orderintegerThe order of the polynomial or spline fit
Table 2. Hyperparameter search space for regression models used in the final evaluation.
Table 2. Hyperparameter search space for regression models used in the final evaluation.
ModelParameterSearch SpaceDescription
PLSn_components2 to min ( 10 , n _ f e a t u r e s ) Number of latent variables
SVRC0.1–100Regularization strength
epsilon0.01–1.0Epsilon-tube width
gamma 10 4 –1.0 (log-scale)Kernel coefficient for RBF
RidgeCV θ 10 6 10 6 (log-scale)Regularization values
LassoLarsCVcv5 (fixed)Number of CV folds
Elastic Netl1_ratio0.1–0.9Balance of L1 vs. L2 penalty
alpha 10 4 –1.0Regularization strength
GBMn_estimators50–300Number of boosting rounds
learning_rate0.01–0.3Step size shrinkage
max_depth3–10Maximum tree depth
min_samples_split2–10Minimum samples to split node
Table 3. Descriptive statistics for milk components: fat, protein, lactose, and total solids.
Table 3. Descriptive statistics for milk components: fat, protein, lactose, and total solids.
MetricFatProteinLactoseTotal Solids
Count385385385385
Mean5.56794.76444.560716.1490
Std0.81780.50650.11521.1442
Min3.07003.19004.090012.1100
25%5.21674.74004.510016.0800
50%5.47504.79004.550016.3560
75%6.26255.02254.590016.8500
Max6.77506.59604.951717.5600
Table 4. Optimal preprocessing pipelines identified through automated Bayesian optimization using allowed combinations [ 1 , 2 ] and [ 1 , 2 , 3 ] .
Table 4. Optimal preprocessing pipelines identified through automated Bayesian optimization using allowed combinations [ 1 , 2 ] and [ 1 , 2 , 3 ] .
ComponentPreprocessing Step(s)Parameters
Fatsavgol → scalerderiv_order = 1, filter_win = 13,
poly_order = 5, with_mean = True,
with_std = True
Lactosescaler → detrendwith_mean = True, with_std = True
method = “polynomial”, order = 2,
dspline = 100
Proteinrobust_scalerquantile_range = [5, 80],
with_scaling = True, unit_variance = False
TSsavgol → scalerderiv_order = 1, filter_win = 13,
poly_order = 5 with_mean = True,
with_std = True
Table 5. Regression performance comparison between no preprocessing and optimized preprocessing obtained via Bayesian optimization.
Table 5. Regression performance comparison between no preprocessing and optimized preprocessing obtained via Bayesian optimization.
No PreprocessingOptimized Preprocessing
Component Best Model RMSEP R 2 Method Best Model RMSEP R 2 Method
FatPLS0.1590.903NoneRidgeCV0.1390.926Optimized
ProteinLassoLarsCV0.0630.974NoneRidgeCV0.0540.981Optimized
LactosePLS0.0270.912NonePLS0.0260.917Optimized
Total SolidsPLS0.1580.958NoneRidgeCV0.1540.960Optimized
Table 6. Comparison of best literature-based preprocessing methods across milk components.
Table 6. Comparison of best literature-based preprocessing methods across milk components.
Zhu et al. [25], Wu et al. [47]Amsaraj et al. [30], Bonfatti et al. [48]Inon et al. [31]
Component Best Method R 2 Best Method R 2 Best Method R 2
FatSNV (PLS)0.882SNV + 1st Der SavGol (PLS)0.896MSC (PLS)0.891
ProteinSNV (Elastic Net)0.960SNV + 1st Der SavGol (LassoLarsCV)0.970MSC (RidgeCV)0.972
LactoseSNV (Elastic Net)0.835SNV + 1st Der SavGol (LassoLarsCV)0.846MSC (RidgeCV)0.822
Total SolidsSNV (RidgeCV)0.917SNV + 1st Der SavGol (PLS)0.933MSC (LassoLarsCV)0.950
Table 7. Statistical comparison of the optimized pipeline with baseline preprocessing methods. Significance is indicated for both uncorrected and Bonferroni-corrected p-values.
Table 7. Statistical comparison of the optimized pipeline with baseline preprocessing methods. Significance is indicated for both uncorrected and Bonferroni-corrected p-values.
ComponentComparisonCohen’s DPSig. (P)P (Corr.)Sig. (Corr.)
FatMSC Only−3.430.001Yes0.017Yes
FatSNV + SG−7.430.000Yes0.005Yes
FatSNV−3.520.001Yes0.024Yes
LactoseMSC Only−3.380.002Yes0.044Yes
LactoseSNV + SG−4.470.001Yes0.025Yes
LactoseSNV−4.990.001Yes0.027Yes
TSMSC Only−3.370.011Yes0.266No
TSSNV + SG−2.910.013Yes0.314No
TSSNV−3.570.005Yes0.120No
True ProteinMSC Only−2.330.000Yes0.002Yes
True ProteinSNV + SG−4.040.003Yes0.070No
True ProteinSNV−2.410.000Yes0.006Yes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Babatunde, H.A.; McDougal, O.M.; Andersen, T. Automated Spectral Preprocessing via Bayesian Optimization for Chemometric Analysis of Milk Constituents. Foods 2025, 14, 2996. https://doi.org/10.3390/foods14172996

AMA Style

Babatunde HA, McDougal OM, Andersen T. Automated Spectral Preprocessing via Bayesian Optimization for Chemometric Analysis of Milk Constituents. Foods. 2025; 14(17):2996. https://doi.org/10.3390/foods14172996

Chicago/Turabian Style

Babatunde, Habeeb Abolaji, Owen M. McDougal, and Timothy Andersen. 2025. "Automated Spectral Preprocessing via Bayesian Optimization for Chemometric Analysis of Milk Constituents" Foods 14, no. 17: 2996. https://doi.org/10.3390/foods14172996

APA Style

Babatunde, H. A., McDougal, O. M., & Andersen, T. (2025). Automated Spectral Preprocessing via Bayesian Optimization for Chemometric Analysis of Milk Constituents. Foods, 14(17), 2996. https://doi.org/10.3390/foods14172996

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop