1. Introduction
Milk is a nutrient-dense biological fluid that contains essential macronutrients, such as fat, protein, and lactose, along with vitamins, minerals, and bioactive compounds that support human growth, health, and disease prevention [
1]. Fats contribute to energy provision, flavor, and fat-soluble vitamin transport; proteins such as casein and whey not only supply essential amino acids with high biological value but also exert a range of bioactive functions, including antibacterial, immunomodulatory, antioxidant, antihypertensive, and opioid-like activities, in addition to providing functional properties important in dairy processing [
2]. Lactose serves as an energy source and facilitates calcium absorption, while minerals such as calcium, phosphorus, and magnesium are critical for bone development and metabolic functions [
2]. The balance of these constituents determines not only the nutritional value of milk but also its technological functionality in the production of a wide range of dairy products [
2]. The accurate determination of these components is therefore fundamental for quality control, economic valuation, and optimization of dairy production.
Conventional analysis of milk composition relies on standardized wet-chemical and instrumental reference methods to ensure accuracy and compliance. Protein content is determined using Kjeldahl nitrogen analysis or Dumas combustion; fat is measured through Gerber/Babcock acid digestion or Rose–Gottlieb solvent extraction; lactose is quantified by polarimetry; and minerals are analyzed through ashing followed by atomic absorption spectroscopy (AAS) or inductively coupled plasma (ICP) spectroscopy [
3,
4,
5,
6,
7]. High-performance liquid chromatography (HPLC) is a versatile technique that can be applied for the quantification of lactose, proteins, fats, vitamins, and other bioactive compounds, offering high sensitivity and specificity across multiple milk components. While these conventional methods deliver high precision, they are often labor intensive, time consuming, and require skilled operators, making them less suitable for real-time process control in modern dairy operations [
8].
In recent years, advanced food processing technologies, such as high-pressure processing (HPP), microfiltration, pulsed electric fields (PEFs), UV-C (ultraviolet C) treatment, and high-pressure homogenization, have been developed to enhance microbial safety, extend shelf life, and preserve the nutritional and sensory qualities of milk [
9]. HPP inactivates pathogenic and spoilage microorganisms using a hydrostatic pressure of 400–600 MPa without significant heat, maintaining vitamins, flavor, and protein functionality [
10]. Microfiltration employs membrane separation to remove bacteria, spores, and somatic cells while retaining desirable components such as proteins and minerals [
11]. PEF uses short bursts of high-voltage electric pulses to disrupt microbial cell membranes, achieving pasteurization-like safety with minimal thermal damage [
12]. UV-C treatment, operating in the 200–280 nm wavelength range, inactivates microbes by damaging their DNA [
13]. It is particularly effective for surface decontamination and thin-film liquid applications, helping preserve the nutritional and sensory qualities of foods while extending shelf life [
14]. High-pressure homogenization applies intense shear forces at elevated pressures to reduce fat globule size, improve emulsion stability, and enhance microbial inactivation [
15]. Complementing these advances, rapid analytical tools, including vibrational spectroscopies (mid-infrared, near-infrared, and Raman), fluorescence sensors, dielectric/impedance detectors, and biosensors, enable non-destructive, on-site measurement of key components and contaminants within seconds. These analytical systems can function independently or in tandem with processing methods to verify composition, confirm microbial inactivation, detect adulteration, and optimize parameters in real time. Their integration, particularly through inline sensors and IoT-enabled monitoring, closes the loop between processing and quality assurance, ensuring milk safety, consistency, and consumer acceptability.
Mid-infrared (MIR) spectroscopy, when coupled with chemometric modeling, offers a rapid, non-destructive, and cost-effective alternative to improve predictive accuracy and processing efficiency [
8,
16]. MIR spectroscopy measures the absorption of infrared radiation within the mid-infrared region of the electromagnetic spectrum from 4000–400 cm
−1 spectral range, where molecular vibrations associated with specific functional groups occur [
17]. These vibrations correspond to the stretching and bending motions of chemical bonds, making MIR highly effective for identifying and quantifying key milk components. For instance, in the MIR spectral range of 4000 to 400 cm
−1, various absorption peaks correspond to the vibrational modes of molecular bonds in milk components interacting with infrared radiation [
18]. Fats are characterized by absorption bands associated with the stretching vibrations of C-H bonds in fatty acid chains. In particular, peaks at approximately 2922 cm
−1 and 2852 cm
−1 correspond to the asymmetric and symmetric stretching vibrations of the methylene (CH
2) groups, respectively. Additionally, an absorption peak around 1743 cm
−1 is linked to the C=O stretching vibrations of ester carbonyl groups in triglycerides, providing a distinctive marker for lipids in milk [
18]. The spectral range between 1700 cm
−1 and 1500 cm
−1 is characterized by prominent peaks associated with peptide bonds in proteins. Two major bands are the amide I band around 1635 cm
−1, attributed to C=O stretching and N-H bending vibrations, and the amide II band near 1548 cm
−1, corresponding to N-H bending coupled with C-N stretching. These bands are directly related to the peptide bonds in milk proteins such as casein and whey proteins [
19]. Additionally, the region between 1200 and 900 cm
−1 contains absorption peaks linked to carbohydrates, particularly lactose. For instance, a peak at approximately 1077 cm
−1 is associated with C-O stretching vibrations in lactose [
20].
Despite these advantages, MIR spectra are inherently convoluted due to overlapping absorption bands from various constituents. Unlike HPLC, which yields distinct peaks for individual analytes, MIR does not allow the direct deconvolution of each component without additional statistical modeling. As a result, chemometrics plays a crucial role in the deconvolution of MIR spectra and linking them to quantifiable milk components such as fat, protein, lactose, and total solids [
8]. A typical chemometric workflow includes three essential steps: spectral preprocessing, wavenumber selection, and predictive model development. Of these, preprocessing is foundational because the raw spectra often contain instrumental noise, baseline drift, scattering effects, and sample inconsistencies that obscure meaningful chemical information [
21]. Spectral preprocessing comprises mathematical transformations designed to minimize unwanted variation and enhance relevant features of the spectra. Common methods include baseline correction, scatter correction (e.g., standard normal variate (SNV), multiplicative scatter correction (MSC)), smoothing, normalization, and derivatives (e.g., Savitzky–Golay (SavGol)) [
22]. These techniques improve the signal-to-noise ratio and promote consistency across samples, thus enhancing the accuracy and robustness of the resulting chemometric models [
21,
23]. However, choosing the appropriate preprocessing pipeline remains a significant challenge. Most studies rely on manual selection or predefined methods from previous work, often without evaluating their suitability for the current dataset or target analyte [
22,
24]. This trial-and-error approach introduces subjectivity and can lead to suboptimal model performance. Several studies have highlighted the limitations of such practices. For example, Zhu et al. [
25] demonstrated that transferring preprocessing techniques such as SNV, SavGol, and first- and second-order derivatives developed for fruit ripening [
26], portable near-infrared (NIR) devices for milk assessment [
27], and meat quality classification [
28] to dielectric spectroscopy for milk fat analysis yielded poor calibration performance, with only SNV in combination with least squares support vector machines (LSSVMs) providing marginal gains. Pinto et al. [
29] similarly showed that preprocessing effectiveness in MIR-based lactose prediction depended heavily on the selected spectral region and transformation method. Amsaraj et al. [
30] applied preprocessing pipelines derived from tea sample analysis to milk adulterant detection with limited success, underscoring the risks of direct method transfer. Inon et al. [
31] observed that the application of MSC, originally developed for NIR spectra, failed to improve the prediction accuracy when adapted to FTIR spectra. Collectively, these works reveal the necessity of dataset-specific preprocessing optimization.
Bayesian optimization (BO) offers a principled framework for addressing this issue. Unlike grid or random search methods, which either exhaustively or blindly explore the hyperparameter space, BO employs probabilistic models (e.g., Gaussian processes or Tree-structured Parzen Estimators) to guide the search toward promising regions of the solution space [
32,
33]. This enables efficient and scalable optimization, particularly in high-dimensional or complex domains like spectral preprocessing. In chemometrics, BO has been shown to outperform greedy or uninformed strategies in tasks such as PLS calibration [
33], spectral feature selection [
34], and MIR-based protein quantification [
8].
Despite recent advances in wavenumber selection and model tuning, preprocessing optimization remains underexplored. Notable efforts such as the nippy package by Torniainen et al. [
22] introduced the automated comparison of preprocessing strategies but relied on greedy search, which becomes computationally expensive and lacks global exploration capabilities. Moreover, the need for more adaptive and scalable preprocessing optimization has been emphasized in the recent chemometric literature [
21,
22,
24].
Motivated by this gap, we propose a novel framework for automated preprocessing optimization in spectroscopic data analysis. The approach integrates spectroscopy-specific and general machine learning preprocessing techniques and leverages Gaussian Process-based Bayesian Optimization to dynamically identify the most effective pipeline for each predictive task. We apply this method to a mid-infrared (MIR) milk spectroscopy dataset to optimize the prediction of fat, protein, lactose, and total solids. Our results demonstrate that data-driven preprocessing selection within a chemometric modeling framework improves model accuracy and robustness while reducing the reliance on intuition and manual tuning. Although developed for milk analysis, this framework is a generalizable solution applicable to a wide range of infrared spectroscopic datasets in food, pharmaceutical, environmental, and agricultural domains.
2. Materials and Methods
2.1. Spectra Acquisition
MIR spectral data of milk were obtained from Agropur Jerome Cheese, Jerome, ID, USA; the dataset consisted of MIR spectra obtained during routine milk analysis during processing. All spectra were collected using a MilkoScan FT1 (Foss North America, Eden Prairie, MN, USA). The MIR dataset included samples from multiple production sources, encompassing various vats, raw tanks (RT), and other operational milk streams that are sampled throughout everyday operations. All spectra were provided in their raw form (absorbance units), without any additional preprocessing applied prior to analysis.
2.2. Dataset Description
The dataset used in this study comprises a total of 1772 spectral and 1193 reference data. After aligning and matching the two sources, 6362 spectral-reference matched samples were obtained. Each spectrum consists of 1060 variables, corresponding to wavenumbers ranging from 4999.99 cm−1 to 925.07 cm−1, with an approximate step size of 4 cm−1. These spectra were paired with the 4 target variables used in this study: fat (%), true protein (%), lactose (%), and total solids (TS, %). Due to multiple replicates within the spectral data (e.g., samples labeled as VAT15_1, VAT15_2, with repeated measurements such as VAT15_1, VAT15_1, etc.), a data reduction step was necessary. To resolve redundancy and improve consistency, mean spectra were computed for replicates of each unique sample group, resulting in a final dataset containing 385 averaged spectra and 193 unique samples.
2.3. Data Splitting
To ensure robust model validation, a modified Kennard-Stone algorithm with replicate handling (
Kennard-StoneR) was implemented for data partitioning as presented in Algorithm 1. This approach addresses a critical limitation in the vanilla Kennard-Stone method [
35], which can inadvertently place replicates from the same sample in both training and test sets, potentially leading to data leakage and overly optimistic model performance estimates. The
Kennard-StoneR algorithm maintains the core principle of the original Kennard-Stone method that maximizes the Euclidean distance between selected samples to ensure representative coverage of the feature space while incorporating group-aware selection to prevent replicate splitting. The algorithm proceeds as follows.
Algorithm 1 Modified Kennard-Stone with replicate handling (Kennard-StoneR). |
Require: Data matrix , labels , group identifiers , test proportion p Ensure: Training and test indices
- 1:
Aggregate replicates by computing the centroid of each sample group: - 2:
for each unique group identifier i do - 3:
- 4:
end for - 5:
Determine the number of training groups: - 6:
- 7:
Compute the distance matrix between all group centroids: - 8:
- 9:
Initialize selected groups S with the pair of groups having maximum distance: - 10:
- 11:
while
do - 12:
For each unselected group u: - 13:
Compute minimum distance to any selected group: - 14:
- 15:
Add the unselected group with maximum minimum distance: - 16:
- 17:
end while - 18:
Map selected groups to original sample indices: - 19:
train_indices - 20:
test_indices - 21:
return X[train_indices], X[test_indices], y[train_indices], y[test_indices]
|
The algorithm generates a training set that optimally spans the feature space while reserving a representative portion (p) of samples for independent testing. For our 193 unique milk composition MIR spectra, this technique resulted in a training set comprising 70% or 135 unique samples and a test set with the remaining 30% or 58 samples, with the complete separation of replicates between sets. Group-aware cross-validation was implemented by grouping samples that share the same base identifier, regardless of replicate index. For example, samples labeled VAT15_1 and VAT15_2 were treated as belonging to the same group (VAT15). This approach ensured that all replicates of a given sample were kept together during both training and testing, thereby preventing data leakage. In total, 193 unique groups were identified based on this naming convention. We verified that spectra within each group had similar reference (target) values, confirming their validity as true replicates.
2.4. Automated Pipeline Optimization for Spectral Preprocessing and Modeling
To enhance the robustness, reproducibility, and efficiency of spectral data analysis, we developed a Python-based framework for automated preprocessing pipeline optimization. At its core is the
PipelineOptimizer class, which leverages Bayesian optimization [
36] to systematically explore and fine-tune combinations of preprocessing techniques and model hyperparameters. This process is designed to yield the most predictive and scientifically valid pipeline tailored to the user’s dataset.
The framework supports a diverse set of preprocessing methods, including both spectroscopy-specific transformations and general purpose machine learning preprocessing from scikit-learn. During optimization, the framework intelligently excludes incompatible combinations based on predefined rules as described subsequently, ensuring that only valid configurations are evaluated. The high-level workflow, encompassing preprocessing configuration, validation, and pipeline optimization, is summarized in Algorithm 2 and
Figure 1. A complete version with detailed steps and procedures is provided in
Appendix A (Algorithm A1). This algorithm outlines the core logic behind candidate generation, evaluation using cross-validation or test data, and the Bayesian search strategy employed for optimization. This structured and reproducible approach provides a powerful tool for advancing chemometric analysis in both research and applied settings.
Algorithm 2 Automated spectroscopic data pipeline optimization. |
- 1:
Input: , , preprocessing steps , incompatibilities , allowed lengths , bounds , , - 2:
Optional: , - 3:
procedure GeneratePipelines () - 4:
Generate all valid preprocessing pipelines subject to incompatibilities - 5:
end procedure - 6:
procedure Evaluate() - 7:
Decode to build pipeline - 8:
if is available then - 9:
Fit and evaluate on test set - 10:
else - 11:
Cross-validate on training set - 12:
end if - 13:
return negative RMSE as score - 14:
end procedure - 15:
procedure Optimize() - 16:
Use Bayesian optimization to find maximizing Evaluate - 17:
Build best pipeline from - 18:
Fit on ; evaluate on if available - 19:
return , - 20:
end procedure - 21:
Output: Optimized pipeline and parameters
|
2.4.1. Overview of Pipeline Optimization Strategy
The
PipelineOptimizer class supports spectral datasets formatted as NumPy arrays, allowing users to specify training and testing sets, cross-validation strategies, and optional grouping variables. The framework incorporates two group-aware validation strategies: GroupShuffleSplit and LeavePGroupsOut, ensuring the robust evaluation of pipelines in the presence of samples with repeated measurements [
37,
38].
2.4.2. Preprocessing Configuration Space
Users can provide a custom list of candidate preprocessing steps which are then filtered for compatibility using a set of predefined rules. The framework supports the following spectroscopy-specific preprocessing methods: SNV, SavGol, MSC, Extended Multiplicative Signal Correction (EMSC), Mean Centering (MeanCN), Detrending, AsymmetricLeastSquareBaselineCorrection, Localized SNV (LSNV), and Robust Normal Variate (RNV) [
39,
40,
41]. Additionally, the framework seamlessly integrates general purpose preprocessing methods from scikit-learn: Standard Scaling, Robust Scaling, Global Scaling, MinMaxScaler, Normalization, QuantileTransformer, Principal Component Analysis (PCA), Locally Linear Embedding (LLE), fast Independent Component Analysis (fast-ICA), kernel-PCA, and PowerTransformer. All valid preprocessing pipelines comprising up to a user-defined maximum number of steps are enumerated in advance, allowing optimization to occur over this discrete configuration space. Users can further constrain the search by specifying the allowed pipeline lengths. For example, setting the maximum pipeline length to 2 allows for either single preprocessing steps or combinations of two compatible methods.
2.4.3. Bayesian Optimization
Pipeline optimization is performed using Bayesian optimization with an Expected Improvement (EI) acquisition function via the Bayesian optimization Python library to autonomously identify optimal preprocessing pipelines, thereby eliminating the traditionally labor-intensive process of manual tuning in spectroscopic analysis [
36]. Bayesian optimization is a probabilistic model-based approach that efficiently locates the extrema of objective functions with minimal evaluations, making it particularly well-suited for complex optimization tasks [
36]. The optimization process begins with
random initial configurations, followed by
intelligently selected configurations guided by the Bayesian model’s posterior distribution (with both
and
defined by the user). The objective function dynamically constructs and evaluates pipelines based on a sampled index into the list of possible preprocessing configurations. Each pipeline is appended with a Ridge regression estimator, where the regularization strength (
) is also optimized.
During each evaluation, the selected pipeline is fitted and validated using the configured cross-validation strategy. The objective function returns the negative Root Mean Squared Error (RMSE), penalizing unstable or ill-conditioned configurations (e.g., those leading to LinAlgError). Logging is integrated throughout the optimization process to track evaluated configurations, metric values (RMSE, ), and potential numerical issues.
2.4.4. Cross-Validation Methods
To ensure robust and realistic evaluation of preprocessing pipelines, we implemented group-aware cross-validation strategies suitable for spectroscopic data and chemometric modeling. These methods are particularly suited for spectral datasets, where measurements may be recorded as replicates. The framework allows users to provide an optional group parameter, specifying the group to which each sample belongs. If no group information is supplied, each sample is treated as independent, and traditional sample-level validation is performed. Two primary group-based cross-validation techniques are supported:
Group-Shuffle-Split: This method randomly divides groups of samples into training and validation sets while ensuring that all samples within a group are assigned to the same split. This technique helps mitigate data leakage and preserves the natural structure of the data, which is important for spectral datasets prone to replicate effects.
Leave-P-Groups-Out: This exhaustive method iteratively leaves out P groups as a validation set, training on the remaining groups. It offers a more stringent assessment of generalization to unseen groups, though at a higher computational cost.
In addition to these cross-validation strategies, we enhanced the evaluation function with conditional logic to leverage external test data when available. Specifically, we added a mechanism to check whether both X_test and y_test are present. If test data is provided, the evaluation proceeds as follows:
The pipeline is fit on the training data.
Predictions are made on the external test data.
Performance metrics: Root Mean Squared Error (RMSE) and coefficient of determination () are computed on the test set.
The negative RMSE is returned as the optimization score for compatibility with minimization-based search frameworks.
If external test data is not available, or if an error occurs during test-based evaluation, the function defaults to the original group-based cross-validation strategy using either Group-Shuffle-Split or Leave-P-Groups-Out.
2.4.5. Compatibility Rules for Preprocessing Pipelines
The framework is designed to support a wide range of preprocessing techniques, drawing from both domain-specific spectroscopic methods and general purpose machine learning transformations available through scikit-learn. While this flexibility enables the construction of diverse and powerful pipelines, it also introduces the risk of combining methods that are theoretically redundant, semantically incompatible, or computationally conflicting.
To address this, we implemented a set of incompatibility rules that automatically prevent mutually exclusive or conceptually redundant methods from being used together. These rules are defined based on both functional similarity and insights from prior spectroscopic and chemometric literature [
22].
For example, the following groups of preprocessing steps are treated as mutually incompatible:
Scatter Correction Methods: SNV, MSC, EMSC, LSNV, RNV are all methods that correct for scatter effects in spectral data. Applying more than one of these techniques can lead to overcorrection or unintended distortions.
Scaling and Normalization Methods: Methods such as scaler, autoscale, globalscale, normalization, robust-scaler, minmax-scaler, power-transformer, quantile-transformer, and row-standardizer all perform some form of scaling or normalization. Using multiple scaling approaches in the same pipeline may introduce redundancy and instability.
Method-Specific Incompatibilities: Specific combinations such as SNV with row-standardizer, or autoscale with scaler, are excluded due to their overlapping functionalities.
Dimensionality Reduction Methods: Techniques such as PCA, fast-ICA, kernel-PCA, and LLE aim to reduce data dimensionality and are typically not applied together, as they each represent distinct reduction philosophies.
These constraints are enforced internally through a predefined list of incompatibility sets. When a user supplies a list of candidate preprocessing techniques, some of which may be mutually incompatible, the framework ensures that such combinations are automatically excluded from consideration during pipeline optimization. Instead of raising an error, the system filters out any configurations that violate the defined compatibility rules, thereby streamlining the search space and maintaining both computational efficiency and methodological validity. This ensures that only scientifically coherent and practically feasible pipelines are explored during the optimization process, in line with established best practices in chemometric data preprocessing [
22].
By integrating these methodological advances, the proposed framework represents a significant improvement over traditional approaches to spectroscopic data preprocessing, enabling more systematic, objective, and reproducible preprocessing pipeline optimization for chemometric applications.
2.5. Regression Analysis
To model the relationship between spectral features and the target variable(s), we employed six regression algorithms provided by the scikit-learn library [
42]. These included Elastic Net, Partial Least Squares (PLS), Support Vector Regression (SVR), LassoLarsCV, RidgeCV, and Gradient Boosting Machines (GBMs).
The regression models are briefly described below:
Elastic Net regression combines both L1 (Lasso) and L2 (Ridge) regularization penalties. It is particularly effective for datasets with multicollinearity and for performing variable selection.
Partial Least Squares (PLS) regression projects both predictors and response variables to a latent space, maximizing their covariance. It is especially suitable for spectral data due to its ability to handle high-dimensional and collinear variables.
Support Vector Regression (SVR) models non-linear relationships by transforming data into a higher-dimensional space using kernel functions. It aims to fit the best hyperplane within a tolerance margin.
LassoLarsCV uses the lasso and the Least Angle Regression (LARS) algorithms with built-in cross-validation to select the optimal amount of L1 regularization. It encourages sparsity and aids in automatic feature selection.
RidgeCV applies L2 regularization and selects the best regularization parameter using cross-validation. It is robust against multicollinearity and can stabilize coefficient estimates.
Gradient Boosting Machines (GBMs) is a powerful ensemble method that builds a sequence of weak learners, typically decision trees, to minimize prediction error. It incrementally fits residuals from previous models to improve overall performance.
Each model was trained on the optimally preprocessed data and evaluated on test data. The hyperparameters of each model were tuned using Bayesian optimization to maximize predictive performance. To prevent data leakage, all steps including hyperparameter optimization were strictly confined to the training data.
2.6. Hyperparameter Tuning Strategy
We adopted a two-stage optimization framework that decouples preprocessing pipeline optimization from final model hyperparameter tuning to balance computational efficiency and modeling flexibility.
During the first stage, preprocessing pipelines were optimized using Bayesian optimization with cross-validation, where each pipeline configuration was evaluated using a RidgeCV regression model. RidgeCV was selected as the estimator during this stage due to its single hyperparameter and efficient internal cross-validation. This allowed the framework to explore a wide variety of preprocessing configurations without the added computational burden of simultaneously tuning complex model architectures. The values were searched over a log-spaced range from to . The values of and were set to 50 and 200, respectively.
Table 1 summarizes the hyperparameter search space for a few of the preprocessing components explored during optimization.
In the second stage, the best performing preprocessing pipelines were fixed and used to evaluate multiple regression models. These included Partial Least Squares (PLS), Elastic Net, RidgeCV, LassoLarsCV, Support Vector Regression (SVR), and Gradient Boosting Machines (GBMs). Each model except RidgeCV and LassoLarsCV underwent hyperparameter tuning using Bayesian optimization on the training data, with the search spaces listed in
Table 2. The values of
and
were set at 5 and 100, respectively, for all models, except PLS, where they were set to 5 and 10.
This two-stage procedure provides a clear separation between preprocessing pipeline discovery and model learning, enabling flexible experimentation while keeping overall search complexity tractable. Moreover, models like Elastic Net and LassoLarsCV inherently perform feature selection by assigning zero weights to less informative variables, offering indirect insights into variable importance.
2.7. Statistical Analysis
To evaluate whether the optimized preprocessing pipeline statistically outperformed baseline methods, we conducted hypothesis testing on fold-level RMSE values from 5-fold GroupShuffleSplit cross-validation across models, and milk components was conducted. We restricted the analysis to the PLS and RidgeCV models, as they yielded the best overall results. This led to a total of 24 pairwise comparisons (2 models × 4 components × 3 comparisons).
For each comparison, we tested the following hypotheses:
The choice of statistical test was based on the normality of the paired RMSE differences, assessed using the Shapiro–Wilk test (
p > 0.05). If normality held, we used a paired t-test; otherwise, the Wilcoxon signed-rank test was applied [
43,
44].
To control the family-wise error rate from multiple comparisons, we applied the Bonferroni correction to the resulting
p-values [
45]. Cohen’s d was computed to quantify the effect size and direction of each comparison, with negative values indicating better performance (lower RMSE) by the optimized pipeline. The absolute magnitude of
d follows conventional benchmarks: values greater than 0.8 denote a large effect size, and values exceeding 1.3 are considered very large [
46]. We also reported 95% confidence intervals for the mean RMSE differences. All tests were two-sided, with a corrected significance threshold of
.
Boxplots showing RMSE distributions for each preprocessing method and milk component were generated to visually support the statistical findings, with asterisks indicating significance levels (* , ** after Bonferroni correction).
3. Results and Discussion
3.1. Dataset Statistics and Distribution
Table 3 provides summary statistics for the four milk components analyzed: fat, protein, lactose, and total solids. Among these, total solids show the highest mean concentration (16.15%), followed by fat (5.57%), protein (4.76%), and lactose (4.56%). Protein and lactose display narrow standard deviations (0.51 and 0.12, respectively), indicating relatively consistent composition across samples, whereas fat and total solids show more variability. The lower bounds for fat and total solids (3.07% and 12.11%, respectively) also suggest possible sample dilution or formulation effects.
Figure 2 provides a visual overview of the relationships and distribution patterns among the analyzed milk components (fat, protein, lactose, and TS). The correlation matrix (top left) reveals a strong positive correlation between fat and total solids (
), indicating that higher fat content tends to be associated with higher total solids, an expected trend in milk composition. A moderate negative correlation is observed between protein and lactose (
), suggesting that as protein levels increase, lactose concentrations may slightly decrease. Additional correlations include a moderate positive relationship between protein and total solids (
), and a weak positive correlation between fat and protein (
).
The boxplots (top right) confirm that protein and lactose concentrations are relatively uniform across samples, whereas fat and total solids show broader variability which may reflect processing practices or targeted composition adjustment in the sample set.
Lastly, the density plots (bottom) show that most components are negatively skewed, particularly protein and total solids, suggesting the majority of samples cluster near upper concentration ranges. Elevated kurtosis values (e.g., 3.86 for protein and 4.37 for lactose) indicate peaked distributions with a few low-value outliers. These trends point toward controlled or processed milk samples rather than fresh raw milk, which is often subject to greater component variation.
3.2. Spectral Preprocessing
To mitigate the risk of overfitting and enhance model robustness, we explored spectral preprocessing pipelines with a restricted number of steps, using our automated optimization framework based on Bayesian optimization. Two configurations were considered by setting the allowed preprocessing pipeline length to either (allows for either single preprocessing steps or combinations of two compatible methods) or (allows for either single preprocessing steps or combinations of two or three compatible methods).
The optimization was performed solely on the training set, leveraging group-aware cross-validation using the Group-Shuffle-Split method to respect sample dependencies. We employed Bayesian optimization with and . All preprocessing steps were drawn from the following set: MSC, SavGol, detrend, scaler, SNV, robust_scaler, EMSC, PCA, normalization, autoscale, globalscaler, and meancn.
The optimized preprocessing pipelines for each milk component are summarized in
Table 4 for the
and
configurations, respectively. Interestingly, the optimal pipelines were identical across both configurations, suggesting that a simpler preprocessing structure was sufficient for our data.
The consistency of results across both experimental configurations supports the robustness of the identified preprocessing schemes.
Figure 3 and
Figure 4 provide a visual comparison of the raw spectral data and the effects of various preprocessing strategies.
Figure 3 presents the raw MIR spectra of 135 milk calibration samples, which exhibit high overall alignment and minimal baseline or scatter artifacts. The spectra are smooth and consistent across the 3000–1000 cm
−1 range, with major absorbance bands clearly preserved. A localized region of high-frequency variation is visible between approximately 1750 cm
−1 and 1600 cm
−1, likely reflecting chemical variability or instrument-related noise in that spectral window.
In
Figure 4, the top row displays preprocessing pipelines optimized via Bayesian optimization and tailored for individual milk components, while the bottom row includes commonly reported literature methods such as SNV, MSC, and their combinations with derivatives. Literature-based techniques (e.g., SNV and MSC) effectively smooth the spectra and suppress global variation, resulting in visually cleaner profiles. However, this visual uniformity can come at the cost of reducing predictive information, particularly if relevant spectral variability is filtered out.
Conversely, the optimized pipelines introduced sharper variations especially in regions like 1450–1250 cm−1 and 2250–1750 cm−1 due to the application of scalers and derivatives. While these transformations may appear noisier, they are selected based on their ability to enhance model-relevant features rather than aesthetic smoothness. This contrast reinforces the core philosophy behind the Bayesian optimization approach: preprocessing should be optimized for predictive performance, not visual clarity.
3.3. Regression Analysis and the Importance of Optimized Preprocessing
To assess the impact of preprocessing techniques on predicting milk component concentrations on the test set, we conducted regression analyses under three distinct scenarios: without preprocessing, with optimized preprocessing obtained via Bayesian optimization, and using the custom preprocessing techniques previously reported in the literature (MSC, SNV, first derivative, and second derivative).
From
Table 5, without preprocessing (baseline scenario), predictive models yielded reasonably accurate results on the test set. For example, fat prediction achieved an RMSEP of 0.159 (PLS regression,
), protein showed high prediction accuracy with an RMSEP of 0.063 (LassoLarsCV,
), lactose presented a moderate predictive accuracy (RMSEP = 0.027, PLS regression,
), and total solids predictions demonstrated robust accuracy (RMSEP = 0.158, PLS regression,
).
Applying optimized preprocessing improved model performance for protein and lactose predictions. Specifically, protein prediction RMSEP decreased to 0.054 (RidgeCV regression, ), enhancing predictive accuracy compared to the baseline scenario. Similarly, lactose prediction benefited from preprocessing optimization, achieving a lower RMSEP of 0.026 (PLS regression, ). Total solids and fat predictions also showed moderate improvements, with the best total solids prediction yielding an RMSEP of 0.154 (RidgeCV regression, ) and fat prediction reaching an RMSEP of 0.139 (RidgeCV regression, ).
Notably, across all three scenarios, support vector regression (SVR) consistently underperformed on the test set, despite often achieving strong performance on the training data. This discrepancy highlights the risk of overfitting when using highly flexible models on relatively limited datasets. The use of group-aware cross-validation during model development proved effective in providing a more realistic assessment of model generalization ability, particularly where reserving a separate internal validation set was not practical.
These findings are visually confirmed in
Figure 5, which displays predicted versus true plots for three representative models, PLS, RidgeCV, and LassoLarsCV on the test set using optimized preprocessing obtained through Bayesian optimization. The best-performing models (highlighted in red) align closely with the identity line, especially for protein and total solids. Results for the remaining models i.e SVR, GBM, and ElasticNet, are included in
Appendix C (
Figure A1). The full regression performance metrics on both the training and test sets for each model and preprocessing strategy are provided in
Appendix B (
Table A1,
Table A2 and
Table A3).
From
Table 6, using custom preprocessing methods commonly reported in the literature, we identified several studies that applied spectroscopy techniques to milk datasets. Zhu et al. [
25] and Wu et al. [
47] both reported SNV as the optimal preprocessing technique for their respective datasets. Wu et al. specifically employed short-wave NIR spectroscopy in the 800–1050 nm range to analyze the primary compounds in milk powder. Similarly, Amsaraj et al. [
30] and Bonfatti et al. [
48] identified a combination of SNV and first-derivative Savitzky–Golay (SavGol) filtering as their optimal preprocessing approach. Bonfatti et al. [
48] specified SavGol parameters as a window length of 15, derivative order of 1, and polynomial order of 4. As Amsaraj et al. [
30] did not report their SavGol parameters, we adopted the same values for consistency.
Although the literature has suggested that MSC and SNV are generally effective in improving model performance, our results contradict this assumption On our dataset, these methods produced inferior outcomes compared to both unprocessed data and the results achieved through our optimized preprocessing pipeline. Similarly, the use of first and second derivatives, often recommended for enhancing predictive power, offered only marginal benefits over no preprocessing in certain cases The best fat prediction using custom preprocessing (SNV + 1st Der SavGol, PLS,
), as used by Amsaraj et al. [
30] and Bonfatti et al. [
48], was lower than both no preprocessing (PLS,
) and our method (RidgeCV,
). For protein, the top custom result (MSC, RidgeCV,
) from Inon et al. [
31] also underperformed compared to no preprocessing (LassoLarsCV,
) and our method (RidgeCV,
). Lactose prediction using SNV + 1st Der SavGol (LassoLarsCV,
) similarly lagged behind no preprocessing (PLS,
) and our approach (PLS,
). For total solids, MSC with LassoLarsCV (
) was outperformed by both no preprocessing (PLS,
) and our method (RidgeCV,
).
These comparative findings underscore the critical importance of dataset-specific preprocessing optimization. Adopting preprocessing methods from unrelated or even closely related prior studies without validation can negatively affect prediction accuracy. Thus, optimized preprocessing tailored explicitly to individual datasets and prediction targets remain an essential step for achieving maximum accuracy and reliability in milk component prediction models.
3.4. Statistical Comparison of Preprocessing Methods
Statistical testing confirmed that the optimized pipeline significantly outperformed baseline preprocessing methods in most milk components as presented in
Table 7. Before Bonferroni’s correction, almost all pairwise comparisons were statistically significant at the level
; this included all comparisons under RidgeCV and 8 of 12 under PLS. The normality of RMSE differences was assessed using the Shapiro–Wilk test for all pairwise comparisons. A paired t-test was used in all cases except one under PLS, where normality was violated; in that instance, the nonparametric Wilcoxon signed-rank test was applied.
After applying the Bonferroni correction, eight of nine RidgeCV comparisons remained significant, with the exception of Total Solids. In particular, three comparisons exhibited strong significance at the level: fat (optimized vs. SNV+SG) (), and true protein (optimized vs. MSC) () and (optimized vs. SNV) (). In contrast, only one PLS comparison remained significant after correction, despite strong trends observed prior.
All RidgeCV comparisons produced large to extremely large effect sizes (Cohen’s ranging from 2.3 to 7.4), supporting the practical relevance of the optimized pipeline. PLS comparisons also consistently showed large effect sizes despite losing corrected significance.
Boxplots (
Figure 6,
Figure 7 and
Figure 8 and
Figure A2) further illustrate these results, showing consistent reductions in RMSE and variability for the optimized pipeline across components. Even for total solids, where Bonferroni corrected significance was not observed, the optimized pipeline exhibited a visibly lower RMSE distribution compared to those for all other methods.
This study highlights the pivotal role of spectral preprocessing in improving the accuracy of milk component predictions. Using a Bayesian optimization-based framework, we identified preprocessing pipelines that consistently outperformed both no preprocessing and alternative algorithms reported in the literature, especially for protein and lactose. Fat and total solids have stronger IR spectral signatures, and we observed more modest gains, suggesting that these analytes may require simpler corrections.
A major insight is the data- and component-specific nature of preprocessing. Optimal pipelines vary between components, confirming that a universal approach is inadequate. This aligns with previous work in spectroscopy, such as Vestergaard et al. [
49], which found that no single preprocessing strategy was the best across analytes. Our findings further show that commonly used methods (e.g., MSC and SNV) underperform when applied without dataset-specific tuning, reinforcing the need for empirical evaluation.
The Bayesian optimization approach offers a significant advantage by automating preprocessing selection, reducing reliance on trial and error. This method efficiently explores the pipeline space, often identifying simple but highly effective two-step combinations that enhance both interpretability and generalizability. Moreover, the transparent and reproducible nature of this approach makes it suitable for broader spectroscopic applications beyond milk, including food quality control and authenticity testing.