1. Introduction
In recent years, with the increasing scarcity of high-quality natural wood resources, the annual supply–demand gap of precious wood in China has exceeded 20 million cubic meters, echoing the global trend reported by FAO (2023) for high-grade timber scarcity [
1]. At the same time, the market demand for decorative wood products has continued to grow, exacerbating the structural contradiction between resources and the market. Wood dyeing technology, as a means to improve the decorativeness and added value of artificial forest materials, can impart the visual characteristics of precious hardwoods to fast-growing species, potentially increasing their market value by a factor of 3 to 5 [
2]. In recent years, research teams in Europe and North America have explored standardized digital dyeing workflows for engineered wood, emphasizing spectral precision, color reproducibility, and environmental efficiency. However, studies integrating high-dimensional spectral data with robust ML algorithms in wood dyeing remain limited. However, traditional dyeing methods mostly rely on empirical ratios, and have problems such as large color difference (ΔE > 5), uneven coloring, and poor recipe reproducibility, which makes it difficult to meet the requirements of high-end markets for product consistency and standardization [
3].
In order to improve the accuracy and controllability of the color matching process, researchers have tried to combine computer-aided color matching with hyperspectral imaging (HSI) technology in recent years. By extracting the spectral response information of the material, dye concentration or color parameter inversion modeling is performed to avoid the phenomenon of “metameric heterochromaticity” [
4]. HSI technology has been widely used in forestry fields such as wood species identification, pest and disease detection, and forest fire assessment due to its high resolution of material surface and microstructure [
5,
6,
7], and it also provides a technical basis for high-dimensional feature input of dyeing formulas.
At the same time, machine learning methods have become the mainstream means of dyeing prediction models. Due to the gradient vanishing and local optimal problems, the accuracy of traditional backpropagation (BP) neural networks drops by 37% when the sample size is less than 500, and its generalization ability and stability are limited [
8]. Although optimization strategies such as particle swarm optimization (PSO) and genetic algorithms can improve model stability [
9,
10], in industrial real-time scenarios, model convergence speed and interpretability are still difficult to ignore. In addition, physical modeling methods such as Stearns–Noechel’s full-band spectral color matching model have high theoretical accuracy, but the noise interference caused by band redundancy, feature collinearity and algorithm complexity limit its actual promotion in the dyeing industry [
11].
In recent years, some studies have begun to explore the value of explainable machine learning in modeling complex material–process interactions in forestry and beyond [
12]. Sarfarazi et al. [
13], for example, proposed an interpretable AI framework that integrates machine learning with finite element modeling to predict structural responses in steel materials. Although their study focused on structural engineering, the methodology underscores the broader relevance of coupling data-driven models with physical insights to enhance predictive accuracy and model transparency—principles that are equally applicable to wood dyeing systems characterized by heterogeneous microstructures and nonlinear dye-material interactions. Among the emerging algorithms, CatBoost—a gradient boosting method based on symmetric trees and ordered boosting—has shown superior performance in high-dimensional, small-sample tasks. It has been successfully applied in wood property prediction, moisture content estimation, and species classification, often outperforming traditional models like XGBoost and Random Forest [
14]. However, its application in wood dye formulation prediction remains underexplored, particularly for multi-dye systems requiring precise concentration inversion under complex spectral inputs.
Based on this, this paper proposes a hyperspectral intelligent prediction model based on the CatBoost algorithm to establish a predictive framework for multi-dye wood coloring formulation. The research objectives are as follows:
- 1.
Construct a CatBoost prediction model for spectral features: Use the CatBoost algorithm combined with random forests to screen sensitive bands and sort feature contributions to address issues such as information redundancy and spectral noise in high-dimensional spectral data;
- 2.
Compare and analyze the performance of various prediction algorithms: Compared with traditional algorithms such as XGBoost, random forest (RF), and support vector regression (SVR), the model is systematically evaluated in terms of mean square error (MSE), determination coefficient (R2), mean absolute error (MAE), and other indicators;
- 3.
Construct a multimodal verification mechanism: Introduce a joint characterization method of hyperspectral imaging and scanning electron microscopy (SEM) to verify the consistency and physical rationality of the prediction model from the perspective of the relationship between microscopic particle distribution and surface reflectivity.
The innovation of this study is that the CatBoost algorithm is introduced into the field of wood dyeing color matching prediction for the first time, and a set of intelligent dyeing modeling strategies with high precision, strong interpretability, and industrial scalability are constructed by combining hyperspectral data mining and multi-source characterization methods. The research results are expected to provide data support and technical paths for the intelligent transformation of wood dyeing technology and the high-value utilization of artificial forest resources.
2. Materials and Methods
The overall research process of this paper is shown in
Figure 1, covering the following: (1) pretreatment of Scots pine veneer with hydrogen peroxide bleaching; (2) 306 groups of dye concentration gradient dyeing; (3) 400–700 nm hyperspectral reflectance acquisition and pretreatment; (4) sensitive band screening and spectral feature parameter extraction based on random forest; (5) CatBoost model optimization (compared with XGBoost/RF/SVR); (6) multimodal verification (statistical indicators + SEM micro-mechanism). The detailed methods of each link are shown below.
2.1. Experimental Materials and Data Collection
In this study, veneer of
Pinus sylvestris var. mongolica, purchased from Harbin, Heilongjiang Province, China, was employed as the dyeing substrate. All specimens were cut along the longitudinal grain direction to dimensions of 30 mm × 15 mm × 1 mm. The initial moisture content was 12 ± 1%, determined in accordance with GB/T 1931-2009 [
15]. Surface roughness was controlled at Ra = 6.3 ± 0.5 μm. Prior to pretreatment, all samples were conditioned for 7 days in a climate-controlled chamber at 20 ± 2 °C and 65 ± 5% relative humidity to ensure moisture equilibrium. The complete experimental procedure is illustrated in
Figure 2, which includes both the workflow of the dyeing process and the molecular structures of the three reactive dyes employed: reactive red X-3B, reactive yellow X-RG, and reactive blue.
- 1.
Dyes and reagents
Dyes: Three reactive dyes were selected for this study based on their complementary absorption characteristics within the visible spectrum. Reactive yellow X-RG exhibits a primary absorption peak near 450 nm in the blue light region, with its azo group (--N=N--) showing strong absorption between 400 and 500 nm. Reactive red X-3B features a main absorption peak around 550 nm in the green light region, attributed to the anthraquinone structure’s characteristic absorption between 500 and 600 nm. Reactive blue demonstrates significant absorption near 640 nm in the orange-red region, where the copper phthalocyanine complex facilitates intense π→π* electronic transitions. All dyes were of analytical grade with purity ≥95%, supplied by Beijing Chemical Plant. This combination of dyes effectively covers the major regions of the visible spectrum and serves as a theoretical foundation for the subsequent screening of sensitive spectral bands using a random forest model.
Auxiliary reagents: 4% H2O2 solution (bleach, analytical grade), aqueous JFC penetrant (industrial grade), anhydrous Na2CO3 (fixing agent, analytical grade), and NaCl (dyeing accelerator, analytical grade).
- 2.
Bleaching and pretreatment
To remove impurities from wood and improve dye penetration, the veneer was treated in a bleaching solution consisting of 4% H2O2 (500 mL), Na2SiO3 (0.5 g, stabilizer), and Na3PO4 (0.5 g, buffer) in a 65 °C constant-temperature water bath for 2 h (liquid-to-material ratio 1:20), and the pH value was controlled at 10.5 ± 0.2. After treatment, the sample was fully rinsed with deionized water and dried at 60 °C to a moisture content of 6–8% (according to standard GB/T 1931-2009).
- 3.
Dyeing and fixation
In the dyeing stage, a total of 306 groups of different dye concentration ratios were designed. In each group, 500 mL of dye solution was prepared, 0.5 mL of aqueous JFC and 15 g·L
−1 NaCl were added, and 6 veneers were placed in each group during the dyeing process. The pretreated samples were dyed in a 65 °C water bath for 2 h, and then 20 g·L
−1 Na
2CO
3 was added for fixation for 30 min [
16]. Finally, they were rinsed thoroughly with clean water and dried at 60 °C for 2 h to remove floating color.
- 4.
Hyperspectral data acquisition
The sample reflectance data was collected using the HySpex VNIR-1800 hyperspectral imaging system (Norsk Elektro Optikk, Oslo, Norway), with a spectral range of 400–1000 nm and a spectral resolution of 2.6 nm. The acquisition settings were as follows: integration time of 100 ms, frame rate of 30 fps, spatial resolution of 0.2 mm/pixel, and a signal-to-noise ratio greater than 300:1. Dark and white reference calibration was performed every 30 min to ensure spectral stability. All data were collected under controlled environmental conditions (23 ± 1 °C, 50 ± 5% RH).
According to the spectral response characteristics of the visible light band (400–700 nm) relevant to wood dyeing, the reflectance data in this range were extracted for subsequent analysis. The pixel average spectrum of the homogeneous area on the surface of each veneer sample was obtained using ENVI 5.3 software (Harris Geospatial Solutions, Broomfield, CO, USA). In total, 306 sets of hyperspectral reflectance spectra were obtained for model training and validation(see
Supplementary Data S1 for the complete dataset).
2.2. Hyperspectral Data Processing and Feature Engineering
2.2.1. Spectral Data Denoising and Smoothing Methods
In order to improve the data reliability of model training, this study systematically preprocessed the collected hyperspectral reflectance data, mainly including three steps of outlier removal, noise smoothing, and scale normalization, as shown in
Figure 3.
First, in order to eliminate outlier data that may appear in the measurement (caused by light source fluctuations, equipment jitter, etc.), the interquartile range (IQR) method is used to identify spectral anomalies, the threshold range is set to [Q1 − 1.5 × IQR, Q3 + 1.5 × IQR], and samples outside this range are eliminated.
Subsequently, the Savitzky–Golay convolution smoothing algorithm (window width 11 band points, second-order polynomial fitting) is used to perform first-order guided denoising on the spectral curve, while retaining the spectral edge features and suppressing high-frequency noise interference. This algorithm has been widely used in plant and wood hyperspectral processing because of its good conformal ability to hyperspectral nonlinear response [
17].
Finally, the reflectance values of all bands are linearly compressed to the [0, 1] interval through Min-Max Scaling to unify the data scale, avoid model weight offset caused by dimensional differences, and provide standardized input for subsequent sensitive band screening and machine learning modeling [
18].
2.2.2. Sensitive Band Screening
In the study of wood dyeing formulation prediction, although traditional approaches typically use the full spectral range of 400–700 nm as model input, not all bands exhibit equal sensitivity to changes in dye concentration. For instance, reactive yellow dye shows strong absorption characteristics around 450 nm (blue region), making the reflectance in this band highly informative for color matching prediction. In contrast, reflectance variations in the far-red or green regions tend to be more stable. Therefore, identifying key sensitive bands within the spectrum is crucial for improving prediction accuracy.
To screen these critical bands and assess their importance, this study introduces the random forest (RF) algorithm. RF has demonstrated strong stability and robustness in handling high-dimensional, nonlinear data, making it well-suited for hyperspectral modeling in this multi-dye mixture system [
19]. In this work, the Gini Index is used as the basis for evaluating feature importance. The specific steps are as follows:
- 1.
Impurity Calculation
For each decision tree node
, the Gini Index is defined as
where
is the proportion of samples in node m belonging to concentration class
. This index reflects the impurity of concentration distribution at the current node.
- 2.
Impurity Reduction by Splitting
When a feature
(i.e., a specific spectral band) is used to split node
, the resulting impurity reduction is calculated as
where
and
represent the Gini indices of the left and right child nodes, respectively, and
,
are the number of samples in the child nodes. This quantifies the effectiveness of feature
in partitioning node
.
- 3.
Feature Importance Aggregation
The total feature importance for
is computed by summing its contributions across all trees in the forest.
where
denotes the set of all nodes in the
tree that are split using feature
. The resulting importance scores are normalized so that
.
By incorporating Gini-based feature importance analysis, this method enables the identification of the most representative and sensitive spectral regions within the 400–700 nm range. The importance score quantifies each band’s contribution to improving node purity in concentration-based classification, thus revealing the physical nature of spectral responses: bands with high values stem from specific dye absorption mechanisms, where reflectance decreases monotonically with concentration; conversely, low values typically indicate weak absorption regions dominated by scattering noise from the wood substrate. Excluding such noisy bands enhances model robustness and allows for a ~50% reduction in input dimensionality, optimizing industrial detection efficiency.
2.2.3. Dye Concentration Optimization
Figure 4 shows the visible spectral reflectance change curves of
Pinus sylvestris var. mongolica veneer under different concentrations of three dyes (0.25, 0.5, 0.75, and 1.0 g·L
−1) for reactive red, reactive yellow and reactive blue. Preliminary experiments found that when the dye concentration exceeded 0.75 g·L
−1, the spectral response showed obvious nonlinear characteristics. In order to explore its generation mechanism, this paper further combined scanning electron microscope (SEM) images to analyze the aggregation state of dye particles at high concentrations and their influence on the spectral response.
2.2.4. Spectral Feature Parameter Extraction
As the dye concentration changes, the characteristic parameters in the spectral data will also change accordingly. For example, the blue edge amplitude can reflect the absorption characteristics of the dye in the blue region, especially those dyes that have significant absorption of blue light. This parameter can be used to distinguish and quantify the concentration of dyes with significant absorption characteristics in the blue region. Green peak reflectance directly reflects the reflectance characteristics of the dye in the green area. Different dyes will change the reflectance of wood in this area, and green peak reflectance can be used to quantify this change. Therefore, selecting key spectral characteristic parameters helps to identify the presence and concentration of specific dyes. Some common spectral characteristic parameters and their corresponding meanings are shown in
Table 1.
2.3. Research Methods
This study used four machine learning algorithms, CatBoost, XGBoost, random forest (RF), and support vector regression (SVR), to build a prediction model for wood dyeing formula. The dataset (306 groups in total) is divided into training sets (214 groups) and test sets (92 groups) in a ratio of 7:3 to evaluate the generalization ability of the model. At the same time, in order to improve the reliability of the model, 8-fold cross-validation (8-fold CV) is used to evaluate the training stability, and key hyperparameters are optimized based on grid search. Finally, model evaluation parameters are constructed to evaluate the algorithm model.
2.3.1. Category Gradient Boosting
CatBoost is an ensemble learning algorithm based on gradient boosting decision trees, which is designed for efficient processing of categorical features and small sample data scenarios. Compared with traditional GBDT algorithms (such as XGBoost and LightGBM), its core innovation lies in the symmetric tree structure and ordered boosting technology, which can effectively solve the noise interference and feature collinearity problems in high-dimensional spectral data, especially in scenarios with small samples or high proportion of categorical features [
20].
The training process of CatBoost consists of multiple weak learners (decision trees). Each new decision tree is built based on the residual of the current model (the difference between the predicted value and the true value). Through multiple rounds of iteration, the overall performance is gradually improved.
In terms of formula, the objective function of CatBoost is similar to gradient boosting, as shown in Formula (4):
One of the most important innovations of CatBoost is its processing of categorical features. Traditional methods (such as one-hot encoding) are prone to high-dimensional sparse matrices, while CatBoost adopts a target encoding strategy to dynamically calculate the statistical correlation between categorical features and target values. Formula (5) is
where
yi is the target value of category
x in the training data.
Npast is the number of past samples, and α is a smoothing parameter that prevents overfitting when the frequency of a category is too low.
To solve the target leakage problem in traditional gradient boosting, CatBoost used the ordered learning (ordered boosting) strategy, that is, when making predictions for a sample, only the historical information from previous samples is used to build a decision tree, thereby eliminating the dependence on future data. This strategy significantly improves the robustness of the model in small-sample scenarios.
Thanks to the above design, CatBoost is particularly suitable for dealing with nonlinear feature selection, noise interference, and collinearity problems in high-dimensional spectral data, and has demonstrated superior performance in many fields such as image recognition, text classification, and hyperspectral regression. In this paper, it is applied to the wood dyeing formula prediction task to improve the model’s ability to model the complex mapping relationship between reflectance spectrum and dye concentration.
2.3.2. Extreme Gradient Boosting
XGBoost (Extreme Gradient Boosting) is an efficient ensemble learning algorithm based on the gradient boosting machine (GBM) framework. It iteratively constructs a sequence of decision trees and optimizes the prediction results in an additive manner [
21]. It is widely used in regression and classification tasks. Its core idea is to iteratively train weak learners (decision trees) and combine their prediction results in an additive manner to gradually approach the optimal solution of the objective function. Compared with the traditional gradient boosting algorithm, XGBoost significantly improves the accuracy and training efficiency of the model by introducing regularization terms, parallel computing, and tree structure optimization strategies.
XGBoost’s objective function consists of a loss function and a regularization term, as shown in Equation (6), which aims to balance the model’s fitting ability and complexity
where
is the loss function for the
th iteration,
and
are the hyperparameters for controlling the tree complexity and the leaf node weight regularization, respectively, to effectively suppress overfitting.
is the number of leaf nodes in the tree, and
is the weight of the
th leaf.
XGBoost achieves multi-threaded parallelism in the feature split point evaluation stage through parallel computing, accelerating model training. At the same time, each round of iteration updates the model through additive training, namely
where
is the learning rate (default
= 0.1), which is used to suppress the contribution of a single tree and prevent overfitting.
2.3.3. Random Forest
The basic principles of the random forest (RF) method have been briefly introduced in the previous article, and will not be repeated here. In this study, the accuracy and generalization ability of the RF model are mainly regulated by two parameters, namely the number of decision trees and the leaf node size. Specifically, the sample size of the training data is set between 0 and 3000 trees, and the leaf size (i.e., the minimum number of leaf node samples per tree) is set in the range of 1 to 10. Experimental verification shows that when the number of trees is increased to 200 and two different leaf size settings are used, the model performance is more stable, and the prediction results are also more reliable [
22].
In addition, in the modeling process of random forest, the bootstrap sampling strategy is adopted to randomly generate multiple training subsets (denoted as S
1, S
2, …, Sₙ), and train and generate corresponding decision trees (denoted as R
1, R
2, …, Rₙ). Finally, by integrating the prediction results of all decision trees, the prediction output of the overall model is achieved [
23].
2.3.4. Basic Principles of SVR
Support vector regression (SVR) is an extended form of support vector machine (SVM) designed for regression tasks. Its core idea is to construct aε-insensitive band so that most sample points fall within this area, thereby achieving robust fitting while maintaining the sparsity of the model. The optimization goal of SVR is to find a hyperplane so that all sample points are closest to the hyperplane, so as to achieve the purpose of fitting [
24]. That is, for a given sample set
We hope to get the following:
At the same time, the constraints are
where
is the model weight,
is the bias term;
is the feature vector after kernel function mapping;
is the loose variable;
is the regularization parameter;
is the insensitive loss value.
In SVR, the deviation ε is allowed, and the loss function is
When is defined as lossless, when the sample data is outside the dotted line, is defined as lossy, and the loss function is . When becomes larger, the sensitivity of the model decreases, which leads to the problem of “under-learning”, while that is too large leads to “over-learning”.
2.3.5. Hyperparameter Optimization
In order to improve the generalization ability of the model and suppress the risk of overfitting, this study used k-fold cross-validation combined with grid search strategy for hyperparameter optimization [
25].
Specifically, the entire dataset is first divided into a training set and a test set. Subsequently, k-fold cross-validation is performed on the training data, that is, the training set is evenly divided into k subsets (folds), one of which is selected as the validation set each time, and the remaining k − 1 subsets are used for model training. This process is repeated k times to ensure that each subset is used as a validation set for training and evaluation. By combining the validation performance indicators of each fold, the stability and generalization ability of the model can be effectively evaluated.
In the hyperparameter tuning stage, k-fold cross-validation and grid search strategies are combined to traverse the preset hyperparameter combinations, and a complete cross-validation evaluation is performed under each combination. Finally, a set of hyperparameters with the best performance on the validation set is selected for the final training and testing of the model. This method significantly improves the overall performance of the model and ensures the robustness and reliability of the results. The specific process is shown in
Figure 5.
2.4. Model Evaluation
After testing the primary assumptions of the model, it is essential to evaluate the usefulness and predictive ability of the proposed approach [
26]. To comprehensively assess the performance of the dye ratio prediction model, this study employed four commonly used statistical metrics: mean squared error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and the coefficient of determination (R
2). Among them, MSE and MAE measure the numerical deviation between the predicted and actual values, MAPE provides a unit-free assessment of relative error, and R
2 reflects the model’s ability to explain the overall variability in the data. The definitions of these metrics are as follows:
Here, denotes the true dye concentration of the sample, is the predicted value generated by the model, is the mean of the true values, and represents the total number of samples in the test set.
In addition, to further evaluate the model’s fitting performance across different concentration ranges, the following sections incorporate residual plots for structural error analysis. Residual plots provide a visual means to assess whether systematic bias exists at extreme concentration levels, thereby helping to identify potential issues such as overfitting or underfitting. Through this multi-perspective evaluation framework, the model is assessed comprehensively in terms of numerical accuracy, relative error performance, and structural fit, ensuring its practical applicability and robustness in predicting dye ratios for wood coloration.
In summary, the main objective of this section is to validate the model’s performance using appropriate evaluation methods to support accurate dye ratio prediction. In this process, model construction and performance evaluation constitute the two critical phases of the modeling workflow. The overall methodological framework is illustrated in
Figure 6.
3. Results
3.1. Evaluation of Spectral Data Denoising and Smoothing
In this study, 822 sets of hyperspectral samples (a total of 115 bands) of Pinus sylvestris var. mongolica were systematically preprocessed, including outlier detection, smoothing and denoising, and scale normalization. The results show that all reflectance data points are within the reasonable range of [244.48, 852.76], and no outliers are found, indicating that the original data collection quality is high. After processing with the Savitzky–Golay algorithm (window width 11 points), the standard deviation of the first five channels of the sample is reduced from 4.5077 to 3.7025, the noise is significantly weakened, and the spectral line profile is well maintained. After normalization, all data are successfully mapped to the [0, 1] interval, which effectively unifies the dimension and improves the data consistency, providing a stable and reliable input basis for subsequent modeling.
3.2. Key Band Screening Results Verification
In this study, the random forest algorithm was employed to analyze the feature importance of hyperspectral reflectance data within the 400–700 nm range, aiming to identify the key sensitive bands that contribute most to dye concentration prediction. As shown in
Figure 7, the importance scores of the 400–450 nm, 550–600 nm, and 600–650 nm bands were 0.330, 0.266, and 0.201, respectively, with a cumulative contribution rate reaching 79.7%, significantly higher than the remaining bands. This “80/20 distribution” pattern suggests that excluding the less informative bands becomes a standard practice once the cumulative importance exceeds 80%. A similar threshold-based strategy was adopted by Sarfarazi et al., in spectral prediction studies for materials [
13], confirming the academic validity and practical value of this feature selection approach (detailed feature importance scores for all wavelengths are provided in
Supplementary Data S2).
Notably, the 400–450 nm and 550–600 nm bands exhibited the highest importance scores, corresponding to the blue and yellow-green regions of the spectrum, respectively. This aligns well with the characteristic absorption behavior of the three primary dyes used. Meanwhile, the red region (600–650 nm) also demonstrated the third highest importance, further validating the intrinsic connection between the dyeing process and the trichromatic composition of visible light.
To further verify the physical rationality of the above sensitive bands, this study used three monochromatic dyes, namely reactive red, reactive yellow, and reactive blue, to dye Pinus sylvestris var. mongolica veneer in groups, set concentration gradients (0.25, 0.50, 0.75, and 1.00 g·L−1), and collected hyperspectral reflectance data under each group of conditions. The results showed that with the increase in dye concentration, the reflectance of the 400–450 nm, 550–600 nm, and 600–650 nm bands decreased most significantly, showing a strong response characteristic of concentration to reflectance, which is highly consistent with the results of feature importance analysis. Based on the comprehensive feature importance evaluation and experimental response trend analysis, 400–450 nm, 550–600 nm, and 650–700 nm can be determined as the key bands for dyeing ratio prediction in this study.
3.3. Analysis of Dye Concentration Optimization Results
As the dye concentration increases, the distribution state of dye particles on the wood surface changes significantly. As shown in
Figure 6, the particles formed by red dye under different concentration treatments show a trend from dispersion to aggregation on the wood surface. Specifically, at low concentrations, the particles are relatively sparse and evenly distributed; as the concentration increases, the particles gradually accumulate, and the number of surface aggregates increases significantly, indicating that the adsorption capacity of the dye on the wood surface increases with the concentration.
Figure 8a–d corresponds to the surface microstructure of the dyed veneer when the dye concentration is 0.25, 0.50, 0.75, and 1.00 g·L
−1, respectively. In order to avoid the influence of abnormal particle distribution under extreme concentration conditions on the spectral modeling results, this study controlled the experimental concentration within 0.80 g·L
−1 to ensure the stability and predictability of the spectral change law.
3.4. Analysis of Spectral Characteristic Parameter Extraction Results
In order to further understand the law of spectral characteristic parameters changing with dye concentration, we first determined the concentration of two dyes in the mixed dye solution, and then observed the law of characteristic parameters changing with the concentration of another dye. In order to ensure the universality of the observed results, we set different gradient values for the determined two dye concentrations. The results are shown in
Table 2,
Table 3 and
Table 4.
The experimental results show that when the red dye concentration increases from 0.0 to 0.5 g·L−1, the blue edge area decreases nonlinearly from −313.8 to −677.9 under certain conditions, and the reflectivity of the green peak and red valley both show a systematic downward trend. In the process of increasing the yellow dye concentration, the yellow edge area demonstrated a continuous decline. For example, when the red/blue concentration is 0.3, the yellow edge area decreases from −858.4 to −1307.4. The increase in the blue dye concentration leads to a significant increase in the blue edge amplitude, showing a significant response to the short-wave band area (such as from 9.89 to 15.21), and multiple spectral characteristic parameters also show a coordinated change trend.
The above results show that the spectral characteristic parameters have good responsiveness to the change in dye concentration, especially in the typical concentration range. The parameters such as blue edge amplitude, green peak reflectance, and red valley reflectance can effectively capture the changes in dye ratio and serve as valuable inputs for the model. Based on these findings, we constructed an enhanced dataset by integrating the high-importance spectral characteristic parameters (Blue_Area, Green_Peak, Red_Valley) with the previously identified sensitive bands (400–450 nm, 550–600 nm, 600–650 nm), creating a comprehensive feature set for improved prediction accuracy (see
Supplementary Data S3).
3.5. Analysis of Hyperparameter Optimization Results
From the description of k-fold cross-validation, it can be seen that the training dataset is crucial to the final hyperparameters, and we need to select appropriate hyperparameters according to the data type and range. We use the data before and after processing as the model input, compare the optimal hyperparameters under different model inputs and analyze them. Finally, the hyperparameter optimization results of the model before data processing are shown in
Table 5 and
Table 6.
Table 5 demonstrated that under the original data input, the hyperparameters of CatBoost, XGBoost, RF, and SVR models tend to be conservatively configured, such as lower learning rate, number of iterations, and smaller depth parameters. In
Table 6, the optimal hyperparameter combination of each model after data processing generally demonstrates a higher level of complexity (such as greater depth, higher iterations), reflecting that the model has stronger fitting ability and expression potential under the optimized data conditions.
3.6. K Cross-Validation Results Analysis
This study adopts a cross-validation framework with k = 2 to k = 10, and evaluates the impact of data preprocessing (including feature standardization and outlier removal) on the generalization ability of the model through a stratified resampling strategy (based on the target variable quantile). Then we train and test the datasets under k = 2, k = 4, k = 6, and k = 8, and the corresponding performance indicators, using a set of random, non-overlapping partition folds. Finally, the stratified re-cross-validation technique is used to evaluate the effectiveness of the model, as shown in
Table 7.
The results show that the preprocessed CatBoost model achieved optimal predictive performance under 6-fold cross-validation (MSE = 0.00271 g2·L−2, MAPE = 3.134%, MAE = 0.0349 g·L−1), representing a 28.7% reduction in MSE compared to the untreated baseline (p < 0.01, paired t-test). As the number of folds increased to k = 8, the model’s mean absolute percentage error (MAPE) varied by less than 0.4 percentage points (from 3.13% at k = 6 to 3.49% at k = 8), indicating that hyperparameter optimization effectively suppressed overfitting. Ultimately, the relative prediction error of CatBoost stabilized within ±2.5% (i.e., the percentage deviation between the predicted and actual concentrations for individual samples), meeting the accuracy requirements for engineering cost estimation (ΔE < 1.75 corresponds to <5% concentration error).
3.7. Model Evaluation Results
When we use the dataset before and after data processing to train the unoptimized model, the system mean square error and running speed before and after data processing are shown in
Table 8. As the core algorithm of this study, CatBoost demonstrated excellent prediction accuracy and computational efficiency after spectral optimization. By implementing sensitive band screening and feature descriptor extraction, the mean square error (MSE) of CatBoost is reduced by 10.6%, which is better than XGBoost and random forest in terms of absolute error minimization. It is worth noting that its training speed is significantly improved, which is due to its ordered boosting mechanism and the natural advantages of symmetric tree structure in processing spectral gradients. In the tree model, CatBoost achieves dual performance breakthroughs with the lowest MSE and the fastest inference time, verifying its inherent advantages in processing high-dimensional optical data with quasi-continuous features. This performance leap further proves that the synergy between feature engineering and CatBoost’s gradient deviation suppression mechanism makes it possible to accurately monitor the dye concentration in real time.
The results of comparing the prediction models under the same preprocessing process (sensitive band screening and spectral feature enhancement) are shown in
Table 9. CatBoost demonstrated excellent prediction ability in the task of estimating wood dye ratio. On the independent test set (92 groups of samples), the MSE of CatBoost is 0.00271, which is 9.9% higher than XGBoost, 12.3% better than random forest, and far higher than SVR’s 27.1%. This advantage is reflected in all evaluation indicators: MAE is 0.0349, while in other models it is between 0.0353 and 0.0405; MAPE is 3.13%, significantly better than in other models, where it ranges between 3.24% and 5.65%. The average error of CatBoost is less than 6% relative to the true value, and it has good error tolerance in practical applications.
Figure 9 demonstrates the exceptional prediction accuracy of the GPU-accelerated CatBoost model across all three dye systems, with R
2 values exceeding 0.95 (red: 0.9605, yellow: 0.9554, blue: 0.9541). The residual scatter plots reveal randomly distributed points around the zero line without systematic patterns or heteroscedasticity, indicating that the model maintains consistent prediction accuracy across the entire concentration range. The residual histograms approximate normal distributions with near-zero means and minimal standard deviations (σ < 0.02), confirming the absence of systematic bias. This remarkable performance can be attributed to CatBoost’s ordered boosting mechanism, which effectively captures the nonlinear spectral-concentration relationships while suppressing gradient estimation bias, particularly crucial for the small-sample scenario (n = 306) in this study. The uniform prediction accuracy across different dye types further validates the model’s robustness in handling the distinct absorption characteristics of each dye system, making it a reliable tool for industrial formulation prediction.
The order of model performance is CatBoost > XGBoost > random forest > SVR, which verifies the compatibility advantage of tree-based models in processing spectral enhancement features. In particular, the ordered boosting mechanism adopted by CatBoost effectively reduces the gradient deviation by 18.7%, which is more accurate than the traditional implementation. In addition, CatBoost training time is 36.4% faster than that of the baseline model, and MAPE is controlled within 4%, making it a feasible solution for the error tolerance requirement of <5% in industrial real-time monitoring systems.
4. Discussion
4.1. Analysis of Sensitive Band Mechanism
This study identified 400–450 nm, 550–600 nm, and 600–650 nm as the key sensitive bands for dye concentration prediction through random forest feature importance evaluation (Formulas (1)–(3)) (
Figure 7), with importance scores of 0.330, 0.266, and 0.201, respectively. This result is highly consistent with the complementary color absorption characteristics of the dye molecule: 400–450 nm band: mainly responds to the absorption of blue light by the active yellow dye (X-RG), and the azo group (-N=N-) in the dye molecule has a strong absorption peak at 450 nm (
Figure 4b); 550–600 nm band: corresponds to the absorption of green light by the active red dye (X-3B), and its anthraquinone structure produces characteristic absorption at 550 nm (
Figure 4a); 600–650 nm band: reflects the absorption of orange-red light by the active blue dye, which is related to the π→π* electronic transition of the copper phthalocyanine complex (
Figure 4c).
As the dye concentration increases (0.25→1.0 g·L
−1), the reflectivity of the above bands decreases systematically (
Figure 4), confirming its concentration sensitivity. It is worth noting that when the concentration is >0.75 g·L
−1, the SEM image demonstrates that the dye particles aggregate on the wood surface (
Figure 8d), resulting in nonlinear scattering in the 600–650 nm band, which is consistent with the relatively low importance score of this band (0.201).
4.2. Advantages of CatBoost Model
CatBoost has demonstrated remarkable advantages in the task of wood dyeing formulation prediction, primarily due to its ordered boosting mechanism, symmetric tree structure, and support for high-dimensional inputs. As illustrated in
Figure 10, these four core features collectively endow CatBoost with the following benefits:
Strong Noise Suppression Capability:
CatBoost leverages gradient estimation based on time-ordered residuals, effectively avoiding the bias introduced by global residual averaging in traditional gradient boosting trees. This approach significantly mitigates the impact of redundant noise in the 650–750 nm spectral range during model training. Results show that with a small dataset of 306 samples, CatBoost achieved a mean squared error (MSE) as low as 0.00271 (see
Table 9), representing a 9.9% reduction compared to XGBoost (MSE = 0.00301), thereby enhancing both model robustness and prediction accuracy.
Superior Feature Interaction Handling:
By employing target-based encoding, CatBoost dynamically associates spectral band features with dye concentrations, replacing traditional one-hot encoding and enabling more efficient handling of high-dimensional continuous inputs. This strategy not only enhances the model’s ability to capture complex nonlinear interactions but also significantly improves training efficiency. The training time was reduced to 3.56 s—36.4% faster than the baseline model (see
Table 8)—substantially lowering computational costs.
Enhanced Nonlinear Adaptability:
The symmetric tree splitting strategy of CatBoost allows precise identification and modeling of light scattering effects that emerge when dye concentration exceeds 0.75 g/L (as shown in
Figure 8d). This contributes to a consistently low mean absolute percentage error (MAPE) of 3.13% under k = 6 cross-validation. Furthermore, the prediction fluctuation was reduced by 52% compared to the SVR model (see
Table 7), ensuring greater stability and reliability of predictions.
Improved Model Interpretability:
CatBoost integrates SHAP (SHapley Additive exPlanations) to quantify the contribution of each spectral band to the prediction outcome, offering transparent insights for industrial applications. CatBoost Model RF-SHAP Consistency Analysis is shown in
Figure 11. SHAP analysis revealed that the spectral bands of 400–450 nm, 550–600 nm, and 600–650 nm contributed 26.8%, 22.1%, and 16.3% to the model’s performance, respectively. These findings align closely with the results of random forest-based feature selection, showing a high consistency of 89.3%, which validates the effectiveness of the feature selection strategy. This enhanced interpretability allows industrial operators to better understand and trust the prediction mechanisms.
In summary, CatBoost combines efficient noise suppression, excellent feature interaction processing, and powerful nonlinear modeling capabilities, making particularly suitable for processing the high-dimensional, small-sample, and continuous spectral features involved in this study. The data scene is the preferred model for the current intelligent prediction of wood dyeing formulas.
4.3. Discussion on Industrial Application Potential
The dye ratio prediction method based on the CatBoost model proposed in this study achieves high-precision concentration inversion under high-dimensional spectral input in a multi-dye mixed system, with a mean absolute error (MAE) lower than 0.035 and a prediction time of less than 3.6 s. It has strong real-time response capabilities and generalization stability. Combined with its automatic screening of key bands and feature compression capabilities, it can greatly reduce the data processing pressure at the industrial acquisition end [
27].
The CatBoost prediction model (MAE = 0.0349 g·L
−1, response time 3.56 s) proposed in this study can integrate multispectral sensors to build a real-time closed-loop control system in the wood dyeing production line: by dynamically adjusting the dye liquor feed (PID parameter Kp = 0.8), the color difference ΔE is stably controlled within 1.75 (reaching the high-end market ΔE < 2.5 standard [
3]), reducing the amount of dye by 25% (compared with the empirical formula), and reducing the emission of 500 tons of wastewater containing active dyes for an annual production of 10,000 tons of dyed veneer [
28]. This solution empowers small and medium-sized forest enterprises with a hardware cost of < USD 1500, increases the added value of fast-growing materials such as Larix gmelinii by 300%, directly alleviates the annual supply and demand gap of 20 million m
3 of precious wood in China, and promotes the sustainable utilization of forestry resources.
However, the generalization of this approach to diverse wood species remains a critical consideration for broader industrial adoption. Different anatomical structures significantly influence dye–wood interactions and subsequent spectral responses. For instance, hardwoods with smaller vessel diameters and tyloses may exhibit reduced dye penetration depths compared to the studied
Pinus sylvestris var. mongolica, potentially requiring recalibration of the sensitive bands identified in this study. The 400–450 nm band response, which showed the highest importance score (0.364) in our model, could shift by 20–30 nm in species with higher extractive content, as these compounds often exhibit strong absorption in the blue spectrum region [
29]. Similarly, woods with porosity exceeding 65% may demonstrate increased spectral variance due to uneven dye distribution, potentially reducing prediction accuracy by 15–20% based on preliminary tests with fast-growing poplar samples.
The model’s performance under varying grain orientations also warrants consideration. While our training data focused on longitudinal sections, industrial processing often involves mixed grain presentations. Radial and tangential surfaces exhibit different light scattering properties due to varying ray cell exposure, which could introduce prediction errors if not properly accounted for. Furthermore, industrial-scale implementation faces environmental variability challenges, where temperature fluctuations (±10 °C) and humidity variations (±20% RH) may induce spectral drift, suggesting the need for adaptive calibration protocols or temperature-compensated acquisition systems [
30]. Despite these limitations, the demonstrated framework provides a robust foundation for single-species production scenarios common in many SME forest enterprises, with the potential for extension through transfer learning approaches and species-specific model libraries in future developments.
5. Conclusions
This study establishes a robust and interpretable framework for predicting wood dyeing formulations by integrating hyperspectral reflectance data (400–700 nm) with CatBoost modeling. The key findings, technical contributions, and future research directions are summarized as follows:
Accurate formulation prediction was achieved by combining hyperspectral data with an optimized CatBoost model, yielding strong performance under small-sample conditions (n = 306).
Key sensitive bands (400–450 nm, 550–600 nm, and 600–650 nm) were identified using random forest analysis, revealing a consistent decrease in reflectance with increasing dye concentration.
SEM analysis confirmed that nonlinear scattering at high dye concentrations (>0.75 g·L−1) results from dye particle aggregation, supporting the observed spectral response mechanism.
CatBoost model outperformed XGBoost, random forest (RF), and support vector regression (SVR), achieving an MSE of 0.00271 g2/L2, an MAE of 0.0349 g·L−1, and a MAPE of 3.13%. Its ordered boosting strategy reduced gradient bias by 18.7% and captured nonlinear spectral interactions more effectively.
- 2.
Major Contributions
Multi-scale mechanism validation: A hybrid framework combining hyperspectral imaging (HSI) and scanning electron microscopy (SEM) was proposed to elucidate the link between microstructure and spectral behavior.
Feature compression strategy: A random forest–driven band selection method was introduced to reduce data dimensionality while preserving physically interpretable descriptors.
Novel algorithm application: This work represents the first successful application of CatBoost to wood dyeing formulation prediction, achieving practical outcomes including
Average color difference (ΔE) < 1.75;
Dye consumption reduction of approximately 25%.
- 3.
Future Research Directions
Species generalization: Extend the proposed framework to additional wood species such as poplar and eucalyptus.
System scalability: Apply the model to diverse dye systems and alternative processing conditions.
Algorithmic enhancement: Investigate the extraction of higher-order spectral features and the integration of advanced strategies, including deep neural networks, transfer learning, and edge-computing architectures.
Industrial deployment: Develop real-time, closed-loop control solutions for intelligent wood dyeing systems in practical manufacturing environments.
Supplementary Materials
The following supporting information can be downloaded at:
https://www.mdpi.com/article/10.3390/f16081279/s1: Data S1: Complete raw hyperspectral dataset containing 306 samples with full spectral range; Data S2: Dataset containing only the sensitive bands (400–450 nm, 550–600 nm, 600–650 nm); Data S3: Combined dataset integrating sensitive bands with spectral characteristic parameters.
Author Contributions
X.G. was responsible for study conceptualization, research design (including determination of research direction, methodology, and experimental protocols), and oversight of the research process. R.X. drafted the initial manuscript. Z.H. performed the data analysis. S.C. reviewed the manuscript and assisted with submission procedures. X.C. conducted language editing. All authors have read and agreed to the published version of the manuscript.
Funding
This research was partially funded by the National Natural Science Foundation of China (grant number 32171691); the Manufacturing Innovation Talent Project supported by the Harbin Science and Technology Bureau (grant number CXRC20221110393); and the Open Research Grant of the Key Laboratory of Sustainable Forest Ecosystem Management, Ministry of Education, Northeast Forestry University (grant number KFJJ2023YB03).
Data Availability Statement
Data are available upon request from the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Boran, S.; Kirsal, Y.E.; Kamil, D. Comparative evaluation and comprehensive analysis of machine learning models for regression problems. Data Intell. 2022, 4, 620–652. [Google Scholar] [CrossRef]
- Liu, Y.; Song, K. Study on the smart dyeing and performance of poplar veneers modified by deep eutectic solvents. Forests 2024, 15, 2120. [Google Scholar] [CrossRef]
- Sharma, G. Digital Color Imaging Handbook; CRC Press: Boca Raton, FL, USA, 2003; pp. 320–322. [Google Scholar]
- Feng, L.; Caiting, C.; Zhiping, M. A novel approach for recipe prediction of fabric dyeing based on feature-weighted support vector regression and particle swarm optimisation. Color. Technol. 2022, 138, 495–508. [Google Scholar]
- Pereira Ribeiro Teodoro, L.; Estevão, R.; Santana, D.C.; Oliveira, I.C.d.; Lopes, M.T.G.; Azevedo, G.B.d.; Rojo Baio, F.H.; da Silva Junior, C.A.; Teodoro, P.E. Eucalyptus species discrimination using hyperspectral sensor data and machine learning. Forests 2024, 15, 39. [Google Scholar] [CrossRef]
- Liu, Y.M.; Li, Y.G.; Shi, L.; Li, Y.Y.; Liu, H. Detection of the stem-boring damage by pine shoot beetle (Tomicus spp.) to Yunnan pine (Pinus yunnanensis Franch.) using UAV hyperspectral data. Front. Plant Sci. 2025, 16, 1514580. [Google Scholar]
- Hu, X.; Jiang, F.; Qin, X.; Huang, S.; Meng, F.; Yu, L. Exploration of suitable spectral bands and indices for forest fire severity evaluation using ZY-1 hyperspectral data. Forests 2025, 16, 640. [Google Scholar] [CrossRef]
- Li, Y.; Chen, Q.; Huang, K.; Wang, Z. The accuracy improvement of sap flow prediction in Picea crassifolia Kom. based on the back-propagation neural network model. Hydrol. Process. 2022, 36, e14490. [Google Scholar] [CrossRef]
- Guan, X.; Wu, Y.; Yang, Q. Research on wood color detection algorithm optimized by improved particle swarm optimization. China For. Prod. Ind. 2024, 61, 1–7. [Google Scholar] [CrossRef]
- Guan, X.; Li, W.; Huang, Q.; Huang, J. Intelligent color matching model for wood dyeing using genetic algorithm and extreme learning machine. J. Intell. Fuzzy Syst. 2022, 42, 4907–4917. [Google Scholar] [CrossRef]
- Wu, M.; Guan, X.; Li, W.; Huang, Q. Color spectra algorithm of hyperspectral wood dyeing using particle swarm optimization. Wood Sci. Technol. 2020, 55, 49–66. [Google Scholar] [CrossRef]
- Guan, X.; Chen, X.; He, Z.; Cui, H. Predicting dye formulations and ultrasonic pretreatment power in wood staining with a SAT fusion-gated BiLSTM model optimized by improved dung beetle algorithm. Appl. Sci. 2025, 15, 1522. [Google Scholar] [CrossRef]
- Sarfarazi, S.; Mascolo, I.; Modano, M.; Guarracino, F. Application of artificial intelligence to support design and analysis of steel structures. Metals 2025, 15, 408. [Google Scholar] [CrossRef]
- Wang, Q.; Yan, C.; Zhang, Y.; Xu, Y.; Wang, X.; Cui, P. Numerical simulation and Bayesian optimization CatBoost prediction method for characteristic parameters of veneer roller pressing and defibering. Forests 2024, 15, 2173. [Google Scholar] [CrossRef]
- GB/T 1931-2009; Method of Test for Moisture Content of Wood. National Forestry and Grassland Administration: Beijing, China, 2009.
- Duzce University. Determination of Color-Changing Effects of Bleaching Chemicals on Some Heat-Treated Woods; Technical Report 5926; Duzce University: Duzce, Turkey, 2020. [Google Scholar]
- Zhang, L.; Li, S.; Wang, J. Deep signal-dependent denoising noise algorithm. Electronics 2023, 12, 1201. [Google Scholar] [CrossRef]
- Hu, J. A study of accounting teaching feature selection and importance assessment based on random forest algorithm. Appl. Math. Nonlinear Sci. 2024, 9, 1–17. [Google Scholar] [CrossRef]
- Schubert, M.; Luković, M.; Christen, H. Prediction of mechanical properties of wood fiber insulation boards as a function of machine and process parameters by random forest. Wood Sci. Technol. 2020, 54, 703–713. [Google Scholar] [CrossRef]
- Wang, Y.; Zhang, W.; Gao, R.; Jin, Z.; Wang, X. Recent advances in the application of deep learning methods to forestry. Wood Sci. Technol. 2021, 55, 1171–1202. [Google Scholar] [CrossRef]
- Li, J.; An, X.; Li, Q.; Wang, C.; Yu, H.; Zhou, X.; Geng, Y.-A. Application of XGBoost algorithm in the optimization of pollutant concentration. Atmos. Res. 2022, 276, 106238. [Google Scholar] [CrossRef]
- Xue, H.; Xu, X.; Meng, X. Variety classification and identification of maize seeds based on hyperspectral imaging method. Optoelectron. Lett. 2025, 21, 234–241. [Google Scholar] [CrossRef]
- Longo, L.B.; Brüchert, F.; Becker, G.; Sauter, U.H. Predicting Douglas-fir knot size in the stand: A random forest model based on CT and field measurements. Wood Sci. Technol. 2021, 56, 509–529. [Google Scholar] [CrossRef]
- Chen, S.; Wang, J.; Liu, Y.; Chen, Z.; Lei, Y. The relationship between color and mechanical properties of heat-treated wood predicted based on support vector machine model. Holzforschung 2022, 76, 994–1002. [Google Scholar] [CrossRef]
- Rimal, Y.; Sharma, N.; Alsadoon, A. The accuracy of machine learning models relies on hyperparameter tuning: Student result classification using random forest, randomized search, grid search, Bayesian, genetic, and Optuna algorithms. Multimed. Tools Appl. 2024, 83, 74349–74364. [Google Scholar] [CrossRef]
- Zhao, J.; Wang, J.; Anderson, N. Machine learning applications in forest and biomass supply chain management: A review. Int. J. For. Eng. 2024, 35, 371–380. [Google Scholar] [CrossRef]
- Li, J.; Liu, J.; Zhang, Y.; Huang, Y. Hyperspectral imaging technology in wood industry: A review. Wood Sci. Technol. 2022, 56, 499–518. [Google Scholar] [CrossRef]
- Yaseen, D.A.; Scholz, M. Textile dye wastewater characteristics and constituents of synthetic effluents: A critical review. Int. J. Environ. Sci. Technol. 2019, 16, 1193–1226. [Google Scholar] [CrossRef]
- Gao, C.; Cui, X.; Matsumura, J. Multidimensional exploration of wood extractives: A review of compositional analysis, decay resistance, light stability, and staining applications. Forests 2024, 15, 1782. [Google Scholar] [CrossRef]
- Popescu, C.-M.; Zeniya, N.; Endo, K.; Genkawa, T.; Matsuo-Ueda, M.; Obataya, E. Assessment of structural differences between water-extracted and non-extracted hydro-thermally treated spruce wood by NIR spectroscopy. Forests 2021, 12, 1689. [Google Scholar] [CrossRef]
Figure 1.
Workflow of the proposed hyperspectral-machine learning framework.
Figure 1.
Workflow of the proposed hyperspectral-machine learning framework.
Figure 2.
Experimental workflow and molecular structures of dyes used.
Figure 2.
Experimental workflow and molecular structures of dyes used.
Figure 3.
Spectral data preprocessing.
Figure 3.
Spectral data preprocessing.
Figure 4.
Reflectance of dyes at different concentrations: (a) Red dye; (b) Yellow dye; (c) Blue dye.
Figure 4.
Reflectance of dyes at different concentrations: (a) Red dye; (b) Yellow dye; (c) Blue dye.
Figure 5.
Flow chart of k-fold cross-validation technique.
Figure 5.
Flow chart of k-fold cross-validation technique.
Figure 6.
Flow chart of model establishment and optimization.
Figure 6.
Flow chart of model establishment and optimization.
Figure 7.
Sensitive wavelength selection. (a) Average band importance for all dyes. (b) Band importance heatmap by dye type. (c) Feature importance distribution across wavelengths.
Figure 7.
Sensitive wavelength selection. (a) Average band importance for all dyes. (b) Band importance heatmap by dye type. (c) Feature importance distribution across wavelengths.
Figure 8.
Distribution of dye particles on dyed veneer surface at different concentrations of red dye: (a) 0.25 g·L−1, (b) 0.50 g·L−1, (c) 0.75 g·L−1, (d) 1.00 g·L−1.
Figure 8.
Distribution of dye particles on dyed veneer surface at different concentrations of red dye: (a) 0.25 g·L−1, (b) 0.50 g·L−1, (c) 0.75 g·L−1, (d) 1.00 g·L−1.
Figure 9.
Residual plots illustrating the prediction errors for red, yellow, and blue dye concentrations based on the selected model.
Figure 9.
Residual plots illustrating the prediction errors for red, yellow, and blue dye concentrations based on the selected model.
Figure 10.
Advantages of the CatBoost model.
Figure 10.
Advantages of the CatBoost model.
Figure 11.
CatBoost model RF-SHAP consistency analysis.
Figure 11.
CatBoost model RF-SHAP consistency analysis.
Table 1.
Spectral characteristic parameters.
Table 1.
Spectral characteristic parameters.
No. | Parameter | Name | Definition and Algorithm |
---|
1 | | Yellow edge area | Area enclosed by the first-order differential wave in the yellow edge range |
2 | | Blue edge amplitude | Maximum value of the first-order derivative spectrum from 490 to 530 nm |
3 | | Blue edge position | Corresponding wavelength position |
4 | | Blue edge area | Area enclosed by the first-order differential wave in the blue edge range |
5 | | Green peak reflectivity | Maximum value of the original spectrum from 510 to 560 nm |
6 | | Green peak position | Corresponding wavelength position |
7 | | Yellow edge position | Corresponding wavelength position |
8 | | Yellow edge amplitude | Maximum value of the first-order derivative spectrum from 560 to 640 nm |
9 | | Red trough position | Wavelength position of the minimum value from 640 to 680 nm |
10 | | Corresponding wavelength position | Minimum value of the original spectrum from 640 to 680 nm |
Table 2.
Spectral characteristic parameters of single board surface under changing red dye concentration.
Table 2.
Spectral characteristic parameters of single board surface under changing red dye concentration.
No. | Red | Yellow | Blue | Blue_Area | Green_Peak | Red_Valley |
---|
1 | 0 | 0 | 0 | 1543.900 | 5795.841 | 4785.029 |
2 | 0 | 0.2 | 0.2 | 703.056 | 4253.786 | 3386.752 |
3 | 0 | 0.25 | 0.25 | 836.887 | 4762.195 | 3905.199 |
4 | 0 | 0.3 | 0.3 | 473.900 | 3524.336 | 2494.026 |
5 | 0.1 | 0.3 | 0.3 | −737.426 | 2075.853 | 1619.847 |
6 | 0.2 | 0.2 | 0.2 | −123.631 | 3030.012 | 2464.250 |
7 | 0.2 | 0.3 | 0.3 | −877.938 | 1417.168 | 1145.891 |
8 | 0.25 | 0 | 0 | −737.336 | 3105.538 | 2837.900 |
9 | 0.25 | 0.2 | 0.2 | −325.647 | 2690.024 | 2205.900 |
10 | 0.25 | 0.25 | 0.25 | −740.726 | 2340.624 | 1912.361 |
11 | 0.3 | 0.25 | 0.25 | −834.430 | 2005.415 | 1765.499 |
12 | 0.5 | 0 | 0 | −802.807 | 1256.341 | 1121.505 |
Table 3.
Spectral characteristic parameters of the single board surface under the change in yellow dye concentration.
Table 3.
Spectral characteristic parameters of the single board surface under the change in yellow dye concentration.
No. | Red | Yellow | Blue | Blue_Area | Yellow_Area |
---|
1 | 0 | 0 | 0 | 1543.900 | 2741.576 |
2 | 0 | 0.25 | 0 | 1998.026 | 2485.660 |
3 | 0 | 0.5 | 0 | 2307.927 | 2308.898 |
4 | 0.2 | 0 | 0.2 | −313.839 | −121.537 |
5 | 0.2 | 0.2 | 0.2 | −92.336 | −396.613 |
6 | 0.2 | 0.25 | 0.2 | −36.961 | −465.382 |
7 | 0.25 | 0 | 0.25 | −1127.801 | −1756.937 |
8 | 0.25 | 0.25 | 0.25 | −740.726 | −2164.924 |
9 | 0.25 | 0.3 | 0.25 | −611.453 | −2147.505 |
10 | 0.3 | 0 | 0.3 | −1054.631 | −858.439 |
11 | 0.3 | 0.1 | 0.3 | −739.481 | −1087.337 |
12 | 0.3 | 0.2 | 0.3 | −677.938 | −1307.374 |
Table 4.
Spectral characteristic parameters of the single board surface under the change in blue dye concentration.
Table 4.
Spectral characteristic parameters of the single board surface under the change in blue dye concentration.
No. | Red | Yellow | Blue | Blue_Amplitude | Blue_Area | Green_Peak | Red_Valley |
---|
1 | 0 | 0 | 0 | 23.530 | 1543.900 | 5795.841 | 4785.029 |
2 | 0 | 0 | 0.25 | 26.776 | −598.353 | 3841.3670 | 2434.562 |
3 | 0 | 0 | 0.5 | 29.912 | −798.164 | 3763.970 | 2222.021 |
4 | 0.2 | 0.2 | 0 | 14.756 | 84.767 | 3053.554 | 2602.712 |
5 | 0.2 | 0.2 | 0.2 | 16.861 | −406.724 | 2561.284 | 2023.949 |
6 | 0.2 | 0.2 | 0.25 | 17.387 | −529.597 | 2438.217 | 1879.258 |
7 | 0.25 | 0.25 | 0 | 11.605 | −424.638 | 2568.751 | 2295.606 |
8 | 0.25 | 0.25 | 0.25 | 15.713 | −740.726 | 2340.624 | 1912.361 |
9 | 0.25 | 0.25 | 0.3 | 15.099 | −773.684 | 1524.036 | 1284.027 |
10 | 0.3 | 0.3 | 0 | 9.892 | −427.413 | 1418.971 | 1273.548 |
11 | 0.3 | 0.3 | 0.1 | 13.531 | −740.182 | 1381.778 | 1149.051 |
12 | 0.3 | 0.3 | 0.2 | 15.212 | −877.938 | 1350.353 | 1121.847 |
Table 5.
Hyperparameter optimization results of the model before data processing.
Table 5.
Hyperparameter optimization results of the model before data processing.
No. | Model | Hyperparameter | Optimal Value |
---|
1 | CatBoost | iterations | 300 |
depth | 10 |
learning_rate | 0.05 |
border_count | 32 |
l2_leaf_reg | 1 |
2 | XGBoost | n_estimators | 300 |
max_depth | 9 |
colsample_bytree | 0.7 |
learning_rate | 0.05 |
Subsample | 0.7 |
3 | RF | max_depth | 20 |
max_features | sqrt |
min_samples_split | 2 |
n_estimators | 300 |
4 | SVR | C | 0.1 |
epsilon | 0.01 |
Gamma | 0.1 |
Table 6.
Hyperparameter optimization results of the model after data processing.
Table 6.
Hyperparameter optimization results of the model after data processing.
No. | Model | Hyperparameter | Optimal Value |
---|
1 | CatBoost | iterations | 1000 |
depth | 10 |
learning_rate | 0.1 |
border_count | 64 |
l2_leaf_reg | 9 |
2 | XGBoost | n_estimators | 200 |
max_depth | 6 |
colsample_bytree | 1.0 |
learning_rate | 0.05 |
Subsample | 0.7 |
3 | RF | max_depth | 10 |
max_features | None |
min_samples_split | 2 |
n_estimators | 200 |
4 | SVR | C | 0.1 |
epsilon | 0.01 |
Gamma | 0.1 |
Table 7.
Performance evaluation of different k-folds.
Table 7.
Performance evaluation of different k-folds.
No. | K-Fold Cross-Validation | Regression Model | Effect Evaluation Index |
---|
MSE | MAPE | MAE | R2 |
---|
1 | k = 2 | CatBoost | 0.00332 | 3.785 | 0.0408 | 0.938 |
XGBoost | 0.00385 | 4.112 | 0.0459 | 0.928 |
RF | 0.00329 | 5.395 | 0.0402 | 0.932 |
SVR | 0.00359 | 6.035 | 0.0417 | 0.925 |
2 | k = 4 | CatBoost | 0.00287 | 3.357 | 0.0369 | 0.948 |
XGBoost | 0.00337 | 3.746 | 0.0417 | 0.940 |
RF | 0.00327 | 4.034 | 0.0397 | 0.942 |
SVR | 0.00378 | 5.642 | 0.0426 | 0.930 |
3 | k = 6 | CatBoost | 0.00271 | 3.134 | 0.0349 | 0.957 |
XGBoost | 0.00301 | 3.240 | 0.0352 | 0.950 |
RF | 0.00313 | 3.687 | 0.0378 | 0.945 |
SVR | 0.00327 | 5.653 | 0.0405 | 0.935 |
4 | k = 8 | CatBoost | 0.00276 | 3.486 | 0.0363 | 0.952 |
XGBoost | 0.00315 | 3.550 | 0.0395 | 0.947 |
RF | 0.00326 | 3.949 | 0.0396 | 0.941 |
SVR | 0.00327 | 5.984 | 0.0405 | 0.933 |
Table 8.
Performance indicators of the model before and after data processing.
Table 8.
Performance indicators of the model before and after data processing.
No. | Model | MSE (Before Processing) | MSE (After Processing) | Running Speed (Before Processing) | Running Speed (After Processing) |
---|
1 | CatBoost | 0.00303 | 0.00271 | 5.6 s | 3.56 s |
2 | XGBoost | 0.00353 | 0.00301 | 5.2 s | 3.85 s |
3 | RF | 0.00361 | 0.00309 | 5.0 s | 4.12 s |
4 | SVR | 0.00403 | 0.00372 | 4.35 s | 4.12 s |
Table 9.
Evaluation performance of the model on the test set under the optimal parameters.
Table 9.
Evaluation performance of the model on the test set under the optimal parameters.
No. | Performance Indicators | Effect Evaluation Index |
---|
CatBoost | XGBOOST | RF | SVR |
---|
1 | MSE | 0.00271 | 0.00301 | 0.00309 | 0.00372 |
2 | MAE | 0.0349 | 0.0353 | 0.0378 | 0.0405 |
3 | MAPE(%) | 3.134 | 3.240 | 3.901 | 5.653 |
4 | R2 | 0.957 | 0.950 | 0.945 | 0.935 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).