CatBoost-Optimized Hyperspectral Modeling for Accurate Prediction of Wood Dyeing Formulations

Guan, Xuemei; Xue, Rongkai; He, Zhongsheng; Chen, Shibin; Chen, Xiangya

doi:10.3390/f16081279

Open AccessArticle

CatBoost-Optimized Hyperspectral Modeling for Accurate Prediction of Wood Dyeing Formulations

by

Xuemei Guan

^*,

Rongkai Xue

,

Zhongsheng He

,

Shibin Chen

and

Xiangya Chen

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China

^*

Author to whom correspondence should be addressed.

Forests 2025, 16(8), 1279; https://doi.org/10.3390/f16081279

Submission received: 4 July 2025 / Revised: 24 July 2025 / Accepted: 31 July 2025 / Published: 5 August 2025

(This article belongs to the Section Wood Science and Forest Products)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study proposes a CatBoost-enhanced hyperspectral modeling approach for accurate prediction of wood dyeing formulations. Using Pinus sylvestris var. mongolica veneer as the substrate, 306 samples with gradient dye concentrations were prepared, and their reflectance spectra (400–700 nm) were acquired. After noise reduction and sensitive band selection (400–450 nm, 550–600 nm, and 600–650 nm), spectral descriptors were extracted as model inputs. The CatBoost algorithm, optimized via k-fold cross-validation and grid search, outperformed XGBoost, random forest, and SVR in prediction accuracy, achieving MSE = 0.00271 and MAE = 0.0349. Scanning electron microscopy (SEM) revealed the correlation between dye particle distribution and spectral response, validating the model’s physical basis. This approach enables intelligent dye formulation control in industrial wood processing, reducing color deviation (ΔE < 1.75) and dye waste by approximately 25%.

Keywords:

CatBoost; hyperspectral imaging; wood dyeing; recipe prediction; feature engineering; mean square error (MSE)

1. Introduction

In recent years, with the increasing scarcity of high-quality natural wood resources, the annual supply–demand gap of precious wood in China has exceeded 20 million cubic meters, echoing the global trend reported by FAO (2023) for high-grade timber scarcity [1]. At the same time, the market demand for decorative wood products has continued to grow, exacerbating the structural contradiction between resources and the market. Wood dyeing technology, as a means to improve the decorativeness and added value of artificial forest materials, can impart the visual characteristics of precious hardwoods to fast-growing species, potentially increasing their market value by a factor of 3 to 5 [2]. In recent years, research teams in Europe and North America have explored standardized digital dyeing workflows for engineered wood, emphasizing spectral precision, color reproducibility, and environmental efficiency. However, studies integrating high-dimensional spectral data with robust ML algorithms in wood dyeing remain limited. However, traditional dyeing methods mostly rely on empirical ratios, and have problems such as large color difference (ΔE > 5), uneven coloring, and poor recipe reproducibility, which makes it difficult to meet the requirements of high-end markets for product consistency and standardization [3].

In order to improve the accuracy and controllability of the color matching process, researchers have tried to combine computer-aided color matching with hyperspectral imaging (HSI) technology in recent years. By extracting the spectral response information of the material, dye concentration or color parameter inversion modeling is performed to avoid the phenomenon of “metameric heterochromaticity” [4]. HSI technology has been widely used in forestry fields such as wood species identification, pest and disease detection, and forest fire assessment due to its high resolution of material surface and microstructure [5,6,7], and it also provides a technical basis for high-dimensional feature input of dyeing formulas.

At the same time, machine learning methods have become the mainstream means of dyeing prediction models. Due to the gradient vanishing and local optimal problems, the accuracy of traditional backpropagation (BP) neural networks drops by 37% when the sample size is less than 500, and its generalization ability and stability are limited [8]. Although optimization strategies such as particle swarm optimization (PSO) and genetic algorithms can improve model stability [9,10], in industrial real-time scenarios, model convergence speed and interpretability are still difficult to ignore. In addition, physical modeling methods such as Stearns–Noechel’s full-band spectral color matching model have high theoretical accuracy, but the noise interference caused by band redundancy, feature collinearity and algorithm complexity limit its actual promotion in the dyeing industry [11].

In recent years, some studies have begun to explore the value of explainable machine learning in modeling complex material–process interactions in forestry and beyond [12]. Sarfarazi et al. [13], for example, proposed an interpretable AI framework that integrates machine learning with finite element modeling to predict structural responses in steel materials. Although their study focused on structural engineering, the methodology underscores the broader relevance of coupling data-driven models with physical insights to enhance predictive accuracy and model transparency—principles that are equally applicable to wood dyeing systems characterized by heterogeneous microstructures and nonlinear dye-material interactions. Among the emerging algorithms, CatBoost—a gradient boosting method based on symmetric trees and ordered boosting—has shown superior performance in high-dimensional, small-sample tasks. It has been successfully applied in wood property prediction, moisture content estimation, and species classification, often outperforming traditional models like XGBoost and Random Forest [14]. However, its application in wood dye formulation prediction remains underexplored, particularly for multi-dye systems requiring precise concentration inversion under complex spectral inputs.

Based on this, this paper proposes a hyperspectral intelligent prediction model based on the CatBoost algorithm to establish a predictive framework for multi-dye wood coloring formulation. The research objectives are as follows:

1.: Construct a CatBoost prediction model for spectral features: Use the CatBoost algorithm combined with random forests to screen sensitive bands and sort feature contributions to address issues such as information redundancy and spectral noise in high-dimensional spectral data;
2.: Compare and analyze the performance of various prediction algorithms: Compared with traditional algorithms such as XGBoost, random forest (RF), and support vector regression (SVR), the model is systematically evaluated in terms of mean square error (MSE), determination coefficient (R²), mean absolute error (MAE), and other indicators;
3.: Construct a multimodal verification mechanism: Introduce a joint characterization method of hyperspectral imaging and scanning electron microscopy (SEM) to verify the consistency and physical rationality of the prediction model from the perspective of the relationship between microscopic particle distribution and surface reflectivity.

The innovation of this study is that the CatBoost algorithm is introduced into the field of wood dyeing color matching prediction for the first time, and a set of intelligent dyeing modeling strategies with high precision, strong interpretability, and industrial scalability are constructed by combining hyperspectral data mining and multi-source characterization methods. The research results are expected to provide data support and technical paths for the intelligent transformation of wood dyeing technology and the high-value utilization of artificial forest resources.

2. Materials and Methods

The overall research process of this paper is shown in Figure 1, covering the following: (1) pretreatment of Scots pine veneer with hydrogen peroxide bleaching; (2) 306 groups of dye concentration gradient dyeing; (3) 400–700 nm hyperspectral reflectance acquisition and pretreatment; (4) sensitive band screening and spectral feature parameter extraction based on random forest; (5) CatBoost model optimization (compared with XGBoost/RF/SVR); (6) multimodal verification (statistical indicators + SEM micro-mechanism). The detailed methods of each link are shown below.

2.1. Experimental Materials and Data Collection

In this study, veneer of Pinus sylvestris var. mongolica, purchased from Harbin, Heilongjiang Province, China, was employed as the dyeing substrate. All specimens were cut along the longitudinal grain direction to dimensions of 30 mm × 15 mm × 1 mm. The initial moisture content was 12 ± 1%, determined in accordance with GB/T 1931-2009 [15]. Surface roughness was controlled at Ra = 6.3 ± 0.5 μm. Prior to pretreatment, all samples were conditioned for 7 days in a climate-controlled chamber at 20 ± 2 °C and 65 ± 5% relative humidity to ensure moisture equilibrium. The complete experimental procedure is illustrated in Figure 2, which includes both the workflow of the dyeing process and the molecular structures of the three reactive dyes employed: reactive red X-3B, reactive yellow X-RG, and reactive blue.

1.: Dyes and reagents

Dyes: Three reactive dyes were selected for this study based on their complementary absorption characteristics within the visible spectrum. Reactive yellow X-RG exhibits a primary absorption peak near 450 nm in the blue light region, with its azo group (--N=N--) showing strong absorption between 400 and 500 nm. Reactive red X-3B features a main absorption peak around 550 nm in the green light region, attributed to the anthraquinone structure’s characteristic absorption between 500 and 600 nm. Reactive blue demonstrates significant absorption near 640 nm in the orange-red region, where the copper phthalocyanine complex facilitates intense π→π* electronic transitions. All dyes were of analytical grade with purity ≥95%, supplied by Beijing Chemical Plant. This combination of dyes effectively covers the major regions of the visible spectrum and serves as a theoretical foundation for the subsequent screening of sensitive spectral bands using a random forest model.

Auxiliary reagents: 4% H₂O₂ solution (bleach, analytical grade), aqueous JFC penetrant (industrial grade), anhydrous Na₂CO₃ (fixing agent, analytical grade), and NaCl (dyeing accelerator, analytical grade).

2.: Bleaching and pretreatment

To remove impurities from wood and improve dye penetration, the veneer was treated in a bleaching solution consisting of 4% H₂O₂ (500 mL), Na₂SiO₃ (0.5 g, stabilizer), and Na₃PO₄ (0.5 g, buffer) in a 65 °C constant-temperature water bath for 2 h (liquid-to-material ratio 1:20), and the pH value was controlled at 10.5 ± 0.2. After treatment, the sample was fully rinsed with deionized water and dried at 60 °C to a moisture content of 6–8% (according to standard GB/T 1931-2009).

3.: Dyeing and fixation

In the dyeing stage, a total of 306 groups of different dye concentration ratios were designed. In each group, 500 mL of dye solution was prepared, 0.5 mL of aqueous JFC and 15 g·L⁻¹ NaCl were added, and 6 veneers were placed in each group during the dyeing process. The pretreated samples were dyed in a 65 °C water bath for 2 h, and then 20 g·L⁻¹ Na₂CO₃ was added for fixation for 30 min [16]. Finally, they were rinsed thoroughly with clean water and dried at 60 °C for 2 h to remove floating color.

4.: Hyperspectral data acquisition

The sample reflectance data was collected using the HySpex VNIR-1800 hyperspectral imaging system (Norsk Elektro Optikk, Oslo, Norway), with a spectral range of 400–1000 nm and a spectral resolution of 2.6 nm. The acquisition settings were as follows: integration time of 100 ms, frame rate of 30 fps, spatial resolution of 0.2 mm/pixel, and a signal-to-noise ratio greater than 300:1. Dark and white reference calibration was performed every 30 min to ensure spectral stability. All data were collected under controlled environmental conditions (23 ± 1 °C, 50 ± 5% RH).

According to the spectral response characteristics of the visible light band (400–700 nm) relevant to wood dyeing, the reflectance data in this range were extracted for subsequent analysis. The pixel average spectrum of the homogeneous area on the surface of each veneer sample was obtained using ENVI 5.3 software (Harris Geospatial Solutions, Broomfield, CO, USA). In total, 306 sets of hyperspectral reflectance spectra were obtained for model training and validation(see Supplementary Data S1 for the complete dataset).

2.2. Hyperspectral Data Processing and Feature Engineering

2.2.1. Spectral Data Denoising and Smoothing Methods

In order to improve the data reliability of model training, this study systematically preprocessed the collected hyperspectral reflectance data, mainly including three steps of outlier removal, noise smoothing, and scale normalization, as shown in Figure 3.

First, in order to eliminate outlier data that may appear in the measurement (caused by light source fluctuations, equipment jitter, etc.), the interquartile range (IQR) method is used to identify spectral anomalies, the threshold range is set to [Q1 − 1.5 × IQR, Q3 + 1.5 × IQR], and samples outside this range are eliminated.

Subsequently, the Savitzky–Golay convolution smoothing algorithm (window width 11 band points, second-order polynomial fitting) is used to perform first-order guided denoising on the spectral curve, while retaining the spectral edge features and suppressing high-frequency noise interference. This algorithm has been widely used in plant and wood hyperspectral processing because of its good conformal ability to hyperspectral nonlinear response [17].

Finally, the reflectance values of all bands are linearly compressed to the [0, 1] interval through Min-Max Scaling to unify the data scale, avoid model weight offset caused by dimensional differences, and provide standardized input for subsequent sensitive band screening and machine learning modeling [18].

2.2.2. Sensitive Band Screening

In the study of wood dyeing formulation prediction, although traditional approaches typically use the full spectral range of 400–700 nm as model input, not all bands exhibit equal sensitivity to changes in dye concentration. For instance, reactive yellow dye shows strong absorption characteristics around 450 nm (blue region), making the reflectance in this band highly informative for color matching prediction. In contrast, reflectance variations in the far-red or green regions tend to be more stable. Therefore, identifying key sensitive bands within the spectrum is crucial for improving prediction accuracy.

To screen these critical bands and assess their importance, this study introduces the random forest (RF) algorithm. RF has demonstrated strong stability and robustness in handling high-dimensional, nonlinear data, making it well-suited for hyperspectral modeling in this multi-dye mixture system [19]. In this work, the Gini Index is used as the basis for evaluating feature importance. The specific steps are as follows:

1.: Impurity Calculation

For each decision tree node

m

, the Gini Index is defined as

{G Ι}_{m} = 1 - \sum_{Κ = 1}^{Κ} p_{m k}^{2}

(1)

where

p_{m k}^{2}

is the proportion of samples in node m belonging to concentration class

k

. This index reflects the impurity of concentration distribution at the current node.

2.: Impurity Reduction by Splitting

When a feature

Χ_{j}

(i.e., a specific spectral band) is used to split node

m

, the resulting impurity reduction is calculated as

∆ {G Ι}_{m} = {G Ι}_{m_{parent}} - (\frac{n_{left}}{n_{m}} \times {G Ι}_{m_{left}} + \frac{n_{right}}{n_{m}} \times {G Ι}_{m_{right}})

(2)

where

{G Ι}_{m_{left}}

and

{G Ι}_{m_{right}}

represent the Gini indices of the left and right child nodes, respectively, and

n_{left}

,

n_{right}

are the number of samples in the child nodes. This quantifies the effectiveness of feature

Χ_{j}

in partitioning node

m

.

3.: Feature Importance Aggregation

The total feature importance for

Χ_{j}

is computed by summing its contributions across all trees in the forest.

ω_{j} = \frac{1}{N_{trees}} \sum_{i = 1}^{N_tress} \sum_{m \in M_{i}^{(j)}} ∆ {G Ι}_{m}

(3)

where

M_{i}^{(j)}

denotes the set of all nodes in the

i

tree that are split using feature

Χ_{j}

. The resulting importance scores are normalized so that

\sum_{j} ω_{j} = 1

.

By incorporating Gini-based feature importance analysis, this method enables the identification of the most representative and sensitive spectral regions within the 400–700 nm range. The importance score

ω_{j}

quantifies each band’s contribution to improving node purity in concentration-based classification, thus revealing the physical nature of spectral responses: bands with high

ω_{j}

values stem from specific dye absorption mechanisms, where reflectance decreases monotonically with concentration; conversely, low

ω_{j}

values typically indicate weak absorption regions dominated by scattering noise from the wood substrate. Excluding such noisy bands enhances model robustness and allows for a ~50% reduction in input dimensionality, optimizing industrial detection efficiency.

2.2.3. Dye Concentration Optimization

Figure 4 shows the visible spectral reflectance change curves of Pinus sylvestris var. mongolica veneer under different concentrations of three dyes (0.25, 0.5, 0.75, and 1.0 g·L⁻¹) for reactive red, reactive yellow and reactive blue. Preliminary experiments found that when the dye concentration exceeded 0.75 g·L⁻¹, the spectral response showed obvious nonlinear characteristics. In order to explore its generation mechanism, this paper further combined scanning electron microscope (SEM) images to analyze the aggregation state of dye particles at high concentrations and their influence on the spectral response.

2.2.4. Spectral Feature Parameter Extraction

As the dye concentration changes, the characteristic parameters in the spectral data will also change accordingly. For example, the blue edge amplitude can reflect the absorption characteristics of the dye in the blue region, especially those dyes that have significant absorption of blue light. This parameter can be used to distinguish and quantify the concentration of dyes with significant absorption characteristics in the blue region. Green peak reflectance directly reflects the reflectance characteristics of the dye in the green area. Different dyes will change the reflectance of wood in this area, and green peak reflectance can be used to quantify this change. Therefore, selecting key spectral characteristic parameters helps to identify the presence and concentration of specific dyes. Some common spectral characteristic parameters and their corresponding meanings are shown in Table 1.

2.3. Research Methods

This study used four machine learning algorithms, CatBoost, XGBoost, random forest (RF), and support vector regression (SVR), to build a prediction model for wood dyeing formula. The dataset (306 groups in total) is divided into training sets (214 groups) and test sets (92 groups) in a ratio of 7:3 to evaluate the generalization ability of the model. At the same time, in order to improve the reliability of the model, 8-fold cross-validation (8-fold CV) is used to evaluate the training stability, and key hyperparameters are optimized based on grid search. Finally, model evaluation parameters are constructed to evaluate the algorithm model.

2.3.1. Category Gradient Boosting

CatBoost is an ensemble learning algorithm based on gradient boosting decision trees, which is designed for efficient processing of categorical features and small sample data scenarios. Compared with traditional GBDT algorithms (such as XGBoost and LightGBM), its core innovation lies in the symmetric tree structure and ordered boosting technology, which can effectively solve the noise interference and feature collinearity problems in high-dimensional spectral data, especially in scenarios with small samples or high proportion of categorical features [20].

The training process of CatBoost consists of multiple weak learners (decision trees). Each new decision tree is built based on the residual of the current model (the difference between the predicted value and the true value). Through multiple rounds of iteration, the overall performance is gradually improved.

In terms of formula, the objective function of CatBoost is similar to gradient boosting, as shown in Formula (4):

y^{(t)} = y^{(t - 1)} + η h_{t} (x)

(4)

One of the most important innovations of CatBoost is its processing of categorical features. Traditional methods (such as one-hot encoding) are prone to high-dimensional sparse matrices, while CatBoost adopts a target encoding strategy to dynamically calculate the statistical correlation between categorical features and target values. Formula (5) is

Target Enc (x) = \frac{\sum_{i \in past} y_{i} + α}{N_{past} + α}

(5)

where y_i is the target value of category x in the training data. N_past is the number of past samples, and α is a smoothing parameter that prevents overfitting when the frequency of a category is too low.

To solve the target leakage problem in traditional gradient boosting, CatBoost used the ordered learning (ordered boosting) strategy, that is, when making predictions for a sample, only the historical information from previous samples is used to build a decision tree, thereby eliminating the dependence on future data. This strategy significantly improves the robustness of the model in small-sample scenarios.

Thanks to the above design, CatBoost is particularly suitable for dealing with nonlinear feature selection, noise interference, and collinearity problems in high-dimensional spectral data, and has demonstrated superior performance in many fields such as image recognition, text classification, and hyperspectral regression. In this paper, it is applied to the wood dyeing formula prediction task to improve the model’s ability to model the complex mapping relationship between reflectance spectrum and dye concentration.

2.3.2. Extreme Gradient Boosting

XGBoost (Extreme Gradient Boosting) is an efficient ensemble learning algorithm based on the gradient boosting machine (GBM) framework. It iteratively constructs a sequence of decision trees and optimizes the prediction results in an additive manner [21]. It is widely used in regression and classification tasks. Its core idea is to iteratively train weak learners (decision trees) and combine their prediction results in an additive manner to gradually approach the optimal solution of the objective function. Compared with the traditional gradient boosting algorithm, XGBoost significantly improves the accuracy and training efficiency of the model by introducing regularization terms, parallel computing, and tree structure optimization strategies.

XGBoost’s objective function consists of a loss function and a regularization term, as shown in Equation (6), which aims to balance the model’s fitting ability and complexity

L^{(t)} = \sum_{i = 1}^{n} l (Y_{i}, {Y_{i}}^{(t - 1)} + f_{t} (x_{i})) + γ T + \frac{1}{2} λ \sum_{j = 1}^{Τ} ω_{j}^{2}

(6)

where

l (Y_{i}, {Y_{i}}^{(t - 1)})

is the loss function for the

t

th iteration,

γ

and

λ

are the hyperparameters for controlling the tree complexity and the leaf node weight regularization, respectively, to effectively suppress overfitting.

Τ

is the number of leaf nodes in the tree, and

ω_{j}

is the weight of the

j

th leaf.

XGBoost achieves multi-threaded parallelism in the feature split point evaluation stage through parallel computing, accelerating model training. At the same time, each round of iteration updates the model through additive training, namely

{Y_{i}}^{(t)} = {Y_{i}}^{(t - 1)} + η f_{t} (x_{i})

(7)

where

η

is the learning rate (default

η

= 0.1), which is used to suppress the contribution of a single tree and prevent overfitting.

2.3.3. Random Forest

The basic principles of the random forest (RF) method have been briefly introduced in the previous article, and will not be repeated here. In this study, the accuracy and generalization ability of the RF model are mainly regulated by two parameters, namely the number of decision trees and the leaf node size. Specifically, the sample size of the training data is set between 0 and 3000 trees, and the leaf size (i.e., the minimum number of leaf node samples per tree) is set in the range of 1 to 10. Experimental verification shows that when the number of trees is increased to 200 and two different leaf size settings are used, the model performance is more stable, and the prediction results are also more reliable [22].

In addition, in the modeling process of random forest, the bootstrap sampling strategy is adopted to randomly generate multiple training subsets (denoted as S₁, S₂, …, Sₙ), and train and generate corresponding decision trees (denoted as R₁, R₂, …, Rₙ). Finally, by integrating the prediction results of all decision trees, the prediction output of the overall model is achieved [23].

2.3.4. Basic Principles of SVR

Support vector regression (SVR) is an extended form of support vector machine (SVM) designed for regression tasks. Its core idea is to construct aε-insensitive band so that most sample points fall within this area, thereby achieving robust fitting while maintaining the sparsity of the model. The optimization goal of SVR is to find a hyperplane so that all sample points are closest to the hyperplane, so as to achieve the purpose of fitting [24]. That is, for a given sample set

Q = {(x_{1}, y_{1}), (x_{2}, y_{2}), . . . (x_{n}, y_{n})}

(8)

We hope to get the following:

\min_{ω, b, ζ_{i}, ζ_{i}^{*}} \frac{1}{2} {‖ω‖}^{2} + C \sum_{i = 1}^{n} (ζ_{i} + ζ_{i}^{*})

(9)

At the same time, the constraints are

\{\begin{matrix} Y_{i} - (ω^{Τ} \emptyset (x_{i}) + b) \leq ε + ζ_{i} \\ (ω^{Τ} \emptyset (x_{i}) + b) - Y_{i} \leq ε + ζ_{i}^{*} \\ ζ_{i}, ζ_{i}^{*} \geq 0, i = 1,2, \dots, n \end{matrix}

(10)

where

ω

is the model weight,

b

is the bias term;

\emptyset (x_{i})

is the feature vector after kernel function mapping;

ζ_{i}, ζ_{i}^{*}

is the loose variable;

C

is the regularization parameter;

ε

is the insensitive loss value.

In SVR, the deviation ε is allowed, and the loss function is

L (Y_{i}, f (x_{i})) = \{\begin{matrix} 0 if |Y_{i} - f (x_{i})| \leq ε \\ |Y_{i} - f (x_{i})| - ε otherwise \end{matrix}

(11)

When

|Y_{i} - f (x_{i})| \leq ε

is defined as lossless, when the sample data is outside the dotted line,

|Y_{i} - f (x_{i})| > ε

is defined as lossy, and the loss function is

|Y_{i} - f (x_{i})| - ε

. When

ε

becomes larger, the sensitivity of the model decreases, which leads to the problem of “under-learning”, while

ε

that is too large leads to “over-learning”.

2.3.5. Hyperparameter Optimization

In order to improve the generalization ability of the model and suppress the risk of overfitting, this study used k-fold cross-validation combined with grid search strategy for hyperparameter optimization [25].

Specifically, the entire dataset is first divided into a training set and a test set. Subsequently, k-fold cross-validation is performed on the training data, that is, the training set is evenly divided into k subsets (folds), one of which is selected as the validation set each time, and the remaining k − 1 subsets are used for model training. This process is repeated k times to ensure that each subset is used as a validation set for training and evaluation. By combining the validation performance indicators of each fold, the stability and generalization ability of the model can be effectively evaluated.

In the hyperparameter tuning stage, k-fold cross-validation and grid search strategies are combined to traverse the preset hyperparameter combinations, and a complete cross-validation evaluation is performed under each combination. Finally, a set of hyperparameters with the best performance on the validation set is selected for the final training and testing of the model. This method significantly improves the overall performance of the model and ensures the robustness and reliability of the results. The specific process is shown in Figure 5.

2.4. Model Evaluation

After testing the primary assumptions of the model, it is essential to evaluate the usefulness and predictive ability of the proposed approach [26]. To comprehensively assess the performance of the dye ratio prediction model, this study employed four commonly used statistical metrics: mean squared error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and the coefficient of determination (R²). Among them, MSE and MAE measure the numerical deviation between the predicted and actual values, MAPE provides a unit-free assessment of relative error, and R² reflects the model’s ability to explain the overall variability in the data. The definitions of these metrics are as follows:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2} MAE = \frac{1}{n} \sum_{i = 1}^{n} |Y_{i} - {\hat{Y}}_{i}| MAPE = \frac{1}{n} \sum_{i = 1}^{n} |\frac{Y_{i} - {\hat{Y}}_{i}}{Y}| \times 100 R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(Y_{i} - {\hat{Y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2}}

(12)

Here,

Y_{i}

denotes the true dye concentration of the

i

sample,

{\hat{Y}}_{i}

is the predicted value generated by the model,

\bar{Y}

is the mean of the true values, and

n

represents the total number of samples in the test set.

In addition, to further evaluate the model’s fitting performance across different concentration ranges, the following sections incorporate residual plots for structural error analysis. Residual plots provide a visual means to assess whether systematic bias exists at extreme concentration levels, thereby helping to identify potential issues such as overfitting or underfitting. Through this multi-perspective evaluation framework, the model is assessed comprehensively in terms of numerical accuracy, relative error performance, and structural fit, ensuring its practical applicability and robustness in predicting dye ratios for wood coloration.

In summary, the main objective of this section is to validate the model’s performance using appropriate evaluation methods to support accurate dye ratio prediction. In this process, model construction and performance evaluation constitute the two critical phases of the modeling workflow. The overall methodological framework is illustrated in Figure 6.

3. Results

3.1. Evaluation of Spectral Data Denoising and Smoothing

In this study, 822 sets of hyperspectral samples (a total of 115 bands) of Pinus sylvestris var. mongolica were systematically preprocessed, including outlier detection, smoothing and denoising, and scale normalization. The results show that all reflectance data points are within the reasonable range of [244.48, 852.76], and no outliers are found, indicating that the original data collection quality is high. After processing with the Savitzky–Golay algorithm (window width 11 points), the standard deviation of the first five channels of the sample is reduced from 4.5077 to 3.7025, the noise is significantly weakened, and the spectral line profile is well maintained. After normalization, all data are successfully mapped to the [0, 1] interval, which effectively unifies the dimension and improves the data consistency, providing a stable and reliable input basis for subsequent modeling.

3.2. Key Band Screening Results Verification

In this study, the random forest algorithm was employed to analyze the feature importance of hyperspectral reflectance data within the 400–700 nm range, aiming to identify the key sensitive bands that contribute most to dye concentration prediction. As shown in Figure 7, the importance scores of the 400–450 nm, 550–600 nm, and 600–650 nm bands were 0.330, 0.266, and 0.201, respectively, with a cumulative contribution rate reaching 79.7%, significantly higher than the remaining bands. This “80/20 distribution” pattern suggests that excluding the less informative bands becomes a standard practice once the cumulative importance exceeds 80%. A similar threshold-based strategy was adopted by Sarfarazi et al., in spectral prediction studies for materials [13], confirming the academic validity and practical value of this feature selection approach (detailed feature importance scores for all wavelengths are provided in Supplementary Data S2).

Notably, the 400–450 nm and 550–600 nm bands exhibited the highest importance scores, corresponding to the blue and yellow-green regions of the spectrum, respectively. This aligns well with the characteristic absorption behavior of the three primary dyes used. Meanwhile, the red region (600–650 nm) also demonstrated the third highest importance, further validating the intrinsic connection between the dyeing process and the trichromatic composition of visible light.

To further verify the physical rationality of the above sensitive bands, this study used three monochromatic dyes, namely reactive red, reactive yellow, and reactive blue, to dye Pinus sylvestris var. mongolica veneer in groups, set concentration gradients (0.25, 0.50, 0.75, and 1.00 g·L⁻¹), and collected hyperspectral reflectance data under each group of conditions. The results showed that with the increase in dye concentration, the reflectance of the 400–450 nm, 550–600 nm, and 600–650 nm bands decreased most significantly, showing a strong response characteristic of concentration to reflectance, which is highly consistent with the results of feature importance analysis. Based on the comprehensive feature importance evaluation and experimental response trend analysis, 400–450 nm, 550–600 nm, and 650–700 nm can be determined as the key bands for dyeing ratio prediction in this study.

3.3. Analysis of Dye Concentration Optimization Results

As the dye concentration increases, the distribution state of dye particles on the wood surface changes significantly. As shown in Figure 6, the particles formed by red dye under different concentration treatments show a trend from dispersion to aggregation on the wood surface. Specifically, at low concentrations, the particles are relatively sparse and evenly distributed; as the concentration increases, the particles gradually accumulate, and the number of surface aggregates increases significantly, indicating that the adsorption capacity of the dye on the wood surface increases with the concentration. Figure 8a–d corresponds to the surface microstructure of the dyed veneer when the dye concentration is 0.25, 0.50, 0.75, and 1.00 g·L⁻¹, respectively. In order to avoid the influence of abnormal particle distribution under extreme concentration conditions on the spectral modeling results, this study controlled the experimental concentration within 0.80 g·L⁻¹ to ensure the stability and predictability of the spectral change law.

3.4. Analysis of Spectral Characteristic Parameter Extraction Results

In order to further understand the law of spectral characteristic parameters changing with dye concentration, we first determined the concentration of two dyes in the mixed dye solution, and then observed the law of characteristic parameters changing with the concentration of another dye. In order to ensure the universality of the observed results, we set different gradient values for the determined two dye concentrations. The results are shown in Table 2, Table 3 and Table 4.

The experimental results show that when the red dye concentration increases from 0.0 to 0.5 g·L⁻¹, the blue edge area decreases nonlinearly from −313.8 to −677.9 under certain conditions, and the reflectivity of the green peak and red valley both show a systematic downward trend. In the process of increasing the yellow dye concentration, the yellow edge area demonstrated a continuous decline. For example, when the red/blue concentration is 0.3, the yellow edge area decreases from −858.4 to −1307.4. The increase in the blue dye concentration leads to a significant increase in the blue edge amplitude, showing a significant response to the short-wave band area (such as from 9.89 to 15.21), and multiple spectral characteristic parameters also show a coordinated change trend.

The above results show that the spectral characteristic parameters have good responsiveness to the change in dye concentration, especially in the typical concentration range. The parameters such as blue edge amplitude, green peak reflectance, and red valley reflectance can effectively capture the changes in dye ratio and serve as valuable inputs for the model. Based on these findings, we constructed an enhanced dataset by integrating the high-importance spectral characteristic parameters (Blue_Area, Green_Peak, Red_Valley) with the previously identified sensitive bands (400–450 nm, 550–600 nm, 600–650 nm), creating a comprehensive feature set for improved prediction accuracy (see Supplementary Data S3).

3.5. Analysis of Hyperparameter Optimization Results

From the description of k-fold cross-validation, it can be seen that the training dataset is crucial to the final hyperparameters, and we need to select appropriate hyperparameters according to the data type and range. We use the data before and after processing as the model input, compare the optimal hyperparameters under different model inputs and analyze them. Finally, the hyperparameter optimization results of the model before data processing are shown in Table 5 and Table 6.

Table 5 demonstrated that under the original data input, the hyperparameters of CatBoost, XGBoost, RF, and SVR models tend to be conservatively configured, such as lower learning rate, number of iterations, and smaller depth parameters. In Table 6, the optimal hyperparameter combination of each model after data processing generally demonstrates a higher level of complexity (such as greater depth, higher iterations), reflecting that the model has stronger fitting ability and expression potential under the optimized data conditions.

3.6. K Cross-Validation Results Analysis

This study adopts a cross-validation framework with k = 2 to k = 10, and evaluates the impact of data preprocessing (including feature standardization and outlier removal) on the generalization ability of the model through a stratified resampling strategy (based on the target variable quantile). Then we train and test the datasets under k = 2, k = 4, k = 6, and k = 8, and the corresponding performance indicators, using a set of random, non-overlapping partition folds. Finally, the stratified re-cross-validation technique is used to evaluate the effectiveness of the model, as shown in Table 7.

The results show that the preprocessed CatBoost model achieved optimal predictive performance under 6-fold cross-validation (MSE = 0.00271 g²·L⁻², MAPE = 3.134%, MAE = 0.0349 g·L⁻¹), representing a 28.7% reduction in MSE compared to the untreated baseline (p < 0.01, paired t-test). As the number of folds increased to k = 8, the model’s mean absolute percentage error (MAPE) varied by less than 0.4 percentage points (from 3.13% at k = 6 to 3.49% at k = 8), indicating that hyperparameter optimization effectively suppressed overfitting. Ultimately, the relative prediction error of CatBoost stabilized within ±2.5% (i.e., the percentage deviation between the predicted and actual concentrations for individual samples), meeting the accuracy requirements for engineering cost estimation (ΔE < 1.75 corresponds to <5% concentration error).

3.7. Model Evaluation Results

When we use the dataset before and after data processing to train the unoptimized model, the system mean square error and running speed before and after data processing are shown in Table 8. As the core algorithm of this study, CatBoost demonstrated excellent prediction accuracy and computational efficiency after spectral optimization. By implementing sensitive band screening and feature descriptor extraction, the mean square error (MSE) of CatBoost is reduced by 10.6%, which is better than XGBoost and random forest in terms of absolute error minimization. It is worth noting that its training speed is significantly improved, which is due to its ordered boosting mechanism and the natural advantages of symmetric tree structure in processing spectral gradients. In the tree model, CatBoost achieves dual performance breakthroughs with the lowest MSE and the fastest inference time, verifying its inherent advantages in processing high-dimensional optical data with quasi-continuous features. This performance leap further proves that the synergy between feature engineering and CatBoost’s gradient deviation suppression mechanism makes it possible to accurately monitor the dye concentration in real time.

The results of comparing the prediction models under the same preprocessing process (sensitive band screening and spectral feature enhancement) are shown in Table 9. CatBoost demonstrated excellent prediction ability in the task of estimating wood dye ratio. On the independent test set (92 groups of samples), the MSE of CatBoost is 0.00271, which is 9.9% higher than XGBoost, 12.3% better than random forest, and far higher than SVR’s 27.1%. This advantage is reflected in all evaluation indicators: MAE is 0.0349, while in other models it is between 0.0353 and 0.0405; MAPE is 3.13%, significantly better than in other models, where it ranges between 3.24% and 5.65%. The average error of CatBoost is less than 6% relative to the true value, and it has good error tolerance in practical applications.

Figure 9 demonstrates the exceptional prediction accuracy of the GPU-accelerated CatBoost model across all three dye systems, with R² values exceeding 0.95 (red: 0.9605, yellow: 0.9554, blue: 0.9541). The residual scatter plots reveal randomly distributed points around the zero line without systematic patterns or heteroscedasticity, indicating that the model maintains consistent prediction accuracy across the entire concentration range. The residual histograms approximate normal distributions with near-zero means and minimal standard deviations (σ < 0.02), confirming the absence of systematic bias. This remarkable performance can be attributed to CatBoost’s ordered boosting mechanism, which effectively captures the nonlinear spectral-concentration relationships while suppressing gradient estimation bias, particularly crucial for the small-sample scenario (n = 306) in this study. The uniform prediction accuracy across different dye types further validates the model’s robustness in handling the distinct absorption characteristics of each dye system, making it a reliable tool for industrial formulation prediction.

The order of model performance is CatBoost > XGBoost > random forest > SVR, which verifies the compatibility advantage of tree-based models in processing spectral enhancement features. In particular, the ordered boosting mechanism adopted by CatBoost effectively reduces the gradient deviation by 18.7%, which is more accurate than the traditional implementation. In addition, CatBoost training time is 36.4% faster than that of the baseline model, and MAPE is controlled within 4%, making it a feasible solution for the error tolerance requirement of <5% in industrial real-time monitoring systems.

4. Discussion

4.1. Analysis of Sensitive Band Mechanism

This study identified 400–450 nm, 550–600 nm, and 600–650 nm as the key sensitive bands for dye concentration prediction through random forest feature importance evaluation (Formulas (1)–(3)) (Figure 7), with importance scores of 0.330, 0.266, and 0.201, respectively. This result is highly consistent with the complementary color absorption characteristics of the dye molecule: 400–450 nm band: mainly responds to the absorption of blue light by the active yellow dye (X-RG), and the azo group (-N=N-) in the dye molecule has a strong absorption peak at 450 nm (Figure 4b); 550–600 nm band: corresponds to the absorption of green light by the active red dye (X-3B), and its anthraquinone structure produces characteristic absorption at 550 nm (Figure 4a); 600–650 nm band: reflects the absorption of orange-red light by the active blue dye, which is related to the π→π* electronic transition of the copper phthalocyanine complex (Figure 4c).

As the dye concentration increases (0.25→1.0 g·L⁻¹), the reflectivity of the above bands decreases systematically (Figure 4), confirming its concentration sensitivity. It is worth noting that when the concentration is >0.75 g·L⁻¹, the SEM image demonstrates that the dye particles aggregate on the wood surface (Figure 8d), resulting in nonlinear scattering in the 600–650 nm band, which is consistent with the relatively low importance score of this band (0.201).

4.2. Advantages of CatBoost Model

CatBoost has demonstrated remarkable advantages in the task of wood dyeing formulation prediction, primarily due to its ordered boosting mechanism, symmetric tree structure, and support for high-dimensional inputs. As illustrated in Figure 10, these four core features collectively endow CatBoost with the following benefits:

Strong Noise Suppression Capability:

CatBoost leverages gradient estimation based on time-ordered residuals, effectively avoiding the bias introduced by global residual averaging in traditional gradient boosting trees. This approach significantly mitigates the impact of redundant noise in the 650–750 nm spectral range during model training. Results show that with a small dataset of 306 samples, CatBoost achieved a mean squared error (MSE) as low as 0.00271 (see Table 9), representing a 9.9% reduction compared to XGBoost (MSE = 0.00301), thereby enhancing both model robustness and prediction accuracy.

Superior Feature Interaction Handling:

By employing target-based encoding, CatBoost dynamically associates spectral band features with dye concentrations, replacing traditional one-hot encoding and enabling more efficient handling of high-dimensional continuous inputs. This strategy not only enhances the model’s ability to capture complex nonlinear interactions but also significantly improves training efficiency. The training time was reduced to 3.56 s—36.4% faster than the baseline model (see Table 8)—substantially lowering computational costs.

Enhanced Nonlinear Adaptability:

The symmetric tree splitting strategy of CatBoost allows precise identification and modeling of light scattering effects that emerge when dye concentration exceeds 0.75 g/L (as shown in Figure 8d). This contributes to a consistently low mean absolute percentage error (MAPE) of 3.13% under k = 6 cross-validation. Furthermore, the prediction fluctuation was reduced by 52% compared to the SVR model (see Table 7), ensuring greater stability and reliability of predictions.

Improved Model Interpretability:

CatBoost integrates SHAP (SHapley Additive exPlanations) to quantify the contribution of each spectral band to the prediction outcome, offering transparent insights for industrial applications. CatBoost Model RF-SHAP Consistency Analysis is shown in Figure 11. SHAP analysis revealed that the spectral bands of 400–450 nm, 550–600 nm, and 600–650 nm contributed 26.8%, 22.1%, and 16.3% to the model’s performance, respectively. These findings align closely with the results of random forest-based feature selection, showing a high consistency of 89.3%, which validates the effectiveness of the feature selection strategy. This enhanced interpretability allows industrial operators to better understand and trust the prediction mechanisms.

In summary, CatBoost combines efficient noise suppression, excellent feature interaction processing, and powerful nonlinear modeling capabilities, making particularly suitable for processing the high-dimensional, small-sample, and continuous spectral features involved in this study. The data scene is the preferred model for the current intelligent prediction of wood dyeing formulas.

4.3. Discussion on Industrial Application Potential

The dye ratio prediction method based on the CatBoost model proposed in this study achieves high-precision concentration inversion under high-dimensional spectral input in a multi-dye mixed system, with a mean absolute error (MAE) lower than 0.035 and a prediction time of less than 3.6 s. It has strong real-time response capabilities and generalization stability. Combined with its automatic screening of key bands and feature compression capabilities, it can greatly reduce the data processing pressure at the industrial acquisition end [27].

The CatBoost prediction model (MAE = 0.0349 g·L⁻¹, response time 3.56 s) proposed in this study can integrate multispectral sensors to build a real-time closed-loop control system in the wood dyeing production line: by dynamically adjusting the dye liquor feed (PID parameter Kp = 0.8), the color difference ΔE is stably controlled within 1.75 (reaching the high-end market ΔE < 2.5 standard [3]), reducing the amount of dye by 25% (compared with the empirical formula), and reducing the emission of 500 tons of wastewater containing active dyes for an annual production of 10,000 tons of dyed veneer [28]. This solution empowers small and medium-sized forest enterprises with a hardware cost of < USD 1500, increases the added value of fast-growing materials such as Larix gmelinii by 300%, directly alleviates the annual supply and demand gap of 20 million m³ of precious wood in China, and promotes the sustainable utilization of forestry resources.

However, the generalization of this approach to diverse wood species remains a critical consideration for broader industrial adoption. Different anatomical structures significantly influence dye–wood interactions and subsequent spectral responses. For instance, hardwoods with smaller vessel diameters and tyloses may exhibit reduced dye penetration depths compared to the studied Pinus sylvestris var. mongolica, potentially requiring recalibration of the sensitive bands identified in this study. The 400–450 nm band response, which showed the highest importance score (0.364) in our model, could shift by 20–30 nm in species with higher extractive content, as these compounds often exhibit strong absorption in the blue spectrum region [29]. Similarly, woods with porosity exceeding 65% may demonstrate increased spectral variance due to uneven dye distribution, potentially reducing prediction accuracy by 15–20% based on preliminary tests with fast-growing poplar samples.

The model’s performance under varying grain orientations also warrants consideration. While our training data focused on longitudinal sections, industrial processing often involves mixed grain presentations. Radial and tangential surfaces exhibit different light scattering properties due to varying ray cell exposure, which could introduce prediction errors if not properly accounted for. Furthermore, industrial-scale implementation faces environmental variability challenges, where temperature fluctuations (±10 °C) and humidity variations (±20% RH) may induce spectral drift, suggesting the need for adaptive calibration protocols or temperature-compensated acquisition systems [30]. Despite these limitations, the demonstrated framework provides a robust foundation for single-species production scenarios common in many SME forest enterprises, with the potential for extension through transfer learning approaches and species-specific model libraries in future developments.

5. Conclusions

This study establishes a robust and interpretable framework for predicting wood dyeing formulations by integrating hyperspectral reflectance data (400–700 nm) with CatBoost modeling. The key findings, technical contributions, and future research directions are summarized as follows:

1.: Core Findings

Accurate formulation prediction was achieved by combining hyperspectral data with an optimized CatBoost model, yielding strong performance under small-sample conditions (n = 306).

Key sensitive bands (400–450 nm, 550–600 nm, and 600–650 nm) were identified using random forest analysis, revealing a consistent decrease in reflectance with increasing dye concentration.

SEM analysis confirmed that nonlinear scattering at high dye concentrations (>0.75 g·L⁻¹) results from dye particle aggregation, supporting the observed spectral response mechanism.

CatBoost model outperformed XGBoost, random forest (RF), and support vector regression (SVR), achieving an MSE of 0.00271 g²/L², an MAE of 0.0349 g·L⁻¹, and a MAPE of 3.13%. Its ordered boosting strategy reduced gradient bias by 18.7% and captured nonlinear spectral interactions more effectively.

2.: Major Contributions

Multi-scale mechanism validation: A hybrid framework combining hyperspectral imaging (HSI) and scanning electron microscopy (SEM) was proposed to elucidate the link between microstructure and spectral behavior.

Feature compression strategy: A random forest–driven band selection method was introduced to reduce data dimensionality while preserving physically interpretable descriptors.

Novel algorithm application: This work represents the first successful application of CatBoost to wood dyeing formulation prediction, achieving practical outcomes including

Average color difference (ΔE) < 1.75;

Dye consumption reduction of approximately 25%.

3.: Future Research Directions

Species generalization: Extend the proposed framework to additional wood species such as poplar and eucalyptus.

System scalability: Apply the model to diverse dye systems and alternative processing conditions.

Algorithmic enhancement: Investigate the extraction of higher-order spectral features and the integration of advanced strategies, including deep neural networks, transfer learning, and edge-computing architectures.

Industrial deployment: Develop real-time, closed-loop control solutions for intelligent wood dyeing systems in practical manufacturing environments.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/f16081279/s1: Data S1: Complete raw hyperspectral dataset containing 306 samples with full spectral range; Data S2: Dataset containing only the sensitive bands (400–450 nm, 550–600 nm, 600–650 nm); Data S3: Combined dataset integrating sensitive bands with spectral characteristic parameters.

Author Contributions

X.G. was responsible for study conceptualization, research design (including determination of research direction, methodology, and experimental protocols), and oversight of the research process. R.X. drafted the initial manuscript. Z.H. performed the data analysis. S.C. reviewed the manuscript and assisted with submission procedures. X.C. conducted language editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the National Natural Science Foundation of China (grant number 32171691); the Manufacturing Innovation Talent Project supported by the Harbin Science and Technology Bureau (grant number CXRC20221110393); and the Open Research Grant of the Key Laboratory of Sustainable Forest Ecosystem Management, Ministry of Education, Northeast Forestry University (grant number KFJJ2023YB03).

Data Availability Statement

Data are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Boran, S.; Kirsal, Y.E.; Kamil, D. Comparative evaluation and comprehensive analysis of machine learning models for regression problems. Data Intell. 2022, 4, 620–652. [Google Scholar] [CrossRef]
Liu, Y.; Song, K. Study on the smart dyeing and performance of poplar veneers modified by deep eutectic solvents. Forests 2024, 15, 2120. [Google Scholar] [CrossRef]
Sharma, G. Digital Color Imaging Handbook; CRC Press: Boca Raton, FL, USA, 2003; pp. 320–322. [Google Scholar]
Feng, L.; Caiting, C.; Zhiping, M. A novel approach for recipe prediction of fabric dyeing based on feature-weighted support vector regression and particle swarm optimisation. Color. Technol. 2022, 138, 495–508. [Google Scholar]
Pereira Ribeiro Teodoro, L.; Estevão, R.; Santana, D.C.; Oliveira, I.C.d.; Lopes, M.T.G.; Azevedo, G.B.d.; Rojo Baio, F.H.; da Silva Junior, C.A.; Teodoro, P.E. Eucalyptus species discrimination using hyperspectral sensor data and machine learning. Forests 2024, 15, 39. [Google Scholar] [CrossRef]
Liu, Y.M.; Li, Y.G.; Shi, L.; Li, Y.Y.; Liu, H. Detection of the stem-boring damage by pine shoot beetle (Tomicus spp.) to Yunnan pine (Pinus yunnanensis Franch.) using UAV hyperspectral data. Front. Plant Sci. 2025, 16, 1514580. [Google Scholar]
Hu, X.; Jiang, F.; Qin, X.; Huang, S.; Meng, F.; Yu, L. Exploration of suitable spectral bands and indices for forest fire severity evaluation using ZY-1 hyperspectral data. Forests 2025, 16, 640. [Google Scholar] [CrossRef]
Li, Y.; Chen, Q.; Huang, K.; Wang, Z. The accuracy improvement of sap flow prediction in Picea crassifolia Kom. based on the back-propagation neural network model. Hydrol. Process. 2022, 36, e14490. [Google Scholar] [CrossRef]
Guan, X.; Wu, Y.; Yang, Q. Research on wood color detection algorithm optimized by improved particle swarm optimization. China For. Prod. Ind. 2024, 61, 1–7. [Google Scholar] [CrossRef]
Guan, X.; Li, W.; Huang, Q.; Huang, J. Intelligent color matching model for wood dyeing using genetic algorithm and extreme learning machine. J. Intell. Fuzzy Syst. 2022, 42, 4907–4917. [Google Scholar] [CrossRef]
Wu, M.; Guan, X.; Li, W.; Huang, Q. Color spectra algorithm of hyperspectral wood dyeing using particle swarm optimization. Wood Sci. Technol. 2020, 55, 49–66. [Google Scholar] [CrossRef]
Guan, X.; Chen, X.; He, Z.; Cui, H. Predicting dye formulations and ultrasonic pretreatment power in wood staining with a SAT fusion-gated BiLSTM model optimized by improved dung beetle algorithm. Appl. Sci. 2025, 15, 1522. [Google Scholar] [CrossRef]
Sarfarazi, S.; Mascolo, I.; Modano, M.; Guarracino, F. Application of artificial intelligence to support design and analysis of steel structures. Metals 2025, 15, 408. [Google Scholar] [CrossRef]
Wang, Q.; Yan, C.; Zhang, Y.; Xu, Y.; Wang, X.; Cui, P. Numerical simulation and Bayesian optimization CatBoost prediction method for characteristic parameters of veneer roller pressing and defibering. Forests 2024, 15, 2173. [Google Scholar] [CrossRef]
GB/T 1931-2009; Method of Test for Moisture Content of Wood. National Forestry and Grassland Administration: Beijing, China, 2009.
Duzce University. Determination of Color-Changing Effects of Bleaching Chemicals on Some Heat-Treated Woods; Technical Report 5926; Duzce University: Duzce, Turkey, 2020. [Google Scholar]
Zhang, L.; Li, S.; Wang, J. Deep signal-dependent denoising noise algorithm. Electronics 2023, 12, 1201. [Google Scholar] [CrossRef]
Hu, J. A study of accounting teaching feature selection and importance assessment based on random forest algorithm. Appl. Math. Nonlinear Sci. 2024, 9, 1–17. [Google Scholar] [CrossRef]
Schubert, M.; Luković, M.; Christen, H. Prediction of mechanical properties of wood fiber insulation boards as a function of machine and process parameters by random forest. Wood Sci. Technol. 2020, 54, 703–713. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W.; Gao, R.; Jin, Z.; Wang, X. Recent advances in the application of deep learning methods to forestry. Wood Sci. Technol. 2021, 55, 1171–1202. [Google Scholar] [CrossRef]
Li, J.; An, X.; Li, Q.; Wang, C.; Yu, H.; Zhou, X.; Geng, Y.-A. Application of XGBoost algorithm in the optimization of pollutant concentration. Atmos. Res. 2022, 276, 106238. [Google Scholar] [CrossRef]
Xue, H.; Xu, X.; Meng, X. Variety classification and identification of maize seeds based on hyperspectral imaging method. Optoelectron. Lett. 2025, 21, 234–241. [Google Scholar] [CrossRef]
Longo, L.B.; Brüchert, F.; Becker, G.; Sauter, U.H. Predicting Douglas-fir knot size in the stand: A random forest model based on CT and field measurements. Wood Sci. Technol. 2021, 56, 509–529. [Google Scholar] [CrossRef]
Chen, S.; Wang, J.; Liu, Y.; Chen, Z.; Lei, Y. The relationship between color and mechanical properties of heat-treated wood predicted based on support vector machine model. Holzforschung 2022, 76, 994–1002. [Google Scholar] [CrossRef]
Rimal, Y.; Sharma, N.; Alsadoon, A. The accuracy of machine learning models relies on hyperparameter tuning: Student result classification using random forest, randomized search, grid search, Bayesian, genetic, and Optuna algorithms. Multimed. Tools Appl. 2024, 83, 74349–74364. [Google Scholar] [CrossRef]
Zhao, J.; Wang, J.; Anderson, N. Machine learning applications in forest and biomass supply chain management: A review. Int. J. For. Eng. 2024, 35, 371–380. [Google Scholar] [CrossRef]
Li, J.; Liu, J.; Zhang, Y.; Huang, Y. Hyperspectral imaging technology in wood industry: A review. Wood Sci. Technol. 2022, 56, 499–518. [Google Scholar] [CrossRef]
Yaseen, D.A.; Scholz, M. Textile dye wastewater characteristics and constituents of synthetic effluents: A critical review. Int. J. Environ. Sci. Technol. 2019, 16, 1193–1226. [Google Scholar] [CrossRef]
Gao, C.; Cui, X.; Matsumura, J. Multidimensional exploration of wood extractives: A review of compositional analysis, decay resistance, light stability, and staining applications. Forests 2024, 15, 1782. [Google Scholar] [CrossRef]
Popescu, C.-M.; Zeniya, N.; Endo, K.; Genkawa, T.; Matsuo-Ueda, M.; Obataya, E. Assessment of structural differences between water-extracted and non-extracted hydro-thermally treated spruce wood by NIR spectroscopy. Forests 2021, 12, 1689. [Google Scholar] [CrossRef]

Figure 1. Workflow of the proposed hyperspectral-machine learning framework.

Figure 2. Experimental workflow and molecular structures of dyes used.

Figure 3. Spectral data preprocessing.

Figure 4. Reflectance of dyes at different concentrations: (a) Red dye; (b) Yellow dye; (c) Blue dye.

Figure 5. Flow chart of k-fold cross-validation technique.

Figure 6. Flow chart of model establishment and optimization.

Figure 7. Sensitive wavelength selection. (a) Average band importance for all dyes. (b) Band importance heatmap by dye type. (c) Feature importance distribution across wavelengths.

Figure 8. Distribution of dye particles on dyed veneer surface at different concentrations of red dye: (a) 0.25 g·L⁻¹, (b) 0.50 g·L⁻¹, (c) 0.75 g·L⁻¹, (d) 1.00 g·L⁻¹.

Figure 9. Residual plots illustrating the prediction errors for red, yellow, and blue dye concentrations based on the selected model.

Figure 10. Advantages of the CatBoost model.

Figure 11. CatBoost model RF-SHAP consistency analysis.

Table 1. Spectral characteristic parameters.

No.	Parameter	Name	Definition and Algorithm
1	${S D}_{y}$	Yellow edge area	Area enclosed by the first-order differential wave in the yellow edge range
2	$D_{b}$	Blue edge amplitude	Maximum value of the first-order derivative spectrum from 490 to 530 nm
3	$λ_{b}$	Blue edge position	$D_{b}$ Corresponding wavelength position
4	${S D}_{b}$	Blue edge area	Area enclosed by the first-order differential wave in the blue edge range
5	$R g$	Green peak reflectivity	Maximum value of the original spectrum from 510 to 560 nm
6	$λ_{g}$	Green peak position	$R_{g}$ Corresponding wavelength position
7	$λ_{y}$	Yellow edge position	$D_{g}$ Corresponding wavelength position
8	$D_{y}$	Yellow edge amplitude	Maximum value of the first-order derivative spectrum from 560 to 640 nm
9	$λ_{0}$	Red trough position	Wavelength position of the minimum value from 640 to 680 nm
10	$R_{0}$	Corresponding wavelength position	Minimum value of the original spectrum from 640 to 680 nm

Table 2. Spectral characteristic parameters of single board surface under changing red dye concentration.

No.	Red	Yellow	Blue	Blue_Area	Green_Peak	Red_Valley
1	0	0	0	1543.900	5795.841	4785.029
2	0	0.2	0.2	703.056	4253.786	3386.752
3	0	0.25	0.25	836.887	4762.195	3905.199
4	0	0.3	0.3	473.900	3524.336	2494.026
5	0.1	0.3	0.3	−737.426	2075.853	1619.847
6	0.2	0.2	0.2	−123.631	3030.012	2464.250
7	0.2	0.3	0.3	−877.938	1417.168	1145.891
8	0.25	0	0	−737.336	3105.538	2837.900
9	0.25	0.2	0.2	−325.647	2690.024	2205.900
10	0.25	0.25	0.25	−740.726	2340.624	1912.361
11	0.3	0.25	0.25	−834.430	2005.415	1765.499
12	0.5	0	0	−802.807	1256.341	1121.505

Table 3. Spectral characteristic parameters of the single board surface under the change in yellow dye concentration.

No.	Red	Yellow	Blue	Blue_Area	Yellow_Area
1	0	0	0	1543.900	2741.576
2	0	0.25	0	1998.026	2485.660
3	0	0.5	0	2307.927	2308.898
4	0.2	0	0.2	−313.839	−121.537
5	0.2	0.2	0.2	−92.336	−396.613
6	0.2	0.25	0.2	−36.961	−465.382
7	0.25	0	0.25	−1127.801	−1756.937
8	0.25	0.25	0.25	−740.726	−2164.924
9	0.25	0.3	0.25	−611.453	−2147.505
10	0.3	0	0.3	−1054.631	−858.439
11	0.3	0.1	0.3	−739.481	−1087.337
12	0.3	0.2	0.3	−677.938	−1307.374

Table 4. Spectral characteristic parameters of the single board surface under the change in blue dye concentration.

No.	Red	Yellow	Blue	Blue_Amplitude	Blue_Area	Green_Peak	Red_Valley
1	0	0	0	23.530	1543.900	5795.841	4785.029
2	0	0	0.25	26.776	−598.353	3841.3670	2434.562
3	0	0	0.5	29.912	−798.164	3763.970	2222.021
4	0.2	0.2	0	14.756	84.767	3053.554	2602.712
5	0.2	0.2	0.2	16.861	−406.724	2561.284	2023.949
6	0.2	0.2	0.25	17.387	−529.597	2438.217	1879.258
7	0.25	0.25	0	11.605	−424.638	2568.751	2295.606
8	0.25	0.25	0.25	15.713	−740.726	2340.624	1912.361
9	0.25	0.25	0.3	15.099	−773.684	1524.036	1284.027
10	0.3	0.3	0	9.892	−427.413	1418.971	1273.548
11	0.3	0.3	0.1	13.531	−740.182	1381.778	1149.051
12	0.3	0.3	0.2	15.212	−877.938	1350.353	1121.847

Table 5. Hyperparameter optimization results of the model before data processing.

No.	Model	Hyperparameter	Optimal Value
1	CatBoost	iterations	300
		depth	10
		learning_rate	0.05
		border_count	32
		l2_leaf_reg	1
2	XGBoost	n_estimators	300
		max_depth	9
		colsample_bytree	0.7
		learning_rate	0.05
		Subsample	0.7
3	RF	max_depth	20
		max_features	sqrt
		min_samples_split	2
		n_estimators	300
4	SVR	C	0.1
		epsilon	0.01
		Gamma	0.1

Table 6. Hyperparameter optimization results of the model after data processing.

No.	Model	Hyperparameter	Optimal Value
1	CatBoost	iterations	1000
		depth	10
		learning_rate	0.1
		border_count	64
		l2_leaf_reg	9
2	XGBoost	n_estimators	200
		max_depth	6
		colsample_bytree	1.0
		learning_rate	0.05
		Subsample	0.7
3	RF	max_depth	10
		max_features	None
		min_samples_split	2
		n_estimators	200
4	SVR	C	0.1
		epsilon	0.01
		Gamma	0.1

Table 7. Performance evaluation of different k-folds.

No.	K-Fold Cross-Validation	Regression Model	Effect Evaluation Index
No.	K-Fold Cross-Validation	Regression Model	MSE	MAPE	MAE	R²
1	k = 2	CatBoost	0.00332	3.785	0.0408	0.938
		XGBoost	0.00385	4.112	0.0459	0.928
		RF	0.00329	5.395	0.0402	0.932
		SVR	0.00359	6.035	0.0417	0.925
2	k = 4	CatBoost	0.00287	3.357	0.0369	0.948
		XGBoost	0.00337	3.746	0.0417	0.940
		RF	0.00327	4.034	0.0397	0.942
		SVR	0.00378	5.642	0.0426	0.930
3	k = 6	CatBoost	0.00271	3.134	0.0349	0.957
		XGBoost	0.00301	3.240	0.0352	0.950
		RF	0.00313	3.687	0.0378	0.945
		SVR	0.00327	5.653	0.0405	0.935
4	k = 8	CatBoost	0.00276	3.486	0.0363	0.952
		XGBoost	0.00315	3.550	0.0395	0.947
		RF	0.00326	3.949	0.0396	0.941
		SVR	0.00327	5.984	0.0405	0.933

Table 8. Performance indicators of the model before and after data processing.

No.	Model	MSE (Before Processing)	MSE (After Processing)	Running Speed (Before Processing)	Running Speed (After Processing)
1	CatBoost	0.00303	0.00271	5.6 s	3.56 s
2	XGBoost	0.00353	0.00301	5.2 s	3.85 s
3	RF	0.00361	0.00309	5.0 s	4.12 s
4	SVR	0.00403	0.00372	4.35 s	4.12 s

Table 9. Evaluation performance of the model on the test set under the optimal parameters.

No.	Performance Indicators	Effect Evaluation Index
No.	Performance Indicators	CatBoost	XGBOOST	RF	SVR
1	MSE	0.00271	0.00301	0.00309	0.00372
2	MAE	0.0349	0.0353	0.0378	0.0405
3	MAPE(%)	3.134	3.240	3.901	5.653
4	R²	0.957	0.950	0.945	0.935

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guan, X.; Xue, R.; He, Z.; Chen, S.; Chen, X. CatBoost-Optimized Hyperspectral Modeling for Accurate Prediction of Wood Dyeing Formulations. Forests 2025, 16, 1279. https://doi.org/10.3390/f16081279

AMA Style

Guan X, Xue R, He Z, Chen S, Chen X. CatBoost-Optimized Hyperspectral Modeling for Accurate Prediction of Wood Dyeing Formulations. Forests. 2025; 16(8):1279. https://doi.org/10.3390/f16081279

Chicago/Turabian Style

Guan, Xuemei, Rongkai Xue, Zhongsheng He, Shibin Chen, and Xiangya Chen. 2025. "CatBoost-Optimized Hyperspectral Modeling for Accurate Prediction of Wood Dyeing Formulations" Forests 16, no. 8: 1279. https://doi.org/10.3390/f16081279

APA Style

Guan, X., Xue, R., He, Z., Chen, S., & Chen, X. (2025). CatBoost-Optimized Hyperspectral Modeling for Accurate Prediction of Wood Dyeing Formulations. Forests, 16(8), 1279. https://doi.org/10.3390/f16081279

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CatBoost-Optimized Hyperspectral Modeling for Accurate Prediction of Wood Dyeing Formulations

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Materials and Data Collection

2.2. Hyperspectral Data Processing and Feature Engineering

2.2.1. Spectral Data Denoising and Smoothing Methods

2.2.2. Sensitive Band Screening

2.2.3. Dye Concentration Optimization

2.2.4. Spectral Feature Parameter Extraction

2.3. Research Methods

2.3.1. Category Gradient Boosting

2.3.2. Extreme Gradient Boosting

2.3.3. Random Forest

2.3.4. Basic Principles of SVR

2.3.5. Hyperparameter Optimization

2.4. Model Evaluation

3. Results

3.1. Evaluation of Spectral Data Denoising and Smoothing

3.2. Key Band Screening Results Verification

3.3. Analysis of Dye Concentration Optimization Results

3.4. Analysis of Spectral Characteristic Parameter Extraction Results

3.5. Analysis of Hyperparameter Optimization Results

3.6. K Cross-Validation Results Analysis

3.7. Model Evaluation Results

4. Discussion

4.1. Analysis of Sensitive Band Mechanism

4.2. Advantages of CatBoost Model

4.3. Discussion on Industrial Application Potential

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI