Quantitative Detection of Mixed Gas Infrared Spectra Based on Joint SAE and PLS Downscaling with XGBoost

Zhou, Xichao; Wang, Baigen; Bao, Xingjiang; Qi, Hongtao; Peng, Yong; Xu, Zishang; Zhang, Fan

doi:10.3390/pr13072112

Open AccessArticle

Quantitative Detection of Mixed Gas Infrared Spectra Based on Joint SAE and PLS Downscaling with XGBoost

by

Xichao Zhou

¹,

Baigen Wang

^2,*,

Xingjiang Bao

²,

Hongtao Qi

²,

Yong Peng

¹,

Zishang Xu

³ and

Fan Zhang

⁴

¹

State Grid Integrated Energy Service Group Co., Ltd., Beijing 100051, China

²

Anqing Power Supply Company of State Grid Anhui Electric Power Co., Ltd., Anqing 246000, China

³

China Electric Power Research Institute Co., Beijing 100051, China

⁴

College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(7), 2112; https://doi.org/10.3390/pr13072112

Submission received: 22 May 2025 / Revised: 23 June 2025 / Accepted: 28 June 2025 / Published: 3 July 2025

(This article belongs to the Special Issue Applications of Artificial Intelligence Technologies in Energy, Manufacturing and Automatic Control Processes)

Download

Browse Figures

Versions Notes

Abstract

In view of the bottleneck problems of serious spectral peak cross-interference, redundant data dimensions, and inefficient traditional dimensionality reduction methods in the infrared spectral analysis of mixed gases, this paper studies a joint dimensionality reduction strategy combining stacked self encoder (SAE) and partial least squares (PLS) and constructs an XGBoost regression model for quantitative detection. The experimental data are from the real infrared spectrum dataset of the National Institute of Standards and Technology (NIST) database, covering key industrial gases such as CO, CH₄, etc. Compared with the traditional principal component analysis (PCA), which relies on the variance contribution rate and leads to dimensional redundancy, and the calculation efficiency of dimension parameters that need to be cross-verified for PLS dimension reduction alone, the SAE-PLS joint strategy has two advantages: first, the optimal dimension reduction is automatically determined by SAE’s nonlinear compression mechanism, which effectively overcomes the limitations of linear methods in spectral nonlinear feature extraction; and second, the feature selection is carried out by combining the variable importance projection index of PLS. Compared with SAE, the compression efficiency is significantly improved. The XGBoost model was selected because of its adaptability to high-dimensional sparse data. Its regularization term and feature importance weighting mechanism can suppress the interference of spectral noise. The experimental results show that the mean square error (MSE) on the test set is reduced to 0.012% (71.4% lower than that of random forest), and the correlation coefficient (R²) is 0.987. By integrating deep feature optimization and ensemble learning, this method provides a new solution with high efficiency and high precision for industrial process gas monitoring.

Keywords:

XGBoost model; mixed gas; infrared spectroscopy; quantitative detection; NIST database

1. Introduction

Accurate and quantitative detection of mixed gas composition is the core technical requirement of industrial process monitoring, environmental pollutant tracking, and safety production early warning [1]. Infrared spectrum analysis has become the mainstream detection method in this field due to its non-invasive nature, high sensitivity, and rapid response characteristics [2,3]. However, the infrared spectra of multicomponent gas mixtures generally have serious spectral peak cross-interference and baseline drift, resulting in a significant decline in the prediction accuracy of the traditional linear regression model [4]. In addition, the nonlinear characteristics of high-dimensional spectral data and noise interference (instrument drift, environmental disturbance) further aggravate the risk of “dimensional disaster” and model overfitting [5]. In industrial field application scenarios, the gas monitoring system of petrochemical enterprises has the redundancy of characteristic dimensions after the dimensionality reduction of principal component analysis method, which leads to significant delay in model reasoning and is difficult to match the time sensitive requirements of real-time control [6]; however, in the process of using the classical partial least squares model for environmental monitoring stations, due to the lack of a screening mechanism for the projection index of variable importance, the noise band is wrongly included in the regression analysis framework, resulting in an abnormal increase in the misjudgment rate of carbon dioxide concentration monitoring results [7]. These pain points highlight the multiple defects of the existing methods in efficiency, accuracy, and interpretability.

Traditional dimensionality reduction and modeling methods face two bottlenecks: first, linear dimensionality reduction technology relies on variance contribution rate or manually set thresholds, which makes it difficult to effectively compress nonlinear spectral features [8]. Experiments show that under the same variance threshold, the reconstruction error of PCA for the CO spectrum is 37% higher than that of the stacked self-coder (SAE). Second, although machine learning models (such as random forest and support vector regression (SVR)) have the ability for nonlinear fitting, they are sensitive to high-dimensional sparse data and have high complexity in parameter optimization [9,10,11]. The research gap lies in the lack of a joint dimension reduction framework that takes into account nonlinear feature compression, efficient dimension selection, and industrial interpretability, as well as a high-precision regression model.

To solve the above problems, this study proposes a quantitative detection method of mixed gas based on SAE-PLS combined with dimension reduction and XGBoost regression. SAE automatically determines the optimal dimension reduction through deep nonlinear compression, which overcomes the efficiency bottleneck of traditional methods relying on artificial thresholds or cross-validation. Furthermore, combining with the dynamic screening characteristics of the VIP index of PLS, the matching degree between the feature set and the vibration mode of gas molecules improved by 19.6%, and the defect of fuzzy physical meaning in the SAE-PCA combination was solved. The XGBoost model is selected because of its regularization mechanism and importance weighting ability [12], which can effectively suppress the interference of spectral noise and support GPU acceleration and embedded deployment. Experiments based on real spectral data from the National Institute of Standards and Technology (NIST) show that the mean square error (MSE) of this method on the test set is as low as 0.012%, which is 41.2% lower than that of random forest, and the correlation coefficient R ² is 0.987.

The theoretical contribution of this study is to build a mixed gas analysis paradigm of “deep nonlinear dimensionality reduction → interpretable feature selection → high-precision integrated learning”. The technical value is reflected in the SAE-PLS-XGBoost collaborative optimization, which overcomes the trade-off dilemma between the efficiency and accuracy of traditional methods.

2. Methods

2.1. Data Acquisition and Preprocessing

2.1.1. Data Generation

The database of this study is derived from the third-generation infrared spectrum database released by the National Institute of Standards and Technology (NIST) in 2024. The database was collected using a synchrotron radiation Fourier transform infrared spectrometer (sr-ftir), covering the 4000–400 cm⁻¹ band range [13,14]. The study selected six target gases, such as CO, CH₄, and CO₂ as the objects of analysis. The target gas specifications provided by the NIST database are different. The gas specifications are shown in Table 1. Although the original database provided high-precision transmission spectral data for a single component under standard conditions, its discrete fixed concentration gradient could not meet the requirements of quantitative analysis for continuous concentration distribution.

In order to construct the infrared spectral dataset with the characteristics of concentration gradient and mixed components, a systematic data preprocessing process was designed. Firstly, the original spectrum was normalized based on the Lambert–Beer law [15], and the transmitted light intensity of different initial concentrations was uniformly converted to the reference value of 100% volume concentration. See Formula (1) for the Lambert–Beer law:

A = ε L C

(1)

where a is the absorbance, ε is the molar absorbance coefficient, l is the optical path length, and C is the gas concentration. Then, the random concentration coefficient matrix constrained by ∑C_i = 1 was introduced to simulate the multicomponent mixing spectrum using a linear superposition algorithm. In order to enhance the statistical completeness of the dataset, the Monte Carlo random sampling method was used for 5000 iterations [16], and finally, a spectral expansion set containing single-component concentration gradients and multi-component mixed systems was generated. A piece of data was randomly selected, and its infrared spectrum is shown in Figure 1.

The dataset construction strategy effectively solves the problem of the high cost of obtaining multi-concentration mixed spectra in experiments and provides high-quality datasets with both physical authenticity and statistical diversity for subsequent model training.

2.1.2. Data Enhancement

Aiming at the differences in characteristics between the ideal laboratory environment and the industrial field spectral data, this study uses the Gaussian white noise injection strategy to improve the realistic representation ability of the data set [17]. Because the spectral acquisition of the NIST database is completed in a precision experimental device with constant temperature and vibration isolation, its data has ultra-low noise and perfect baseline stability, which cannot reflect the spectral distortion caused by instrument fluctuations, environmental interference, and other factors in real industrial settings [18,19]. Therefore, on the premise of ensuring the physical law of spectral absorption characteristics, the measurement deviation under different working conditions is systematically simulated by superimposing random white noise with controllable intensity on the original absorbance data.

In the specific implementation, according to the typical noise level of the industrial spectrometer, it is divided into three enhancement levels: low, medium, and high, which correspond to the laboratory level, standard industrial level, and harsh environmental level detection conditions [20,21]. In this paper, these correspond to three different SNR levels: 20 dB, 15 dB, and 10 dB, respectively. The SNR can be understood as the ratio of signal to noise; that is, the smaller the SNR, the greater the noise. Each noise level adopts an independent Gaussian distribution parameter, which uniformly disturbs the whole spectral band while retaining the overall shape of the characteristic absorption peak. In order to avoid the mechanical repetition of the noise template, a dynamic random seed mechanism is introduced to ensure the uniqueness of the noise mode for each enhanced spectrum [22]. The enhanced dataset not only includes the pure spectrum under ideal conditions but also contains spectral variants with different noise characteristics, effectively expanding the coverage of data distribution. The infrared spectrum comparison diagram of the same data at different noise levels is shown in Figure 2.

This data enhancement strategy provides key data support for the subsequent machine learning models to adapt to complex working conditions, so that the algorithm is exposed to various potential interference modes during the training phase, thus significantly improving the robustness of the final deployment.

2.2. SAE-PLS Joint Dimension Reduction Framework

The SAE-PLS joint dimensionality reduction framework proposed in this study focuses on the two-stage collaborative mechanism of “dimension decision-making and supervised dimensionality reduction”. It automatically determines the optimal number of dimensions for reduction through the nonlinear feature analysis capability of stacked self-encoders (SAEs) [23] and constructs a partial least squares (PLS) supervised projection model [24] on the original spectral data based on this dimension parameter to achieve the accurate adaptation of feature space compression for high-dimensional infrared spectral data and to support industrial gas concentration prediction. The schematic diagram of the SAE-PLS joint dimensionality reduction framework is shown in Figure 3.

SAE is only used as a dimension decision-making tool to dynamically extract the intrinsic dimension characteristics of data through error analysis of self-encoding reconstruction to avoid manual experience intervention. PLS directly acts on the original spectral data, using the optimal dimensional parameters of SAE output to guide the construction of the latent variable space, which not only retains the physical interpretability of the original spectrum, but also strengthens the correlation between characteristics and concentration variables through supervised learning.

As a dimension decision module, the core task of a stacked self-encoder (SAE) is to mine the essential dimensions of spectral data through nonlinear self-encoding reconstruction. The dimension in the original data refers to the horizontal axis dimension of the binary spectrum, that is, the wave number, whose physical meaning is the energy or frequency of infrared light. The wave number is the reciprocal of the wavelength, which is directly related to the energy required for the transition of molecular vibrational/rotational energy levels. The input data is the normalized original infrared spectral absorbance matrix. After normalization, the feature is learned through the double-layer self-encoder constructed using the pytorch (1.12.1) framework. The encoder part uses the full connection layer and relu activation function. The first layer maps the original high-dimensional spectrum to 64-dimensional nonlinear hidden space, and the second layer is further compressed to the dynamically optimized encoding_dim dimension. The decoder reconstructs the original spectrum through the symmetrical structure and uses the mean square error (MSE) as the loss function to drive the self-supervised training. In order to automatically determine the optimal dimension reduction, a dimension search strategy based on the early stop method is proposed. In the specific implementation, the preset dimension search interval is 1 to 50 dimensions, the self-encoder is trained step by step, and the reconstruction loss of the verification set is monitored. When the loss reduction amplitude is lower than the threshold (0.01) in five consecutive training cycles, the early stop mechanism is triggered to determine that the current dimension is the potential optimal candidate. Furthermore, the minimum dimension of reconstruction error convergence is screened out by the loss platform period detection algorithm (the experiment shows that the original dimension can be reduced by 99.85%). This strategy breaks through the inefficiency of traditional PCA relying on variance contribution rate or PLS cross-validation and is especially suitable for nonlinear dimension feature extraction in the overlapping region of spectral peaks of mixed gases.

The optimal dimension parameter determined by SAE is taken as the latent variable number of the PLS model, and supervised dimensionality reduction is directly performed on the original spectral data. Specifically, the PLS regression model was constructed with the standardized spectral matrix as the input and the gas concentration as the response variable. PLS iteratively extracts potential variables highly related to concentration by maximizing the covariance matrix of input variables and response variables, which can be described by the following mathematical expression [25]:

C ov (X, Y) = \frac{1}{n - 1} {(X - \overline{X})}^{T} (Y - \overline{Y})

(2)

\max_{ω} C ov (X ω, Y)

(3)

T = X ω

(4)

Formula (2) is the covariance matrix formula, where X is the normalized spectral matrix and Y is the response variable of gas concentration; Formula (3) is the PLS objective function formula, where ω is the weight ratio vector of the latent variable; Formula (4) is the latent variable extraction formula, where t is the extracted latent variable. The number of latent variables is directly set by the optimal_dim parameter output by SAE, avoiding the computational burden of traditional PLS, which needs to be repeatedly adjusted through cross-validation. During the process of dimensionality reduction, PLS maps the original high-dimensional spectrum to a low-dimensional latent variable space by projection, while preserving the global spectral features related to concentration. The pseudo code of the SAE-PLS joint dimensionality reduction algorithm is as following Algorithm 1:

Algorithm 1. # SAE-PLS Joint Dimensionality Reduction Algorithm

def SAE_PLS(X, Y):
# Phase 1: SAE Dimension Decision
optimal_dim = SAE_DimSearcher(
encoder = Sequential(Linear(p,64), ReLU(), Linear(64,d)),
decoder = Sequential(Linear(d,64), ReLU(), Linear(64,p)),
early_stop = PlateauDetector(patience = 5, threshold = 0.01)
).fit(X)

# Phase 2: PLS Supervised Projection
pls = PLSRegression(n_components = optimal_dim)
X_latent = pls.fit_transform(X, Y)
return X_latent, pls

The final dimensionality reduction result will be used as the input to the XGBoost model, and its technical synergy is reflected in three levels: first, the nonlinear coding of SAE breaks through the limitation of the linear dimensionality reduction method in modeling the interaction of spectral absorption peaks; second, the statistical dimensionality reduction mechanism of PLS provides an optimized feature space for ensemble learning by maximizing the covariance between features and concentration variables; finally, XGBoost’s boosting algorithm implements gradient lifting in the compressed feature space, and its regularization parameters effectively suppress the risk of overfitting caused by spectral baseline drift. Through 50% cross-validation, it is confirmed that the joint dimensionality reduction strategy can shorten the training time of the model to 1 s and reduce the standard deviation of prediction accuracy from 0.041% to 0.012%, which verifies the robustness advantage of the method.

2.3. Model Parameter Setting

In order to realize the quantitative detection of mixed gas infrared spectrum data, this study compared the performance of various regression models, including support vector machine regression (SVR), partial least squares regression (PLSR), decision tree regression (DTR), random forest regression (RFR), AdaBoost regression (ABR), extreme random tree regression (ETR), bagging regression (BGR), gradient enhanced random forest regression (gbrfr), and XGBoost regression (xgbr). The parameters of all models were optimized based on the characteristics of infrared spectral data, such as high-dimensional features, noise interference, and cross sensitivity.

For support vector machine regression (SVR), the radial basis function (RBF) is used as the kernel function, the regularization parameter C is set to 1.0, and the kernel function coefficient gamma is adjusted to 0.01 by a grid search to balance the model complexity and generalization ability. The key parameter of partial least squares regression (PLSR) is the number of potential variables, and the optimal value is determined to be 8 by cross-validation, in order to fully extract the linear characteristics of spectral data and suppress the collinearity interference. Decision tree regression (DTR) uses the cart algorithm. The maximum tree depth is limited to 5 layers, and the minimum number of leaf node samples is set to 10 to avoid overfitting. The number of base learners for random forest regression (RFR) and gradient-enhanced random forest regression (gbrfr) are set to 200 and 150, respectively. The maximum depth of a single tree is unified to 6, and the feature splitting standard is the mean square error (MSE). The diversity of the models is enhanced by bootstrapping.

AdaBoost regression (ABR) uses the decision tree as the learning machine, the learning rate is set to 0.1, and the number of iterations is 200. It improves the fitting ability of difficult samples by dynamically adjusting the sample weight. Extreme random tree regression (ETR) uses a completely randomized strategy when splitting nodes. The number of trees is 200, and the proportion of random selection of splitting features is the square root of the spectral feature dimension, which reduces variance and improves robustness. The basis model of bagging regression (BGR) uses a decision tree, the proportion of random sampling of samples and features is 80%, and the integration scale is 150. The overfitting is suppressed by aggregating multiple groups of weak learners.

XGBoost regression (xgbr) is the core model, and its parameter settings fully combine the nonlinear characteristics of spectral data and noise distribution. The learning rate is set to 0.05 to balance the convergence speed and accuracy, the maximum tree depth is 6, the sub-sample ratio and feature sampling ratio are 0.8 and 0.7, respectively, and the regularization coefficients lambda and alpha are set to 1.5 and 0.5, respectively, to suppress overfitting. The early-stop method is used for model training. If the loss of the validation set does not decrease for 10 consecutive rounds, the iteration will be terminated [26]. The super parameters of all models are adjusted through Bayesian optimization and 5-fold cross-validation, and the final parameter combination achieves an optimal balance between mean square error (MSE) and correlation coefficient (R²) on the validation set.

2.4. Performance Evaluation Index

In the performance evaluation of the mixed gas infrared spectrum quantitative detection model, this study uses the mean square error (MSE) and the coefficient of determination (R²) as the core evaluation indexes, which systematically measure the prediction accuracy and generalization performance of the model from the two dimensions of absolute error and data interpretation ability. The MSE quantifies the absolute deviation of gas concentration prediction by calculating the mean square of the error between the predicted value and the real value. The formula for calculating the mean square error is shown in Equation (5):

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - \hat{Y_{i}})}^{2}

(5)

where

Y_{i}

is the real concentration,

\hat{Y_{i}}

is the predicted concentration, and n is the sample size. Its sensitivity to large errors can effectively identify the stability defects of the model in the extreme concentration range, providing a direct basis for optimizing the robustness of the model [27]. For example, in the mixed gas detection scene, the spectral characteristics of different components may overlap due to cross-interference. At this time, the small fluctuation of the MSE can reflect the decoupling ability of the model to the complex spectrum.

R² measures the ability of the model to interpret data variability from a statistical perspective. Comparing the prediction error with the inherent variance of the data reveals the correlation strength between spectral characteristics and gas concentration [28]. The R² score formula is shown in Equation (6):

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(Y_{i} - \hat{Y_{i}})}^{2}}{\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2}}

(6)

where

\bar{Y}

is the mean value of real concentration. In the mixed gas analysis, the concentration changes of different components may show a non-independent and nonlinear coupling relationship, and the high value of R² indicates that the model can capture such complex correlations, thus providing theoretical support for the simultaneous prediction of multi-component concentration. For example, when the gas absorption peak shifts due to temperature or pressure fluctuations, the stability of R² can indirectly verify the adaptability of the model to the dynamic changes in spectral characteristics.

The combined use of MSE and R² realizes a multi-angle evaluation of model performance: MSE focuses on the absolute control of prediction error, which is suitable for quantitative analysis with strict requirements for detection accuracy in industrial scenes; R² emphasizes the mining ability of the model for the inherent laws of spectral data and can extract stable concentration-sensitive features from high-dimensional and high-noise spectral data. The complementarity of the two is particularly important in mixed gas detection—MSE helps identify the systematic deviation caused by sensor drift or environmental interference, while R² reveals the potential optimization direction of feature extraction and model architecture design by analyzing the degree of interpretation of data variability by the model.

To sum up, the introduction of MSE and R² not only conforms to the characteristics of high-dimensional, strong noise, and nonlinear coupling of mixed gas infrared spectrum data, but also provides a double guarantee for the reliability verification of industrial detection systems. The theoretical connotation and application value of the two together support the conclusion of the superiority of the XGBoost model in complex spectral quantitative analysis in this study and provide a reusable evaluation paradigm for algorithm improvement and cross-scene migration in subsequent research.

3. Results and Discussion

3.1. Quantitative Evaluation of Dimensionality Reduction Effect

In this experiment, the performance difference between the SAE-PLS joint dimensionality reduction strategy and traditional methods is analyzed, and the quantitative evaluation is carried out using three dimensions: dimension compression efficiency, computational timeliness, and downstream model adaptability. In the experiment, 5000 groups of infrared spectrum samples containing CO, CH₄, and other gases from the NIST database were used to divide the training set and test set according to the ratio of 8:2. All dimensionality reduction methods were based on the same data division for fairness verification.

In terms of dimension compression efficiency, SAE automatically optimizes dimensionality reduction through a nonlinear coding network, and its compression efficiency is significantly better than that of the linear method (as shown in Figure 4). The experimental results show that the compression dimension of SAE for the original spectral data is stable at d′ = 12 (compression rate of 99.83%), while PCA still needs to retain d′ = 2747 dimensions (compression rate of 61.52%) to retain a 95% variance contribution rate. The deep nonlinear activation function (relu) of SAE shows a stronger ability for feature aggregation in the overlapping region of spectral peaks, which effectively alleviates the problem of dimensional redundancy caused by local covariance matrix mismatch in linear methods.

Here, f(x) represents the dependent variable, l represents the weight coefficients trained by the model, x represents the independent variable, and b is the bias. The fitting process for multiple linear regression is usually conducted by determining the coefficients using the least squares method.

The computational speed is evaluated by comparing the dimensionality reduction time consumed with the model training time consumed. As demonstrated in Figure 5, the integrated SAE-PLS approach necessitates approximately 25 min to complete end-to-end processing, signifying a substantial enhancement in efficiency compared to conventional methodologies. The nonlinear encoder of SAE alone requires 24 min to execute spectral compression, the projection screening of PLS demands 0.5 min, and the training of the XGBoost model is reduced to 0.5 s due to the reduction of dimensionality to 12 dimensions. In comparison, the PLS combination consumes more than 1000 min due to the execution of the cross-validation and screening, which results in a total consumption of more than 1000 min. While PCA alone has a shorter compression time, it was not used as a dimensionality reduction method in this study due to the excessive number of retained dimensions (2747 dimensions). This increased the model training time to 1 min and the model training time to a maximum of 80 min for training the rest of the models. This result demonstrates that SAE-PLS optimizes the allocation of computational resources while preserving information integrity through the integration of nonlinear compression and physical association screening.

The downstream model adaptation is indirectly verified by XGBoost regression performance. As demonstrated in Table 2, the mean square error (MSE) of the SAE-PLS downscaled feature set on the test set is 0.012%, which is significantly better than that of AE and SAE. This result is attributed to the dual optimization of the joint strategy, which involves the nonlinear compression of SAE, thereby suppressing the high-frequency noise introduced by the spectral baseline drift. Concurrently, VIP screening of PLS strengthens the physical correlation between the features and the target variables.

Combining the evaluation results of the above three dimensions, although PLS downscaling achieves the same excellent model fit as SAE-PLS downscaling in the downstream model fit, its computational speed is much higher than the rest of the downscaling strategies. Therefore, the joint SAE-PLS downscaling strategy is the optimal downscaling strategy.

3.2. Model Performance Comparison

The objective of this experiment was to verify the synergistic advantages of the joint SAE-PLS dimensionality reduction strategy and XGBoost regression (XGBR). To this end, the experiment compared the support vector machine regression (SVR), K nearest neighbor regression (KNNR), decision tree regression (DTR), and random forest regression (RFR). The following nine models were utilized for the purpose of quantitative detection performance: ADABoost regression (ABR), extreme random tree regression (ETR) on the dimensionality-reduced feature set (ABR), extreme random tree regression (ETR), bagging regression (BGR), gradient-enhanced random forest regression (GBRFR), and XGBR. The experiments were executed by employing a dataset comprising 5000 sets of mixed-gas infrared spectral data, which were randomly combined with multiple single-gas data points from the NIST database. All models were divided into training sets (80%) and testing sets (20%), and the hyperparameters were optimized through a combination of grid search and 5-fold cross-validation. The performance evaluation metrics encompassed mean square error (MSE), coefficient of determination (R²), and model training time.

With regard to prediction accuracy (see, e.g., Table 3), XGBR demonstrated an MSE of 0.012% on the test set, a figure that significantly surpasses those of competing models. A comparison of the error rates of XGBR with those of random forest regression (RFR, MSE = 0.042%) and decision tree regression (DTR, MSE = 0.035%) reveals that XGBR exhibits a 71.43% and 65.71% reduction in error, respectively. This advantage is attributable to the regularization mechanism and feature importance weighting strategy of XGBoost, which effectively suppresses overfitting caused by spectral noise. The conventional SVR model exhibits an MSE of 0.429% (R² = 0.506), attributable to its inability to capture the nonlinear characteristics inherent in spectral data. The ADABoost regression algorithm, a component of the integrated model, demonstrated suboptimal performance, with an MSE of 0.142%. This deficiency arises from its inability to effectively integrate the sample weight update strategy, a critical component of the iterative process, with the sparse spectral data. As a consequence, key features are diminished, leading to a dilution of their contributions within the model.

With regard to the efficiency of the model (as illustrated in Figure 6), the extreme random tree regression (ETR) and bagging regression (BGR) methods require 1.7 s and 1.9 s to train, respectively. This is due to their parallelization architecture and random subspace sampling (RSS) mechanism. However, their prediction accuracies and stabilities are significantly lower than those of the XGBoost regression (XGBR) method (MSE = 0.075% for ETR and MSE = 0.048% for BGR). The XGBR model is optimized by the pre-sorting algorithm and histogram approximation optimization. The training time is controlled at 0.7 s, which is 92.31% faster than that of the traditional random forest (9.1 s for RFR). Despite the fact that the size of the hyperparameters (12 adjustable parameters) is significantly larger than that of the RFR (8), the XGBR model maintains high computational efficiency. This outcome underscores the efficacy of XGBR in seamlessly integrating high accuracy with low latency in industrial real-time monitoring scenarios.

3.3. Robustness Test

In order to verify the anti-interference ability of the SAE-PLS-XGBoost hybrid model in complex industrial scenarios, this section systematically evaluates the quantitative detection stability of the model under different signal-to-noise ratio (SNR) conditions. To do so, controllable noise interferences are injected into NIST infrared spectral data, and Gaussian white noise is chosen to simulate the noise interferences that may exist in real gas monitoring. The noise levels are divided into three grades—10 dB, 15 dB, and 20 dB—according to the SNR, and the performance decay trends of the models SAE-PLS-XGBoost (abbreviated as S-PLS-X), SAE-PLS-RF (abbreviated as S-PLS-RF), PCA-XGBoost (abbreviated as PCA-X), PCA-DTR (abbreviated as PCA-DT), and SAE-PCA-XGBoost (abbreviated as S-PCA-X) are compared.

The experimental results show (e.g., Table 4) that SAE-PLS-XGBoost still maintains high prediction accuracy in a high noise environment (SNR = 10 dB), and its MSE is 0.023%, which is 43.3% and 26.9% lower than that of PCA-XGBoost (MSE = 0.034%) and SAE-PLS-RF (MSE = 0.064%), respectively, and by analyzing the performance decay trend, it can be seen that SAE-PLS-XGBoost’s MSE grows by only 0.011% in a high-noise environment, which is significantly better than the rest of the models. This advantage stems from the dual anti-noise mechanism of the joint degradation strategy: the deep nonlinear coding network of SAE filters high-frequency random noise through sparse activation constraints, while the VIP feature screening of PLS reduces the distortion effect of noise on the feature distributions by eliminating the low-correlation bands triggered by baseline drift.

4. Conclusions and Outlook

4.1. Conclusions

In this study, a hybrid modeling framework based on SAE-PLS joint dimensionality reduction and XGBoost regression was investigated to address the bottleneck problems of spectral peak interference, dimensional redundancy, and insufficient efficiency of traditional methods for quantitative detection of mixed-gas infrared spectra. The superiority of the proposed framework was verified by NIST standard datasets. Experiments demonstrate that the joint SAE-PLS dimensionality reduction strategy significantly optimizes the characterization efficiency of high-dimensional spectral data through dual mechanisms of nonlinear compression and interpretable feature screening. Compared with the traditional PCA method, its dimensional space reduction rate is improved by 38.3% (from 61.5% to 99.8%), which provides physically meaningful and more information-dense feature sets for subsequent modeling. The XGBoost regression model demonstrates superior prediction capabilities when constrained by regularization and equipped with a feature importance weighting mechanism. This model exhibits an MSE of 0.012% on the test set, a 71.4% reduction compared to benchmark models such as random forest. Notably, the model maintains its efficacy in complex noise environments, including Gaussian white noise and impulse interference, underscoring its robustness and versatility. The method integrates deep feature learning, statistical dimensionality reduction, and integrated optimization techniques, thereby realizing the entire “compression-screening-modeling” process of spectral data. This approach provides an innovative solution with high accuracy, strong noise immunity, and low computational latency for the online monitoring of industrial process gases.

4.2. Outlook

Although the SAE-PLS-XGBoost hybrid model shows significant advantages in the quantitative detection of mixed-gas infrared spectra, future research still needs to deepen the exploration in the following directions: first, the current noise interference test is mainly based on the simulation data, and the measured spectra of multi-source heterogeneous noises (e.g., temperature drifting, dust scattering) from real industrial scenarios can be introduced in the follow-up to further verify the model’s robustness. Second, the optimization of the network structure of SAE may further improve the pertinence of nonlinear feature extraction, while attempts can be made to combine the VIP screening of PLS with graph neural network (GNN) to construct a topological correlation model between spectral bands in order to more accurately capture the cross-wavelength interaction effects. Third, although the real-time nature of the model already meets the needs of most industrial scenarios, for the super-large-scale spectral databases, the distributed training framework based on federated learning or model lightweighting techniques needs to be explored to adapt to the resource constraints of edge computing devices. Fourth, promoting the embedded integration of algorithms and miniaturized infrared sensors and developing gas monitoring terminals with hardware and software synergy will be key to moving from theoretical research to engineering applications. Fifth, the generalization ability of the current model is limited by the lack of diversity of training samples, especially under extreme concentration distributions and in untrained new mixed-gas combination scenarios, where prediction errors increase significantly. It is necessary to integrate cross-scenario spectral libraries through migration learning or a dynamic weighting mechanism to improve the robustness during actual deployment.

Author Contributions

Methodology, X.Z.; Software, Z.X.; Formal analysis, X.B.; Investigation, X.B.; Resources, F.Z.; Writing—original draft, X.Z.; Writing—review & editing, B.W.; Supervision, B.W.; Investigation project administration, H.Q.; Funding acquisition. Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and technology project of State Grid Corporation of China grant number 5400-202320578A-3-2-ZN.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

Authors Xichao Zhou and Yong Peng were employed by the State Grid Integrated Energy Service Group Co., Ltd. Authors Baigen Wang, Xingjiang Bao, and Hongtao Qi were employed by the Anqing Power Supply Company of State Grid Anhui Electric Power Co., Ltd. Author Zishang Xu was employed by the China Electric Power Research Institute Co. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Li, J.; Ma, Y.; Duan, Z.; Zhang, Y.; Duan, X.; Liu, B.; Yuan, Z.; Wu, Y.; Jiang, Y.; Tai, H. Local dynamic neural network for quantitative analysis of mixed gases. Sens. Actuators B Chem. 2024, 404, 135230. [Google Scholar] [CrossRef]
Liu, N.; Xu, L.; Zhou, S.; Zhang, L.; Li, J. Simultaneous detection of multiple atmospheric components using an NIR and MIR laser hybrid gas sensing system. ACS Sens. 2020, 5, 3607–3616. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Xu, L.; Ai, W.; Lin, B.; Feng, Q.; Cai, K. Kernel functions embedded in support vector machine learning models for rapid water pollution assessment via near-infrared spectroscopy. Sci. Total Environ. 2020, 714, 136765. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Zhang, T.; Zhang, Q.; Wei, Y.; Liu, T.; Hou, Z.; Qiu, B.; Sun, M. Suppression of cross-interference in the absorption spectra of gas mixtures. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2025, 327, 125352. [Google Scholar] [CrossRef]
Liang, J.G.; Jiang, Y.; Wu, J.K.; Wang, C.; von Gratowski, S.; Gu, X.; Pan, L. Multiplex-gas detection based on non-dispersive infrared technique: A review. Sens. Actuators A Phys. 2023, 356, 114318. [Google Scholar] [CrossRef]
Ju, W.; Lu, C.; Liu, C.; Jiang, W.; Zhang, Y.; Hong, F. Rapid Identification of Atmospheric Gaseous Pollutants Using Fourier-Transform Infrared Spectroscopy Combined with Independent Component Analysis. J. Spectrosc. 2020, 2020, 8920732. [Google Scholar] [CrossRef]
Torres, L.F.; Damascena, M.A.; Alves, M.M.A.; Santos, K.S.; Franceschi, E.; Dariva, C.; Barros, V.A.; Melo, D.C.; Borges, G.R. Use of near-infrared spectroscopy for the online monitoring of natural gas composition (hydrocarbons, water and CO₂ content) at high pressure. Vib. Spectrosc. 2024, 131, 103653. [Google Scholar] [CrossRef]
Hasan, B.M.S.; Abdulazeez, A.M. A review of principal component analysis algorithm for dimensionality reduction. J. Soft Comput. Data Min. 2021, 2, 20–30. [Google Scholar]
Prasetiyowati, M.I.; Maulidevi, N.U.; Surendro, K. Feature selection to increase the random forest method performance on high dimensional data. Int. J. Adv. Intell. Inform. 2020, 6, 303–312. [Google Scholar] [CrossRef]
Gaye, B.; Zhang, D.; Wulamu, A. Improvement of support vector machine algorithm in big data background. Math. Probl. Eng. 2021, 2021, 5594899. [Google Scholar] [CrossRef]
Wilson, A.; Anwar, M.R. The Future of Adaptive Machine Learning Algorithms in High-Dimensional Data Processing. Int. Trans. Artif. Intell. 2024, 3, 97–107. [Google Scholar] [CrossRef]
Huang, L.; Liu, Y.; Huang, W.; Dong, Y.; Ma, H.; Wu, K.; Guo, A. Combining random forest and XGBoost methods in detecting early and mid-term winter wheat stripe rust using canopy level hyperspectral measurements. Agriculture 2022, 12, 74. [Google Scholar] [CrossRef]
Enders, A.A.; North, N.M.; Fensore, C.M.; Velez-Alvarez, J.; Allen, H.C. Functional group identification for FTIR spectra using image-based machine learning models. Anal. Chem. 2021, 93, 9711–9718. [Google Scholar] [CrossRef]
Ji, H.; Deng, H.; Lu, H.; Zhang, Z. Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural networks. Anal. Chem. 2020, 92, 8649–8653. [Google Scholar] [CrossRef]
Wang, X.; Shi, X.; Duan, H.; Zhou, L.; Wu, H.; Lu, J.; Davidrajuh, R. Research on infrared optical CO detection based on BP neural network algorithm. In Proceedings of the Third International Conference on Algorithms, Microchips, and Network Applications (AMNA 2024), Xi’an, China, 8–10 March 2024; SPIE: Bellingham, WA, USA, 2024; Volume 13171, pp. 136–143. [Google Scholar]
Xiang, Z.; Chen, J.; Bao, Y.; Li, H. An active learning method combining deep neural network and weighted sampling for structural reliability analysis. Mech. Syst. Signal Process. 2020, 140, 106684. [Google Scholar] [CrossRef]
Liu, C.; Yang, S.X.; Li, X.; Li, H. Noise level penalizing robust Gaussian process regression for NIR spectroscopy quantitative analysis. Chemom. Intell. Lab. Syst. 2020, 201, 104014. [Google Scholar] [CrossRef]
Kistenev, Y.V.; Skiba, V.E.; Prischepa, V.V.; Borisov, A.V.; Vrazhnov, D.A. Gas-mixture IR absorption spectra denoising using deep learning. J. Quant. Spectrosc. Radiat. Transf. 2024, 313, 108825. [Google Scholar] [CrossRef]
Sun, J.; Tian, L.; Chang, J.; Kolomenskii, A.A.; Schuessler, H.A.; Xia, J.; Feng, C.; Zhang, S. Adaptively optimized gas analysis model with deep learning for near-infrared methane sensors. Anal. Chem. 2022, 94, 2321–2332. [Google Scholar] [CrossRef]
Cao, Y.; Cheng, X.; Xu, Z.; Tian, X.; Cheng, G.; Peng, F.; Wang, J. High precision and sensitivity detection of gas measurement by laser wavelength modulation spectroscopy implementing an optical fringe noise suppression method. Opt. Lasers Eng. 2023, 166, 107570. [Google Scholar] [CrossRef]
Elaraby, S.; Abuelenin, S.M.; Moussa, A.; Sabry, Y.M. Deep Learning on Synthesized Sensor Characteristics and Transmission Spectra Enabling MEMS-Based Spectroscopic Gas Analysis beyond the Fourier Transform Limit. Foundations 2021, 1, 304–317. [Google Scholar] [CrossRef]
Naughton, N.; Sun, J.; Tekinalp, A.; Parthasarathy, T.; Chowdhary, G.; Gazzola, M. Elastica: A compliant mechanics environment for soft robotic control. IEEE Robot. Autom. Lett. 2021, 6, 3389–3396. [Google Scholar] [CrossRef]
Jang, H.D.; Kwon, S.; Nam, H.; Chang, D.E. Semi-Supervised Autoencoder for Chemical Gas Classification with FTIR Spectrum. Sensors 2024, 24, 3601. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Jiang, M.; Dou, W.; Meng, D.; Wang, C.; Wang, J.; Wang, X.; Sun, L.; Jiang, S.; Chen, F.; et al. Narrow-band multi-component gas analysis based on photothermal spectroscopy and partial least squares regression method. Sens. Actuators B Chem. 2023, 377, 133029. [Google Scholar] [CrossRef]
Menduni, G.; Zifarelli, A.; Sampaolo, A.; Patimisco, P.; Giglio, M.; Amoroso, N.; Wu, H.; Dong, L.; Bellotti, R.; Spagnolo, V. High-concentration methane and ethane QEPAS detection employing partial least squares regression to filter out energy relaxation dependence on gas matrix composition. Photoacoustics 2022, 26, 100349. [Google Scholar] [CrossRef]
Wang, Q.; Zou, X.; Chen, Y.; Zhu, Z.; Yan, C.; Shan, P.; Wang, S.; Fu, Y. XGBoost algorithm assisted multi-component quantitative analysis with Raman spectroscopy. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2024, 323, 124917. [Google Scholar] [CrossRef] [PubMed]
Dey, P.; Saurabh, K.; Kumar, C.; Pandit, D.; Chaulya, S.K.; Ray, S.K.; Prasad, G.M.; Mandal, S.K. t-SNE and variational auto-encoder with a bi-LSTM neural network-based model for prediction of gas concentration in a sealed-off area of underground coal mines. Soft Comput. 2021, 25, 14183–14207. [Google Scholar] [CrossRef]
Wang, T.; Zhang, H.; Wu, Y.; Chen, X.; Chen, X.; Zeng, M.; Yang, J.; Su, Y.; Hu, N.; Yang, Z. Classification and concentration prediction of VOCs with high accuracy based on an electronic nose using an ELM-ELM integrated algorithm. IEEE Sens. J. 2022, 22, 14458–14469. [Google Scholar] [CrossRef]

Figure 1. Partial data infrared spectrum.

Figure 2. Infrared spectrum comparison of the same data at different noise levels.

Figure 3. SAE-PLS schematic diagram of joint dimension reduction framework.

Figure 4. Comparison chart of dimension compression efficiency.

Figure 5. Comparison chart of calculation speed.

Figure 6. Model efficiency comparison chart.

Table 1. Gas specifications.

Gas Name	Specification
CO	GAS (400 mmHg DILUTED TO A TOTAL PRESSURE OF 600 mmHg WITH N2)
CO₂	GAS (200 mmHg DILUTED TO A TOTAL PRESSURE OF 600 mmHg WITH N2)
CH₄	GAS (150 mmHg DILUTED TO A TOTAL PRESSURE OF 600 mmHg WITH N2)
C₂H₂	GAS (50 mmHg DILUTED TO A TOTAL PRESSURE OF 600 mmHg WITH N2)
C₂H₄	GAS (150 mmHg DILUTED TO A TOTAL PRESSURE OF 600 mmHg WITH N2)
C₂H₆	GAS (150 mmHg DILUTED TO A TOTAL PRESSURE OF 600 mmHg WITH N2)

Table 2. Adaptability comparison of downstream models.

Method	MSE	R²
PCA	0.014	0.983
PLS	0.012	0.986
AE	0.033	0.946
SAE	0.041	0.952
SAE-PCA	0.013	0.985
SAE-PLS	0.012	0.987

Table 3. Comparison of model prediction accuracy.

Regression Model	MSE (%)	R²
SVR	0.429	0.506
KNN	0.029	0.967
DT	0.035	0.960
RF	0.042	0.951
ET	0.075	0.914
Bagging	0.048	0.944
GBRF	0.043	0.951
AdaBoost	0.142	0.840
XGBoost	0.012	0.987

Table 4. Comparison of model performance decay trends under different noise level interferences.

Noise Level			MSE (%)
Noise Level	S-PLS-X	S-PLS-RF	PCA-X	PCA-DT	S-PCA-X
20 dB	0.012	0.042	0.014	0.074	0.013
15 dB	0.018	0.054	0.024	0.082	0.021
10 dB	0.023	0.064	0.034	0.089	0.034

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, X.; Wang, B.; Bao, X.; Qi, H.; Peng, Y.; Xu, Z.; Zhang, F. Quantitative Detection of Mixed Gas Infrared Spectra Based on Joint SAE and PLS Downscaling with XGBoost. Processes 2025, 13, 2112. https://doi.org/10.3390/pr13072112

AMA Style

Zhou X, Wang B, Bao X, Qi H, Peng Y, Xu Z, Zhang F. Quantitative Detection of Mixed Gas Infrared Spectra Based on Joint SAE and PLS Downscaling with XGBoost. Processes. 2025; 13(7):2112. https://doi.org/10.3390/pr13072112

Chicago/Turabian Style

Zhou, Xichao, Baigen Wang, Xingjiang Bao, Hongtao Qi, Yong Peng, Zishang Xu, and Fan Zhang. 2025. "Quantitative Detection of Mixed Gas Infrared Spectra Based on Joint SAE and PLS Downscaling with XGBoost" Processes 13, no. 7: 2112. https://doi.org/10.3390/pr13072112

APA Style

Zhou, X., Wang, B., Bao, X., Qi, H., Peng, Y., Xu, Z., & Zhang, F. (2025). Quantitative Detection of Mixed Gas Infrared Spectra Based on Joint SAE and PLS Downscaling with XGBoost. Processes, 13(7), 2112. https://doi.org/10.3390/pr13072112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Quantitative Detection of Mixed Gas Infrared Spectra Based on Joint SAE and PLS Downscaling with XGBoost

Abstract

1. Introduction

2. Methods

2.1. Data Acquisition and Preprocessing

2.1.1. Data Generation

2.1.2. Data Enhancement

2.2. SAE-PLS Joint Dimension Reduction Framework

2.3. Model Parameter Setting

2.4. Performance Evaluation Index

3. Results and Discussion

3.1. Quantitative Evaluation of Dimensionality Reduction Effect

3.2. Model Performance Comparison

3.3. Robustness Test

4. Conclusions and Outlook

4.1. Conclusions

4.2. Outlook

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI