1. Introduction
Raman spectroscopy, characterized by its high resolution, high accuracy, and rapid real-time analysis capability, is widely employed for qualitative and quantitative analysis in fields such as materials science, food and pharmaceuticals, medical diagnostics, environmental monitoring, and mineral identification [
1,
2,
3,
4,
5,
6,
7,
8]. However, Raman data analysis faces multiple challenges, among which, baseline drift is a primary concern. Due to the influences of the environment, instrumentation, and the sample itself, Raman spectra frequently exhibit baseline drift. This phenomenon degrades the signal-to-noise ratio, distorts the true shape of spectral peaks, and leads to inaccuracies in the calculation of characteristic parameters such as peak height and peak area, thereby causing significant interference in subsequent qualitative and quantitative analyses [
9,
10]. Therefore, prior to in-depth analysis, it is essential to remove this background signal through baseline correction algorithms to extract pure, component-related information from complex spectral data. Effectively eliminating the fluorescent background and enhancing data reliability is a prerequisite for accurate band decomposition and all subsequent chemometric analyses.
A variety of baseline correction algorithms are available, including polynomial fitting [
11,
12,
13,
14], segmented fitting [
15], moving window smoothing [
16], wavelet transform [
17,
18], morphological methods [
19,
20,
21], penalized least squares [
22], and deep learning [
23]. In the processing of Raman spectra, different baseline correction algorithms exhibit distinct performance characteristics. The derivative method alters the original spectral shape, which can significantly impact subsequent quantitative accuracy. Polynomial filtering employs low-order polynomials to fit the spectrum, yet its parameters require careful selection. In segmented fitting algorithms, the choice of segment points often demands manual intervention, hindering automated baseline correction. The moving window smoothing method iteratively reduces peak intensities but tends to overestimate the baseline, performing particularly poorly in regions with overlapping peaks. Wavelet transform-based correction primarily removes low-frequency components from the spectrum, and its effectiveness heavily depends on the selection of the decomposition scale and the mother wavelet. Morphological algorithms, rooted in image processing, often produce non-smooth baselines and may lead to loss of peak information. Deep learning approaches require extensive training data and involve complex parameter tuning.
The penalized least squares (PLS) algorithm has become one of the most widely used methods for baseline correction due to its high computational efficiency, absence of requirement for peak detection, minimal requirement for prior knowledge, and strong adaptability. The PLS framework was originally introduced by Whittaker in 1922 [
24]. Building on this work, various baseline correction algorithms have been developed by adopting different weighting strategies. Eilers et al. proposed the asymmetric least squares (AsLS) method [
25]. He et al. subsequently incorporated a first-order derivative constraint to enhance the asymmetric least squares algorithm, resulting in an improved version (IAsLS) [
26]. Zhang et al. introduced the adaptive iterative reweighted penalized least squares (airPLS) algorithm [
27], which addressed the issue of selecting the asymmetry parameter and made the baseline correction process more automated. Xu et al. proposed the doubly reweighted penalized least squares (drPLS) method [
28], which imposes a penalty term to constrain the smoothness term. Baek et al. developed the asymmetric reweighted penalized least squares (arPLS) method [
29], which adaptively determines weights using a generalized logistic function to mitigate the influence of noise. Guo et al. fully considered the energy distribution of the signal above and below the fitted baseline, effectively addressing the issue of local over-smoothing [
30]. Zhang et al. presented the adaptive smoothing parameter penalized least squares (asPLS) approach [
31], which controls the smoothness level through a scaling coefficient.
Although existing PLS-based algorithms have made notable progress, they still suffer from three key limitations. First, their weight update strategies often fail to adequately account for heterogeneous noise distribution, which can lead to the loss of weak broad peaks or overestimation of peak intensities. Second, the use of a single smoothing parameter limits their ability to adapt to varying signal characteristics across different spectral regions, making it difficult to balance baseline smoothness with peak fidelity. Third, the lack of a post-processing mechanism grounded in the physical nature of Raman spectra may introduce non-physical negative signals. To address these limitations, this study proposes the aisPLS algorithm, which incorporates an optimized weight update strategy, an adaptive iterative smoothing vector, and a physically constrained post-processing procedure. These enhancements significantly improve the accuracy and robustness of Raman spectral baseline correction. Comprehensive validation using both simulated and experimentally measured spectra demonstrates that the proposed algorithm achieves excellent performance across varying noise levels and concentration conditions, thereby providing reliable technical support for high-precision quantitative Raman spectroscopy.
2. Baseline Correction Results and Analysis of the Simulation Spectrum
In order to evaluate the baseline correction performance of the improved algorithm, this study first conducted validation using simulated Raman spectra. Given that the true baseline of actual Raman spectra is difficult to obtain accurately, making it challenging to reliably assess the correction results, simulated spectra were employed to verify the performance of the proposed method. By constructing simulated data that mathematically model the baseline, noise, and pure spectral signals, it becomes possible to accurately calculate the error between the corrected baseline and the true baseline, thereby providing a reliable basis for comparing the performance of different algorithms.
2.1. Simulation of the Raman Spectrum
The simulated spectral data is modeled as three components: the pure spectrum (pr), the baseline (bl), and the noise (ns). The pure spectral component is simulated using Gaussian peaks, incorporating various types of peaks such as overlapping peaks of different shapes, sharp peaks, and weak broad peaks embedded within the noise signal to further evaluate the stability of the algorithm. The baseline component consists of background noise following a Gaussian distribution, along with trigonometric functions, linear functions, exponential functions, and others. Additionally, random noise with varying signal-to-noise ratios is introduced via stochastic functions, enabling a comparison of the baseline correction algorithm’s performance across different SNR conditions.
The Gaussian function is expressed as follows:
The Lorentzian function is expressed as follows:
where the pure spectrum pr is calculated as a composite of Gaussian and Lorentzian peak functions. The simulated spectrum covers a wavenumber range from 0 cm
−1 to 1600 cm
−1. The simulated data is generated according to Equation (3):
The baseline component is constructed by combining three types of functions: Linear, sinusoidal, and Gaussian, as shown in Equation (4):
In experimentally measured Raman signals, noise is often present. To simulate the noise encountered in practical spectral data, random functions are employed to introduce noise at signal-to-noise ratios of 20 dB, 40 dB, and 60 dB. The signal-to-noise ratio is defined as follows:
The resulting simulated spectrum is shown in
Figure 1.
2.2. Comparison of Different Baseline Correction Algorithms
Table 1 lists the optimal smoothing parameter λ obtained through cross-validation. This procedure ensures that the baseline can effectively suppress background and noise while preserving valid spectral features to the greatest extent, thereby providing a reliable data foundation for subsequent quantitative analysis. The root mean square error (RMSE) is adopted as the evaluation metric, and its calculation formula is defined as follows:
As evidenced by the quantitative evaluation results in
Table 2, the proposed aisPLS algorithm consistently achieves the best baseline estimation performance across various signal-to-noise ratio (SNR) conditions, yielding the lowest root mean square error (RMSE) among all compared methods. This not only demonstrates the stable improvement capability of aisPLS in diverse noise environments, but also highlights its particularly pronounced advantage in low-SNR scenarios—where the algorithm more effectively maintains estimation accuracy despite weak signals and significant noise interference.
aisPLS exhibits superior precision in signal decomposition, which is supported by the comparative visualization of baseline correction effects from the four algorithms in
Figure 2. The algorithm more clearly separates the baseline component from the effective signal, notably avoiding the common issue of “weak-broad-peak loss” encountered with other methods while effectively suppressing overestimation of strong peaks. In terms of overall fitting quality, aisPLS achieves a better balance between baseline smoothness and spectral fidelity. The estimated baseline curve more closely follows the true background variation, and the corrected spectral profile approximates the ideal state without noticeable distortion or artificial oscillation.
Both quantitative metrics and visual analysis confirm the superior overall performance of aisPLS, particularly its high correction reliability under strong noise interference, which provides robust support for its application in complex real-world scenarios.
3. Baseline Correction Results and Analysis for Experimental Raman Spectra
The simulation results in the preceding section have demonstrated the effectiveness of the proposed aisPLS algorithm. To further assess its applicability and reliability in practical detection scenarios, validation experiments were conducted using experimentally measured Raman spectra. Both mineral and organic solution samples were analyzed. The mineral Raman spectra, characterized by weak signals and severe baseline drift, were employed to evaluate the algorithm’s capability in preserving weak spectral features and its accuracy in baseline correction. Meanwhile, the organic solution spectra, despite exhibiting relatively strong signals, still presented noticeable baseline drift, and were thus used to investigate the algorithm’s impact on improving subsequent quantitative analysis.
3.1. Raman Spectra of Minerals
The experimental setup employed a prototype Raman spectrometer. The samples tested were polished thin sections of peridotite and pyroxene. The excitation wavelength was 532 nm, with a spectral acquisition range of 200–1200 cm−1 and a spectral resolution of 5 cm−1. The working distance was set to 30 mm. Each spectrum was acquired with an integration time of 1 s, and 20 acquisitions were averaged to obtain the final result, thereby minimizing random noise interference.
Raman spectra of pyroxene, forsterite, and fayalite were measured experimentally, and the correction results of four baseline correction algorithms were compared. As can be seen from the raw spectra in
Figure 3, all three mineral samples exhibit severe baseline drift along with weak signal intensity, with some characteristic peaks being obscured by the baseline drift and noise.
Taking pyroxene in
Figure 3a as an example, in the 200–600 cm
−1 region, the airPLS algorithm exhibits insufficient baseline correction, with the corrected baseline still showing noticeable convex residual drift. In contrast, the arPLS and asPLS algorithms overestimate the baseline, causing the spectrum to be excessively pulled downward. This not only weakens the intensity of the characteristic peak at 682 cm
−1, but also leads to peak distortion. In the weak signal region of 800–850 cm
−1, arPLS and asPLS even submerge the effective signals entirely, failing to preserve spectral details. By comparison, the aisPLS algorithm proposed in this study demonstrates superior correction performance: it effectively eliminates baseline drift while perfectly preserving the original morphology of characteristic peaks, without excessive smoothing or peak suppression. Meanwhile, key Raman characteristic peak information at 340 cm
−1, 373 cm
−1, and 1015 cm
−1 is preserved to the greatest extent. These characteristic peaks are essential for the qualitative identification of minerals, and their integrity directly determines the reliability of subsequent analysis.
The baseline correction results for forsterite and fayalite exhibited the same pattern observed in pyroxene. The three comparison algorithms, airPLS, arPLS, and asPLS, all suffered from either over-correction or under-correction, failing to achieve both baseline flatness and preservation of characteristic peak integrity. In contrast, the aisPLS algorithm proposed in this study consistently and effectively eliminated baseline drift while accurately retaining all key characteristic peak information, demonstrating excellent adaptability.
3.2. Raman Spectra of Organic Solutions
The fiber optic spectrometer used in the experiment was the DQPro model manufactured by Shanghai RuHai Optoelectronics Technology Co., Ltd. (Shanghai, China), which was equipped with an immersion Raman probe (model RPB4-H). The test liquids were ethanol and acetonitrile mixed at different ratios. Anhydrous ethanol was provided by Tianjin Fuyu Fine Chemical Co., Ltd. (Tianjin, China), while acetonitrile and distilled water were sourced from Xi’an Tianmao Baoding Biotechnology Co., Ltd. (Xi’an, China). All chemicals were used as received without further purification. The excitation wavelength was set at 785 nm, with a spectral acquisition range of 200 cm−1–3200 cm−1 and a spectral resolution of approximately 5 cm−1. To minimize random noise interference, 50 spectral acquisitions were performed for each sample, with an integration time of 1 s per acquisition. The average of these 50 scans was taken as the raw Raman spectrum for each sample. In addition, single-scan spectral data were also collected to simulate real-world rapid detection scenarios.
Figure 4 presents the Raman spectra of 20% ethanol and acetonitrile along with the baseline correction results obtained by different algorithms, including both 50-scan averaged and single-scan measurements, to evaluate the performance of the aisPLS algorithm under varying noise levels. As shown in
Figure 4a, the 50-scan averaged ethanol spectrum exhibits significant baseline drift in the 200–250 cm
−1 region. Among the four algorithms compared, aisPLS demonstrates superior correction performance in this region. However, in the characteristic peak region of 1200–1450 cm
−1, airPLS and arPLS still suffer from under-correction, which may adversely affect subsequent peak analysis. A similar phenomenon can also be observed in
Figure 4b.
Figure 4c,d display the single-scan Raman spectra of ethanol and acetonitrile, which simulate rapid detection scenarios. Due to the short integration time, these spectra exhibit relatively low signal-to-noise ratios. Under such conditions, aisPLS still performs effective baseline correction, and the corrected spectra retain clearer characteristic peak information compared to those processed by other algorithms. This demonstrates the robust performance of the aisPLS algorithm across different signal-to-noise ratio levels, indicating that it exhibits a certain degree of robustness.
Figure 5 and
Figure 6 illustrate the correction results of the aisPLS algorithm applied to Raman spectra of ethanol and acetonitrile at various concentrations.
Figure 5a,b show the raw spectra obtained from 50-scan averaging, while
Figure 5c,d present the corresponding corrected spectra.
Figure 6a,b display the raw single-scan spectra, with
Figure 6c,d showing the spectra after correction. The results demonstrate that aisPLS effectively eliminates baseline drift and restores both the original shape and intensity of characteristic peaks. Across different concentrations and signal-to-noise ratios, aisPLS consistently maintains reliable baseline correction performance for both ethanol and acetonitrile Raman spectra.
Under constant conditions, there exists a linear relationship between Raman scattering intensity and component concentration. Therefore, the improvement in quantitative analysis capability achieved by baseline correction can be evaluated through the linear fitting error (R
2) of characteristic peak heights.
Table 3 compares the R
2 values of key characteristic peak heights after correction by the four algorithms. The proposed aisPLS method achieves the smallest fitting errors across all characteristic peaks, with R
2 values of 0.0591 for the C–O bond in ethanol and 0.2194 for the C≡N bond in acetonitrile, significantly outperforming the other methods. This indicates that aisPLS effectively preserves concentration-related quantitative information, facilitating subsequent quantitative analysis.
To further validate the practical application value of the proposed algorithm, we established a partial least squares (PLS) quantitative analysis model based on single-scan Raman spectral data. Compared with the conventional practice of using averaged spectra from multiple scans, modeling with single-measurement data more accurately reflects the algorithm’s performance in real-world rapid detection scenarios. Model performance was comprehensively evaluated using the root mean square error of cross-validation (RMSECV) and the coefficient of determination for prediction (Q2). The former reflects the absolute prediction error of the model, while the latter measures the model’s ability to explain variations in sample concentration.
As shown in
Table 4, the key performance indicators demonstrate that the model constructed from data corrected by the aisPLS algorithm achieves the lowest RMSECV values (0.0374 for ethanol and 0.0362 for acetonitrile) and the highest Q
2 values (0.9828 for ethanol and 0.9839 for acetonitrile). It is particularly noteworthy that these results were obtained without spectral averaging—directly processing the raw single-scan signals—further highlighting the ability of the aisPLS algorithm to better preserve the information in the original data. The experimental results clearly indicate that the aisPLS baseline correction algorithm can significantly improve the prediction accuracy and stability of subsequent quantitative analysis models. By effectively extracting high-fidelity spectral features, the algorithm provides solid and reliable technical support for high-precision, rapid quantitative analysis of practical samples using Raman spectroscopy.
4. Methodology
4.1. The Penalized Least Squares Methods
The PLS algorithm constructs its objective function by jointly constraining the similarity between the signal and the simulated baseline, along with the smoothness of the baseline itself. Assuming the signal sequence is represented as
and the fitted smooth sequence is
, the fidelity term
is introduced to quantify the deviation between the fitted sequence and the original signal sequence:
The roughness measure
is employed to quantify the smoothness of the fitted sequence. Since second-order differences are often adopted in practical applications, we use first-order difference to simplify the expression of the formula:
Here,
and
is defined as the derivative of the identity matrix and incorporates the smoothing parameter
. Thus, the cost function
is formulated as follows:
During the computational procedure, the baseline estimation problem is transformed into minimizing the cost function, which is typically achieved by setting the partial derivative to zero and solving for the solution:
The final expression for the fitted baseline
is derived as:
where
represents the identity matrix. The standard PLS algorithm does not account for the regional effect of fidelity on the objective function. To address this limitation, the AsLS algorithm introduces a weight vector
, which aims to assign greater weights to regions with higher signal-to-noise ratios and smaller weights to those with lower signal-to-noise ratios:
The corrected formulation for the fitted baseline
z is given by:
By modifying the iterative strategy for updating the weight vector and the termination criteria of the iteration process, various variants of the PLS algorithm have been developed. In the airPLS algorithm, the weight vector is selected according to the following criterion:
where
denotes a vector composed of elements where
. The iteration termination condition is defined as follows:
In the ArPLS algorithm, the weight vector is updated using the following rule:
where
is defined over the range where
, with
and
representing the mean and standard deviation of
, respectively. The logistic function is defined as:
4.2. The Proposed Method: aisPLS
The baseline in Raman spectra primarily consists of inherent system noise and background noise. The inherent system noise includes the instrument response function and the dark current of the detector. The system response function typically manifests as the low-frequency component of the spectral curve, which may contain slowly varying tilts or curvatures, while the dark current of the charge-coupled device (CCD) usually presents as positive values. Background noise comprises fluorescence background, scattered light background, and the physical background of the sample, among other sources. The heterogeneity of noise distribution leads to shortcomings in conventional weighting strategies: airPLS directly sets peak regions to zero, which tends to underestimate the baseline and overestimate peak intensities; ArPLS assumes a symmetric noise distribution, resulting in insufficient capability to identify weak and broad peaks, and is prone to over-fitting or under-fitting.
To address the above issues, this study proposes a refined weight update strategy, with two core improvements: first, the application of the 3-sigma rule to eliminate outliers in
, followed by recalculation of the mean m and standard deviation σ, thereby avoiding interference from outliers in weight assignment; second, the introduction of a second-derivative discriminant factor
to achieve refined classification of data points near the baseline, balancing noise suppression with the retention of weak peaks.
represents the second derivative of
;
is defined as:
It represents the ratio of the second derivatives of the fitted baseline to the original signal, capturing the abruptness of signal changes at each point. The final weight update rule is formulated as follows:
The iteration termination condition is similar to (9), ensuring the convergence of the algorithm.
For the smoothing parameter λ, its value typically ranges from 10 to 108, and the specific value must be determined through experimental validation. A larger λ value results in a smoother baseline, whereas a smaller value makes the baseline more susceptible to peak-induced fluctuations. The selection of λ reflects the relationship between the baseline and the signal, directly influencing the accuracy of baseline correction. The final determination of λ requires a combination of theoretical guidance and practical experimental optimization. In the asPLS algorithm, the ratio of the difference between the fitted vector and the spectral signal and its maximum value is used to update λ. However, such an update mechanism may lead to varying λ values even within peak regions, thereby distorting the spectral peak shapes.
To address the above issues, we propose an adaptive iterative smoothing parameter penalized least squares baseline correction algorithm (aisPLS). In aisPLS, a parameter β is introduced as the smoothing parameter adaptation rate, controlling the magnitude of amplification or reduction in each λi during iterations. The choice of β directly affects the algorithm’s convergence speed, baseline fitting stability, and the preservation of weak peaks. Throughout the iterative process, λi in peak regions is designed to exhibit an initially slow then accelerating exponential growth, which ensures the integrity of spectral peaks while allowing the baseline in non-peak regions to converge as closely as possible to the true values.
Furthermore, the scalar smoothing parameter λ is extended into a smoothing vector
. The initial value is set as
. After initialization, the value of
at each iteration step t is updated according to the following expression:
During the baseline correction process using penalized least squares, the corrected Raman spectrum may contain sub-zero signals. Since Raman spectroscopy records the scattering intensity of incident light by the sample, with intensity values representing energy changes in molecular vibrational or rotational modes—which correspond to physically measurable quantities—these values should inherently be non-negative. Based on the slope characteristics of Gaussian and Lorentzian peaks, this study designs a systematic post-processing algorithm. Specifically, the algorithm begins by zeroing all negative-intensity data points. For positive-intensity data points, if the preceding data point was negative, the algorithm evaluates whether the difference between the current positive value and zero is below the standard deviation while further analyzing its variation trend: if three consecutive sampling points fail to exhibit a monotonic increasing or decreasing trend, zeroing is applied. This algorithm ensures spectral fidelity and enhances the accuracy of subsequent Raman spectral analysis. Based on the above discussion, the pseudocode of aisPLS is summarized as Algorithm 1.
| Algorithm 1. Flow of aisPLS |
Input: spectral data y, smoothing parameter , maximum relative error , maximum iteration count T, smoothing parameter adaptation rate r Output: Baseline , Pure spectral
- (1)
Initialization: weight matrix , second-order difference matrix D, Smoothing parameter matrix , maximum iteration count T = 200, maximum relative error , smoothing parameter adaptation rate r = 2 - (2)
Weight Update:
- (a)
calculate the fitted baseline according to Equation (7) - (b)
compute baseline error d = y − , extract elements of d less than 0 to form vector d−, apply the 3σ rule to remove outliers, then recalculate mean m and standard deviation σ - (c)
update the weight matrix W according to Equation (13)
- (3)
Simultaneous Smoothing Parameter Update: based on the classification from the weight matrix in Equation (13), update the smoothing matrix λ according to Equation (14) - (4)
Termination Condition Check: check the termination condition according to Equation (9). If satisfied, proceed to Step (5); otherwise, return to Step (2) - (5)
Algorithm Termination: output baseline , output pure spectrum - (a)
If , set it to 0 - (b)
If , compute , , if both are satisfied, and three consecutive sampling points exhibit no monotonic increasing or decreasing trend, set the corresponding output to 0
|
5. Conclusions
This study proposes an improved baseline correction algorithm for Raman spectroscopy (aisPLS), which introduces three key enhancements to the traditional penalized least squares method: First, an outlier detection and elimination mechanism is incorporated into the weight update process, improving the robustness of the algorithm through statistical analysis. Second, a dual discrimination strategy is designed for data points near the baseline, effectively preserving weak spectral features. Additionally, the smoothing parameter is innovatively extended into an iteratively updated smoothing vector, while a physically constrained post-processing module is integrated to ensure the corrected results are both mathematically sound and physically meaningful.
Simulation experiments demonstrate that the proposed algorithm outperforms existing mainstream methods across various noise levels, with particularly significant improvements in peak recognition accuracy under high signal-to-noise ratio conditions. In practical spectral validation, the algorithm successfully eliminated baseline drift in ethanol and acetonitrile solutions at different concentrations, achieving Q2 values above 0.98 for the PLS quantitative models of both substances. The design concept of the algorithm exhibits strong generalizability and can be extended to baseline correction tasks for other spectroscopic techniques such as infrared and fluorescence spectroscopy.
Future work will focus on optimizing the adaptive adjustment mechanism for the smoothing parameter, expanding the algorithm’s applicability to extreme scenarios such as multi-component systems and strong fluorescence backgrounds, and exploring its integration with deep learning methods to develop a more powerful spectral preprocessing framework.