1. Introduction
Near-infrared spectroscopy (NIRS, 780–2500 nm) is widely applied in food, pharmaceutical, and agricultural analytics because its overtone and combination bands of O–H, C–H, and N–H functional groups provide molecular-level chemical information [
1,
2,
3]. When combined with machine learning or deep learning algorithms, NIRS enables rapid, nondestructive prediction of fruit quality [
4,
5,
6], tablet formulation properties [
7], soil nutrients [
8,
9], and blood oxygen saturation [
10]. However, the reliability of these applications critically depends on spectral quality, and real NIRS measurements are often contaminated by multiple noise sources.
Unlike images or standard time-series signals, NIR spectra contain long-range baseline structures, narrow absorption peaks, and weak interband correlations. Noise, therefore, acts destructively: light-source drift, detector thermal noise, sample scattering, and environmental fluctuations distort peak intensities and shift baselines, reducing the signal-to-noise ratio (SNR) and impairing chemical interpretability. In practical usage, even small perturbations near the absorption peaks can propagate to large errors in downstream regression or classification. Thus, spectral denoising is not merely a preprocessing step but a prerequisite for preserving chemically meaningful features and preventing cumulative degradation in subsequent models.
Classical preprocessing techniques, such as Savitzky–Golay smoothing [
11], Standard Normal Variate (SNV) [
12,
13,
14], Multiplicative Scatter Correction (MSC) [
15,
16], and PCA-based denoising [
17], employ fixed linear transformations and assume local smoothness or global stationarity. These assumptions hold only for mild, uniformly distributed noise. When disturbances are nonlinear, wavelength-specific, correlated, or mixed, these methods frequently oversmooth informative peak regions, distort the spectral morphology, and treat all wavelengths equally, even though chemically relevant signatures are concentrated in only a few bands.
Recent deep learning methods have improved denoising by capturing nonlinear dependencies. Autoencoders and their variants recover latent representations to reconstruct spectra [
18,
19,
20,
21], whereas convolutional autoencoders enhance local feature extraction from adjacent spectral windows [
22]. Convolutional Neural Network–Long Short-Term Memory network (CNN–LSTM) hybrids further incorporate temporal modeling along the wavelength dimension to improve the robustness under structured noise [
23]. Attention-based denoisers extend cross-band interactions, enabling global token communication and improved preservation of fine spectral structures [
24,
25,
26,
27,
28,
29]. Evidence from other sensing domains also supports this direction: graph-based neural models maintain robust gesture recognition from noisy 3D point clouds [
30] and similar architectures outperform handcrafted methods in human motion modeling under uncertain sensor conditions [
31]. These findings indicate that deep networks can extract noise-resilient structures from high-dimensional measurements, which is desirable for NIR spectral denoising.
However, these methods all perform denoising implicitly; noise suppression emerges only as a byproduct of minimizing the reconstruction error. This leads to two inherent limitations: (i) noise is never explicitly estimated or disentangled from the spectral signal, and (ii) reconstruction objectives penalize high-frequency deviations, causing oversmoothing of the chemically informative peaks.
In NIR analysis, peak distortion is substantially more harmful than baseline fluctuations because the peak amplitude and position encode functional group information. Implicit denoising models typically reduce MSE while degrading spectral interpretability and chemical utility. Furthermore, most supervised approaches require large corpora of paired clean–noisy spectra, which are rarely available for commercial NIR instruments. Real measurements included baseline drift, correlated noise, intensity fluctuations, and occasional wavelength failures. To avoid these issues, existing studies [
8,
32] commonly inject simplified Gaussian noise, which poorly approximates realistic multisource degradation patterns and fails to reflect spectral acquisition mechanisms.
To address these challenges, we constructed a controllable pseudo-supervised framework that generates paired training data by injecting synthetic noise into real spectra. We defined four representative perturbation types, constant, stripe, uniform, and correlated, each corresponding to typical degradation sources in NIRS: baseline shift, random intensity fluctuations, detector bandwise interference, and cross-band variations. Their intensities and distributions were parameterized for reproducible experiments across degradation levels. A hybrid configuration combines all four to emulate multisource contamination and serves as a stress test for model robustness.
Building on this framework, we propose BiLSTM-FuseNet, a denoising architecture tailored to the NIR spectral structure. Unlike conventional AE or LSTM architectures that indirectly suppress noise, BiLSTM-FuseNet performs explicit residual noise estimation: a lightweight MLP branch learns a nonlinear mapping from the input spectra to the noise component, which is subsequently subtracted from the original input. In parallel, a bidirectional LSTM backbone captures global wavelength continuity, whereas a convolution–Bidirectional Long Short-Term Memory (BiLSTM) fusion path strengthens local spectral dependencies. This dual-path design enables the preservation of narrow functional absorption regions and the stabilization of broad baseline structures, yielding robust reconstructions even under mixed or correlated degradations. The main contributions of this study are as follows:
We introduce a multitype artificial noise-mixing strategy to simulate real-world composite noise conditions.
We developed a spectral denoising training framework based on synthetically paired spectra that enables strong generalization to unseen noise types.
We systematically evaluated the model performance under varying noise types and intensities and compared them with both traditional and deep learning-based baselines.
The experimental results demonstrate that our method achieves a superior reconstruction performance in high-noise environments, thereby exhibiting strong practical applicability and research value.
3. Experimental Results and Discussion
3.1. Experimental Settings
3.1.1. Dataset
Pharmaceutical tablet dataset [
34]: This study utilized the tablet dataset, which consists of 654 pharmaceutical tablets. This dataset was excellent for chemometric training and algorithm testing. The measured spectral wavelength range was 600–1898 nm and the resolution was 2 nm. Each sample was labeled according to its weight, hardness, and active pharmaceutical ingredient (API). This dataset is available at
https://eigenvector.com/wp-content/uploads/2019/06/nir_shootout_2002.mat_.zip (accessed on 26 May 2025).
Anhui-NIR-Soil Dataset [
8]: The soil spectral dataset used in this study was the Anhui-NIR-Soil Dataset, which was mainly collected from Huangshan and Shitai counties in southern Anhui Province, China. The dataset was designed to support urban soil environmental quality monitoring and provide basic data on soil functionality and environmental conditions for sustainable development in the Anhui Province. Soil samples were collected using the “five-point diagonal sampling method,” where each sample point was taken from a depth of 0–20 cm, and five points were combined into one composite sample. A total of 188 composite samples were collected from this dataset, each weighing approximately 1.5 kg. The collected soil samples were air-dried, pulverized, and screened through a 20-mesh screen in the laboratory to obtain the required particle size. Each sample was then subjected to near-infrared (NIR) spectroscopic scanning and a series of chemical property measurements to form a structured dataset. The near-infrared (NIR) spectral range for each sample in this dataset was 901.57 nm–1701.18 nm, totaling 228 bands. The nitrogen, phosphorus, and potassium content of each sample was measured.
3.1.2. Implementation Details
The training configuration and the detailed synthetic noise models used in the experiments are summarized in
Table 1 and
Table 2, respectively. We used a 4-fold cross-validation method to partition the pharmaceutical tablet and Anhi-NIR-Soil datasets. The batch size was 32, the training period was 1000, the optimizer was AdamW, the initial learning rate was 0.001, and learning rate scheduling was performed using CosineAnnealing. All experiments were implemented in PyTorch running on Python 3.10.
For synthetic noise generation, σ was used to control the intensity of the constant and uniform noise. Stripe noise was injected into 33% of the spectral bands, and 5–15% of the samples were corrupted with stripe artifacts. The stripe amplitude followed a fixed uniform distribution, , whereas the other stripe-related parameters were kept constant. σ only controls the standard deviation of the additive Gaussian noise applied to the contaminated spectral bands. For correlated noise, σ is defined as the absolute value of parameters β and η. We fixed η at 0.15 and varied β to adjust the correlated noise strength. Mixed noise simultaneously included all four types, and σ represents the global mixed-noise level. For example, when σ = 0.1, the spectra contained constant noise with intensity 0.1, uniform noise sampled from , stripe noise with amplitude 0.1, and correlated noise with β = 0.1. The effective perturbation is approximately four times the magnitude of the single-noise setting.
It should also be noted that the clean spectra used in training and evaluation are not strictly noise-free in a physical sense but represent the closest available approximations to noise-reduced ground truth.
3.1.3. Evaluation Metrics
The average MSE is used as the loss during the training process. SNR, RMSE, R, and were used as the denoising model evaluation metrics. The SNR was used to measure the signal-to-noise ratio between the model outputs and reference spectra, with larger values indicating better noise suppression and spectral line reduction. The RMSE was used to evaluate the average reconstruction error between the reconstructed and real spectra, with smaller values indicating better spectral line reduction. The RMSE is used to evaluate the average reconstruction error between the reconstructed spectrum and the real spectrum; the smaller the value, the better is the spectral line restoration. Pearson’s correlation coefficient was used to measure the linear correlation between the reconstructed spectrum and the real spectrum; the closer it was to 1, the better the spectral line restoration effect. The coefficient of determination indicates the extent to which the model output explains the variations in the real spectrum. A value of zero or negative indicated that the reconstructed spectrum failed completely. Positive values and values closer to 1 indicate a better spectral line reduction.
The formulas for calculating the MSE loss and the assessment indicators are as follows:
where
denotes the true spectral intensity value of the
ith sample in the
jth band, denotes the reconstructed spectral value of the
ith sample in the
jth band,
denotes the true spectral value of all samples in the
jth band,
denotes the predicted spectral value of all samples in the
jth band,
denotes the mean of the true spectra in all bands,
denotes the mean of the predicted spectra,
N denotes the number of samples, and
B denotes the number of spectral bands. In the correlation coefficient R,
represents the covariance between the reconstructed and true spectra,
and
denote their variances.
To validate the effectiveness of the spectral lines after denoising, we used a Support Vector Regression (SVR) model to predict the target attributes in the downstream task. Regression performance was evaluated using RMSE, R, and
. The formula for calculating the downstream evaluation metrics is as follows:
where
N denotes the total number of test samples,
denotes the true label value of the
ith sample,
denotes the model-predicted value of the
ith sample, and
denotes the average value of all the true labels. In the correlation coefficient R,
represents the covariance between the predicted and true values, whereas
and
denote their variances.
3.2. Visualization of Denoising Results Under Synthetic Noise
To visually evaluate the denoising capability of the different methods,
Figure 2,
Figure 3 and
Figure 4 present the reconstructed spectra under three representative scenarios: constant noise, mixed synthetic noise, and mixed noise from the Anhui-NIR-Soil dataset. In
Figure 2 and
Figure 3, five denoising approaches are included—PCA denoiser, Autoencoder, CNN denoiser, BiLSTM denoiser, and the proposed BiLSTM-FuseNet—allowing a comprehensive comparison across different noise conditions. In contrast,
Figure 4 displays only three representative models (Autoencoder, BiLSTM denoiser, and BiLSTM-FuseNet), as additional methods yield visually redundant spectral curves and do not provide meaningful insight for comparative analysis.
From
Figure 2, it can be observed that PCA denoising fails to reconstruct the spectral profile and leaves a considerable amount of residual noise. Autoencoder- and CNN-based denoisers suppress large fluctuations but oversmooth the signal, causing the loss of the characteristic absorption peaks. The BiLSTM denoiser captures most of the peak structures; however, compared with our proposed BiLSTM-FuseNet, it exhibits weaker reconstruction fidelity in chemically meaningful regions, particularly at approximately 1206, 1392, 1570, and 1676 nm. These wavelength regions are closely associated with fundamental physicochemical absorptions in the NIR domain: 1200–1400 nm corresponds to the overtone absorption of C–H and N–H bonds; 1400–1600 nm corresponds to the second overtone of O–H stretching; and 900–1100 nm is influenced by polysaccharide-related scattering and matrix effects. In contrast, BiLSTM-FuseNet preserves both the global spectral trend and fine peak–valley structures in these chemically relevant regions, indicating that global temporal modeling combined with local feature extraction enables the network to recover physically meaningful spectral information rather than merely smoothing noise.
As shown in
Figure 3, the mixed-noise setting introduced both amplitude perturbations and structured distortions, making the reconstruction task more challenging. Principal component analysis (PCA) and convolutional neural network (CNN)-based denoisers primarily suppress low-frequency fluctuations and fail to recover the chemically relevant peaks. The BiLSTM denoiser restores the overall spectral trend but attenuates feature bands within 1400–1600 nm (second overtone of O–H stretching), which are crucial for characterizing excipient composition and hydration. In contrast, BiLSTM-FuseNet preserved both the peak height and valley curvature in these regions, indicating superior retention of O–H molecular signatures and other physicochemical structures. This suggests that combining global bidirectional modeling with local feature extraction enables the network to retain chemically meaningful spectral responses, rather than smoothing them as noise.
For the Anhui-NIR-Soil dataset in
Figure 4, the spectra contained very few sharp peaks and primarily exhibited smooth broadband variations. In this case, the autoencoder tends to oversmooth the curve and deviates noticeably from the reference. By contrast, both the BiLSTM denoiser and BiLSTM-FuseNet maintained the overall spectral shape well. Overall, BiLSTM-FuseNet provides the closest match to the ground truth, reflecting a better preservation of the global spectral trend.
In conclusion, BiLSTM-FuseNet is the most consistent with the ground truth, as shown in the next step, which can infer better retention of the global spectral trend of the region to a greater extent. Collectively, these visual results support that the proposed BiLSTM-FuseNet not only suppresses noise but also preserves chemically informative spectral structures and is superior to the other methods that smooth or approximate the spectral curve.
Table 3.
Reconstruction performance comparison of different denoising methods under various noise types (Tablet dataset; values reported as mean ± standard deviation).
Table 3.
Reconstruction performance comparison of different denoising methods under various noise types (Tablet dataset; values reported as mean ± standard deviation).
| Noise | Input SNR | Method | RMSE ↓ | Output SNR ↑ | Pearson R ↑ | R2 ↑ |
|---|
| Constant (σ = 0.5) | 0.65 ± 0.01 | PCA | 0.180 ± 0.001 ‡ | 1.92 ± 0.02 ‡ | 0.661 ± 0.003 ‡ | – |
| AutoEncoder [20] | 0.097 ± 0.000 ‡ | 3.44 ± 0.05 ‡ | 0.787 ± 0.003 ‡ | – |
| BiLSTM [23] | 0.031 ± 0.002 † | 14.79 ± 1.34 † | 0.987 ± 0.002 † | 0.44 ± 0.14 |
| Ours | 0.026 ± 0.001 | 18.44 ± 0.56 | 0.992 ± 0.000 | 0.64 ± 0.04 |
| Stripes (σ = 0.5) | 0.64 ± 0.01 | PCA | 0.181 ± 0.001 ‡ | 1.92 ± 0.03 ‡ | 0.656 ± 0.004 ‡ | – |
| AutoEncoder | 0.097 ± 0.001 ‡ | 3.41 ± 0.07 ‡ | 0.782 ± 0.005 ‡ | – |
| BiLSTM | 0.030 ± 0.002 † | 15.04 ± 1.14 † | 0.988 ± 0.003 | 0.49 ± 0.11 |
| Ours | 0.025 ± 0.001 | 18.44 ± 0.23 | 0.992 ± 0.000 | 0.65 ± 0.02 |
| Uniform (σ = [0, 0.5]) | 1.12 ± 0.02 | PCA | 0.104 ± 0.001 ‡ | 3.37 ± 0.06 ‡ | 0.834 ± 0.002 ‡ | – |
| AutoEncoder | 0.053 ± 0.001 ‡ | 6.20 ± 0.13 ‡ | 0.940 ± 0.001 ‡ | – |
| BiLSTM | 0.023 ± 0.002 ‡ | 18.65 ± 0.84 ‡ | 0.993 ± 0.001 † | 0.71 ± 0.03 † |
| Ours | 0.018 ± 0.001 | 22.22 ± 0.43 | 0.995 ± 0.000 | 0.81 ± 0.02 |
| Correlated (β = 0.5, η = 0.15) | 1.05 ± 0.02 | PCA | 0.124 ± 0.001 ‡ | 2.80 ± 0.05 ‡ | 0.785 ± 0.002 ‡ | – |
| AutoEncoder | 0.062 ± 0.001 ‡ | 5.38 ± 0.10 ‡ | 0.918 ± 0.001 ‡ | – |
| BiLSTM | 0.017 ± 0.002 | 23.08 ± 2.60 | 0.995 ± 0.001 | 0.84 ± 0.02 |
| Ours | 0.015 ± 0.000 | 27.14 ± 0.86 | 0.997 ± 0.000 | 0.89 ± 0.01 |
| Mix (σ = 0.5) | 0.39 ± 0.01 | PCA | 0.299 ± 0.001 ‡ | 1.16 ± 0.02 ‡ | 0.473 ± 0.004 ‡ | – |
| AutoEncoder | 0.123 ± 0.001 ‡ | 2.70 ± 0.03 ‡ | 0.618 ± 0.006 ‡ | – |
| BiLSTM | 0.033 ± 0.001 | 15.60 ± 0.44 † | 0.989 ± 0.001 † | 0.41 ± 0.04 |
| Ours | 0.033 ± 0.002 | 16.24 ± 0.67 | 0.990 ± 0.001 | 0.43 ± 0.05 |
3.3. Robustness Evaluation Under Different Noise Types and Intensities
The visual observations in
Figure 2,
Figure 3 and
Figure 4 indicate that BiLSTM-FuseNet better preserves chemically meaningful peak–valley structures, especially under mixed and correlated noise. To statistically validate these observations, we conducted a comprehensive quantitative comparison across different noise types and intensities, as summarized in
Table 3,
Table 4 and
Table 5. In our proposed model, the number of residual iterations
K was set to 12 for the Pharmaceutical Tablet Dataset and 6 for the Anhi-NIR-Soil Dataset.
Table 3 reports reconstruction performance on the Tablet dataset across five noise types. Overall, our BiLSTM-FuseNet achieved the lowest RMSE and the highest SNR and R
2 in all settings, significantly outperforming PCA and AutoEncoder (paired Student’s
t-test
p < 0.05 or
p < 0.01). Under constant and stripe noise, PCA fails to reconstruct meaningful spectral profiles, whereas BiLSTM-FuseNet maintains RMSE around 0.025–0.026 and SNR around 18.4 dB, indicating strong resistance to baseline shift and localized perturbations. For correlated noise (β = 0.5), the advantage becomes more pronounced: SNR = 27.14 for Ours vs. 23.08 for BiLSTM and 5.38 for AutoEncoder, demonstrating superior modeling of cross-band dependencies. In the mixed-noise condition, PCA and AutoEncoder collapse (negative R
2), whereas BiLSTM and BiLSTM-FuseNet remain stable. Notably, only BiLSTM-FuseNet retains positive R
2, with a Pearson correlation of approximately 0.64, indicating that its reconstructed spectra remain chemically consistent rather than being merely smoothed waveforms.
Table 4 compares the denoising performances of the different methods with increasing mixed noise intensity. BiLSTM-FuseNet consistently outperformed PCA and AutoEncoder across all noise levels, with statistically significant improvements in RMSE, SNR, and R
2 (paired Student’s
t-test,
p < 0.05). Compared with the BiLSTM baseline, the advantage of the BiLSTM-FuseNet is noise-dependent. At low noise levels (
σ = 0.1–0.25), both models achieved comparable performance, and no statistically significant difference was observed. As the noise intensity increases (
σ ≥ 0.5), BiLSTM-FuseNet demonstrates a significantly lower RMSE and higher SNR/R
2 (
p < 0.05), while the BiLSTM baseline begins to deteriorate, particularly at
σ = 1.0, (R
2 = 0.07). These results indicate that the proposed dual-path architecture not only suppresses random noise but also retains chemically meaningful spectral structures under challenging noise conditions, where recurrent reconstruction alone becomes unstable.
To evaluate the cross-dataset generalization, mixed-noise experiments with varying intensities were performed on the Anhui NIR-Soil dataset (
Table 5). Under mild noise (
σ = 0.1), all models achieved comparable performance, and the differences between AutoEncoder, CNN, BiLSTM, and BiLSTM-FuseNet were not statistically significant (
p > 0.05), indicating that low-level perturbations can be handled by standard reconstruction models. As the noise increases (
σ = 0.25–0.50), the PCA-derived and convolutional baselines begin to degrade rapidly: the R
2 of AutoEncoder drops from 0.73 to 0.15, and CNN from 0.73 to 0.15, implying that latent compression alone fails to preserve the spectral structure. In contrast, the BiLSTM baseline remained stable in this range, and BiLSTM-FuseNet achieved the best reconstruction with statistically significant improvements in SNR and R
2 over both AutoEncoder and CNN (
p < 0.05). When noise becomes severe (
σ ≥ 0.75), shallow methods collapse completely: CNN and AutoEncoder produce negative R
2 values and a very low output SNR (<4 dB), indicating structural reconstruction failure. The BiLSTM baseline maintained partial recovery capability, but its performance deteriorated at
σ = 1.0 (RMSE = 0.064, R
2 = 0.27). In comparison, BiLSTM-FuseNet achieved the lowest RMSE and highest SNR/R
2 across all noise intensities and remained effective even at
σ = 1.0 (R
2 = 0.29), where all other models failed. This result demonstrates that explicit residual noise modeling and hierarchical feature fusion substantially improve the denoising robustness under strong spectral degradation.
Overall, BiLSTM-FuseNet demonstrated noise-dependent superiority: performance differences were negligible under mild noise but became statistically significant at medium and high noise levels, indicating that the proposed dual-path architecture is particularly effective in heavily degraded spectral conditions.
3.4. Ablation Studies
3.4.1. Study on Encoder Replacement in BiLSTM-FuseNet
However, to understand how BiLSTM-FuseNet outperforms the baseline denoisers with respect to reconstruction accuracy and statistical robustness, we performed an ablation study involving systematic variation in the encoder design. We replaced the BiLSTM encoder with different backbone networks, while keeping all the training and optimization settings the same. The performance differences are listed in
Table 6. Under constant noise (
σ = 0.5), GRU and BiLSTM achieve the lowest reconstruction error (RMSE = 0.025), while BiLSTM obtains the highest output SNR (19.09 dB vs. 19.03 dB for GRU), indicating that bidirectional sequence modeling offers a slight but consistent advantage in preserving local spectral transitions. The superiority was not statistically significant at this noise level (
p > 0.05), suggesting that simple recurrent units can handle uniform perturbations when the noise structure remains homogeneous.
When the noise became more heterogeneous (Mix, σ = 0.5), the differences across encoders became statistically meaningful (p < 0.05). BiLSTM achieved the best reconstruction across all metrics (RMSE = 0.034, SNR = 16.32 dB, R2 = 0.39), outperforming unidirectional recurrent models and transformer-based encoders. These results suggest that bidirectional dependency modeling is more tolerant to structured noise because it retains cross-band relationships that are degraded in unidirectional recurrent or attention-based encoders.
Although the encoder comparison clarifies the performance advantage of BiLSTM in complex noise scenarios, it is also necessary to consider whether such benefits incur computational or deployment costs. From
Table 7, we observe that lightweight baseline models such as AutoEncoder and CNN denoiser have extremely small parameter sizes (6.0 K and 31.0 K) and sub-millisecond latency, indicating their suitability for resource-limited environments. However, their limited capacity limits their ability to recover complex spectral structures. In contrast, recurrent architectures provide stronger temporal dependency modeling, yet their sequential computation causes a noticeable increase in latency (e.g., BiLSTM denoiser: 134.0 K parameters and 46.31 ms). FuseNet-based variants further increase complexity because of multi-branch feature fusion. Although Transformer-FuseNet achieves a favorable balance between the model size (116.6 K) and inference speed (18.21 ms), recurrent versions, particularly BiLSTM-FuseNet, exhibit the highest computational cost (204.0 K parameters and 474.12 ms), as bidirectional recurrence cannot be efficiently parallelized. Nevertheless, this complexity correlates with a consistently superior denoising performance across all noise conditions, reflecting the benefit of enhanced spectral dependency modeling. All latency measurements were performed on an Intel Core i7-12700K CPU with a batch size of 1, representing simplified real-time inference rather than an optimized GPU execution. If practical deployment requires a trade-off between denoising accuracy and throughput, Transformer-FuseNet provides a reasonable compromise.
3.4.2. Comparison of K Values Across Iterations
Beyond the encoder choice, the denoising behavior of BiLSTM-FuseNet is also affected by the refinement depth of the residual module, which is controlled by the iteration number K. To study this effect, we conducted experiments on two representative NIR datasets with different intrinsic spectral characteristics. Although the same mixed-noise configuration was applied, the model responses varied substantially across datasets. As shown in
Figure 5, the model shows only minor improvement at small iteration counts, whereas both SNR and Pearson R increase steadily as K increases, indicating that multistage refinement progressively suppresses stochastic perturbations and restores the spectral structure.
For the Tablet dataset, the model achieves optimal performance at K = 10, where the denoised SNR reaches its maximum and the Pearson R is also near its peak. Beyond this point, an increase in the number of iterations led to degradation. This suggests that moderate refinement helps recover the local spectral features of pharmaceutical samples (e.g., water-related or functional group absorption peaks), whereas excessive iterations introduce oversmoothing, thereby attenuating subtle yet meaningful spectral structures.
By contrast, the Anhui soil dataset attained its optimal iteration count at approximately K = 6. Soil spectra primarily exhibit slowly varying broadband patterns with fewer sharp absorption peaks; therefore, fewer iterations are required to restore the dominant structures. Additional iterations tend to oversmooth the spectra and diminish informative details.
Overall, the denoising iteration number exhibited a clear “overfitting” effect: moderate iterations removed noise and reinforced key spectral features, whereas excessive refinement destroyed local absorption structures, causing simultaneous declines in both SNR and Pearson R. Hence, the optimal value of K is closely tied to the intrinsic spectral characteristics of the dataset and should be adaptively determined for different tasks.
3.5. Comparison of Denoising Performance Under Synthetic and Real Noise
The ablation studies presented in
Section 3.4 show that denoising behavior in BiLSTM-FuseNet relies on architectural design, refinement depth, and encoder selection. However, all prior evaluations were performed under controllable synthetic noise, where the degradation was analytically defined and followed idealized assumptions. We then evaluated the model using real acquisition noise from repeated spectra obtained from two commercial NIR spectrometers to assess the applicability of these advantages under real measurements.
Specifically, the Pharmaceutical Tablet dataset was acquired using two NIR spectrometers manufactured by Foss (USA) [
34]. We extracted the real measurement noise from the differences between the measurements of the same samples obtained by the two instruments. The extraction process consisted of (i) normalizing both datasets to the same range, (ii) computing the measurement differences of identical samples across the two spectrometers, (iii) removing systematic inter-instrument bias by subtracting the mean difference at each wavelength, and (iv) treating the remaining residuals as real acquisition noise.
To ensure a fair comparison, we first injected synthetic noise (five types) into the clean spectra and treated it as the reference degradation level.
Then, the amplitude of the real instrument noise extracted from repeated measurements was scaled by a factor α so that the resulting noisy spectra reached the same SNR level as the synthetic noise:
where
was iteratively adjusted until
This procedure guarantees that the two noise sources are evaluated under matched noise severity while preserving the statistical structure of real noise (non-Gaussianity, wavelength correlation, spectral drift, etc.).
From
Table 8, we observe that, under a matched noisy SNR, the model generally achieves a higher denoised SNR and R
2 under synthetic noise. In contrast, the performance under real noise is relatively low, which indicates that the mismatch between the synthetic noise distribution and real instrument noise leads to a more challenging denoising problem. Nevertheless, the model consistently improved the SNR under all real-noise settings, confirming that BiLSTM-FuseNet remains effective for practical acquisition noise and provides tangible engineering value in real deployment scenarios.
3.6. Model Sensitivity to Spectral Peaks and Characteristic Bands
Real instrument noise is inherently more complex than synthetic perturbations, and BiLSTM-FuseNet can restore meaningful spectral structures under such conditions. However, the denoising performance in spectroscopy cannot be fully assessed using global metrics alone, because RMSE, SNR, and R2 evaluate the overall fidelity but do not reveal whether the model preserves chemically informative features. In near-infrared spectra, functional information is concentrated in absorption peaks that correspond to molecular vibrations, whereas most non-peak intervals reflect the baseline drift or matrix background. A denoising model can achieve high global accuracy while unintentionally smoothing peak amplitudes, shifting peak positions, or degrading narrow absorption bands.
To examine whether BiLSTM-FuseNet captures chemically relevant regions rather than uniformly smoothing noise, we analyzed its spectral sensitivity from two complementary perspectives: (1) the relative attention allocation between peak and non-peak spectral regions and (2) the internal feature sampling behavior of the convolutional extraction branch.
3.6.1. Comparative Response Analysis Between Peak and Non-Peak Regions
The peak regions contain chemically informative spectral signatures, and the BiLSTM naturally exhibits an implicit attention-like mechanism that can be trained to process such features. We first trained a BiLSTM-FuseNet model on mixed-noise spectra with an SNR of 0.39 (σ = 0.5) without applying any weighting between the peak and non-peak regions. The peak-to-non-peak attention ratio was then evaluated for the 20 samples, as shown in
Figure 6a. Under the same conditions, we further trained another BiLSTM-FuseNet model, in which the loss of peak regions was weighted three times higher than that of non-peak regions. The peak-to-nonpeak attention ratio was evaluated for the same 20 samples, and the results are shown in
Figure 6b.
For both models, the attention ratio was less than and approached 1, indicating that the model devoted slightly more attention to non-peak regions, although the difference was small. During the denoising training process, the supervision targets are clean spectra; therefore, the model focuses on reconstructing a globally consistent spectrum rather than prioritizing any specific peak band, as is necessary in downstream prediction tasks. In pure denoising, the model emphasizes the global spectral characteristics. Even when the peak regions are explicitly up-weighted, the learned attention ratio converges toward a uniform distribution.
These results clearly demonstrate that without task-specific supervision, the model gravitates toward uniform attention, thereby confirming the necessity of downstream guidance to extract discriminative peak-related spectral information.
3.6.2. Analysis and Discussion of Spectral Feature Sampling
Although our model adopts a fixed downsampling stride of two, the convolutional layers with ReLU activation still provide an inherent feature selection mechanism. Convolution kernels produce high-magnitude responses that pass through the ReLU layer and are preserved, thereby retaining rich high-frequency details and effectively acting at a high sampling rate. In contrast, weak or negative responses are suppressed or set to zero by the ReLU, which corresponds to a lower sampling rate that filters out redundant information. Thus, the model can dynamically allocate information within the spectral features according to the local characteristics of the input signal.
As shown in
Figure 7, we visualized the feature activation intensity of the local extraction branch (convolutional layers) and compared it with the averaged input spectrum. The results indicated that the feature sampling density was strongly correlated with the amplitudes of the spectral peaks. This suggests that the model automatically “focused” on peak-dense regions, thereby achieving an implicit form of adaptive spectral processing.
3.7. Comparison Experiments on Downstream Regression Tasks
The experiments outlined in
Section 3.6 suggest that BiLSTM-FuseNet reconstructs spectra in a way that focuses on high-level consistency and does not directly promote local peak features, which is consistent with the overall reconstruction of spectra. However, internal behavior does not directly reflect practical utility. To assess whether the denoising performance changes into usable analytical performance, we also performed downstream regression experiments on tablet quality attributes.
Three tablet properties—weight, hardness, and active pharmaceutical ingredient (API) content—were predicted using Support Vector Regression (SVR). The inputs included the original noisy spectra, BiLSTM denoised spectra, denoised spectra, and clean reference spectra, and the evaluation metrics were the RMSE, R2, and Pearson’s correlation coefficient (R).
As shown in
Table 9, the tablet weight prediction under mixed noise (
σ = 0.25) performed poorly when SVR was trained on the original noisy spectra (R
2 = 0.079; R = 0.291). When using the spectra denoised by the BiLSTM baseline, R
2 improved to 0.209. BiLSTM-FuseNet further boosted the performance to R
2 = 0.218, representing a statistically significant improvement over the noisy baseline (
p < 0.01) and provided the closest prediction to the clean–spectra benchmark (R
2 = 0.393). These results indicate that denoising effectively restores the spectral characteristics relevant to tablet mass estimation.
For the hardness prediction, the original noisy spectra yielded a very weak predictive power (R2 = 0.029). BiLSTM-denoised inputs improved the performance to R2 = 0.135, whereas BiLSTM-FuseNet achieved R2 = 0.119, performing slightly below BiLSTM but still markedly above the noisy baseline (p < 0.05). The clean reference spectra achieved the best performance (R2 = 0.397), suggesting that the hardness is moderately robust to noise but remains sensitive to residual distortions. The gap between denoised and clean performance implies that hardness-related spectral variations likely reside in confined regions, where even subtle errors may be amplified.
Prediction of API content proved to be the most challenging. Using noisy spectra, the SVR exhibited R2 = 0.033 and R = 0.247, indicating almost no predictive utility. The denoising approaches produced small but statistically significant gains (R2 = 0.070 for BiLSTM; R2 = 0.074 for BiLSTM-FuseNet, both p < 0.05). Nonetheless, the performance remained well below that of the clean spectra model (R2 = 0.585). This behavior implies that the estimation of API is sensitive to subtle, fine-grained, spectral features, which are extremely susceptible to interference and difficult to recover without fully restoring peak-related information.
Overall, these results confirm that spectral denoising substantially enhances downstream regression performance, particularly for global quantity indicators, such as tablet weight. Although improvements in hardness and API content are modest, denoised inputs consistently outperform their noisy counterparts, demonstrating the practical transferability and applied value of the proposed BiLSTM-FuseNet in real analytical workflows.
4. Conclusions
In this study, we propose a spectral denoising method based on artificial noise addition and weakly supervised learning which enables high-quality spectral restoration without the need for real labels or clean noise-containing paired data. By simulating different types and intensities of noise pollution, we constructed a unified self-supervised training framework to reconstruct the mapping relations of the original spectral lines from noise-containing spectra, using autoencoder structure learning.
The experimental results indicate that the proposed approach is robust and generalized in the presence of different noise types and SNR levels. In particular, with mixed noise at σ = 0.5 (Input SNR = 0.39), the proposed BiLSTM-FuseNet achieves 88.9% reconstruction error reduction compared to PCA (0.033 vs. 0.299 RMSE) and has a slightly higher value of Pearson correlation than the vanilla BiLSTM baseline (+0.001). Moreover, the coefficient of determination increased from 0.411 to 0.430, reflecting a 4.6% improvement in predictive consistency.
Despite its excellent denoising performance, the proposed method has several limitations. First, BiLSTM-FuseNet relies on the numerical fidelity of the reference spectra; in downstream pill hardness prediction, we observed performance degradation when the “clean” reference still contained weak noise or non-ideal artifacts. This indicates that an excessively faithful reconstruction of suboptimal reference signals may propagate undesirable spectral details to the subsequent models. Second, the model implicitly captures global spectral consistency rather than emphasizing the characteristic peak bands, which limits interpretability and reduces discriminative ability when spectral peaks are subtle or sparsely distributed.
Future studies should explore stronger regularization strategies to alleviate these challenges. Contrastive learning can be incorporated to decouple key structural features from weak background signals, thereby reducing the reliance on ideal labels. Furthermore, hybrid attention or spectral prior-guided mechanisms may help bias the feature extraction towards chemically meaningful bands. Finally, extending this method to cross-instrument adaptation, real production spectra, and larger datasets will further validate its practical value in industrial spectral analytics.