You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

14 November 2025

OS-Denseformer: A Lightweight End-to-End Noise-Robust Method for Chinese Speech Recognition

,
,
and
The Institute of Cyberspace Security, Zhejiang University of Technology, Hangzhou 310023, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section Computing and Artificial Intelligence

Abstract

Automatic speech recognition (ASR) technology faces the dual challenges of model complexity and noise robustness when deployed on terminal devices (e.g., mobile devices, embedded systems). To meet the demand for lightweight and high-performance models in terminal devices, we propose a lightweight end-to-end speech recognition model, OS-Denseformer (Omni-Scale-Denseformer). The core of this model lies in its lightweight design and noise adaptability: multi-scale acoustic features are efficiently extracted through a multi-sampling structure to enhance noise robustness; the proposed OS-Conv module improves local feature extraction capability while significantly reducing the number of parameters, enhancing computational efficiency, and lowering model complexity; the proposed normalization function, ExpNorm, normalizes the model output, facilitating more accurate parameter optimization during model training. Finally, we employ distinct loss functions across different training stages, using Minimum Bayes Risk (MBR) joint optimization to determine the optimal weighting scheme that directly minimizes the character error rate (CER). Experimental results on public datasets such as AISHELL-1 demonstrate that, under a high-noise environment of −15 dB, the CER of the OS-Denseformer model is reduced by 9.95%, 7.97%, and 4.85% compared to the benchmark models Squeezeformer, Conformer, and Zipformer, respectively. Additionally, the model parameter count is reduced by 53.35%, 10.27%, and 27.66%, while the giga floating-point operations per second (GFLOPs) are decreased by 67.51%, 66.51%, and 13.82%, respectively. Deployment on resource-constrained mobile devices demonstrates that, compared to Conformer, OS-Denseformer reduced memory usage by 10.79% and decreased inference latency by 61.62%.

1. Introduction

With the rapid advancement of artificial intelligence technologies and the widespread adoption of Internet of Things (IoT) applications, ASR is now extensively deployed on a wide range of smart and embedded devices, including smartphones, wearables, smart home systems, and industrial IoT nodes [,,]. Consequently, the efficient deployment of high-performance ASR models on resource-constrained edge devices has become imperative. This capability has significant potential to enable more intelligent applications by providing greater ubiquity, enhancing real-time capabilities, and improving privacy protection [,,].
However, the practical deployment of ASR models on edge devices faces two primary challenges. First, there is a fundamental tension between model complexity and resource constraints. Mainstream high-performance ASR models are typically large and computationally intensive, requiring resources that far exceed the limited processing capacity, storage space, and power of edge devices. Second, environmental noise is a major obstacle. The operational environments of edge devices are complex and unpredictable; factors such as background noise, reverberation, and far-field conditions can severely degrade speech signal quality, leading to significantly reduced recognition accuracy and model robustness. Consequently, developing ASR models that are both lightweight and noise-robust is a critical research challenge at the intersection of speech processing and embedded artificial intelligence.
Thus, we introduce and validate an efficient, robust end-to-end ASR framework that advances current speech recognition paradigms. By synergistically integrating specialized components, our framework enables accurate multi-scale speech processing in noisy environments while maintaining computational efficiency on resource-constrained edge devices.
This paper proposes OS-Denseformer, a lightweight end-to-end automatic speech recognition model, to address the challenge of robust speech recognition on edge devices operating in complex acoustic environments. The model employs a cascaded architecture comprising three key components: dilated convolutional downsampling, anti-noise feature-encoding blocks, and upsampling. This design enables multi-scale acoustic representation fusion to mitigate interference from redundant information, enhances the discriminability of acoustic features in noisy speech, and significantly reduces computational complexity. Additionally, we designed an enhanced local feature extraction module, termed OS-Conv. This module integrates a residual OS block, a densely connected depthwise separable convolution, and an ExpNorm function to improve the network’s capability to filter out non-semantic fragments. The residual OS block enables efficient receptive field modeling, while the densely connected depthwise separable convolution structure promotes feature reusability and reduces the number of parameters. Furthermore, the ExpNorm function helps reduce computational overhead during training and improve convergence stability. We employ a phased training strategy with a loss function that leverages MBR to directly align the optimization objective with the final evaluation metric. To enhance generalization in specific noisy environments, we adopt a transfer learning strategy that involves pre-training on clean speech before fine-tuning on noisy speech.
Experimental results on public datasets—including AISHELL-1, QUT-NOISE, and Noisex92—demonstrate that OS-Denseformer achieves superior performance compared to mainstream ASR models in both noise robustness and lightweight characteristics. Furthermore, deployment tests demonstrate that OS-Denseformer maintains robust performance with a compact footprint across diverse mobile hardware platforms.

3. Proposed OS-Denseformer

3.1. OS-Denseformer Encoder

Figure 1 illustrates the overall architecture of the OS-Denseformer encoder. It uses a cascaded feature extraction module designed to effectively extract multi-scale acoustic features from noisy speech and enhance the resulting representations. The pipeline comprises three main components: dilated convolutional downsampling, an anti-noise feature-encoding block, and upsampling. Each anti-noise feature-encoding block operates on the downsampled, low-sampling-rate signals, significantly reducing the computational load during processing. In the figure, dashed boxes represents a distinct feature extraction module.
Figure 1. OS-Denseformer structure.
The encoder employs a cascade of six feature extraction modules to capture multi-scale speech features. The first module contains two stacked anti-noise feature-encoding blocks. The remaining five modules each follow a downsampling–encoding–upsampling pattern. They contain varying numbers of anti-noise feature-encoding blocks: 2, 3, 4, 3, and 2, respectively. Their embedding dimensions are 192, 256, 384, 512, 384, and 256, respectively, with the central module having the largest dimension. This architecture places more blocks in the higher-dimensional intermediate modules, forming a symmetric distribution. This design avoids the representation saturation problem inherent in single-sampling-rate architectures like the Conformer. Each feature extraction module uses a connector to fuse its output features back to its input features at a 50 Hz sampling rate. Finally, the sampling rate is reduced from 50 Hz down to 25 Hz by a convolutional downsampling layer, yielding the encoder’s final output.

3.2. Sampling Module and Connector

To improve the modeling of fused speech-noise features, we adapt an image-processing technique for sampling-rate transformation within the anti-noise feature-encoding block. In this analogy, the speech frame information is conceptualized as a grayscale image, where the single channel of the speech frame corresponds to that of a grayscale image, the FBank feature dimension (typically 80) defines the image height, and the variable number of speech frames (determined by sampling rate and utterance length) defines the image width.
The workflow of the sampling module is illustrated in Figure 2. We propose a downsampling module that integrates average pooling with dilated convolution. First, average pooling serves as an effective anti-aliasing filter by attenuating signal components above the target Nyquist frequency, thereby substantially reducing spectral aliasing in subsequent processing stages. The subsequent dilated convolution then performs feature extraction from the low-resolution representation while systematically expanding the receptive field. Crucially, unlike fixed averaging operations, the dilated convolution employs learnable kernel parameters that enable the model to adaptively discover optimal feature integration patterns directly from training data. The downsampled vectors then undergo high-level semantic feature modeling in the anti-noise feature-encoding block at this lower sampling rate.
Figure 2. Feature extraction module.
Finally, the sampling rate is restored to 50 Hz through an upsampling module to reestablish temporal alignment. We employ bilinear interpolation combined with CNN post-processing as the upsampling module. Bilinear interpolation inherently functions as a weak anti-imaging filter, attenuating a portion of the imaging frequencies. Subsequently, a CNN-based residual learning strategy is applied to eliminate artifacts such as residual imaging frequencies from the interpolation and recover high-frequency details.
As shown in Figure 3, the X and Y axes represent positions in the FBank, while the P axis denotes the spectral energy value. The spectral energy value at the interpolation point is calculated based on known values at four coordinates: ( x 1 , y 1 ) , ( x 1 , y 2 ) , ( x 2 , y 1 ) , and ( x 2 , y 2 ) . First, interpolation is performed along the x-direction between p 11 and p 12 and p 21 and p 22 :
p 1 = x 2 x x 2 x 1 p 11 + x x 1 x 2 x 1 p 21 ,
p 2 = x 2 x x 2 x 1 p 12 + x x 1 x 2 x 1 p 22 ,
where p 1 and p 2 denote the interpolated spectral energy values at points ( x , y 1 ) and ( x , y 2 ) , respectively, as obtained from the initial x-direction interpolation. Subsequently, the final interpolated value p is calculated by performing interpolation along the y-direction between p 1 and p 2 , as given by the following expression:
p = y 2 y y 2 y 1 p 1 + y y 1 y 2 y 1 p 2 ,
where p represents the final spectral energy value at the target interpolation point ( x , y ) . This bilinear interpolation process within the FBank domain upsamples the feature representation from a lower sampling rate back to 50 Hz, thereby preserving acoustic information that might be lost after aggressive downsampling. Finally, a CNN is used for post-processing to mitigate interpolation artifacts and restore high-frequency details in the upsampled features.
Figure 3. Bilinear interpolation for FBank.
As shown in Figure 1, the proposed OS-Denseformer incorporates a learnable connector to fuse features. This operation is formulated as follows:
z = c x + ( 1 c ) y ,
where c is a learnable parameter, x is the feature embedding after processing, y is the original feature embedding, and ⊙ denotes element-wise multiplication. This connective structure adaptively adjusts the parameter c during training to control the relative contribution of the transformed features x and the original features y, acting as a learned gating mechanism.

3.3. Anti-Noise Feature-Encoding Block

To capture multi-scale representations, the anti-noise feature-encoding block processes feature vectors from different sampling rates. As shown in Figure 4, each encoding block comprises a 4-head self-attention module, two two-layer feed-forward networks (FFNs), an OS-Conv module, and the ExpNorm normalization function. The self-attention mechanism projects the 512-dimensional input into four parallel heads of 128 dimensions each. The FFN uses an inner dimension of 2048 with sigmoid activation. These identical blocks (N = 16) are stacked to form the complete encoder.
Figure 4. Anti-noise feature-encoding block structure.
The feed-forward modules apply independent nonlinear transformations to each position in the feature sequence, enhancing the model’s capacity to learn complex patterns. The multi-head self-attention module employs a self-attention mechanism with relative positional encoding to capture global dependencies within the input. The OS-Conv module handles local feature extraction through its Omni-scale block (OS block) [] and convolutional components. Finally, the ExpNorm function normalizes the outputs to stabilize training. Given an input feature vector f i to the i-th encoding block, the output r i is computed as follows:
f ˜ i = f i + 1 2 F F N ( f i ) f i = f ˜ i + M H S A ( f ˜ i ) f i = f i + O S C o n v ( f i ) r i = E x p N o r m f i + 1 2 F F N ( f i ) ,
where F F N ( · ) is the feed-forward module, M H S A ( · ) is the multi-head self-attention module, O S C o n v ( · ) is the OS-Conv module, and E x p N o r m ( · ) is the ExpNorm normalization function.
In noisy environments, the superposition of speech and noise often causes acoustic models to misclassify non-speech segments as speech. This introduces non-semantic information as valid features, degrading recognition accuracy. To address this, our OS-Denseformer incorporates an OS-Conv module to enhance local feature extraction, improving upon Conformer architectures that rely on a single convolutional module. As shown in Figure 5, this enhanced OS-Conv module consists of three OS blocks with residual connections and five densely connected depthwise separable convolution modules, collectively enhancing the network’s capacity to model complex acoustic features. When processing features from diverse noise types, the network adaptively approximates the optimal receptive field. The five depthwise separable convolution modules (kernel size = 15) enable deep abstract feature extraction. This design mitigates gradient vanishing and explosion problems caused by increasing network depth and improves feature reuse via maximized feature interactions [].
Figure 5. OS-Conv module.
For the normalization of the anti-noise feature-encoding block’s output, we introduce the ExpNorm function, motivated by RMSNorm, as a replacement for the standard Conformer’s LayerNorm to enhance computational efficiency. It reduces computational overhead and improves training stability. The mathematical definition of ExpNorm is as follows:
E x p N o r m ( x ) = x R M S ( x ) + ϵ × e γ ,
where x is the input, γ is a scaling parameter, R M S ( · ) is the root mean square operator, and ϵ serves as a numerical stabilizer to prevent division by zero errors. The key differences from LayerNorm are that ExpNorm uses the root mean square instead of the mean and variance, and it employs a single scalar γ rather than a per-channel vector, significantly reducing computational overhead. Furthermore, we observed that during early training stages, the optimizer tends to drive γ toward zero to mitigate the negative impact of untrained modules. This causes γ to oscillate near zero, resulting in frequent gradient sign changes. Such inconsistent gradient directions can trap the module in poor local optima that are difficult to escape, thereby impeding training convergence. In ExpNorm, we employ an exponential parameterization for γ , which ensures the scaling parameter remains consistently positive, thus effectively circumventing the aforementioned issues.

3.4. Loss Function

To overcome the limitations of conventional loss functions in noisy acoustic environments, we introduce a loss function that directly aligns model optimization with the CER, the primary evaluation metric. This alignment is achieved through a joint optimization framework based on MBR [].
The training process is divided into two stages. First, a conventional hybrid loss L Stage 1 is used for initial training, allowing the model to learn initial acoustic–linguistic mappings. This stage provides a stable starting point for subsequent MBR training, mitigating potential instability from direct MBR training from a random initialization. The Stage 1 loss function is defined as follows:
L Stage 1 = λ L CTC + ( 1 λ ) L AED
where L Stage 1 denotes the conventional hybrid loss, comprising the connectionist temporal classification (CTC) loss L CTC , the attention-based encoder–decoder (AED) loss L AED , and a balancing hyperparameter λ [ 0 , 1 ] .
Given that CTC’s alignment is vital for early training stability and the Attention mechanism’s semantic modeling dominates later, we employ a dynamic scheduling strategy for λ , defined as follows:
λ = max λ min , λ init · 1 e E total
where λ min and λ init denote the minimum and initial CTC weight coefficients, e is the current training epoch, and E total is the total number of epochs.
Stage 2 replaces the AED loss with MBR optimization. For each training sample ( x , y ref ) , the warm-started decoder generates N candidate sequences y i i = 1 N through stochastic sampling of the output distribution. The risk R ( y i , y ref ) for each candidate y i is defined as its CER with respect to the reference y ref . The expected risk is approximated by a finite-sample weighted average over the N candidates:
L MBR i = 1 N P ( y i | x ) j = 1 N P ( y j | x ) C E R ( y i , y ref )
where P ( y i | x ) is the probability of candidate sequence y i computed from import x by the AED decoder.
Because C E R ( y i , y ref ) is a non-differentiable function, we employ the REINFORCE algorithm to estimate the gradient of the MBR loss L MBR . The gradient estimation procedure is detailed below:
θ L MBR 1 N i = 1 N C E R ( y i , y ref ) b θ log P ( y i | x )
where b is the baseline used to reduce the variance of the gradient estimate. It is calculated as the average risk over all candidate sequences for all utterances in the current mini-batch, a common practice termed standard batch-average risk.
Stage 2 final loss is a combination of the CTC and MBR losses:
L Stage 2 = λ L CTC + ( 1 λ ) L MBR
where the CTC weight coefficients λ uses the same value as defined in Stage 1.

4. Experimental Results and Analysis

4.1. Experimental Data and Preprocessing

Clean speech data were sourced from the AISHELL-1 Mandarin speech database [], with noise samples drawn from the QUT-NOISE [] and Noisex92 [] datasets. The QUT-NOISE dataset encompasses three primary acoustic scenes—domestic, office, and public and urban environments—which are further categorized into nine distinct noise types. The NOISEX-92 dataset includes eight noise categories, such as white noise, HF channel noise, and factory workshop noise. We selected an equal proportion of samples from each noise category to ensure balanced representation and prevent bias toward any specific noise type during training.
The noise-augmented training set was constructed by mixing clean speech with environmental noise at prescribed signal-to-noise ratios (SNRs), as outlined in Figure 6. The procedure was as follows: First, the clean speech training set was divided equally into seven subsets. Next, we randomly extracted noise segments matching the duration of the clean speech. To ensure balanced representation across all noise conditions and prevent bias toward any specific noise type, we maintained equal sampling proportions from each noise category during this process. These segments were then mixed with the corresponding clean speech subsets at SNRs of −15 dB, −10 dB, −5 dB, 0 dB, 5 dB, 10 dB, and 15 dB. Finally, all noise-mixed samples across different SNR conditions were combined to form the final noise-augmented training dataset.
Figure 6. Synthesis method for noise-augmented training datasets.
The model was initially pre-trained for 150 epochs on the clean AISHELL-1 dataset to establish a robust baseline. Subsequently, it was fine-tuned for 50 epochs on noisy speech synthesized from AISHELL-1 and QUT-NOISE. Robustness was evaluated on two distinct noisy test sets: AISHELL-1 with QUT-NOISE noise and AISHELL-1 with Noisex92 noise. This yields two evaluation scenarios: the first tests performance on matched noise (QUT-NOISE), while the second assesses generalization to unmatched noise (Noisex92).
We emphasize that all mixed datasets were constructed by strictly following the original official data partitions of each source dataset. Crucially, speakers appearing in the official AISHELL-1 test set were excluded from both training and validation partitions. Additionally, noise segments from QUT-NOISE and Noisex92 were rigorously separated to ensure completely disjoint sets for training, validation, and testing.

4.2. Experimental Environment and Parameter Settings

All experiments were run on a server equipped with a 64-bit Ubuntu OS, an Intel Xeon Platinum 8352V CPU, an NVIDIA GeForce RTX 4090 GPU (24 GB VRAM), and 32 GB of DDR4 RAM. The models were developed in Python 3.10 with the PyTorch 2.9.0 framework. Comprehensive hyperparameter settings are listed in Table 1. To improve generalization, we employed a parameter-averaging strategy, where the parameters from the top-10 performing epochs were averaged to produce the final model.
Table 1. Hyperparameter settings.

4.3. Experimental Verification and Analysis

To evaluate the proposed OS-Denseformer, we conducted a comprehensive set of experiments: (1) a performance comparison with mainstream ASR models; (2) an analysis of noisy-data fine-tuning on robustness; and (3) ablation studies on key architectural components.

4.3.1. Comparison and Analysis with Mainstream Models

We evaluated OS-Denseformer by comparing it against state-of-the-art speech recognition models; results are presented in Table 2. Models were categorized by size into three groups: L (large), M (medium), and S (small).
Table 2. Performance of different models on clean and noisy speech datasets.
Table 2 presents two principal findings. (1) Model Scale Enhances Robustness: Results indicate a clear trend within models above, which is that noise robustness improves with scale. The Conformer illustrates this, with CER on matched noise decreasing from 13.79% (S, 10.3 M) and 11.42% (M, 32.6 M) to 10.83% (L, 125.6 M), and on unmatched noise from 13.98%, 11.66% to 11.32%. This comes at a high computational cost, with GFLOPs rising from 30.5 (S), 79.0 (M), to 294.1 (L). (2) OS-Denseformer’s Performance Superiority: OS-Denseformer demonstrates compelling performance advantages. It achieves the lowest CER against Transformer, Conformer-L, Squeezeformer-L, and Zipformer-L, while reducing parameters by 147.9 M, 12.9 M, 128.9 M, and 43.1 M and GFLOPs by 544.2, 206.6, 215.7, and 26.8, respectively. This validates OS-Denseformer’s dual strengths in efficiency and accuracy for noisy speech recognition.
While Speech-Mamba achieves competitive performance on clean speech, its CER degrades substantially more than all other models under noisy conditions. This indicates that the state-space modeling approach currently lacks the robustness of optimized CNN-Transformer hybrid architectures when processing complex, non-stationary environmental noise.
As shown in Figure A3 (Appendix A), the scatter plot of GFLOPs versus CER across model architectures illustrates that OS-Denseformer occupies a position on the Pareto frontier within the lower-left quadrant—demonstrating its optimal balance between computational efficiency and recognition accuracy.
Figure 7 presents the CER performance of various models under different SNR conditions. A performance comparison was conducted between OS-Denseformer and the large-scale Conformer, Squeezeformer, and Zipformer. As expected, a consistent trend in decreasing CER with increasing SNR is observed for all models. OS-Denseformer consistently achieves the lowest CER across the SNR range, with its superiority being most pronounced under low-SNR conditions. To quantify the performance advantage across the −15 dB to 15 dB SNR range, we computed the Area Under the Curve (AUC) for each model. OS-Denseformer achieved the lowest AUC of 407.0, reflecting marked improvements of 22.5%, 19.2%, and 12.4% over the Conformer-L (525.4), Squeezeformer-L (503.6), and Zipformer-L (464.5), respectively.
Figure 7. CER of models under varying SNR.
The performance advantage of OS-Denseformer grows as the SNR declines, highlighting its superior noise robustness. At −15 dB, the CER improvements reach 9.95%, 7.97%, and 4.85% relative to Squeezeformer-L, Conformer-L, and Zipformer-L, respectively. These findings underscore the model’s exceptional robustness in low-SNR conditions.

4.3.2. Model Fine-Tuning Results and Analysis

We assessed the effect of noise-augmented fine-tuning using the following four experimental conditions: (1) Baseline: The original pre-trained model (no fine-tuning); (2) Clean FT: Fine-tuning on clean speech; (3) Unmatched Noise FT: Fine-tuning on Noisex92; (4) Matched Noise FT: Fine-tuning on QUT-NOISE.
We assessed all four model variants on a test set constructed by mixing the AISHELL-1 dataset with QUT-NOISE noise.
As shown in Figure 8, the CER results across different SNR conditions reveal a clear performance hierarchy: the pre-trained model without fine-tuning showed the highest CER, followed by the model fine-tuned only on clean speech. The model fine-tuned with noise-augmented data achieved better performance in unmatched noise conditions and the best performance in matched noise conditions. The comparative analysis revealed that fine-tuning with noise-augmented datasets most effectively improves speech recognition accuracy under strong noise interference. Specifically, when fine-tuned on matched noise data, the model achieved CER reductions of 1.37%, 2.89%, and 4.1% at −5 dB, −10 dB, and −15 dB SNR, respectively, compared to the model fine-tuned on clean speech.
Figure 8. Impact of model fine-tuning on anti-noise performance.
The results further confirm that OS-Denseformer maintains remarkably stable performance on unmatched noise types, with no significant degradation. Compared to matched noise conditions, the CER on unseen noise increased by only 0.59%, 1.17%, and 2.1% at −5 dB, −10 dB, and −15 dB, respectively. Despite these minor increments, model’s CER remained substantially lower than both the baseline and the model fine-tuned only on clean speech. These findings confirm that noise-augmented fine-tuning—especially with matched noise—effectively enhances the robustness of OS-Denseformer in challenging acoustic environments.

4.3.3. Ablation Experiments and Analysis

An ablation study was conducted to assess the individual contributions of key modules in the OS-Denseformer architecture. Following a controlled-variable methodology, only one target component was altered per experiment while maintaining all other elements fixed. The corresponding results are summarized in Table 3.
Table 3. Performance of different modified modules.
Ablation results in Table 3 delineate each component’s contribution:
Multi-sampling Architecture: This design reduces CER versus a single downsampling baseline, while concurrently reducing parameters by 24.2% (from 148.6 M to 112.7 M) and GFLOPs substantially, affirming its efficient acoustic feature extraction.
Downsampling and Upsampling Module: Replacing the proposed downsampling and upsampling modules with simple frame averaging and duplication directly increases CER. This finding provides empirical validation that our sampling architecture successfully mitigates the performance degradation typically induced by spectral distortion during acoustic signal processing. As supplementary material, Table A1 in the Appendix A quantifies the spectral distortion introduced by different sampling modules before and after rate transformation.
OS-Conv Module: Ablating the OS block raised CER, confirming its critical role in noisy speech modeling.
Normalization Function: Replacing ExpNorm with LayerNorm raised computational cost, validating ExpNorm’s efficiency advantage for inference. Additionally, we conducted comparative evaluations of the training convergence behavior and CER performance for models employing ExpNorm versus RMSNorm. The corresponding results are presented in Figure A1 and Figure A2 in the Appendix A.
MBR Loss Function: Substituting the two-stage MBR loss with the conventional CTC + AED loss function results in a noticeable increase in CER. This result further validates that our two-stage MBR framework directly minimizes the expected edit distance between reference texts and N-best hypothesis sets. By evaluating the relative quality among hypotheses rather than relying on absolute probability estimates, MBR effectively guides the model toward noise-robust representations when acoustic features are corrupted. The online decoding process generates diverse N-best candidates, enabling information integration across multiple hypotheses. Through this ensemble-like approach, the model captures essential speech patterns even when some hypotheses contain noise artifacts, thereby reducing the detrimental effects of noise on individual predictions. Additionally, we conducted a sensitivity analysis on the hyperparameter N, with the results presented in Table A2 in the Appendix A.
To quantify temporal redundancy in the multi-sampling architecture, we use cosine similarity as a metric. The similarity was measured between outputs of adjacent encoding blocks, with higher values denoting increased feature redundancy. Figure 9 shows that the multi-sampling architecture maintains substantially lower cosine similarity compared to the single downsampling structure, particularly in deeper layers. This effect stems from the architecture’s inherent multi-scale design, which promotes diverse feature learning across sampling rates and reduces redundancy. Consequently, the design mitigates temporal redundancy, sharpens salient feature extraction, and reduces inference cost.
Figure 9. CosSim of feature embeddings across model architectures.
A key observation from Figure 9 is the increase in temporal redundancy when processing noisy versus clean speech. The multi-sampling architecture, however, generates feature representations from noisy inputs that are more aligned with those from clean speech. This suggests that the multi-scale mechanism better captures the underlying speech structure amidst noise. Consequently, it is less prone to incorporating non-speech noise into features, leading to improved robustness.

5. Mobile Deployment and Performance Analysis

The resource constraints and power sensitivity of mobile devices, which differ fundamentally from server environments, necessitate thorough system-level evaluation. We deployed OS-Denseformer on smartphones and evaluated its performance using four key metrics: (1) Loading Time: The duration from loading the model into memory until it is ready for inference, affecting cold-start performance; (2) Memory Consumption: Peak memory consumption during operation; (3) Time Delay per Word: The latency for streaming inference to generate text from input audio; (4) Power Consumption: Energy used during model execution.
Our mobile end-to-end speech recognition system implements a complete low-latency pipeline that converts continuous audio signals into text output. A lightweight GRU-based Voice Activity Detection (VAD) module processes the raw 16 kHz single-channel audio stream in real-time. When speech onset is detected, the VAD initiates processing and enables the downstream pipeline. Detected speech segments are forwarded to the feature extraction module to produce streaming log-Mel spectrograms. These features are fed to the acoustic model for streaming encoding, and finally, an on-device decoder incorporating a quantized MobileBERT generates recognized text using beam search.
The PyTorch-trained model was first converted to the ONNX format using ONNX Runtime Mobile. To accommodate mobile resource constraints, we applied static quantization, converting FP32 weights to INT8. The impact of quantization on storage, inference latency and accuracy was then tested for both OS-Denseformer and Conformer on a Xiaomi 10 smartphone, the results are presented in Table 4. Within the table cells, Inference Latency and Memory Consumption are reported in the “Median/Mean/P95” format.
Table 4. Comparison of model performance indicators.
Experimental results confirm that quantization reduces storage requirements by approximately 75% while significantly accelerating inference. In comparative evaluations, OS-Denseformer demonstrates significantly lower memory usage and achieves a reduced CER while cutting inference latency by more than half compared to Conformer.
The cross-platform compatibility of the OS-Denseformer system was evaluated on three smartphones: Xiaomi 10, Xiaomi 12S, and Vivo X100 Pro. The corresponding performance benchmarks are provided in Table 5. OS-Denseformer maintains stable and efficient performance across diverse mobile platforms, without introducing CER spikes or excessive power drain due to variations in hardware and software configurations.
Table 5. Smartphone performance comparison.
We evaluated quantization robustness in four common noisy environments: server rooms, busy streets, subway platforms, and riverside areas. Table 6 compares CER of OS-Denseformer and Conformer under both FP32 and INT8 precision.
Table 6. Quantization performance comparison in real-world noisy environments.
Our experimental results reveal three principal findings:(1) OS-Denseformer consistently surpasses Conformer across all tested environments under both FP32 and INT8 precision, achieving superior CER performance. (2) Quantization-induced CER increase is substantially more pronounced for Conformer (+1.37% to +2.67%) than for OS-Denseformer (+1.07% to +1.88%), confirming superior quantization robustness. (3) OS-Denseformer demonstrates reduced variance across noise conditions (e.g., 15.27 ± 0.92% vs. 16.05 ± 0.98% in subway environments), indicating more stable performance.
These results demonstrate OS-Denseformer’s superior robustness in maintaining accuracy under both quantization constraints and diverse acoustic conditions, underscoring its strong suitability for deployment on resource-constrained devices where computational efficiency and environmental adaptability are paramount.
We have validated the model’s compliance with core mobile deployment requirements—accuracy, latency, and memory footprint. However, thermal stability under prolonged usage and its potential impact on sustained performance remain critical factors requiring thorough empirical evaluation in real-world product environments to ensure product-ready robustness.

6. Conclusions and Future Works

This paper introduces OS-Denseformer, a lightweight and noise-robust end-to-end model for ASR designed to tackle the persistent challenges of computational complexity and performance degradation in noisy environments. Its multi-sampling architecture achieves a favorable efficiency–robustness trade-off by hierarchically compressing speech signals into multi-scale representations, thereby reducing computational demand. The proposed OS-Conv module further enhances local feature extraction and promotes parameter efficiency through a reuse mechanism. The proposed phased MBR loss function directly integrates CER evaluation metric into the training process, thereby enhancing parameter optimization. Comprehensive evaluations demonstrate that OS-Denseformer achieves superior recognition accuracy over leading models, including Transformer, Conformer, and Squeezeformer, with significantly lower computational and memory footprints in noisy settings.
While this study focuses on model-domain lightweight noise robustness, future works will enhance edge adaptability through: (1) developing a runtime model selection strategy guided by real-time acoustic environment profiling (noise type and SNR); (2) implementing a dynamic sub-module orchestration framework that optimally balances recognition accuracy with resource constraints. In addition, we will (3) explore synergistic integration with lightweight front-end speech enhancement modules to further improve performance in challenging acoustic conditions; (4) extend the evaluation to multilingual scenarios, particularly in English, to validate the generalizability of the proposed approach.

Author Contributions

S.Q. was primarily responsible for manuscript writing, with the experimental design and numerical simulation conducted under the guidance of L.Q. and Q.W.; data collection and manuscript refinement were handled by S.Q. and M.L., while L.Q. and M.L. provided key assistance in data analysis and conducted a comprehensive analysis of the results. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the National Natural Science Foundation of China (NSFC), Grant Numbers 62122069 and 620714314.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available: the AISHELL-1 dataset is accessible at https://www.aishelltech.com/kysjcp (accessed on 1 November 2025), the QUT-NOISE dataset can be obtained from https://research.qut.edu.au/saivt/databases/qut-noise-databases-and-protocols/ (accessed on 1 November 2025), and the Noisex92 dataset is available at http://spib.linse.ufsc.br/noise.html (accessed on 1 November 2025). The data augmentation scripts are publicly available at https://github.com/queshiqi/OS-Denseformer (accessed on 1 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. ExpNorm

Figure A1. A comparative analysis of the training convergence behavior between models employing ExpNorm and RMSNorm as normalization functions.
Figure A2. CER performance of models utilizing ExpNorm and RMSNorm.

Appendix A.2. Sampling Module

Table A1. Comparative analysis of post-sampling spectral distortion.
Table A1. Comparative analysis of post-sampling spectral distortion.
MethodMetrics
LSD (dB) Spectral Energy Error (MSE) High-Freq Preservation (%) Low-Freq Preservation (%)
Baseline Methods
   Down: Frame averaging 2.85 ± 0.23 0.156 ± 0.021 72.3 ± 4.1 88.7 ± 3.2
   Up: Frame repetition 3.12 ± 0.31 0.182 ± 0.028 68.9 ± 5.2 85.4 ± 4.1
Proposed Methods
   Down: Avg pool + dilated conv 1.92 ± 0.15 0.087 ± 0.012 86.5 ± 3.3 94.2 ± 2.1
   Up: Bilinear + CNN post-proc 1.78 ± 0.14 0.074 ± 0.009 89.1 ± 2.8 96.3 ± 1.8

Appendix A.3. MBR Loss Function

Table A2. Analysis of the Impact of MBR Candidate Number N on Development Set CER.
Table A2. Analysis of the Impact of MBR Candidate Number N on Development Set CER.
CandidateDev SetRelativeTraining
Number N CER (%) Change Time (Hours)
58.92-4.6
108.45↓ 0.475.8
158.23↓ 0.226.2
208.21↓ 0.028.3
308.20↓ 0.0110.1
The results show that while increasing the number of candidate sequences N initially leads to a substantial reduction in CER, the performance gains diminish significantly once N exceeds 15, even though training time continues to rise. We therefore selected N = 15 as the optimal operating point, representing the performance saturation threshold where further computational investment yields negligible accuracy improvements.

Appendix A.4. GFLOPs-CER Scatter Plot

Figure A3. Scatter plot of GFLOPs versus CER for different models.

Appendix A.5. Model Architecture Details

Table A3. Comparison of baseline model architecture details.
Table A3. Comparison of baseline model architecture details.
ModelEnc. LayersAtt. HeadsAtt. Dim.FFN Dim.Conv. Kernel
OS-Denseformer164512204815
Conformer-S8414457615
Conformer-M126256102415
Conformer-L168512204815
Squeezeformer-S10419276815
Squeezeformer-M146384153615
Squeezeformer-L188512204815
Zipformer-S12419276815
Zipformer-M166384153615
Zipformer-L208512204815
Speech-Mamba245122048

References

  1. Kheddar, H.; Hemis, M.; Himeur, Y.; Megias, D.; Amira, A. Deep Learning for Steganalysis of Diverse Data Types: A Review of Methods, Taxonomy, Challenges and Future Directions. Neurocomputing 2024, 581, 127528. [Google Scholar] [CrossRef]
  2. Nedjah, N.; Bonilla, A.D.; de Macedo Mourelle, L. Automatic Speech Recognition of Portuguese Phonemes Using Neural Networks Ensemble. Expert Syst. Appl. 2023, 229, 120378. [Google Scholar] [CrossRef]
  3. O’Shaughnessy, D. Trends and Developments in Automatic Speech Recognition Research. Comput. Speech Lang. 2024, 83, 101538. [Google Scholar] [CrossRef]
  4. Kheddar, H.; Himeur, Y.; Al-Maadeed, S.; Amira, A.; Bensaali, F. Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization. Knowl.-Based Syst. 2023, 277, 110851. [Google Scholar] [CrossRef]
  5. Kheddar, H.; Hemis, M.; Himeur, Y. Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey. Inf. Fusion 2024, 109, 102422. [Google Scholar] [CrossRef]
  6. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
  7. Zhang, W.; Zhai, M.; Huang, Z.; Liu, C.; Li, W.; Cao, Y. Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks. In Proceedings of the International Conference on Intelligent Robotics and Applications, Shenyang, China, 8–11 August 2019; pp. 332–341. [Google Scholar]
  8. Miyazaki, K.; Murata, M.; Koriyama, T. Structured State Space Decoder for Speech Recognition and Synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  9. Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X.; et al. A Comparative Study on Transformer vs RNN in Speech Applications. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, China, 15–18 December 2019; pp. 449–456. [Google Scholar]
  10. Zhang, Q.; Lu, H.; Sak, H.; Tripathi, A.; McDermott, E.; Koo, S.; Kumar, S. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7829–7833. [Google Scholar]
  11. Wang, Y.; Mohamed, A.; Le, D.; Liu, C.; Xiao, A.; Mahadeokar, J.; Huang, H.; Tjandra, A.; Zhang, X.; Zhang, F.; et al. Transformer-Based Acoustic Modeling for Hybrid Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6874–6878. [Google Scholar]
  12. Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-Augmented Transformer for Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 5036–5040. [Google Scholar]
  13. Peng, Y.; Dalmia, S.; Lane, I.; Watanabe, S. Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 18–21 February 2022; pp. 17627–17643. [Google Scholar]
  14. Kim, K.; Wu, F.; Peng, Y.; Pan, J.; Sridhar, P.; Han, K.J.; Watanabe, S. E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Macau, China, 9–12 January 2023; pp. 84–91. [Google Scholar]
  15. Kim, S.; Gholami, A.; Shaw, A.; Lee, N.; Mangalam, K.; Malik, J.; Mahoney, M.W.; Keutzer, K. Squeezeformer: An Efficient Transformer for Automatic Speech Recognition. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 9361–9373. [Google Scholar]
  16. Yao, Z.; Guo, L.; Yang, X.; Kang, W.; Kuang, F.; Yang, Y.; Jin, Z.; Lin, L.; Povey, D. Zipformer: A Faster and Better Encoder for Automatic Speech Recognition. arXiv 2023, arXiv:2310.11230. [Google Scholar]
  17. Gao, X.; Chen, N.F. Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Macau, China, 2–5 December 2024; pp. 1–8. [Google Scholar]
  18. Gao, Z.; Zhang, S.; McLoughlin, I.; Yan, Z. Paraformer: Fast and Accurate Parallel Transformer for Non-Autoregressive End-to-End Speech Recognition. arXiv 2022, arXiv:2206.08317. [Google Scholar]
  19. Xu, K.-T.; Xie, F.-L.; Tang, X.; Hu, Y. FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration. arXiv 2025, arXiv:2501.14350. [Google Scholar]
  20. Yu, W.; Tang, C.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; Zhang, C. Connecting Speech Encoder and Large Language Model for ASR. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12637–12641. [Google Scholar]
  21. Gajecki, T.; Zhang, Y.; Nogueira, W. A Deep Denoising Sound Coding Strategy for Cochlear Implants. IEEE Trans. Biomed. Eng. 2023, 70, 2700–2709. [Google Scholar] [CrossRef] [PubMed]
  22. Yang, L.-P.; Fu, Q.-J. Spectral Subtraction-Based Speech Enhancement for Cochlear Implant Patients in Background Noise. J. Acoust. Soc. Am. 2005, 117, 1001–1004. [Google Scholar] [CrossRef] [PubMed]
  23. Guevara, N.; Bozorg-Grayeli, A.; Bebear, J.-P.; Ardoint, M.; Saaï, S.; Gnansia, D.; Hoen, M.; Romanet, P.; Lavieille, J.-P. The Voice Track Multiband Single-Channel Modified Wiener-Filter Noise Reduction System for Cochlear Implants: Patients’ Outcomes and Subjective Appraisal. Int. J. Audiol. 2016, 55, 431–438. [Google Scholar] [CrossRef] [PubMed]
  24. Sun, X.; Xia, R.; Li, J.; Yan, Y. A Deep Learning Based Binaural Speech Enhancement Approach with Spatial Cues Preservation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5766–5770. [Google Scholar]
  25. Nian, Z.; Tu, Y.-H.; Du, J.; Lee, C.-H. A Progressive Learning Approach to Adaptive Noise and Speech Estimation for Speech Enhancement and Noisy Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6913–6917. [Google Scholar]
  26. Bu, S.; Zhao, Y.; Zhao, T.; Wang, S.; Han, M. Modeling Speech Structure to Improve TF Masks for Speech Enhancement and Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2705–2715. [Google Scholar] [CrossRef]
  27. Pizarro, M.; Kolossa, D.; Fischer, A. Robustifying Automatic Speech Recognition by Extracting Slowly Varying Features. In Proceedings of the SPSC 2021, Virtual, 10–12 November 2021; pp. 37–41. [Google Scholar]
  28. Burchi, M.; Timofte, R. Audio-Visual Efficient Conformer for Robust Speech Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2258–2267. [Google Scholar]
  29. Tang, W.; Long, G.; Liu, L.; Zhou, T.; Blumenstein, M.; Jiang, J. Omni-Scale CNNs: A Simple and Effective Kernel Size Configuration for Time Series Classification. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  30. Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  31. Weng, C.; Yu, C.; Cui, J.; Zhang, C.; Yu, D. Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 966–970. [Google Scholar]
  32. Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AISHELL-1: An Open-Source Mandarin Speech Corpus and a Speech Recognition Baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Hsinchu, China, 17–19 October 2017; pp. 1–5. [Google Scholar]
  33. Dean, D.; Sridharan, S.; Vogt, R.; Mason, M. The QUT-NOISE-TIMIT Corpus for Evaluation of Voice Activity Detection Algorithms. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH), Makuhari, Japan, 26–30 September 2010; pp. 3110–3113. [Google Scholar]
  34. Varga, A.; Steeneken, H.J.M. Assessment for Automatic Speech Recognition: II. NOISEX-92: A Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.