OS-Denseformer: A Lightweight End-to-End Noise-Robust Method for Chinese Speech Recognition

Que, Shiqi; Qian, Liping; Li, Mingqing; Wang, Qian

doi:10.3390/app152212096

Open AccessArticle

OS-Denseformer: A Lightweight End-to-End Noise-Robust Method for Chinese Speech Recognition

The Institute of Cyberspace Security, Zhejiang University of Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12096; https://doi.org/10.3390/app152212096

Submission received: 20 October 2025 / Revised: 7 November 2025 / Accepted: 11 November 2025 / Published: 14 November 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Automatic speech recognition (ASR) technology faces the dual challenges of model complexity and noise robustness when deployed on terminal devices (e.g., mobile devices, embedded systems). To meet the demand for lightweight and high-performance models in terminal devices, we propose a lightweight end-to-end speech recognition model, OS-Denseformer (Omni-Scale-Denseformer). The core of this model lies in its lightweight design and noise adaptability: multi-scale acoustic features are efficiently extracted through a multi-sampling structure to enhance noise robustness; the proposed OS-Conv module improves local feature extraction capability while significantly reducing the number of parameters, enhancing computational efficiency, and lowering model complexity; the proposed normalization function, ExpNorm, normalizes the model output, facilitating more accurate parameter optimization during model training. Finally, we employ distinct loss functions across different training stages, using Minimum Bayes Risk (MBR) joint optimization to determine the optimal weighting scheme that directly minimizes the character error rate (CER). Experimental results on public datasets such as AISHELL-1 demonstrate that, under a high-noise environment of −15 dB, the CER of the OS-Denseformer model is reduced by 9.95%, 7.97%, and 4.85% compared to the benchmark models Squeezeformer, Conformer, and Zipformer, respectively. Additionally, the model parameter count is reduced by 53.35%, 10.27%, and 27.66%, while the giga floating-point operations per second (GFLOPs) are decreased by 67.51%, 66.51%, and 13.82%, respectively. Deployment on resource-constrained mobile devices demonstrates that, compared to Conformer, OS-Denseformer reduced memory usage by 10.79% and decreased inference latency by 61.62%.

Keywords:

automatic speech recognition; lightweight; multi-sampling; anti-noise; robustness; edge devices

1. Introduction

With the rapid advancement of artificial intelligence technologies and the widespread adoption of Internet of Things (IoT) applications, ASR is now extensively deployed on a wide range of smart and embedded devices, including smartphones, wearables, smart home systems, and industrial IoT nodes [1,2,3]. Consequently, the efficient deployment of high-performance ASR models on resource-constrained edge devices has become imperative. This capability has significant potential to enable more intelligent applications by providing greater ubiquity, enhancing real-time capabilities, and improving privacy protection [4,5,6].

However, the practical deployment of ASR models on edge devices faces two primary challenges. First, there is a fundamental tension between model complexity and resource constraints. Mainstream high-performance ASR models are typically large and computationally intensive, requiring resources that far exceed the limited processing capacity, storage space, and power of edge devices. Second, environmental noise is a major obstacle. The operational environments of edge devices are complex and unpredictable; factors such as background noise, reverberation, and far-field conditions can severely degrade speech signal quality, leading to significantly reduced recognition accuracy and model robustness. Consequently, developing ASR models that are both lightweight and noise-robust is a critical research challenge at the intersection of speech processing and embedded artificial intelligence.

Thus, we introduce and validate an efficient, robust end-to-end ASR framework that advances current speech recognition paradigms. By synergistically integrating specialized components, our framework enables accurate multi-scale speech processing in noisy environments while maintaining computational efficiency on resource-constrained edge devices.

This paper proposes OS-Denseformer, a lightweight end-to-end automatic speech recognition model, to address the challenge of robust speech recognition on edge devices operating in complex acoustic environments. The model employs a cascaded architecture comprising three key components: dilated convolutional downsampling, anti-noise feature-encoding blocks, and upsampling. This design enables multi-scale acoustic representation fusion to mitigate interference from redundant information, enhances the discriminability of acoustic features in noisy speech, and significantly reduces computational complexity. Additionally, we designed an enhanced local feature extraction module, termed OS-Conv. This module integrates a residual OS block, a densely connected depthwise separable convolution, and an ExpNorm function to improve the network’s capability to filter out non-semantic fragments. The residual OS block enables efficient receptive field modeling, while the densely connected depthwise separable convolution structure promotes feature reusability and reduces the number of parameters. Furthermore, the ExpNorm function helps reduce computational overhead during training and improve convergence stability. We employ a phased training strategy with a loss function that leverages MBR to directly align the optimization objective with the final evaluation metric. To enhance generalization in specific noisy environments, we adopt a transfer learning strategy that involves pre-training on clean speech before fine-tuning on noisy speech.

Experimental results on public datasets—including AISHELL-1, QUT-NOISE, and Noisex92—demonstrate that OS-Denseformer achieves superior performance compared to mainstream ASR models in both noise robustness and lightweight characteristics. Furthermore, deployment tests demonstrate that OS-Denseformer maintains robust performance with a compact footprint across diverse mobile hardware platforms.

2. Related Work

2.1. End-to-End ASR

End-to-end ASR models have been widely studied in recent years and have gradually become the mainstream. Zhang et al. [7] first proposed integrating a deep convolutional neural network structure into an end-to-end ASR framework. Subsequently, Miyazaki et al. [8] unified the decoding processes of speech recognition and synthesis through a unified state-space modeling approach. Inspired by the success of the Transformer architecture in natural language processing, researchers have successfully integrated the attention mechanism into end-to-end speech recognition models [9,10,11]. Both Convolutional Neural Networks (CNNs) and the Transformer architecture have become core components of end-to-end ASR systems. Nevertheless, CNNs are limited in capturing global features, whereas the self-attention mechanism in Transformers is less proficient at capturing fine-grained local patterns. The Conformer [12] bridges this gap by combining convolutional layers with self-attention modules. This hybrid model excels in modeling both global and local features, leading to significant performance improvements.

Recent efforts to reduce computational cost and enhance performance have focused on modifying the Conformer architecture. Peng et al. [13] introduced the Branchformer, which employs a dual-path architecture with parallel branches dedicated to extracting global dependencies and local patterns, respectively. E-Branchformer [14] builds upon this foundation by integrating dynamic gating and learnable convolutional modules to improve adaptability during local modeling of different speech segments. Kim et al. [15] proposed Squeezeformer, which simplifies the Conformer’s macaron structure into a more Transformer-like block and restructures the overall framework into a U-Net architecture to reduce computational overhead. Similarly, Yao et al. [16] introduced Zipformer, which also uses a U-Net-like structure and processes information at multiple sampling rates across its layers. Gao et al. [17] developed Speech-Mamba, a model that integrates a selective state space model (Mamba) with the Transformer to more efficiently capture long-range dependencies in speech, leading to faster inference. In a different approach, Gao et al. [18] proposed Paraformer, a non-autoregressive ASR system that uses parallel decoding to accelerate inference speed.

Our work is substantially inspired by Zipformer’s multi-sampling architecture, which effectively reduces computational costs while extracting fused speech–noise features from noisy signals. However, Zipformer’s rate transformation modules are relatively simplistic: downsampling uses averaging of adjacent frames, while upsampling simply duplicates frames. This downsampling approach suffers from severe aliasing artifacts, provides limited noise robustness, and restricts the receptive field to immediate neighbors, thereby failing to capture longer-range contextual dependencies. The upsampling method lacks anti-aliasing filters, introducing significant mirror frequency components in the spectral domain—another critical limitation we address.

Notably, contemporary speech recognition systems continue to rely predominantly on the CTC + AED framework as their core optimization objective [12,13,14,15,16]. However, in high-noise environments, acoustic feature distortion disperses the probability mass of correct transcriptions across multiple noisy hypotheses, causing these models to frequently converge to suboptimal solutions that capture noise artifacts rather than essential speech characteristics. Overcoming this fundamental limitation represents a central focus of our work.

Breakthroughs in large language models (LLMs) have spurred exploration into their integration with ASR systems. For instance, Xu et al. [19] proposed FireRedASR, which cascades an ASR system with an LLM to correct recognition errors—including homophones, technical terms, and colloquial ellipses. In another approach, Yu et al. [20] compressed frame-level speech encoder features into semantic unit sequences compatible with LLMs and aligned speech with text via dynamic duration prediction. Despite their powerful contextual modeling and semantic error correction capabilities, the massive size of LLMs is incompatible with the limited memory resources of edge devices. Moreover, a cloud-edge collaborative deployment strategy for LLMs introduces network dependency, along with inherent latency and privacy concerns. Consequently, employing LLMs to assist edge ASR systems faces significant technical bottlenecks, rendering this approach unsuitable for the scenarios targeted in this study.

2.2. Noise-Robust Speech Recognition

In practical scenarios, ASR systems are often degraded by ambient noise. The complex interplay of noise and speech distorts the original speech frames, introducing significant redundant information that interferes with the acoustic model during feature extraction [21].

To mitigate this, researchers have developed noise-robust speech recognition techniques, which fall into two principal categories: signal-domain processing and model-domain processing. Signal-domain methods treat speech enhancement as an independent front-end module, aiming to isolate clean information from noisy signals. Common techniques include spectral subtraction [22], Wiener filtering [23], and deep learning-based speech enhancement [24,25,26]. However, these approaches exhibit limited generalization across diverse and non-stationary noise types and often introduce detrimental front-end distortions. They also require real-time buffering of multiple audio frames to compute spectral masks or reconstruct waveforms, imposing significant memory and computational burdens that render them unsuitable for edge devices. In contrast, model-domain methods directly improve the ASR model’s inherent robustness by adapting its architecture or training procedure, enabling it to learn noise-resistant representations implicitly. For example, Pizarro et al. [27] used slow feature analysis to enhance noise robustness, and Burchi et al. [28] employed a multimodal strategy that incorporated visual information from lip movements. Because these techniques achieve robustness through end-to-end inference within a single model, their computational characteristics are better suited for deployment on resource-constrained edge devices.

Given the above, we focus on model-domain strategies, using the Conformer as our reference model. We use the Conformer as a reference model for our analysis. Its local feature extraction relies heavily on convolutional modules that use depthwise separable convolutions. This design is suboptimal for local feature extraction in high-noise conditions. Secondly, the Conformer uses a uniform stack of blocks operating at a single sampling rate, which fails to capture multi-scale acoustic features in noise and can lead to the accumulation of redundant information. Furthermore, under noisy conditions, the hybrid CTC/AED approach becomes highly susceptible to distortion of local acoustic features, consequently leading to inefficient optimization and training instability.To overcome these drawbacks, OS-Denseformer introduces three key innovations: (1) the OS-Conv module, an enhanced replacement for the depthwise separable convolution; (2) a multi-sampling architecture for enriched feature diversity and reduced computational overhead; and (3) a phased MBR loss function for direct optimization of CER.

3. Proposed OS-Denseformer

3.1. OS-Denseformer Encoder

Figure 1 illustrates the overall architecture of the OS-Denseformer encoder. It uses a cascaded feature extraction module designed to effectively extract multi-scale acoustic features from noisy speech and enhance the resulting representations. The pipeline comprises three main components: dilated convolutional downsampling, an anti-noise feature-encoding block, and upsampling. Each anti-noise feature-encoding block operates on the downsampled, low-sampling-rate signals, significantly reducing the computational load during processing. In the figure, dashed boxes represents a distinct feature extraction module.

The encoder employs a cascade of six feature extraction modules to capture multi-scale speech features. The first module contains two stacked anti-noise feature-encoding blocks. The remaining five modules each follow a downsampling–encoding–upsampling pattern. They contain varying numbers of anti-noise feature-encoding blocks: 2, 3, 4, 3, and 2, respectively. Their embedding dimensions are 192, 256, 384, 512, 384, and 256, respectively, with the central module having the largest dimension. This architecture places more blocks in the higher-dimensional intermediate modules, forming a symmetric distribution. This design avoids the representation saturation problem inherent in single-sampling-rate architectures like the Conformer. Each feature extraction module uses a connector to fuse its output features back to its input features at a 50 Hz sampling rate. Finally, the sampling rate is reduced from 50 Hz down to 25 Hz by a convolutional downsampling layer, yielding the encoder’s final output.

3.2. Sampling Module and Connector

To improve the modeling of fused speech-noise features, we adapt an image-processing technique for sampling-rate transformation within the anti-noise feature-encoding block. In this analogy, the speech frame information is conceptualized as a grayscale image, where the single channel of the speech frame corresponds to that of a grayscale image, the FBank feature dimension (typically 80) defines the image height, and the variable number of speech frames (determined by sampling rate and utterance length) defines the image width.

The workflow of the sampling module is illustrated in Figure 2. We propose a downsampling module that integrates average pooling with dilated convolution. First, average pooling serves as an effective anti-aliasing filter by attenuating signal components above the target Nyquist frequency, thereby substantially reducing spectral aliasing in subsequent processing stages. The subsequent dilated convolution then performs feature extraction from the low-resolution representation while systematically expanding the receptive field. Crucially, unlike fixed averaging operations, the dilated convolution employs learnable kernel parameters that enable the model to adaptively discover optimal feature integration patterns directly from training data. The downsampled vectors then undergo high-level semantic feature modeling in the anti-noise feature-encoding block at this lower sampling rate.

Finally, the sampling rate is restored to 50 Hz through an upsampling module to reestablish temporal alignment. We employ bilinear interpolation combined with CNN post-processing as the upsampling module. Bilinear interpolation inherently functions as a weak anti-imaging filter, attenuating a portion of the imaging frequencies. Subsequently, a CNN-based residual learning strategy is applied to eliminate artifacts such as residual imaging frequencies from the interpolation and recover high-frequency details.

As shown in Figure 3, the X and Y axes represent positions in the FBank, while the P axis denotes the spectral energy value. The spectral energy value at the interpolation point is calculated based on known values at four coordinates:

(x_{1}, y_{1})

,

(x_{1}, y_{2})

,

(x_{2}, y_{1})

, and

(x_{2}, y_{2})

. First, interpolation is performed along the x-direction between

p_{11}

and

p_{12}

and

p_{21}

and

p_{22}

:

p_{1} = \frac{x_{2} - x}{x_{2} - x_{1}} p_{11} + \frac{x - x_{1}}{x_{2} - x_{1}} p_{21},

(1)

p_{2} = \frac{x_{2} - x}{x_{2} - x_{1}} p_{12} + \frac{x - x_{1}}{x_{2} - x_{1}} p_{22},

(2)

where

p_{1}

and

p_{2}

denote the interpolated spectral energy values at points

(x, y_{1})

and

(x, y_{2})

, respectively, as obtained from the initial x-direction interpolation. Subsequently, the final interpolated value p is calculated by performing interpolation along the y-direction between

p_{1}

and

p_{2}

, as given by the following expression:

p = \frac{y_{2} - y}{y_{2} - y_{1}} p_{1} + \frac{y - y_{1}}{y_{2} - y_{1}} p_{2},

(3)

where p represents the final spectral energy value at the target interpolation point

(x, y)

. This bilinear interpolation process within the FBank domain upsamples the feature representation from a lower sampling rate back to 50 Hz, thereby preserving acoustic information that might be lost after aggressive downsampling. Finally, a CNN is used for post-processing to mitigate interpolation artifacts and restore high-frequency details in the upsampled features.

As shown in Figure 1, the proposed OS-Denseformer incorporates a learnable connector to fuse features. This operation is formulated as follows:

z = c ⊙ x + (1 - c) ⊙ y,

(4)

where c is a learnable parameter, x is the feature embedding after processing, y is the original feature embedding, and ⊙ denotes element-wise multiplication. This connective structure adaptively adjusts the parameter c during training to control the relative contribution of the transformed features x and the original features y, acting as a learned gating mechanism.

3.3. Anti-Noise Feature-Encoding Block

To capture multi-scale representations, the anti-noise feature-encoding block processes feature vectors from different sampling rates. As shown in Figure 4, each encoding block comprises a 4-head self-attention module, two two-layer feed-forward networks (FFNs), an OS-Conv module, and the ExpNorm normalization function. The self-attention mechanism projects the 512-dimensional input into four parallel heads of 128 dimensions each. The FFN uses an inner dimension of 2048 with sigmoid activation. These identical blocks (N = 16) are stacked to form the complete encoder.

The feed-forward modules apply independent nonlinear transformations to each position in the feature sequence, enhancing the model’s capacity to learn complex patterns. The multi-head self-attention module employs a self-attention mechanism with relative positional encoding to capture global dependencies within the input. The OS-Conv module handles local feature extraction through its Omni-scale block (OS block) [29] and convolutional components. Finally, the ExpNorm function normalizes the outputs to stabilize training. Given an input feature vector

f_{i}

to the i-th encoding block, the output

r_{i}

is computed as follows:

\{\begin{matrix} {\tilde{f}}_{i} = f_{i} + \frac{1}{2} F F N (f_{i}) \\ f_{i}^{'} = {\tilde{f}}_{i} + M H S A ({\tilde{f}}_{i}) \\ f_{i}^{″} = f_{i}^{'} + O S C o n v (f_{i}^{'}) \\ r_{i} = E x p N o r m (f_{i}^{″} + \frac{1}{2} F F N (f_{i}^{″})) \end{matrix},

(5)

where

F F N (\cdot)

is the feed-forward module,

M H S A (\cdot)

is the multi-head self-attention module,

O S C o n v (\cdot)

is the OS-Conv module, and

E x p N o r m (\cdot)

is the ExpNorm normalization function.

In noisy environments, the superposition of speech and noise often causes acoustic models to misclassify non-speech segments as speech. This introduces non-semantic information as valid features, degrading recognition accuracy. To address this, our OS-Denseformer incorporates an OS-Conv module to enhance local feature extraction, improving upon Conformer architectures that rely on a single convolutional module. As shown in Figure 5, this enhanced OS-Conv module consists of three OS blocks with residual connections and five densely connected depthwise separable convolution modules, collectively enhancing the network’s capacity to model complex acoustic features. When processing features from diverse noise types, the network adaptively approximates the optimal receptive field. The five depthwise separable convolution modules (kernel size = 15) enable deep abstract feature extraction. This design mitigates gradient vanishing and explosion problems caused by increasing network depth and improves feature reuse via maximized feature interactions [30].

For the normalization of the anti-noise feature-encoding block’s output, we introduce the ExpNorm function, motivated by RMSNorm, as a replacement for the standard Conformer’s LayerNorm to enhance computational efficiency. It reduces computational overhead and improves training stability. The mathematical definition of ExpNorm is as follows:

E x p N o r m (x) = \frac{x}{R M S (x) + ϵ} \times e^{γ},

(6)

where x is the input,

γ

is a scaling parameter,

R M S (\cdot)

is the root mean square operator, and

ϵ

serves as a numerical stabilizer to prevent division by zero errors. The key differences from LayerNorm are that ExpNorm uses the root mean square instead of the mean and variance, and it employs a single scalar

γ

rather than a per-channel vector, significantly reducing computational overhead. Furthermore, we observed that during early training stages, the optimizer tends to drive

γ

toward zero to mitigate the negative impact of untrained modules. This causes

γ

to oscillate near zero, resulting in frequent gradient sign changes. Such inconsistent gradient directions can trap the module in poor local optima that are difficult to escape, thereby impeding training convergence. In ExpNorm, we employ an exponential parameterization for

γ

, which ensures the scaling parameter remains consistently positive, thus effectively circumventing the aforementioned issues.

3.4. Loss Function

To overcome the limitations of conventional loss functions in noisy acoustic environments, we introduce a loss function that directly aligns model optimization with the CER, the primary evaluation metric. This alignment is achieved through a joint optimization framework based on MBR [31].

The training process is divided into two stages. First, a conventional hybrid loss

L_{Stage 1}

is used for initial training, allowing the model to learn initial acoustic–linguistic mappings. This stage provides a stable starting point for subsequent MBR training, mitigating potential instability from direct MBR training from a random initialization. The Stage 1 loss function is defined as follows:

L_{Stage 1} = λ L_{CTC} + (1 - λ) L_{AED}

(7)

where

L_{Stage 1}

denotes the conventional hybrid loss, comprising the connectionist temporal classification (CTC) loss

L_{CTC}

, the attention-based encoder–decoder (AED) loss

L_{AED}

, and a balancing hyperparameter

λ \in [0, 1]

.

Given that CTC’s alignment is vital for early training stability and the Attention mechanism’s semantic modeling dominates later, we employ a dynamic scheduling strategy for

λ

, defined as follows:

λ = max (λ_{\min}, λ_{init} \cdot (1 - \frac{e}{E_{total}}))

(8)

where

λ_{\min}

and

λ_{init}

denote the minimum and initial CTC weight coefficients, e is the current training epoch, and

E_{total}

is the total number of epochs.

Stage 2 replaces the AED loss with MBR optimization. For each training sample

(x, y_{ref})

, the warm-started decoder generates N candidate sequences

{y_{i}}_{i = 1}^{N}

through stochastic sampling of the output distribution. The risk

R (y_{i}, y_{ref})

for each candidate

y_{i}

is defined as its CER with respect to the reference

y_{ref}

. The expected risk is approximated by a finite-sample weighted average over the N candidates:

L_{MBR} \approx \sum_{i = 1}^{N} \frac{P (y_{i} | x)}{\sum_{j = 1}^{N} P (y_{j} | x)} C E R (y_{i}, y_{ref})

(9)

where

P (y_{i} | x)

is the probability of candidate sequence

y_{i}

computed from import x by the AED decoder.

Because

C E R (y_{i}, y_{ref})

is a non-differentiable function, we employ the REINFORCE algorithm to estimate the gradient of the MBR loss

L_{MBR}

. The gradient estimation procedure is detailed below:

\nabla_{θ} L_{MBR} \approx \frac{1}{N} \sum_{i = 1}^{N} (C E R (y_{i}, y_{ref}) - b) \nabla_{θ} log P (y_{i} | x)

(10)

where b is the baseline used to reduce the variance of the gradient estimate. It is calculated as the average risk over all candidate sequences for all utterances in the current mini-batch, a common practice termed standard batch-average risk.

Stage 2 final loss is a combination of the CTC and MBR losses:

L_{Stage 2} = λ L_{CTC} + (1 - λ) L_{MBR}

(11)

where the CTC weight coefficients

λ

uses the same value as defined in Stage 1.

4. Experimental Results and Analysis

4.1. Experimental Data and Preprocessing

Clean speech data were sourced from the AISHELL-1 Mandarin speech database [32], with noise samples drawn from the QUT-NOISE [33] and Noisex92 [34] datasets. The QUT-NOISE dataset encompasses three primary acoustic scenes—domestic, office, and public and urban environments—which are further categorized into nine distinct noise types. The NOISEX-92 dataset includes eight noise categories, such as white noise, HF channel noise, and factory workshop noise. We selected an equal proportion of samples from each noise category to ensure balanced representation and prevent bias toward any specific noise type during training.

The noise-augmented training set was constructed by mixing clean speech with environmental noise at prescribed signal-to-noise ratios (SNRs), as outlined in Figure 6. The procedure was as follows: First, the clean speech training set was divided equally into seven subsets. Next, we randomly extracted noise segments matching the duration of the clean speech. To ensure balanced representation across all noise conditions and prevent bias toward any specific noise type, we maintained equal sampling proportions from each noise category during this process. These segments were then mixed with the corresponding clean speech subsets at SNRs of −15 dB, −10 dB, −5 dB, 0 dB, 5 dB, 10 dB, and 15 dB. Finally, all noise-mixed samples across different SNR conditions were combined to form the final noise-augmented training dataset.

The model was initially pre-trained for 150 epochs on the clean AISHELL-1 dataset to establish a robust baseline. Subsequently, it was fine-tuned for 50 epochs on noisy speech synthesized from AISHELL-1 and QUT-NOISE. Robustness was evaluated on two distinct noisy test sets: AISHELL-1 with QUT-NOISE noise and AISHELL-1 with Noisex92 noise. This yields two evaluation scenarios: the first tests performance on matched noise (QUT-NOISE), while the second assesses generalization to unmatched noise (Noisex92).

We emphasize that all mixed datasets were constructed by strictly following the original official data partitions of each source dataset. Crucially, speakers appearing in the official AISHELL-1 test set were excluded from both training and validation partitions. Additionally, noise segments from QUT-NOISE and Noisex92 were rigorously separated to ensure completely disjoint sets for training, validation, and testing.

4.2. Experimental Environment and Parameter Settings

All experiments were run on a server equipped with a 64-bit Ubuntu OS, an Intel Xeon Platinum 8352V CPU, an NVIDIA GeForce RTX 4090 GPU (24 GB VRAM), and 32 GB of DDR4 RAM. The models were developed in Python 3.10 with the PyTorch 2.9.0 framework. Comprehensive hyperparameter settings are listed in Table 1. To improve generalization, we employed a parameter-averaging strategy, where the parameters from the top-10 performing epochs were averaged to produce the final model.

4.3. Experimental Verification and Analysis

To evaluate the proposed OS-Denseformer, we conducted a comprehensive set of experiments: (1) a performance comparison with mainstream ASR models; (2) an analysis of noisy-data fine-tuning on robustness; and (3) ablation studies on key architectural components.

4.3.1. Comparison and Analysis with Mainstream Models

We evaluated OS-Denseformer by comparing it against state-of-the-art speech recognition models; results are presented in Table 2. Models were categorized by size into three groups: L (large), M (medium), and S (small).

Table 2 presents two principal findings. (1) Model Scale Enhances Robustness: Results indicate a clear trend within models above, which is that noise robustness improves with scale. The Conformer illustrates this, with CER on matched noise decreasing from 13.79% (S, 10.3 M) and 11.42% (M, 32.6 M) to 10.83% (L, 125.6 M), and on unmatched noise from 13.98%, 11.66% to 11.32%. This comes at a high computational cost, with GFLOPs rising from 30.5 (S), 79.0 (M), to 294.1 (L). (2) OS-Denseformer’s Performance Superiority: OS-Denseformer demonstrates compelling performance advantages. It achieves the lowest CER against Transformer, Conformer-L, Squeezeformer-L, and Zipformer-L, while reducing parameters by 147.9 M, 12.9 M, 128.9 M, and 43.1 M and GFLOPs by 544.2, 206.6, 215.7, and 26.8, respectively. This validates OS-Denseformer’s dual strengths in efficiency and accuracy for noisy speech recognition.

While Speech-Mamba achieves competitive performance on clean speech, its CER degrades substantially more than all other models under noisy conditions. This indicates that the state-space modeling approach currently lacks the robustness of optimized CNN-Transformer hybrid architectures when processing complex, non-stationary environmental noise.

As shown in Figure A3 (Appendix A), the scatter plot of GFLOPs versus CER across model architectures illustrates that OS-Denseformer occupies a position on the Pareto frontier within the lower-left quadrant—demonstrating its optimal balance between computational efficiency and recognition accuracy.

Figure 7 presents the CER performance of various models under different SNR conditions. A performance comparison was conducted between OS-Denseformer and the large-scale Conformer, Squeezeformer, and Zipformer. As expected, a consistent trend in decreasing CER with increasing SNR is observed for all models. OS-Denseformer consistently achieves the lowest CER across the SNR range, with its superiority being most pronounced under low-SNR conditions. To quantify the performance advantage across the −15 dB to 15 dB SNR range, we computed the Area Under the Curve (AUC) for each model. OS-Denseformer achieved the lowest AUC of 407.0, reflecting marked improvements of 22.5%, 19.2%, and 12.4% over the Conformer-L (525.4), Squeezeformer-L (503.6), and Zipformer-L (464.5), respectively.

The performance advantage of OS-Denseformer grows as the SNR declines, highlighting its superior noise robustness. At −15 dB, the CER improvements reach 9.95%, 7.97%, and 4.85% relative to Squeezeformer-L, Conformer-L, and Zipformer-L, respectively. These findings underscore the model’s exceptional robustness in low-SNR conditions.

4.3.2. Model Fine-Tuning Results and Analysis

We assessed the effect of noise-augmented fine-tuning using the following four experimental conditions: (1) Baseline: The original pre-trained model (no fine-tuning); (2) Clean FT: Fine-tuning on clean speech; (3) Unmatched Noise FT: Fine-tuning on Noisex92; (4) Matched Noise FT: Fine-tuning on QUT-NOISE.

We assessed all four model variants on a test set constructed by mixing the AISHELL-1 dataset with QUT-NOISE noise.

As shown in Figure 8, the CER results across different SNR conditions reveal a clear performance hierarchy: the pre-trained model without fine-tuning showed the highest CER, followed by the model fine-tuned only on clean speech. The model fine-tuned with noise-augmented data achieved better performance in unmatched noise conditions and the best performance in matched noise conditions. The comparative analysis revealed that fine-tuning with noise-augmented datasets most effectively improves speech recognition accuracy under strong noise interference. Specifically, when fine-tuned on matched noise data, the model achieved CER reductions of 1.37%, 2.89%, and 4.1% at −5 dB, −10 dB, and −15 dB SNR, respectively, compared to the model fine-tuned on clean speech.

The results further confirm that OS-Denseformer maintains remarkably stable performance on unmatched noise types, with no significant degradation. Compared to matched noise conditions, the CER on unseen noise increased by only 0.59%, 1.17%, and 2.1% at −5 dB, −10 dB, and −15 dB, respectively. Despite these minor increments, model’s CER remained substantially lower than both the baseline and the model fine-tuned only on clean speech. These findings confirm that noise-augmented fine-tuning—especially with matched noise—effectively enhances the robustness of OS-Denseformer in challenging acoustic environments.

4.3.3. Ablation Experiments and Analysis

An ablation study was conducted to assess the individual contributions of key modules in the OS-Denseformer architecture. Following a controlled-variable methodology, only one target component was altered per experiment while maintaining all other elements fixed. The corresponding results are summarized in Table 3.

Ablation results in Table 3 delineate each component’s contribution:

Multi-sampling Architecture: This design reduces CER versus a single downsampling baseline, while concurrently reducing parameters by 24.2% (from 148.6 M to 112.7 M) and GFLOPs substantially, affirming its efficient acoustic feature extraction.

Downsampling and Upsampling Module: Replacing the proposed downsampling and upsampling modules with simple frame averaging and duplication directly increases CER. This finding provides empirical validation that our sampling architecture successfully mitigates the performance degradation typically induced by spectral distortion during acoustic signal processing. As supplementary material, Table A1 in the Appendix A quantifies the spectral distortion introduced by different sampling modules before and after rate transformation.

OS-Conv Module: Ablating the OS block raised CER, confirming its critical role in noisy speech modeling.

Normalization Function: Replacing ExpNorm with LayerNorm raised computational cost, validating ExpNorm’s efficiency advantage for inference. Additionally, we conducted comparative evaluations of the training convergence behavior and CER performance for models employing ExpNorm versus RMSNorm. The corresponding results are presented in Figure A1 and Figure A2 in the Appendix A.

MBR Loss Function: Substituting the two-stage MBR loss with the conventional CTC + AED loss function results in a noticeable increase in CER. This result further validates that our two-stage MBR framework directly minimizes the expected edit distance between reference texts and N-best hypothesis sets. By evaluating the relative quality among hypotheses rather than relying on absolute probability estimates, MBR effectively guides the model toward noise-robust representations when acoustic features are corrupted. The online decoding process generates diverse N-best candidates, enabling information integration across multiple hypotheses. Through this ensemble-like approach, the model captures essential speech patterns even when some hypotheses contain noise artifacts, thereby reducing the detrimental effects of noise on individual predictions. Additionally, we conducted a sensitivity analysis on the hyperparameter N, with the results presented in Table A2 in the Appendix A.

To quantify temporal redundancy in the multi-sampling architecture, we use cosine similarity as a metric. The similarity was measured between outputs of adjacent encoding blocks, with higher values denoting increased feature redundancy. Figure 9 shows that the multi-sampling architecture maintains substantially lower cosine similarity compared to the single downsampling structure, particularly in deeper layers. This effect stems from the architecture’s inherent multi-scale design, which promotes diverse feature learning across sampling rates and reduces redundancy. Consequently, the design mitigates temporal redundancy, sharpens salient feature extraction, and reduces inference cost.

A key observation from Figure 9 is the increase in temporal redundancy when processing noisy versus clean speech. The multi-sampling architecture, however, generates feature representations from noisy inputs that are more aligned with those from clean speech. This suggests that the multi-scale mechanism better captures the underlying speech structure amidst noise. Consequently, it is less prone to incorporating non-speech noise into features, leading to improved robustness.

5. Mobile Deployment and Performance Analysis

The resource constraints and power sensitivity of mobile devices, which differ fundamentally from server environments, necessitate thorough system-level evaluation. We deployed OS-Denseformer on smartphones and evaluated its performance using four key metrics: (1) Loading Time: The duration from loading the model into memory until it is ready for inference, affecting cold-start performance; (2) Memory Consumption: Peak memory consumption during operation; (3) Time Delay per Word: The latency for streaming inference to generate text from input audio; (4) Power Consumption: Energy used during model execution.

Our mobile end-to-end speech recognition system implements a complete low-latency pipeline that converts continuous audio signals into text output. A lightweight GRU-based Voice Activity Detection (VAD) module processes the raw 16 kHz single-channel audio stream in real-time. When speech onset is detected, the VAD initiates processing and enables the downstream pipeline. Detected speech segments are forwarded to the feature extraction module to produce streaming log-Mel spectrograms. These features are fed to the acoustic model for streaming encoding, and finally, an on-device decoder incorporating a quantized MobileBERT generates recognized text using beam search.

The PyTorch-trained model was first converted to the ONNX format using ONNX Runtime Mobile. To accommodate mobile resource constraints, we applied static quantization, converting FP32 weights to INT8. The impact of quantization on storage, inference latency and accuracy was then tested for both OS-Denseformer and Conformer on a Xiaomi 10 smartphone, the results are presented in Table 4. Within the table cells, Inference Latency and Memory Consumption are reported in the “Median/Mean/P95” format.

Experimental results confirm that quantization reduces storage requirements by approximately 75% while significantly accelerating inference. In comparative evaluations, OS-Denseformer demonstrates significantly lower memory usage and achieves a reduced CER while cutting inference latency by more than half compared to Conformer.

The cross-platform compatibility of the OS-Denseformer system was evaluated on three smartphones: Xiaomi 10, Xiaomi 12S, and Vivo X100 Pro. The corresponding performance benchmarks are provided in Table 5. OS-Denseformer maintains stable and efficient performance across diverse mobile platforms, without introducing CER spikes or excessive power drain due to variations in hardware and software configurations.

We evaluated quantization robustness in four common noisy environments: server rooms, busy streets, subway platforms, and riverside areas. Table 6 compares CER of OS-Denseformer and Conformer under both FP32 and INT8 precision.

Our experimental results reveal three principal findings:(1) OS-Denseformer consistently surpasses Conformer across all tested environments under both FP32 and INT8 precision, achieving superior CER performance. (2) Quantization-induced CER increase is substantially more pronounced for Conformer (+1.37% to +2.67%) than for OS-Denseformer (+1.07% to +1.88%), confirming superior quantization robustness. (3) OS-Denseformer demonstrates reduced variance across noise conditions (e.g., 15.27 ± 0.92% vs. 16.05 ± 0.98% in subway environments), indicating more stable performance.

These results demonstrate OS-Denseformer’s superior robustness in maintaining accuracy under both quantization constraints and diverse acoustic conditions, underscoring its strong suitability for deployment on resource-constrained devices where computational efficiency and environmental adaptability are paramount.

We have validated the model’s compliance with core mobile deployment requirements—accuracy, latency, and memory footprint. However, thermal stability under prolonged usage and its potential impact on sustained performance remain critical factors requiring thorough empirical evaluation in real-world product environments to ensure product-ready robustness.

6. Conclusions and Future Works

This paper introduces OS-Denseformer, a lightweight and noise-robust end-to-end model for ASR designed to tackle the persistent challenges of computational complexity and performance degradation in noisy environments. Its multi-sampling architecture achieves a favorable efficiency–robustness trade-off by hierarchically compressing speech signals into multi-scale representations, thereby reducing computational demand. The proposed OS-Conv module further enhances local feature extraction and promotes parameter efficiency through a reuse mechanism. The proposed phased MBR loss function directly integrates CER evaluation metric into the training process, thereby enhancing parameter optimization. Comprehensive evaluations demonstrate that OS-Denseformer achieves superior recognition accuracy over leading models, including Transformer, Conformer, and Squeezeformer, with significantly lower computational and memory footprints in noisy settings.

While this study focuses on model-domain lightweight noise robustness, future works will enhance edge adaptability through: (1) developing a runtime model selection strategy guided by real-time acoustic environment profiling (noise type and SNR); (2) implementing a dynamic sub-module orchestration framework that optimally balances recognition accuracy with resource constraints. In addition, we will (3) explore synergistic integration with lightweight front-end speech enhancement modules to further improve performance in challenging acoustic conditions; (4) extend the evaluation to multilingual scenarios, particularly in English, to validate the generalizability of the proposed approach.

Author Contributions

S.Q. was primarily responsible for manuscript writing, with the experimental design and numerical simulation conducted under the guidance of L.Q. and Q.W.; data collection and manuscript refinement were handled by S.Q. and M.L., while L.Q. and M.L. provided key assistance in data analysis and conducted a comprehensive analysis of the results. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the National Natural Science Foundation of China (NSFC), Grant Numbers 62122069 and 620714314.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available: the AISHELL-1 dataset is accessible at https://www.aishelltech.com/kysjcp (accessed on 1 November 2025), the QUT-NOISE dataset can be obtained from https://research.qut.edu.au/saivt/databases/qut-noise-databases-and-protocols/ (accessed on 1 November 2025), and the Noisex92 dataset is available at http://spib.linse.ufsc.br/noise.html (accessed on 1 November 2025). The data augmentation scripts are publicly available at https://github.com/queshiqi/OS-Denseformer (accessed on 1 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. ExpNorm

Figure A1. A comparative analysis of the training convergence behavior between models employing ExpNorm and RMSNorm as normalization functions.

Figure A2. CER performance of models utilizing ExpNorm and RMSNorm.

Appendix A.2. Sampling Module

Table A1. Comparative analysis of post-sampling spectral distortion.

Method	Metrics
Method	LSD (dB)	Spectral Energy Error (MSE)	High-Freq Preservation (%)	Low-Freq Preservation (%)
Baseline Methods
Down: Frame averaging	$2.85 \pm 0.23$	$0.156 \pm 0.021$	$72.3 \pm 4.1$	$88.7 \pm 3.2$
Up: Frame repetition	$3.12 \pm 0.31$	$0.182 \pm 0.028$	$68.9 \pm 5.2$	$85.4 \pm 4.1$
Proposed Methods
Down: Avg pool + dilated conv	$1.92 \pm 0.15$	$0.087 \pm 0.012$	$86.5 \pm 3.3$	$94.2 \pm 2.1$
Up: Bilinear + CNN post-proc	$1.78 \pm 0.14$	$0.074 \pm 0.009$	$89.1 \pm 2.8$	$96.3 \pm 1.8$

Appendix A.3. MBR Loss Function

Table A2. Analysis of the Impact of MBR Candidate Number N on Development Set CER.

Candidate	Dev Set	Relative	Training
Number N	CER (%)	Change	Time (Hours)
5	8.92	-	4.6
10	8.45	↓ 0.47	5.8
15	8.23	↓ 0.22	6.2
20	8.21	↓ 0.02	8.3
30	8.20	↓ 0.01	10.1

The results show that while increasing the number of candidate sequences N initially leads to a substantial reduction in CER, the performance gains diminish significantly once N exceeds 15, even though training time continues to rise. We therefore selected N = 15 as the optimal operating point, representing the performance saturation threshold where further computational investment yields negligible accuracy improvements.

Appendix A.4. GFLOPs-CER Scatter Plot

Figure A3. Scatter plot of GFLOPs versus CER for different models.

Appendix A.5. Model Architecture Details

Table A3. Comparison of baseline model architecture details.

Model	Enc. Layers	Att. Heads	Att. Dim.	FFN Dim.	Conv. Kernel
OS-Denseformer	16	4	512	2048	15
Conformer-S	8	4	144	576	15
Conformer-M	12	6	256	1024	15
Conformer-L	16	8	512	2048	15
Squeezeformer-S	10	4	192	768	15
Squeezeformer-M	14	6	384	1536	15
Squeezeformer-L	18	8	512	2048	15
Zipformer-S	12	4	192	768	15
Zipformer-M	16	6	384	1536	15
Zipformer-L	20	8	512	2048	15
Speech-Mamba	24	–	512	2048	–

References

Kheddar, H.; Hemis, M.; Himeur, Y.; Megias, D.; Amira, A. Deep Learning for Steganalysis of Diverse Data Types: A Review of Methods, Taxonomy, Challenges and Future Directions. Neurocomputing 2024, 581, 127528. [Google Scholar] [CrossRef]
Nedjah, N.; Bonilla, A.D.; de Macedo Mourelle, L. Automatic Speech Recognition of Portuguese Phonemes Using Neural Networks Ensemble. Expert Syst. Appl. 2023, 229, 120378. [Google Scholar] [CrossRef]
O’Shaughnessy, D. Trends and Developments in Automatic Speech Recognition Research. Comput. Speech Lang. 2024, 83, 101538. [Google Scholar] [CrossRef]
Kheddar, H.; Himeur, Y.; Al-Maadeed, S.; Amira, A.; Bensaali, F. Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization. Knowl.-Based Syst. 2023, 277, 110851. [Google Scholar] [CrossRef]
Kheddar, H.; Hemis, M.; Himeur, Y. Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey. Inf. Fusion 2024, 109, 102422. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Zhang, W.; Zhai, M.; Huang, Z.; Liu, C.; Li, W.; Cao, Y. Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks. In Proceedings of the International Conference on Intelligent Robotics and Applications, Shenyang, China, 8–11 August 2019; pp. 332–341. [Google Scholar]
Miyazaki, K.; Murata, M.; Koriyama, T. Structured State Space Decoder for Speech Recognition and Synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X.; et al. A Comparative Study on Transformer vs RNN in Speech Applications. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, China, 15–18 December 2019; pp. 449–456. [Google Scholar]
Zhang, Q.; Lu, H.; Sak, H.; Tripathi, A.; McDermott, E.; Koo, S.; Kumar, S. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7829–7833. [Google Scholar]
Wang, Y.; Mohamed, A.; Le, D.; Liu, C.; Xiao, A.; Mahadeokar, J.; Huang, H.; Tjandra, A.; Zhang, X.; Zhang, F.; et al. Transformer-Based Acoustic Modeling for Hybrid Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6874–6878. [Google Scholar]
Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-Augmented Transformer for Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 5036–5040. [Google Scholar]
Peng, Y.; Dalmia, S.; Lane, I.; Watanabe, S. Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 18–21 February 2022; pp. 17627–17643. [Google Scholar]
Kim, K.; Wu, F.; Peng, Y.; Pan, J.; Sridhar, P.; Han, K.J.; Watanabe, S. E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Macau, China, 9–12 January 2023; pp. 84–91. [Google Scholar]
Kim, S.; Gholami, A.; Shaw, A.; Lee, N.; Mangalam, K.; Malik, J.; Mahoney, M.W.; Keutzer, K. Squeezeformer: An Efficient Transformer for Automatic Speech Recognition. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 9361–9373. [Google Scholar]
Yao, Z.; Guo, L.; Yang, X.; Kang, W.; Kuang, F.; Yang, Y.; Jin, Z.; Lin, L.; Povey, D. Zipformer: A Faster and Better Encoder for Automatic Speech Recognition. arXiv 2023, arXiv:2310.11230. [Google Scholar]
Gao, X.; Chen, N.F. Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Macau, China, 2–5 December 2024; pp. 1–8. [Google Scholar]
Gao, Z.; Zhang, S.; McLoughlin, I.; Yan, Z. Paraformer: Fast and Accurate Parallel Transformer for Non-Autoregressive End-to-End Speech Recognition. arXiv 2022, arXiv:2206.08317. [Google Scholar]
Xu, K.-T.; Xie, F.-L.; Tang, X.; Hu, Y. FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration. arXiv 2025, arXiv:2501.14350. [Google Scholar]
Yu, W.; Tang, C.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; Zhang, C. Connecting Speech Encoder and Large Language Model for ASR. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12637–12641. [Google Scholar]
Gajecki, T.; Zhang, Y.; Nogueira, W. A Deep Denoising Sound Coding Strategy for Cochlear Implants. IEEE Trans. Biomed. Eng. 2023, 70, 2700–2709. [Google Scholar] [CrossRef] [PubMed]
Yang, L.-P.; Fu, Q.-J. Spectral Subtraction-Based Speech Enhancement for Cochlear Implant Patients in Background Noise. J. Acoust. Soc. Am. 2005, 117, 1001–1004. [Google Scholar] [CrossRef] [PubMed]
Guevara, N.; Bozorg-Grayeli, A.; Bebear, J.-P.; Ardoint, M.; Saaï, S.; Gnansia, D.; Hoen, M.; Romanet, P.; Lavieille, J.-P. The Voice Track Multiband Single-Channel Modified Wiener-Filter Noise Reduction System for Cochlear Implants: Patients’ Outcomes and Subjective Appraisal. Int. J. Audiol. 2016, 55, 431–438. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; Xia, R.; Li, J.; Yan, Y. A Deep Learning Based Binaural Speech Enhancement Approach with Spatial Cues Preservation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5766–5770. [Google Scholar]
Nian, Z.; Tu, Y.-H.; Du, J.; Lee, C.-H. A Progressive Learning Approach to Adaptive Noise and Speech Estimation for Speech Enhancement and Noisy Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6913–6917. [Google Scholar]
Bu, S.; Zhao, Y.; Zhao, T.; Wang, S.; Han, M. Modeling Speech Structure to Improve TF Masks for Speech Enhancement and Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2705–2715. [Google Scholar] [CrossRef]
Pizarro, M.; Kolossa, D.; Fischer, A. Robustifying Automatic Speech Recognition by Extracting Slowly Varying Features. In Proceedings of the SPSC 2021, Virtual, 10–12 November 2021; pp. 37–41. [Google Scholar]
Burchi, M.; Timofte, R. Audio-Visual Efficient Conformer for Robust Speech Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2258–2267. [Google Scholar]
Tang, W.; Long, G.; Liu, L.; Zhou, T.; Blumenstein, M.; Jiang, J. Omni-Scale CNNs: A Simple and Effective Kernel Size Configuration for Time Series Classification. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Weng, C.; Yu, C.; Cui, J.; Zhang, C.; Yu, D. Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 966–970. [Google Scholar]
Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AISHELL-1: An Open-Source Mandarin Speech Corpus and a Speech Recognition Baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Hsinchu, China, 17–19 October 2017; pp. 1–5. [Google Scholar]
Dean, D.; Sridharan, S.; Vogt, R.; Mason, M. The QUT-NOISE-TIMIT Corpus for Evaluation of Voice Activity Detection Algorithms. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH), Makuhari, Japan, 26–30 September 2010; pp. 3110–3113. [Google Scholar]
Varga, A.; Steeneken, H.J.M. Assessment for Automatic Speech Recognition: II. NOISEX-92: A Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems. Speech Commun. 1993, 12, 247–251. [Google Scholar] [CrossRef]

Figure 1. OS-Denseformer structure.

Figure 2. Feature extraction module.

Figure 3. Bilinear interpolation for FBank.

Figure 4. Anti-noise feature-encoding block structure.

Figure 5. OS-Conv module.

Figure 6. Synthesis method for noise-augmented training datasets.

Figure 7. CER of models under varying SNR.

Figure 8. Impact of model fine-tuning on anti-noise performance.

Figure 9. CosSim of feature embeddings across model architectures.

Table 1. Hyperparameter settings.

Parameters	Values
Optimizer	Adam
L2 Weight Decay Coefficient	1.0 × 10⁻⁶
Dropout	0.1
Initial Learning Rate	0.001
FBank Feature Dimension	80
Batch Size	32
Number of Attention Heads	4
Maximum Training Epochs	150 + 50
Warm-up Steps	5000
Initial CTC Weight Coefficient	0.7
Minimum CTC Weight Coefficient	0.3
Numerical Stabilizer $ϵ$	1.0 × 10⁻⁸

Table 2. Performance of different models on clean and noisy speech datasets.

Model	CER (%)						Params (M)	GFLOPs
Model	Dev-Clean	Dev-Matched Noise	Dev-Unmatched Noise	Test-Clean	Test-Matched Noise	Test-Unmatched Noise	Params (M)	GFLOPs
Transformer	6.09 ± 0.21	10.28 ± 0.38	10.53 ± 0.42	6.36 ± 0.23	10.73 ± 0.41	11.04 ± 0.45	260.6	631.7
Speech-mamba	5.43 ± 0.18	10.84 ± 0.45	11.32 ± 0.48	6.03 ± 0.22	11.63 ± 0.49	12.15 ± 0.52	76.0	90.2
Conformer-S	8.83 ± 0.32	13.42 ± 0.55	13.63 ± 0.58	9.42 ± 0.35	13.79 ± 0.59	13.98 ± 0.61	10.3	30.5
Conformer-M	6.37 ± 0.22	10.84 ± 0.41	11.02 ± 0.44	6.80 ± 0.25	11.42 ± 0.46	11.66 ± 0.49	32.6	79.0
Conformer-L	5.91 ± 0.20	10.03 ± 0.36	10.21 ± 0.39	6.03 ± 0.21	10.83 ± 0.40	11.32 ± 0.43	125.6	294.1
Squeezeformer-S	6.74 ± 0.24	11.06 ± 0.43	11.37 ± 0.47	7.21 ± 0.27	11.50 ± 0.46	11.84 ± 0.50	18.2	34.2
Squeezeformer-M	5.41 ± 0.17	9.84 ± 0.35	9.98 ± 0.37	5.94 ± 0.20	10.54 ± 0.38	10.83 ± 0.41	57.3	90.2
Squeezeformer-L	5.29 ± 0.16	9.48 ± 0.32	9.62 ± 0.34	5.80 ± 0.18	10.28 ± 0.36	10.58 ± 0.38	241.6	303.2
Zipformer-S	5.38 ± 0.15	9.51 ± 0.31	9.63 ± 0.33	5.63 ± 0.17	10.13 ± 0.35	10.43 ± 0.37	25.2	43.8
Zipformer-M	5.25 ± 0.14	8.81 ± 0.28	9.02 ± 0.30	5.53 ± 0.16	9.34 ± 0.31	9.58 ± 0.33	67.8	71.3
Zipformer-L	5.16 ± 0.13	8.58 ± 0.26	8.73 ± 0.28	5.39 ± 0.15	9.03 ± 0.29	9.28 ± 0.31	155.8	114.3
OS-Denseformer	5.13 ± 0.12	8.45 ± 0.24	8.63 ± 0.26	5.34 ± 0.14	8.87 ± 0.27	9.13 ± 0.29	112.7	87.5

Table 3. Performance of different modified modules.

Ablation	CER (%)				Params (M)
Ablation	Dev-Clean	Dev-Noise	Test-Clean	Test-Noise	Params (M)
OS-Denseformer	5.19	8.86	5.43	8.98	112.7
Multi-sampling Structure	5.54	9.21	5.78	9.34	148.6
Sampling Module	5.43	9.04	5.58	9.26	109.2
OS block	5.26	8.92	5.48	9.17	108.5
Densely Connected Convolution Module	5.26	8.96	5.50	9.06	114.9
ExpNorm	5.32	8.98	5.52	9.08	112.7
MBR Loss Function	5.47	9.13	5.58	9.28	112.7

Table 4. Comparison of model performance indicators.

Evaluation Indicator	OS-Denseformer		Conformer
Evaluation Indicator	Before Quant.	After Quant.	Before Quant.	After Quant.
Parameter Size (M)	114.8	29.2	126.8	32.9
Inference Latency (ms)	370/386/420	160/165/180	950/988/1050	420/430/470
Memory Consumption (M)	550/568.1/600	230/232.4/250	590/603.8/650	255/260.5/280
CER (%)	8.87	9.93	10.83	12.32

Table 5. Smartphone performance comparison.

Smartphone	Xiaomi 10	Xiaomi 12S	Vivo X100 Pro
Processor	Snapdragon 865	Snapdragon 8+	Dimensity 9300
Runtime Memory (M)	8192	12,288	16,384
Battery Capacity (mAh)	4780	4500	5400
Loading Time (ms)	173	156	132
Averaged Memory Consumption (M)	232.4	233.6	233.8
Maximum Memory Consumption (M)	319.4	320.7	320.1
Inference Latency (ms)	165	160	159
CER (%)	9.93	9.84	9.86
Averaged Memory Consumption (mAh)	80.61	77.63	77.41

Table 6. Quantization performance comparison in real-world noisy environments.

Environment	OS-Denseformer			Conformer
Environment	FP32 CER (%)	INT8 CER (%)	CER Increase	FP32 CER (%)	INT8 CER (%)	CER Increase
Server Room	$8.75 \pm 0.32$	$9.82 \pm 0.41$	$+ 1.07$	$9.15 \pm 0.38$	$10.52 \pm 0.47$	$+ 1.37$
Busy Street	$12.43 \pm 0.68$	$13.89 \pm 0.75$	$+ 1.46$	$13.12 \pm 0.72$	$15.03 \pm 0.84$	$+ 1.91$
Subway Platform	$15.27 \pm 0.92$	$17.15 \pm 1.03$	$+ 1.88$	$16.05 \pm 0.98$	$18.72 \pm 1.15$	$+ 2.67$
Riverside	$10.86 \pm 0.45$	$12.04 \pm 0.58$	$+ 1.18$	$11.42 \pm 0.52$	$13.18 \pm 0.66$	$+ 1.76$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Que, S.; Qian, L.; Li, M.; Wang, Q. OS-Denseformer: A Lightweight End-to-End Noise-Robust Method for Chinese Speech Recognition. Appl. Sci. 2025, 15, 12096. https://doi.org/10.3390/app152212096

AMA Style

Que S, Qian L, Li M, Wang Q. OS-Denseformer: A Lightweight End-to-End Noise-Robust Method for Chinese Speech Recognition. Applied Sciences. 2025; 15(22):12096. https://doi.org/10.3390/app152212096

Chicago/Turabian Style

Que, Shiqi, Liping Qian, Mingqing Li, and Qian Wang. 2025. "OS-Denseformer: A Lightweight End-to-End Noise-Robust Method for Chinese Speech Recognition" Applied Sciences 15, no. 22: 12096. https://doi.org/10.3390/app152212096

APA Style

Que, S., Qian, L., Li, M., & Wang, Q. (2025). OS-Denseformer: A Lightweight End-to-End Noise-Robust Method for Chinese Speech Recognition. Applied Sciences, 15(22), 12096. https://doi.org/10.3390/app152212096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OS-Denseformer: A Lightweight End-to-End Noise-Robust Method for Chinese Speech Recognition

Abstract

1. Introduction

2. Related Work

2.1. End-to-End ASR

2.2. Noise-Robust Speech Recognition

3. Proposed OS-Denseformer

3.1. OS-Denseformer Encoder

3.2. Sampling Module and Connector

3.3. Anti-Noise Feature-Encoding Block

3.4. Loss Function

4. Experimental Results and Analysis

4.1. Experimental Data and Preprocessing

4.2. Experimental Environment and Parameter Settings

4.3. Experimental Verification and Analysis

4.3.1. Comparison and Analysis with Mainstream Models

4.3.2. Model Fine-Tuning Results and Analysis

4.3.3. Ablation Experiments and Analysis

5. Mobile Deployment and Performance Analysis

6. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. ExpNorm

Appendix A.2. Sampling Module

Appendix A.3. MBR Loss Function

Appendix A.4. GFLOPs-CER Scatter Plot

Appendix A.5. Model Architecture Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI