Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning

Wu, Chaoji; Pan, Yi; Wu, Haipan; Ning, Lei

doi:10.3390/informatics12040107

Open AccessReview

Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning

¹

College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518122, China

²

College of Physics and Opto-Electronic Engineering, Shenzhen University, Shenzhen 518060, China

^*

Author to whom correspondence should be addressed.

Informatics 2025, 12(4), 107; https://doi.org/10.3390/informatics12040107

Submission received: 23 August 2025 / Revised: 28 September 2025 / Accepted: 29 September 2025 / Published: 4 October 2025

(This article belongs to the Section Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Automatic speech recognition (ASR) has advanced rapidly, evolving from early template-matching systems to modern deep learning frameworks. This review systematically traces ASR’s technological evolution across four phases: the template-based era, statistical modeling approaches, the deep learning revolution, and the emergence of large-scale models under diverse learning paradigms. We analyze core technologies such as hidden Markov models (HMMs), Gaussian mixture models (GMMs), recurrent neural networks (RNNs), and recent architectures including Transformer-based models and Wav2Vec 2.0. Beyond algorithmic development, we examine how ASR integrates into intelligent information systems, analyzing real-world applications in healthcare, education, smart homes, enterprise systems, and automotive domains with attention to deployment considerations and system design. We also address persistent challenges—noise robustness, low-resource adaptation, and deployment efficiency—while exploring emerging solutions such as multimodal fusion, privacy-preserving modeling, and lightweight architectures. Finally, we outline future research directions to guide the development of robust, scalable, and intelligent ASR systems for complex, evolving environments.

Keywords:

automatic speech recognition; deep learning; self-supervised learning; end-to-end systems; future directions in ASR

1. Introduction

Automatic speech recognition (ASR) converts speech signals into text as one of Artificial Intelligence (AI)’s cornerstones. Its core principles are based on acoustic analysis and language modeling to extract meaning from speech, enabling computers to interpret and process natural language. ASR has evolved significantly over decades, progressing from early rule-based and template-matching methods to statistical modeling frameworks, and most recently to deep-learning-based end-to-end systems. This transformation stems from advances in computational power, access to vast datasets, and breakthroughs in machine learning algorithms, with recent studies highlighting the pivotal role of neural architectures like Transformers and self-supervised learning in driving these advancements [1,2,3]. More recently, studies have spotlighted the ascendancy of large models, such as federated-learning-based ASR systems for privacy-preserving applications and lightweight architectures tailored for edge computing [4,5,6].

ASR technology is widely applied in various sectors, including commercial, industrial, agricultural, medical, and educational domains [7,8,9,10]. Its applications, such as intelligent voice assistants, automatic subtitle generation, smart home controls, and medical diagnostic tools, enhance human–computer interaction, expand access to information, and support accessibility through assistive technologies. Despite notable progress, evidenced by higher accuracy and broader use cases, ASR faces ongoing challenges. These include dependence on large annotated datasets, limited performance in noisy environments, difficulties with low-resource languages, and the need for efficient model inference. Recent ASR surveys have primarily adopted technology-centric perspectives, focusing on specific algorithmic domains. Kheddar et al. (2024) [11] concentrated on advanced deep learning paradigms including transfer learning, federated learning, and reinforcement learning, while comprehensive reviews by Dhanjal and Singh (2024) [12] emphasized neural network architectures spanning 2015–2021. The survey by Zahorian and Karnjanadecha (2023) [13] examined design principles for exploiting discriminative information in speech signals, and recent work by Khapra (2024) [14] focused on end-to-end system implementations across various domains. While these works, along with influential surveys by Kumar et al. (2018), Prabhavalkar et al. (2023), and Malik et al. (2021) [15,16,17], have substantially advanced our understanding of ASR technologies, they predominantly examine algorithms in isolation from their deployment contexts.

This survey adopts a systems integration perspective that differs from existing reviews. Rather than focusing solely on algorithmic performance, we examine how ASR technologies function within real-world intelligent information systems. We analyze ASR integration across four key domains: healthcare, education, smart homes, and enterprise environments. Our approach encompasses technologies ranging from traditional statistical models to modern self-supervised frameworks like wav2vec 2.0, HuBERT, and WavLM. This integration-focused methodology reveals practical considerations often overlooked in algorithm-centric reviews: deployment constraints, system architecture requirements, privacy considerations, and multimodal fusion needs.

This review systematically maps ASR’s evolutionary trajectory, offering insights into paradigm shifts, unresolved challenges, and future research directions. It categorizes ASR’s development into four distinct phases:

(1): Early Exploration and Template Matching Era: Initial efforts to develop ASR technology.
(2): Statistical Model-Driven Period: Emergence of statistical frameworks as a key driver of progress.
(3): Deep Learning Revolution Phase: Deep learning’s transformative impact on ASR development.
(4): Era of Large Models under Different Learning Paradigms: Growth of innovative paradigms within large-scale architectures.

This review examines ASR’s technological evolution by analyzing core methodologies, technical trajectories, and real-world outcomes across these phases. It traces the field’s journey from early experimentation to today’s cutting-edge innovations. The review pursues three primary objectives:

Identify Trends: Systematically uncover domain-specific technological trends in ASR development across different learning paradigms and application domains.
Evaluate Challenges: Critically assess persistent technical bottlenecks hindering progress, including both algorithmic limitations and practical deployment challenges in intelligent information systems.
Propose Directions: Offer forward-looking perspectives for future theoretical and applied research, emphasizing the integration of ASR within complex, evolving intelligent system environments.

By bridging historical insights with contemporary technological challenges, this work constructs a robust analytical framework that traces ASR’s evolution from early template matching to modern large-scale models. This comprehensive technological survey encompasses ASR literature spanning from foundational approaches (1950s) to contemporary developments (2025), with particular emphasis on paradigmatic shifts and intelligent systems integration. Literature selection prioritized seminal works that introduced fundamental concepts, landmark papers demonstrating significant performance improvements, and recent developments in self-supervised learning and large-scale models, primarily drawn from major venues (ICASSP, INTERSPEECH, ICLR, NeurIPS, etc.), databases (IEEE Xplore, ACM Digital Library, arXiv), and industry reports tracking technological advances. This survey serves as a critical reference for researchers to understand paradigmatic shifts, identify research gaps, and advance theoretical foundations in ASR development within intelligent information systems.

2. Early Exploration and Template Matching Era

Research on ASR dates back to the 1950s. In 1952, Davis et al. [18] developed the Audrey system, which recognized spoken digits from a single speaker with 97–99% accuracy. Audrey segmented speech into two frequency bands and used axis-crossing counters to track frequency variations, comparing the results against standard digit templates.

Despite its limitation to ten digits, Audrey demonstrated the feasibility of machine-based speech recognition. Subsequent efforts included phonetic typewriters at Princeton and Massachusetts Institute of Technology (MIT) [19,20,21], and foundational techniques such as dynamic programming (DP) and linear predictive coding (LPC) [22], laying the groundwork for future ASR systems.

In the 1960s, Japan advanced ASR hardware development. Tokyo Telecommunications Laboratory built a vowel recognizer [23], while Kyoto University created the Sonotype phonetic typewriter [24]. In the U.S., International Business Machines Corporation (IBM)’s Shoebox system gained public attention at the 1962 World’s Fair, recognizing 16 English words including arithmetic commands, which helped catalyze DARPA’s early investments in ASR research.

During the 1970s, the focus shifted to isolated word recognition, which enabled technical breakthroughs despite limited practical use. Itakura [25] introduced a minimum prediction residual method combining LPC with dynamic programming, enabling more structured signal alignment. Later, Sakoe and Chiba [26] optimized this into the dynamic time warping (DTW) algorithm, improving alignment efficiency through constraints like slope limits and symmetric forms.

While isolated word recognition dominated early research, its limitations in handling natural speech spurred interest in continuous speech recognition. With improved computational resources, researchers turned to statistical approaches—most notably hidden Markov models (HMMs) [27]—which provided a probabilistic framework for modeling variable-length speech sequences and supported the transition to more practical ASR systems.

3. Statistical-Model-Driven Period

Early successes in isolated word recognition encouraged research into more flexible speech input methods, leading to connected and continuous speech recognition [28]. Connected word recognition served as a bridge between isolated and continuous speech recognition, allowing natural-paced speech with detectable pauses. However, its reliance on clear word boundaries limited its application in fluent speech. Efforts mainly focused on refining matching algorithms. At Bell Labs, Myers et al. improved the DTW algorithm and proposed a level-building technique [29,30]. Lee and Rabiner introduced a frame-synchronous search algorithm [31], Sakoe and Chiba developed a DP-based time-normalization method [26], and Bridle et al. proposed a one-pass approach [32]—all contributing to the advancement of connected word recognition.

Continuous speech recognition aimed to handle natural, uninterrupted speech. The Harpy system by Carnegie Mellon University [33] was an early milestone, using finite state machines for constrained vocabulary recognition. During the 1980s, statistical methods, particularly hidden Markov models (HMMs) and Gaussian mixture models (GMMs), replaced pattern-matching techniques. HMMs modeled speech as a time series using parameters A, B, and

π

to capture temporal dependencies and dynamic changes [34]. Without requiring explicit segmentation, HMMs could infer word boundaries by maximizing state sequence probabilities, enabling modeling at phoneme, word, or sub-phoneme levels. Efficient algorithms like the forward-backward and Viterbi methods ensured computational tractability [34,35,36,37].

However, the discrete-state assumptions of HMMs limited their ability to model continuous features such as MFCCs. To overcome this, Rabiner proposed the continuous-density HMM (CD-HMM) framework [38], incorporating GMMs into state emission modeling. This enabled modeling in continuous feature spaces and robust, extensible training via the Baum–Welch algorithm and maximum likelihood estimation (MLE). Maximum a posteriori (MAP) estimation further improved performance, reducing the word error rate (WER) by 10–25% on benchmark corpora [39].

The GMM-HMM combination laid the foundation for early consumer systems like IBM ViaVoice [40]. Nonetheless, separate training of acoustic and language models caused error accumulation. Subsequent efforts focused on joint optimization of GMM-HMM and statistical language models (e.g., n-gram), which proved essential for advancing continuous speech recognition.

4. Deep Learning Revolution Phase

Although the GMM-HMM framework has achieved landmark breakthroughs driven by advanced joint optimization and big data, becoming a mainstream, widely adopted method in speech recognition, its limitations have gradually surfaced. These include:

(1): Insufficient Feature Representation Ability [41,42]
The GMM has limited capabilities in modeling high-dimensional and nonlinear speech features. It struggles to model such features effectively. In addition, it lacks robustness in complex noisy environments and requires a large number of mixture components to represent intricate acoustic distributions.
(2): Defects in Context Modeling [43]
The Markov assumption in HMM focuses on local state transitions, constraining its capacity to capture long-term dependencies in speech signals.
(3): Dependence on Artificial Features
GMM-HMM relies on manually designed acoustic features (such as MFCC and PLP), which may lose key information from the original speech signal.
(4): Limitations of the n-gram Language Model [44]
Due to the Markov assumption, n-gram language models overlook long-range semantic dependencies, reducing their effectiveness in resolving ambiguity within complex contexts.

The limitations of GMM-HMM prompted researchers to recognize that achieving speech recognition capable of transcribing daily spoken language required a new technological revolution. Consequently, four key efforts advanced speech recognition toward deep learning:

(1): Initial Exploration of Shallow Neural Networks [45,46]
In the 1990s, Bourlard and others integrated artificial neural networks (ANNs) with HMM. However, constrained by limited computing resources and training efficiency, this approach did not substantially outperform the traditional GMM-HMM model.
(2): Theoretical Reserves for Deep Learning [47]
In 2006, Hinton’s team introduced a layer-by-layer pre-training strategy for the deep belief network (DBN), addressing optimization challenges in deep networks, though not yet applied to speech tasks.
(3): The Maturity of Big Data and GPU Computing Power [48]
In 2009, the general-purpose GPU computing framework (CUDA 2.0) and the availability of thousand-hour corpora unleashed the potential of deep networks.
(4): The Germination of End-to-End Learning
Graves et al. (2004) [49] incorporated long short-term memory (LSTM) into speech sequence modeling, suggesting an alternative to the traditional HMM alignment mechanism.

In 2012, four research institutions including Microsoft Research, Google, the University of Toronto, and IBM systematically expounded the hybrid DNN-HMM model, which integrates deep neural networks with hidden Markov models [42]. This model, tested on the Switchboard task, achieved a 30% relative reduction in WER compared to GMM systems, highlighting the capacity of deep networks for feature abstraction. At this juncture, elements from the statistical era, such as discriminative objectives and the WFST decoding framework, merged with the new deep learning paradigm, marking the transition of speech recognition into a phase of data-driven and end-to-end optimization.

4.1. Introduction of Deep Neural Networks (DNNs)

In the GMM-HMM framework, the goal is to map a segment of speech to the most likely word sequence

\hat{Y}

, typically formulated as maximizing

P (Y | X)

, where X is the observed acoustic signal. Direct computation is intractable, but by modeling the joint probability and applying Bayes’ rule, this is transformed into:

\underset{Y}{argmax} {P (X ∣ Y) P (Y)}

(1)

where

P (X | Y)

denotes the acoustic model and

P (Y)

the language model. The acoustic model estimates the likelihood of the observed features given a hypothesized word sequence, while the language model provides prior probabilities over word sequences, helping resolve ambiguities (e.g., “peace talks” vs. “peas talks”).

In the GMM-HMM system, this is implemented by decomposing the audio into frame-level features, mapping them to phonetic states through pre-processing (e.g., phoneme or triphone modeling), and modeling the emission probability

P (x | s)

—i.e., the probability of generating frame x from state s. In contrast, DNNs outputs the posterior probability

P (s | x)

through discriminative learning, representing “the probability of a state given a speech frame”. This posterior probability is then converted into emission probability via Bayes’ formula,

P (x ∣ s) = \frac{P (s ∣ x) P (x)}{P (s)} \propto \frac{P (s ∣ x)}{P (s)}

(2)

since

P (x)

is constant for a given input. Here,

P (s)

is typically estimated from training data. This transformation avoids explicit modeling of the feature distribution and enables the use of powerful discriminative models. By keeping the HMM topology and decoding structures unchanged, DNNs effectively replace GMMs as acoustic model components while preserving compatibility with the existing framework.

DNNs extract hierarchical representations of input data via neurons in multiple hidden layers, which enables the effective capture of complex nonlinear relationships within speech signals, as illustrated in Figure 1. Successive layers progressively refine these representations, transforming raw speech features into high-order discriminative patterns that better handle ambiguity and noise compared to GMMs. However, when replacing GMMs with DNNs for modeling HMM emission probabilities, an inevitable critical challenge arises in the form of text alignment, which involves the precise mapping of speech signals in audio to their corresponding text. Unlike GMMs, which inherently provide frame-level state alignments during Baum–Welch training, DNNs require supervised alignment information. Thus, the standard approach involves first training a GMM-HMM system to obtain frame-level state alignments, which are then used to train the DNNs, thereby constructing a complete DNN-HMM speech recognition system.

In early speech recognition systems, DNNs denoted the multilayer perceptron (MLP). However, as deep learning technologies evolved, the definition of DNNs has gradually expanded to include other architectures [50] (Nassif et al., 2019), such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs). These variants enhance the modeling capacity of acoustic models.

4.2. Application of Recurrent Neural Networks (RNNs)

RNNs are powerful tools for processing sequential data. In 1982, Hopfield (1982) [51] proposed the Hopfield network, an early form of RNNs. Later, Rumelhart et al. (1985) [52] introduced the concept of the “backpropagation network”, which provided the first description of RNNs in the modern sense. In Section 4, when summarizing the limitations of GMM-HMM, we noted that traditional statistical models rely on independent feature vectors when modeling sequential data, thereby neglecting the temporal dynamics and contextual dependencies inherent in speech signals.

However, early RNNs applications in speech recognition did not achieve high accuracy, generally maintaining levels between 48% and 54% [53], and exhibited high data loss rates. This was mainly due to the gradient vanishing and gradient exploding issues encountered during RNN training [54]. RNNs are trained using backpropagation through time (BPTT), which unfolds the network over time steps and computes the gradients of the loss function with respect to the model parameters. During backpropagation, gradients are recursively propagated through the network via the chain rule. Despite appropriate initialization of activation functions and weights, if the eigenvalues of the weight matrix are less than one, the gradients may exponentially decay over time, thereby hindering the model’s ability to capture long-term dependencies. This phenomenon is referred to as the vanishing gradient problem. Conversely, when the eigenvalues exceed one, the gradients may grow exponentially, resulting in excessively large weight updates and unstable training. This issue is commonly known as the exploding gradient problem. As speech signals are long-duration sequential data that require capturing extended contextual information and are further influenced by factors such as speaker accent and speech rate, the increased number of BPTT steps amplifies the likelihood of gradient vanishing or exploding. To address the issues of gradient vanishing and exploding, Hochreiter and Schmidhuber (1997) [55] proposed the LSTM network, which effectively enhances the neural network’s ability to model long-term dependencies by introducing a gating mechanism. The core of LSTM lies in its cell state and linear propagation properties, which directly mitigate the risks of gradient decay and explosion during BPTT. Building upon this advancement, larger-scale supervised ASR systems emerged. One representative example is Baidu’s Deep Speech series [48,56], with Deep Speech 2 [56] being among the earliest large-scale end-to-end ASR models. It combined convolutional layers, LSTM-based recurrent layers, and fully connected layers, trained with the CTC loss function. By leveraging massive datasets, extensive data augmentation, and optimization techniques such as batch normalization [57] and SortaGrad, Deep Speech 2 achieved state-of-the-art results in both English and Mandarin, in some cases matching or surpassing human transcription accuracy. Its efficient system design and deployment enabled low-latency, high-throughput online ASR services, setting a new benchmark for practical end-to-end speech recognition.

Deep bidirectional LSTM, when combined with end-to-end training by Graves et al. (2013) [43], achieved a remarkable reduction in the phoneme error rate (PER) to 17.7% on the TIMIT dataset [58], outperforming prior methods at that time. However, traditional LSTM models had issues such as a large parameter count, high computational cost, and extended training time. To address these challenges, Cho et al. (2014) [59] introduced the gated recurrent unit (GRU), a streamlined model equipped with an update gate and a reset gate. GRU successfully decreased computational complexity, enhanced training efficiency, and delivered performance similar to LSTM across various tasks [59,60,61,62], making it suitable for resource-constrained scenarios or applications requiring rapid iteration.

Subsequently, Graves and Jaitly (2014) [63] developed a speech recognition system using a deep bidirectional LSTM recurrent neural network integrated with the connectionist temporal classification (CTC) objective function. This system recorded a WER of 27.3% on the Wall Street Journal corpus. When combined with a baseline system, it managed to lower the error rate to 6.7%. Although this system still retained some modular elements, it established a basis for later end-to-end speech recognition systems. Table 1 provides a detailed comparison of the differences between LSTM and GRU (unidirectional and bidirectional).

4.3. Fusion of Convolutional Neural Networks (CNNs)

CNNs originally excelled in image processing due to their convolutional and pooling layers, which extract features invariant to translations and other transformations. In ASR, speech signals are transformed into two-dimensional spectrograms (time and frequency axes) using methods like short-time Fourier transform (STFT) or Mel spectrograms, which aligns with the structure of image data. This similarity facilitated the application of CNNs to speech signals, leading researchers to explore their potential in acoustic modeling [64,65]. In 2014, Abdel-Hamid et al. (2014) [66] extended the weight-sharing concept from time-delay neural networks [67] by introducing limited weight sharing. They applied separate weight sets across frequency bands, with only convolutional units linked to the same pooling unit sharing weights. They utilized Mel-frequency spectral coefficient (MFSC) features, structuring the speech signal for CNNs processing by assigning static features, first-order derivatives, and second-order derivatives from a 15-frame context window to the red, green, and blue channels, respectively. Figure 2 presents the specific form. This method achieved a 6% to 10% reduction in error rate compared to DNN across TIMIT phoneme recognition and large-vocabulary speech recognition tasks in speech search scenarios.

The work of Abdel-Hamid et al. (2014) [66] laid the foundation for the application of CNNs in ASR, with a focus on one-dimensional convolution. One year later, Sainath et al. (2015) [68] extended this approach by adopting two-dimensional convolutions to jointly model temporal and frequency correlations. They applied CNNs to three large-vocabulary continuous speech recognition (LVCSR) tasks. By integrating a CNN/DNN hybrid architecture, incorporating speaker adaptation, using rectified linear units (ReLU), and applying dropout techniques, they enhanced the performance across multiple datasets. Specifically, their model reduced the WER by 12% to 14% compared to DNNs. The recorded WERs were 13.6% for a 50 h broadcast news task, 12.7% for a 400 h broadcast news task, and 10.7% for the 300 h switchboard task.

In the following years, CNNs attracted increasing attention from researchers in the field of speech recognition due to their strong ability to extract local features from speech signals and effectively process the acoustic feature space. This growing interest propelled the continuous development of CNN-based ASR technology. As research advanced, CNNs were gradually applied to more challenging scenarios, such as dialect recognition and low-resource speech transcription [69]. For instance, in dialect recognition, Kurdish comprises three major dialects: Northern Kurdish, Central Kurdish, and Hawrami. Ghafoor et al. (2021) [70] achieved an average prediction accuracy of 95.53% when using a 1D CNN for Kurdish dialect classification. Moreover, researchers found that CNNs show strong compatibility with other deep learning models, such as RNNs and Transformers [71,72]. Hybrid architectures that combine CNNs with these models have demonstrated remarkable adaptability and high accuracy in various speech recognition tasks.

4.4. Connectionist Temporal Classification (CTC) and Listen, Attend and Spell (LAS)

In the field of ASR, the advent of end-to-end technology was a transformative milestone, breaking away from the traditional paradigms of speech recognition systems and ushering in a new development era. It is important to note, however, that end-to-end speech recognition did not emerge overnight but was built upon years of accumulated technological advancements. Indeed, it is difficult to pinpoint a single paper or experiment as the origin of end-to-end speech recognition. When looking back on the evolution of these technologies, the LSTM and GRU discussed in Section 4.2 already began to demonstrate their potential in the early stages of end-to-end speech recognition. Their remarkable ability to capture the complex temporal dependencies within speech sequences led to their innovative application in end-to-end models, thereby laying a solid foundation for subsequent breakthroughs and emerging as key drivers in the development of end-to-end speech recognition technology. This section provides a brief introduction to two landmark end-to-end techniques, namely CTC and LAS. Additional technologies that have contributed to the advancement of end-to-end speech recognition are summarized in Table 2. This table provides readers with an intuitive overview of the overall development trajectory and landscape of these advancements.

In 2006, Graves et al. (2006) [73] introduced the CTC technique to address the alignment issue between speech signals and text sequences. In traditional speech recognition, achieving precise alignment between speech and text typically requires extensive manual labeling. CTC overcomes this challenge by introducing a “blank label” to extend the original label set. This allows the model to automatically learn the alignment during training. Subsequently, the model determines the output sequence by merging repeated labels and removing the blank symbols, thus enabling the direct prediction and output of text sequences from the input speech signal. Figure 3 illustrates how CTC performs automatic alignment. This approach significantly simplifies the training process.

When given an input speech feature sequence

X = (x_{1}, x_{2}, \dots, x_{T})

and a target text sequence

Y = (y_{1}, y_{2}, \dots, y_{U})

, CTC estimates the probability of the target sequence by summing over all possible alignment paths. At its core, CTC performs a comprehensive search over all valid mappings from input frames to output labels, enabling it to capture the likelihood of a sequence without requiring explicit alignment. The resulting loss function is defined as

- l o g P (Y | X)

. Since directly optimizing

P (Y | X)

is challenging due to its nonlinear nature, the logarithmic form is adopted. By leveraging the monotonicity of the logarithm, maximizing

P (Y | X)

becomes equivalent to minimizing

- l o g P (Y | X)

. This transformation facilitates gradient-based optimization and makes the training process more tractable in practice.

Bahdanau et al. (2014) [74] proposed an innovative technique called the alignment model, which later became known as the attention mechanism. Originally introduced in neural machine translation tasks to address the information loss issue in traditional encoder-decoder frameworks, this mechanism aligns the input and output sequences, allowing the model to more effectively capture the correspondence between the source and target languages and thereby significantly enhancing translation accuracy and fluency. Building on this technology, an end-to-end model called Listen, Attend and Spell (LAS) was proposed by Chan et al. (2016) [75]. LAS consists primarily of an encoder recurrent neural network, known as the listener, and a decoder recurrent neural network, referred to as the speller.

The listener utilizes a pyramidal bidirectional long short-term memory network (pBLSTM). This network systematically reduces the temporal resolution through a multi-layer structure. In each layer, the pBLSTM halves the time steps of the input sequence and cumulatively compresses the time dimension by a factor of eight across all layers. This pyramidal structure preserves the sequential information of the speech signal while decreasing the computational complexity for subsequent processing. The speller, which typically consists of recurrent neural network architectures (such as LSTM or GRU), effectively models sequence dependencies. Based on the speech features processed by the listener and a context vector obtained from the attention mechanism, the speller generates the corresponding character sequence at each time step, thereby converting speech into text.

Table 2. Technologies that contribute to the development of end-to-end speech recognition technology.

Technology	Description	Contribution
Recurrent Neural Aligner [76]	Based on an encoder-decoder framework, it introduces blank labels to define probability distributions and is trained using approximate dynamic programming and sampling techniques.	Achieves accuracy comparable to CTC word models on YouTube video transcription tasks; bidirectional models perform well in mobile dictation tasks.
Neural Transducer [77]	Makes incremental predictions based on partial input and output sequences and is trained using dynamic programming algorithms.	Suitable for online tasks, with performance in the TIMIT phoneme recognition task on par with state-of-the-art models.
RNN Transducer [78]	Comprises transcription and prediction network, extends the output space to define distributions, and is trained using the forward-backward algorithm with beam search during testing.	Achieves low error rates in the TIMIT phoneme recognition task, effectively integrating acoustic and linguistic information.
Monotonic Chunkwise Attention [79]	Adaptively segments the input sequence into chunks, applies soft attention within each chunk, supports online decoding with linear time complexity, and can handle local reordering.	Delivers state-of-the-art performance in online speech recognition and significantly enhances performance in document summarization tasks. ²

² Performance improvements are particularly notable in handling long-form documents where monotonic attention constraints align well with the sequential nature of summarization.

5. The Era of Large-Scale Models in Different Learning Paradigms

5.1. Necessity of Large-Scale Models

Deep learning has profoundly transformed speech recognition, triggering a far-reaching revolution over the past decade. This has significantly propelled the development of artificial intelligence and simplified the architectures of ASR systems. When comparing Figure 4a,b, modern ASR systems grounded in deep learning not only exhibit greater simplicity in their architectures but also demonstrate remarkable enhancements in performance.

During the era when supervised learning methods dominated the field of speech recognition, model training was highly reliant on large quantities of manually labeled data. This was a process that entailed substantial costs in terms of both time and money. For example, creating a high-quality speech recognition dataset that encompasses a wide variety of accents, language styles, and scenarios requires a significant investment of manpower and resources. The construction of the original TIMIT speech dataset, for instance, cost approximately USD 1.5 million [80]. Moreover, with the further development of the big data era and the global spread of the Internet, the volume of data has experienced explosive growth [81]. The vast quantity of unlabeled data harbors rich information [82]. However, traditional supervised learning methods, which rely heavily on extensive manual labeling, are unable to fully utilize this unlabeled data [83], resulting in substantial resource wastage. Coupled with the rapid development in sectors such as manufacturing and services, the demand for AI systems with improved generalization capabilities and the ability to handle complex tasks has been continuously rising. Consequently, pure deep learning has become inadequate, leading to the emergence of large-scale models that draw on different learning paradigms.

5.2. The Preeminent Leadership of the Transformer

The Transformer architecture [84] has revolutionized sequence modeling tasks, particularly in ASR. Unlike traditional RNN-based approaches that process audio sequences sequentially, Transformer’s self-attention mechanism enables parallel processing of the entire input sequence, making it exceptionally suitable for capturing long-range dependencies in speech signals—a critical requirement for handling coarticulation effects and contextual dependencies that span across multiple phonemes or words. In ASR applications, Transformer models typically adopt an encoder-decoder framework where the encoder processes acoustic feature sequences (such as Mel-filterbank features or raw waveforms) and the decoder generates corresponding text sequences. Each encoder layer consists of multi-head self-attention and feed-forward sublayers, connected through residual connections [85] and followed by layer normalization [86]. This design enhances training stability and alleviates gradient vanishing problems, which is particularly important for processing long speech sequences. The self-attention mechanism allows the model to dynamically focus on different temporal regions of the speech signal, enabling effective modeling of acoustic variability across speakers, speaking rates, and phonetic contexts.

Beyond the standard attention mechanisms, ASR systems have adopted various specialized attention variants to address speech-specific challenges, as summarized in Table 3. These include additive attention, which enables nonlinear combination of queries and keys for explicit alignment modeling; dot-product attention, which provides efficient computation with normalized scaling; multi-head attention for parallel modeling in multiple subspaces to capture diverse acoustic features; location-aware attention, which incorporates previous attention weights to enhance monotonic alignment between audio and text sequences; and adaptive attention, which dynamically adjusts attention ranges or weights to optimize computational efficiency for variable-length speech inputs.

The computational parallelism of Transformer, its scalability in model capacity, and its natural compatibility with large-scale pretraining have collectively facilitated its successful transition from NLP to ASR. However, despite its outstanding performance, Transformer also has inherent limitations, which present challenges in its application to ASR:

(1): High Computational Demand: The attention mechanism in Transformer scales quadratically with sequence length, leading to a significant increase in computational cost when processing long speech inputs.
(2): Insufficient Capture of Short-Term Local Features: Local characteristics of speech signals, such as the instantaneous variations of phonemes, are crucial for accurate recognition. However, pure Transformer models are less effective in capturing these fine-grained temporal features compared to CNNs.
(3): Limitations to Offline ASR Tasks Due to Architectural Features: The architectural requirements of encoder-decoder-based Transformers, which necessitate the complete speech utterance as input, limit their application to offline ASR tasks and make them challenging to deploy in streaming scenarios that require real-time output generation, such as generating transcriptions shortly after each spoken word [88].

To address the inherent limitations of Transformer in ASR applications, researchers have proposed various improved solutions. Firstly, the convolution-augmented Transformer (Conformer) innovatively incorporates convolutional modules, leveraging the advantages of CNNs in local feature extraction to strengthen local feature modeling capabilities while retaining the global attention mechanism of the Transformer, enabling the model to consider both global and local information and effectively improving speech recognition performance. Beyond architectural modifications, sparse attention [89] and linear attention [90] have emerged as promising approaches to mitigate the computational overhead of self-attention. Notably, efficient Transformer variants such as Performer [91] and Linformer [92] demonstrate unique advantages in computational efficiency. Linformer introduces a low-rank projection matrix, reducing the time and space complexity of self-attention from the conventional

O (n^{2})

to approximately

O (n)

. Performer, on the other hand, employs random feature mapping techniques to achieve a similar

O (n)

complexity.

Additionally, ContextNet [93] introduces depthwise separable convolutions in the encoder and incorporates global context information into convolution layers by adding squeeze-and-excitation modules, effectively capturing fine-grained speech details. Branchformer [94] employs a parallel branch structure where one branch uses global self-attention to model long-range dependencies while the other branch utilizes an MLP module with convolutional gating (cgMLP) to extract local patterns, enabling effective modeling of dependencies across various temporal ranges. Transformer Transducer [95] combines Transformer encoders with the RNN-T framework, achieving frame-synchronous real-time decoding by limiting the left context for self-attention in Transformer layers, making decoding computationally tractable for streaming with only slight accuracy degradation. These remarkable optimization achievements and architectural innovations effectively address the key challenges of pure Transformer models in computational efficiency, local feature modeling, and real-time processing, making the application of Transformer in the field of ASR more promising and providing new ideas for solving computational resource bottlenecks when processing long-sequence speech data.

Speech signals, as continuous sequential data, contain rich long-range dependencies, such as the semantic coherence within a sentence or the overall patterns of speech prosody. Unlike RNNs, which propagate information recursively, the Transformer has the capacity to directly capture the relationships between any two positions in the input sequence. This capacity enables the Transformer to stand out in modeling long-distance dependencies in speech, capturing crucial features like intonation and rhythm that are essential for ASR. In contrast, although CNNs perform commendably in extracting local features, their receptive fields are inherently constrained by the size of the convolutional kernels, making them less proficient in capturing distant associations. However, the Transformer’s dominance in ASR did not materialize overnight; instead, it was gradually established through a series of landmark models. In 2018, Dong et al. [96] introduced Speech-Transformer along with a 2D-attention mechanism, marking the first application of the Transformer architecture to end-to-end speech recognition tasks. To prevent overfitting, they implemented several strategies during training, including using the Adam optimizer and adjusting the learning rate. This model achieved a WER of 10.9% on the Wall Street Journal (WSJ) dataset [97]. Remarkably, the entire training process was completed in just 1.2 days on an NVIDIA K80, which was significantly faster than the contemporary RNN-based sequence-to-sequence models. These promising results prompted researchers to recognize the immense potential of Transformer architecture for speech recognition. Subsequently, in 2020, the proposal of the Conformer model [71] firmly established its position in the field of speech recognition.

As previously stated, CNNs are highly proficient in extracting local features, while Transformers are skilled at capturing global dependencies. The Conformer model combines convolutional modules with attention mechanisms. In doing so, it retains sensitivity to short-term local features (such as the spectral details within speech frames) and maintains the advantages of long-sequence modeling. The core of the Conformer’s design is its encoder architecture, which effectively strikes a balance between these two aspects. The detailed structure of the encoder is depicted in Figure 5.

In this architecture, the multi-head self-attention module is responsible for capturing global dependencies, while the convolutional module serves to extract local features. The integration of these two modules allows for the effective modeling of both local and global dependencies within audio sequences. Without utilizing an external language model, Conformer achieved a WER of 4.3% on the test-other dataset of the LibriSpeech [98] benchmark and a WER of 2.1% on the test dataset. When an external language model was incorporated, the WERs decreased to 3.9% and 1.9%, respectively, significantly outperforming the best RNN-based models of the same period. Subsequently, Google’s Transformer-based ASR models and Meta AI (formerly known as Facebook)’s speech processing applications also widely adopted this architecture. The adoption of Conformer by these leading technology companies further strengthened the dominant position of the Transformer in the ASR field [99,100,101].

The emergence of the Transformer is undoubtedly a revolutionary change with profound implications in the field of ASR, reshaping the modeling paradigm of the entire field. Its robust parallel computing capabilities, remarkable model capacity expansion performance, and innate adaptability to large-scale pre-training have rendered it an ideal architecture for large-scale models across different learning paradigms.

5.3. The Critical Role of Learning Paradigms in the Development of Large ASR Models

In recent years, with the emergence of large-scale models, the performance of ASR has been remarkably enhanced. These large-scale models are typically trained on extensive datasets. This not only significantly boosts the recognition accuracy of ASR systems but also propels a profound evolution of learning paradigms. The integration of large-scale data and powerful computing resources enables ASR systems to perform outstandingly in diverse and intricate scenarios, such as speech recognition in noisy environments, separation of multi-person conversations, and handling of dialects and accents. Compared with traditional models, they demonstrate greater robustness and adaptability. The learning paradigms commonly employed in large ASR models can be broadly categorized into supervised learning, semi-supervised learning, and self-supervised learning. A schematic diagram of their structure is shown in Figure 6. In this section, we will explore in depth the application of different learning paradigms in large ASR models and conduct a detailed analysis of these models.

5.3.1. Large ASR Models Under Supervised Learning

Supervised learning is a classical training approach that utilizes labeled data to train a model, allowing it to learn the mapping relationship between inputs and outputs from the input data. In the field of ASR, this process is reflected as a strong reliance on the input speech and its corresponding text labels. The input speech signals contain abundant acoustic information, while the text labels act as crucial supervisory signals, guiding the model to deeply learn the correspondence between speech and text. The model optimizes its own parameters by minimizing the discrepancy between the predicted output and the true label, thereby continuously enhancing the accuracy and reliability of speech recognition. During the training process, we quantify the discrepancy between the predicted output and the true label as a loss function. There are several common loss functions in ASR, as shown in Table 4.

Gradient descent algorithms based on the loss function (such as stochastic gradient descent, SGD) [102] update the model parameters through backpropagation, thus gradually improving the accuracy of recognition. The cross-entropy loss is widely used in ASR because it can effectively measure the difference between probability distributions and is suitable for multi-class classification tasks (such as phoneme or word recognition).

Supervised learning is an earlier learning paradigm in speech recognition. The experiments in the published papers of the Gaussian mixture model–hidden Markov model (GMM–HMM) and Conformer that we mentioned earlier all employed supervised learning.

However, despite its contributions, supervised learning, as a benchmark paradigm for early large-scale models, has inherent limitations that impede its full adaptation to the era of big data and extensive information integration. The performance of large-scale models based on supervised learning hinges on the scale and quality of the labeled data [82]. Annotating speech-text alignment data incurs extremely high costs. Especially for low-resource languages, professional domains (such as medicine and law), or dialectal scenarios, the scarcity of data and annotation noise (such as transcription errors) will directly undermine the performance of the model. Additionally, the relationship between annotation scale and model accuracy typically follows a power-law trend rather than being proportional, meaning that each incremental gain in accuracy requires exponentially more labeled data [103]. For example, if annotating 10,000 h of data can elevate the model accuracy to 90%, increasing it to 93% may demand an additional tens of thousands of hours or even more of annotated data.

Another crucial limitation lies in the generalization capability. When the actual application scenario diverges from the distribution of the training data (such as variations in accents, differences in background noise, or disparities in devices), the recognition performance may decline significantly. For instance, the issue of dataset shift [104], which leads to different training and testing distributions, will consequently affect the reliability and generalization ability of the model, causing the model’s performance to be unstable across different environments, namely the training environment and the deployment environment.

In addition to the intrinsic limitations of supervised learning, the growing demand for improved ASR performance, together with significant technological advancements such as the maturation of pre-training techniques and the increasing emphasis on multimodal integration, has prompted researchers to explore novel learning paradigms for ASR. Nevertheless, this does not signify the abandonment of supervised learning. On the contrary, by optimizing model architecture, supervised ASR models can still attain state-of-the-art (SOTA) performance within specific linguistic domains. A prominent illustration is the recently open-sourced FireRedASR series developed by FireRedTeam. This series encompasses two ASR models trained within a supervised learning framework [105]. Among them, the large-scale FireRed-LLM model has outperformed Seed-ASR [106] in public Chinese ASR benchmark tests, thereby establishing itself as a new SOTA model, Figure 7 illustrates the performance of different models on publicly available Chinese ASR benchmark tests.

Moreover, FireRed-LLM has exhibited superior performance on the KeSpeech dialect benchmark and lyrics transcription tasks, surpassing both commercial and open-source baseline models. Notably, on the LibriSpeech test set, FireRed-LLM achieved a WER comparable to that of Whisper-Large-v3. Despite the fact that Whisper was trained on over 680,000 h of speech data, FireRed-LLM was trained on merely 70,000 h. For the WER of the English corpus, we provide a more detailed presentation in Table 5 and Table 6.

5.3.2. Large ASR Models Under Semi-Supervised Learning

While some equate semi-supervised learning with weakly supervised learning, others emphasize their differences [128,129]. After reviewing the literature [130,131], we regard semi-supervised learning as a subset of weak supervision that has evolved into a relatively independent paradigm. In this paper, we acknowledge its origin but treat it as a distinct category.

Weak supervision broadly includes three scenarios:

Incomplete supervision: Only part of the data is labeled, e.g., a few labeled texts among massive unlabeled social media data.
Imprecise supervision: Annotations are coarse or high-level, such as labeling a bird image simply as “bird” without species information.
Inaccurate supervision: Labels contain errors or noise, as in mislabeling medical images due to human mistakes.

Incomplete supervision overlaps with semi-supervision. However, over time, semi-supervised learning has matured into a robust framework with well-defined theories and algorithms. Merz et al. (1992) [132] first coined the term “semi-supervised,” and since then, numerous methods have emerged, shaping it as a standalone research field.

The seminal book by Chapelle et al. (2009) [133] offers a comprehensive summary of semi-supervised learning, covering generative models, low-density separation, graph-based methods, two-step learning, real-world applications, and future directions. It formalized the paradigm and bridged theoretical and practical gaps.

With the rise of deep learning, strategies like consistency regularization and pseudo-labeling significantly advanced semi-supervised learning. Consistency regularization improves model generalization by enforcing stable predictions under input perturbations. The pseudo-label method proposed by Lee (2013) [134] bootstraps from labeled data, using model predictions on unlabeled data as training targets—which is particularly effective for deep networks.

Although originally designed for general tasks, these methods have notably boosted ASR performance. For example, SpecAugment [135], which applies time and frequency masking, significantly improved results on LibriSpeech 960h and Switchboard 300h when used with LAS models.

While most large-scale ASR research emphasizes unsupervised pre-training with unlabeled data [117,136,137], OpenAI’s Whisper [138] takes a novel semi-supervised (or weakly supervised) approach. It leverages 680,000 h of multilingual and multitask data with minimal preprocessing and trains a Transformer encoder-decoder using a sequence-to-sequence strategy. This design streamlines the pipeline and yields robust zero-shot performance across ASR and speech translation, approaching human-level accuracy. Analysis shows that scaling model size and data volume improves performance, though gains diminish over time.

Though pure pseudo-label-based large-scale semi-supervised ASR models remain rare, related approaches like noise student training (NST) [139] have proven effective. NST typically involves training a teacher model, generating pseudo-labels for unlabeled data, filtering by confidence, and combining them with labeled data to train a student model. SpecAugment-based data augmentation is also used to enhance generalization [109,113].

5.3.3. Large ASR Models Under Self-Supervised Learning

Similar to the confusion between semi-supervised learning and weakly supervised learning, self-supervised learning and unsupervised learning are often used interchangeably. In fact, self-supervised learning can be considered as a special type of unsupervised learning approach. Unsupervised learning refers to a method of learning from data without explicit given labels or target values, aiming to uncover the natural structures, patterns, or regularities within the data.

Conversely, self-supervised learning generates supervisory signals by leveraging the inherent structures and correlations within the data itself, thereby enabling the training of the model. For instance, in natural language processing, certain words in a sentence can be replaced with masks, and the model is then tasked with predicting the masked words [140]. In the field of image processing, operations such as rotation and cropping can be applied to images to enhance the model’s ability to understand and learn image features [141,142]. In the domain of speech recognition, operations like adding noise, downsampling, or partial deletion can be performed on the speech, and the model is made to attempt to restore the original clean speech. Through this approach, the model can learn the intrinsic features and patterns of the data.

It is worth noting that both self-supervised learning and semi-supervised learning are typically employed in conjunction with different learning paradigms to fully leverage their respective advantages and address complex practical issues. In practical engineering applications, the collaborative framework of self-supervised learning and supervised learning has emerged as an important paradigm for enhancing the effectiveness of models [143].

The core logic is as follows. First, self-supervised learning is employed for pre-training on a large volume of unlabeled data, allowing the model to acquire the ability to learn general feature representations. Subsequently, supervised fine-tuning is performed with a relatively small amount of labeled data to optimize domain adaptation. As an illustration, in the field of computer vision, self-supervised pre-training on millions of unlabeled images enables the model to construct a multi-dimensional visual feature space. Subsequent fine-tuning on specific tasks (such as a subset of ImageNet) can remarkably enhance the classification accuracy. In the domain of speech processing, self-supervised pre-training based on long-duration unlabeled audio data (such as the Wav2Vec 2.0 architecture), in combination with the fine-tuning of a limited number of labeled speech samples, can effectively enhance the robustness of the ASR system. This combined strategy not only reduces the labeling cost but also accomplishes the dual optimization of the model’s generalization ability and task accuracy.

In 2019, the research team of Meta released the wav2vec model [136], which is a self-supervised learning model for speech recognition. The emergence of wav2vec marked the beginning of self-supervised learning’s prominence in the field of ASR. It achieved remarkable results in speech recognition tasks by using a large amount of unlabeled speech data for pre-training and then fine-tuning on a small amount of labeled data. A year later, Meta updated wav2vec to wav2vec 2.0 on the basis of wav2vec [117]. wav2vec 2.0 introduced more efficient self-supervised training methods (such as contrastive learning and quantization representation), and achieved performance close to that of supervised learning on benchmark datasets such as LibriSpeech. The release of this version further promoted the popularity of self-supervised ASR models. Subsequently, the emergence of HuBERT [118] and WavLM [119] further promoted the development of self-supervised large models in the field of ASR. Up to now, self-supervised learning has become a development trend of large ASR models.

Wav2vec 2.0, as a representative self-supervised ASR model, incorporates various advancements from previous research. This model ingeniously builds upon the research achievements of numerous predecessors: it draws inspiration from the masked language modeling concept of BERT and masks some of the speech representations in the latent space, and it employs the discrete selection method of Gumbel-Softmax [144] and product quantization [145] to jointly learn the representations of discrete speech units. In terms of model construction, wav2vec 2.0 encodes the speech input through a multi-layer convolutional neural network, which is in line with the approaches taken by predecessors in related fields. Meanwhile, it conducts context modeling of speech representations based on the Transformer architecture to further explore the contextual information within the speech sequence. During the training stage, the loss function adopted by the model is based on the CTC loss to optimize the model training process. The innovation of wav2vec 2.0 lies in its capability to learn contextual representations and quantization units in an end-to-end manner, breaking the limitations of the traditional two-step approach and demonstrating superior performance. Experimental data reveals that wav2vec 2.0 performs exceptionally well in low-resource scenarios on the LibriSpeech dataset. For instance, when only 10 min of labeled data is utilized for training, on the test-clean/test-other test sets, the WER can reach 4.8/8.2.

HuBERT (hidden-unit BERT) was released by Meta AI in 2021. HuBERT represents a significant advancement in self-supervised speech representation learning by addressing the chicken-and-egg problem inherent in wav2vec 2.0’s approach. While wav2vec 2.0 requires discrete targets for masked prediction but must learn these targets jointly with the representations, HuBERT adopts a more principled approach by using an iterative refinement strategy. The model begins with simple acoustic unit discovery methods (such as k-means clustering on MFCC features) to generate initial discrete targets, and then progressively refines these targets through multiple generations of training. This iterative process enables HuBERT to learn increasingly sophisticated acoustic representations without requiring parallel refinement of both the representation network and the quantization codebook. Experimental results demonstrate that HuBERT achieves superior performance compared to wav2vec 2.0 across multiple benchmarks, particularly excelling in low-resource scenarios, where it maintained competitive performance with significantly reduced labeled data requirements. The model’s design also facilitates better transfer learning capabilities, making it highly effective for adapting to new domains and languages.

WavLM (wave language model) was developed by Microsoft in 2022. WavLM extends the self-supervised learning paradigm to address the challenges of complex acoustic environments and multi-speaker scenarios. Unlike previous models that primarily focused on clean speech recognition, WavLM incorporates explicit training objectives for handling overlapping speech, noise robustness, and speaker separation tasks. The model introduces several architectural innovations, including gated relative position bias and utterance-mixing training strategies that enable it to learn representations robust to acoustic variations commonly encountered in real-world applications. WavLM’s training procedure incorporates diverse speech conditions during pre-training, including simulated noisy environments and multi-speaker overlaps, which significantly enhances its downstream performance on challenging tasks such as speaker verification, speech separation, and robust ASR. The model demonstrates particular strength in scenarios involving background noise, reverberation, and speaker diarization, making it highly suitable for deployment in practical intelligent information systems where clean audio conditions cannot be guaranteed. Furthermore, WavLM’s representations have shown remarkable transferability across various speech processing tasks, establishing it as a versatile foundation model for comprehensive speech understanding applications.

These developments in self-supervised learning represent a maturation of the paradigm, moving beyond simple masked prediction tasks toward more sophisticated objectives that capture the nuanced challenges of real-world speech processing. The success of HuBERT and WavLM has influenced subsequent research directions, inspiring the development of even larger and more capable models that continue to push the boundaries of what is achievable with self-supervised speech representation learning.

5.4. The Selection and Trade-Off of Different Learning Paradigms

In the domain of ASR, the selection of the training paradigm for large-scale models has a direct influence on both the model’s performance and the implementation cost. As a traditional approach, supervised learning is highly reliant on large-scale labeled speech-text paired data. In scenarios where the pronunciation is clear and the grammar is standardized, supervised learning demonstrates excellent performance. Nevertheless, its high labeling cost restricts the model’s cross-domain generalization capability. Particularly in tasks related to minority languages or dialects, supervised learning encounters severe data bottleneck issues. ASR models predominantly based on English serve as typical examples. Such models can achieve a relatively high accuracy rate on the well-labeled LibriSpeech dataset. However, when applied to scenarios of low-resource languages, their performance is significantly inadequate. For instance, although OpenAI’s Whisper model achieves near state-of-the-art accuracy on well-annotated datasets such as LibriSpeech, its WER increases substantially when evaluated on low-resource languages like Bengali and Punjabi, reflecting the persistent challenges of cross-lingual generalization in supervised ASR systems.

Semi-supervised learning and self-supervised learning provide novel approaches to surmounting the limitations of labeled data. Semi-supervised learning makes use of a combination of a relatively small amount of labeled data and a vast quantity of unlabeled data. Through techniques such as self-training and pseudo-labeling, it iteratively optimizes the model, thus significantly reducing the annotation costs. In contrast, self-supervised learning, exemplified by wav2vec 2.0, pretrains general representations by utilizing the intrinsic patterns within speech signals, such as masked prediction and contrastive learning. It only requires fine-tuning on the final task to efficiently adapt to downstream applications. Although both paradigms decrease the reliance on human annotation, they also encounter technical challenges, such as sensitivity to noise and bias in negative sample selection. Therefore, when selecting a model, it is necessary to carefully balance among three key factors: the data volume and labeling cost, computational resources, and task complexity, Figure 8 illustrates the distribution of the demand capabilities of the three learning paradigms.

Data volume and labeling cost: Supervised learning highly relies on large-scale and high-quality labeled data to build an accurate speech recognition model. However, manually labeling speech data is not only time consuming and labor intensive but also requires a high cost, which greatly limits the scale and speed of data acquisition. Semi-supervised learning takes an alternative approach. By using a small amount of labeled data and a large amount of unlabeled data, it alleviates the pressure of data labeling and reduces the overall cost. Self-supervised learning completely gets rid of the dependence on manual labeling. It utilizes the structure and features of the data itself to automatically mine learning signals from a large amount of original speech data. However, it has an extremely high demand for the scale of the original data.

Computational resources: Since supervised learning only needs to process labeled data, its computational requirements are relatively low. However, the performance of its model is limited by the scale and quality of the labeled data. During the operation of semi-supervised learning, it is necessary to reasonably allocate computing power between the generation of pseudo-labels and the model training stage. When generating pseudo-labels, the model needs to be called to predict the unlabeled data, which consumes computing power. In the model training stage, both the labeled data and the pseudo-labeled data need to be included for parameter updates. Self-supervised learning requires a large investment of computational resources in the early stage to process and analyze a large amount of original data to build a basic model. However, in the later fine-tuning stage, since the basic model has learned rich speech features, only a small amount of computational resources are needed to achieve efficient optimization for specific tasks.

Task complexity: When facing highly complex tasks such as speech recognition in a noisy environment or multi-lingual mixed scenarios, supervised learning is limited by the scenarios and language types covered in the labeled dataset and is difficult to adapt to actual complex and changeable situations, resulting in a significant reduction in performance. On the other hand, self-supervised and semi-supervised learning, by effectively utilizing unlabeled data, can learn broader and more general speech patterns and features, thus demonstrating stronger adaptability and robustness in complex tasks.

6. Frontier Technology Research and Future Directions

In recent years, with the vigorous development of large model technology, ASR technology has entered a period of rapid progress. Currently, ASR technology is evolving in three key dimensions: advancing intelligent contextual understanding to accurately capture the complete meaning behind the semantics, improving resource utilization to achieve optimal performance under limited computing power conditions, and broadening its application scope by integrating ASR technology into more complex and dynamic real-world scenarios. In this process, the technology is gradually approaching the cognitive ability of human speech interaction.

6.1. ASR Deployment Architectures: Edge, Cloud, and Hybrid Approaches

Among the three key dimensions of ASR evolution, improving resource utilization under limited-computing-power conditions represents a fundamental challenge that directly impacts system deployment strategies. The transition from laboratory research to real-world applications necessitates careful architectural decisions that balance performance, privacy, and resource constraints. As ASR systems become increasingly sophisticated and are deployed across diverse intelligent information systems, the choice of deployment architecture has emerged as a critical factor determining system effectiveness, user experience, and operational feasibility. Recent research has identified three primary deployment paradigms, each addressing different aspects of the resource utilization challenge while offering distinct characteristics and trade-offs in practical implementations.

6.1.1. Edge Deployment Architecture

Edge deployment maintains all processing locally on user devices, ensuring maximum privacy protection as speech data never leaves the device (Figure 9). Xu et al. [146] demonstrated that advanced Conformer-based end-to-end streaming ASR systems can be successfully deployed on resource-constrained devices such as smartphones, smart wearables, and other smart home automation devices through model architecture adaptations, neural network graph transformations, and numerical optimizations. Their work achieved over 5.26 times faster than real-time (0.19 real-time factor, RTF) speech recognition on smart wearables while minimizing energy consumption and achieving state-of-the-art accuracy. This approach enables Transformer-based server-free AI applications to operate fully offline, saving cloud computing resources while providing stronger user privacy guarantees.

6.1.2. Cloud-Centric Architecture

Cloud deployment leverages centralized computational resources to achieve optimal recognition performance using large-scale models (Figure 10). Modern cloud-based ASR systems utilize sophisticated architectures like AssemblyAI’s Conformer-2, which is trained on 1.1 million hours of English audio data to provide improvements on proper nouns, alphanumerics, and robustness to noise [147]. The typical pipeline involves device-based audio capture, secure transmission to cloud infrastructure, server-side ASR processing with potential language model rescoring, and return of results to the client device. This architecture enables the use of large multimodal models but introduces latency dependencies and requires careful privacy boundary management.

6.1.3. Hybrid Edge–Cloud Architecture

Hybrid architectures attempt to balance the advantages of both approaches through intelligent workload distribution (Figure 11). Miao et al. [148] explored online hybrid CTC/attention end-to-end ASR architectures that replace all offline components of conventional CTC/attention ASR architecture with their corresponding streaming components, including stable monotonic chunk-wise attention (sMoChA) for streaming conventional global attention and dynamic waiting joint decoding (DWJD) algorithm to dynamically collect predictions of CTC and attention in an online manner. The AS-ASR framework by Bao et al. [149] demonstrates how lightweight models based on Whisper-tiny can be specifically optimized for edge deployment, providing a scalable, efficient solution for real-world disordered speech recognition. The selection among these architectures depends on specific application requirements including latency tolerance, privacy constraints, computational budget, and accuracy demands. Modern intelligent information systems increasingly benefit from deep learning advancements that have made ASR architectures such as Quartznet, Citrinet, and Conformer capable of accurate processing of varying language dialects and accents, with breakthroughs in multilingual ASR helping move algorithms from cloud to on-device deployments that save money, protect privacy, and speed up inference [150].

6.2. Multimodal Fusion

Multimodal fusion has emerged as one of the most prominent research directions, aiming to enhance the robustness and accuracy of ASR systems by integrating diverse information sources such as audio, visual, and auxiliary textual data, as shown in Figure 12.

Building on the growing interest in multimodal fusion for ASR, several recent studies have explored how integrating visual and auxiliary information can enhance recognition performance. Notable approaches include MLCA-AVSR [151], which introduces a multi-layer cross-attention fusion strategy to leverage audio and visual information at different hierarchical levels for joint learning. This method demonstrated remarkable performance in the MISP2022-AVSR challenge, achieving a 30.57% CER under the connectionist minimum permutation criterion on the evaluation set [152]. Another representative method is uncertainty-aware dynamic fusion (UADF), which dynamically calibrates and integrates acoustic information during ASR processing, effectively reducing the WER in noisy environments.

Several other recent works further illustrate the effectiveness of visual and auxiliary information in ASR. AVFormer [153] injects visual embeddings and lightweight adapters into a frozen speech model, achieving audiovisual fusion with low training cost. This method demonstrates strong generalization with limited weakly labeled video data and attains zero-shot AV-ASR SOTA performance on How2, VisSpeech, and Ego4D, while maintaining high performance on LibriSpeech. Similarly, VHASR [154] integrates scene-related visual information, such as hotwords extracted from images, using a dual-stream architecture for feature-level fusion, showing notable improvements in image-assisted ASR tasks on datasets like Flickr8k, ADE20k, COCO, and OpenImages. AVATAR [155] uses full visual frames (not limited to lip regions) together with audio input in a seq2seq Transformer framework, employing a “word masking” strategy to encourage reliance on visual cues, resulting in enhanced robustness in real-world video scenarios, particularly under noisy conditions.

The discriminative multi-modality speech recognition approach [156] adopts a unique two-stage strategy: first utilizing visual information such as lip movement features to assist in separating the target speech and reduce interference, and then conducting joint modeling through a multi-modal sub-network. This method has achieved remarkable results on the LRS3–TED [157] and LRW datasets, demonstrating the strong potential of integrating visual cues. In addition, MaLa–ASR [158] leverages unique domain-specific auxiliary information from conference room slides—such as extracted keywords—and skillfully incorporates it into the ASR process, significantly reducing the WER on the SlideSpeech dataset [159]. Collectively, these examples illustrate that well-designed multimodal fusion strategies can effectively improve recognition accuracy, particularly in challenging acoustic or domain-specific scenarios.

6.3. Breakthroughs in Edge ASR Technology

In the era of the vigorous development of AI, it has become an inevitable trend to integrate large models into edge devices to achieve intelligence for all things. Against this backdrop, endowing edge devices with the ability to accurately “understand speech” has become a highly cutting-edge research direction in the field of ASR. This direction focuses on achieving high-precision recognition under the condition of limited computing resources, as well as making breakthroughs in edge ASR technology that can meet the requirements of multi-language and multi-scenario tasks. We present the lightweight ASR models developed in recent years in Table 7. These models can all run on edge devices with certain performance capabilities.

To address these challenges, various research teams have proposed innovative solutions. Xu et al. (2023) [146] designed a comprehensive optimization framework encompassing model architecture adaptation, neural network graph transformation, and numerical optimization techniques. This framework enables Conformer-based end-to-end streaming ASR systems to operate efficiently on resource-constrained devices, significantly improving both recognition speed and accuracy. Additionally, it provides a theoretical foundation for numerical stabilization, ensuring the model’s reliable performance. Qin et al. (2024) [160] introduced the Tiny-Align framework, which takes a novel approach by designing a Transformer-based projector and related mechanisms to achieve efficient cross-modal alignment between ASR and large language models. Moreover, they incorporated an instruction injection mechanism to further enhance system performance. Notably, this work represents the first high-efficiency ASR-LLM alignment study specifically tailored for resource-constrained edge devices, marking a pioneering advancement in the field. Manepalli et al. (2021) [161] proposed the DYN-ASR method, which leverages language and accent recognition techniques to dynamically select monolingual ASR models. This approach effectively reduces resource consumption while significantly improving multilingual recognition performance, offering a novel perspective for edge ASR applications in multilingual scenarios.

Table 7. The performance of lightweight ASR models in recent years on LibriSpeech.

Model	WER (%, Test-Clean)	WER (%, Test-Other)	Parma
Squeezeformer-XS [162]	3.74	9.09	9 M
ContextNet-S [93]	2.9	7.0	10.8 M
QuartzNet-15x5 [163]	3.9	11.28	19 M
Zipformer-S [164]	2.42	5.73	23.3 M
Moonshine Tiny [165]	4.52	11.7	27.1 M
whisper.tiny.en [138]	5.66	15.45	37.8 M
E-Branchformer (B) [120]	2.49	5.61	41.12 M

6.4. Combination with Spiking Neural Networks (SNNs )

Spiking neural networks (SNNs), recognized as the third generation of neural networks, offer distinct advantages in modeling biological neural systems. Unlike traditional ANNs that process continuous activation values, SNNs use discrete action potentials, or “spikes,” which better mimic biological neuronal communication and confer higher biological plausibility.

The foundational concept of SNNs was introduced by Maass (1997) [166], who demonstrated their computational benefits using spiking neuron models like integrate-and-fire. SNNs can achieve efficient performance with fewer neurons in tasks such as pattern matching and coincidence detection. With the rise of deep learning, deep SNNs have garnered increasing interest for their potential in complex applications. Tavanaei et al. (2019) [167] reviewed the use of deep SNNs in areas like image and speech recognition. SNNs support both supervised and unsupervised learning paradigms, with spike-timing-dependent plasticity (STDP) [168] playing a central role. STDP enables synaptic weight updates based on spike timing, allowing dynamic adaptation.

While traditional ANNs have been applied to speech recognition, they are often limited by high computational and energy costs. SNNs, leveraging event-driven computation and sparse spike activity, provide a biologically inspired and energy-efficient alternative.

Auge et al. (2021) [169] proposed an end-to-end SNN architecture that processes raw audio with resonant neurons, eliminating the need for MFCC preprocessing. This design achieves comparable accuracy in keyword detection while significantly reducing energy consumption. Wu et al. (2020) [170] extended SNNs to large vocabulary continuous speech recognition via a tandem learning framework, achieving ANN-level performance on TIMIT and LibriSpeech datasets with only 10 time steps and reduced synaptic operations, demonstrating efficient inference. Wang et al. (2023) [171] further advanced the field by introducing the DyTr-SNN model, which incorporates dynamic neurons and spike-based Transformers. Their model improves PER, robustness, and efficiency on LJSpeech, underscoring the value of biologically inspired mechanisms in temporal sequence processing.

Overall, SNNs show great promise for speech recognition due to their unique spiking computation and energy efficiency. Nevertheless, challenges remain in training algorithm optimization and neuron model enhancement. Future work should focus on developing more effective training strategies and incorporating richer biological features to improve performance and scalability in practical applications.

7. Speech Recognition Applications in Intelligent Information Systems

The integration of ASR technology into intelligent information systems has revolutionized human–computer interaction across numerous domains, fundamentally transforming how users access, process, and manipulate information. This transformation extends beyond simple voice-to-text conversion, encompassing sophisticated multimodal interfaces that combine speech recognition with contextual understanding, real-time processing, and adaptive learning capabilities. The deployment of ASR in intelligent information systems presents unique challenges and opportunities that differ significantly from standalone speech recognition applications, requiring careful consideration of system architecture, performance optimization, and user experience design. As illustrated in Table 8 and Table 9, different ASR model families present distinct trade-offs in terms of accuracy, computational requirements, robustness, and deployment feasibility, necessitating careful selection based on specific system requirements and constraints.

Table 8. Comprehensive ASR model family comparison for intelligent information systems.

Model Family	WER (%) ^†	RTF ^‡	SNR Robust ^§	Languages ^‖	Memory ^¶	Streaming ^#	Edge Deploy ^†^†
Traditional RNN	5–12%	0.2–0.4	<10 dB	1–5	1–2 GB	Yes	Feasible
(DeepSpeech, CTC-RNN)
CNN-based	2–4%	0.1–0.3	10–20 dB	1–10	1–4 GB	Yes	Optimal
(QuartzNet, ContextNet)
Hybrid CNN-RNN	3–6%	0.2–0.4	5–15 dB	1–20	2–4 GB	Yes	Feasible
(Jasper, CitriNet)
Self-Supervised SSL	1.8–3%	0.3–0.5	0–10 dB	1–10	2–5 GB	Partial	Challenging
(wav2vec2, HuBERT, WavLM)
Transformer Encoder	1.8–2%	0.2–0.4	−5 to 15 dB	20–50	3–6 GB	Variable	Feasible
(Conformer, Branchformer)
Enhanced Transformer	1.8–2.4%	0.2–0.4	−10 to 20 dB	20–50	3–6 GB	Variable	Feasible
(E-Branchformer, Zipformer)
Sequence-to-Sequence	2–8%	0.4–1.0	−5 to 15 dB	20–99	4–8 GB	Partial	Challenging
(LAS, Transformer S2S)
Large Multimodal	2.4–15%	0.5–2.0	−20 to 40 dB	99+	1–10 GB	No	Limited
(Whisper, Universal ASR)
Efficient/Mobile	3–12%	0.05–0.2	5–20 dB	1–5	<1–2 GB	Yes	Optimal
(SqueezeFormer, Moonshine)
Streaming-Optimized	2.5–5%	0.1–0.3	0–15 dB	10–50	1–3 GB	Yes	Feasible
(Streaming Conformer, RNN-T)
Cross-Modal Fusion	1.5–3%	0.8–1.5	−15 to 30 dB	10–99	8–15 GB	Partial	Impractical
(Audio-Visual, Multi-modal)

^† LibriSpeech test-clean range; ^‡ real-time factor; ^§ SNR threshold for degradation; ^‖ supported languages; ^¶ GPU memory; ^# streaming capability; ^†^† edge deployment feasibility.

Table 9. ASR paradigm selection decision matrix for intelligent information systems.

ASR Paradigm	WER Range	Latency (ms)	RTF	Noise Robust	Languages	Memory (GB)	Deployment Scenario
CTC-based Models	3–8%	50–200	0.1–0.3	Medium	1–20	1–3	Edge devices, streaming
(DeepSpeech, QuartzNet)				(>5 dB SNR)			applications
RNN-Transducer	2–5%	80–300	0.2–0.4	High	1–50	2–4	Real-time streaming,
(Conformer-RNN-T)				(0–15 dB)			mobile applications
Conformer Encoder	1.8–3%	100–400	0.2–0.4	Very High	20–99	3–6	High-accuracy offline,
(Transformer-based)				(−5 to 20 dB)			server deployment
SSL Pretrained	1.5–2.5%	200–600	0.3–0.6	Excellent	10–50	4–8	Research, fine-tuning
(wav2vec2, HuBERT)				(−10 to 25 dB)			for specific domains
Whisper-class	2–15%	500–2000	0.5–2.0	Excellent	99+	1–10	Multilingual, robust
(Large Multimodal)		(varies)		(−20 to 40 dB)		(var sizes)	general-purpose

Usage Guidelines: Select CTC for low-latency edge deployment; RNN-T for streaming with accuracy balance; Conformer for highest accuracy server deployment; SSL models for domain adaptation; Whisper-class for robust multilingual applications. Consider privacy requirements (edge vs. cloud), computational budget, and accuracy tolerance when selecting paradigms.

7.1. Privacy, Security, and Safety Considerations

The deployment of ASR systems in intelligent information systems introduces critical privacy and security challenges that require systematic mitigation strategies. Voice data represent highly sensitive biometric information, necessitating comprehensive protection mechanisms across the entire processing pipeline.

7.1.1. Privacy-Preserving Deployment Strategies

Modern ASR deployments increasingly adopt privacy-preserving architectures to address data sensitivity concerns. Edge deployment maintains complete local processing, eliminating cloud transmission risks [146]. Federated learning approaches enable collaborative model training without centralizing sensitive audio data [4,5]. Large-scale models demonstrate that robust ASR performance can be achieved with minimal preprocessing requirements [138]. Secure smart home implementations show that offline ASR systems can operate effectively while protecting user privacy [172].

7.1.2. Security and Safety Mechanisms

ASR systems require multiple layers of security protection. Self-supervised learning models such as wav2vec 2.0 and HuBERT [117,118] enable training on unlabeled data, reducing dependence on potentially compromised labeled datasets. Multi-modal fusion approaches [151,153] enhance verification capabilities through cross-modal consistency checks, providing additional security layers against spoofing attacks.

7.1.3. Concrete Mitigation Strategies

Based on current research and deployment practices, three essential mitigation strategies emerge:

Edge-First Processing Architecture: Deploy ASR processing locally on edge devices to minimize data transmission risks [146]. Smart home automation systems demonstrate that secure, offline voice processing can achieve high performance while maintaining strong privacy protection [172].
Federated Learning with Privacy Preservation: Implement federated ASR training frameworks that enable collaborative model improvement without exposing individual voice samples [4,5]. This approach allows organizations to benefit from collective learning while maintaining strict data governance requirements.
Multi-Modal Security Framework: Integrate ASR with complementary modalities to provide enhanced authentication and liveness detection [151,153]. Cross-modal consistency checks between audio and visual information significantly improve system robustness against synthetic speech attacks.

These strategies collectively address fundamental privacy, security, and safety requirements for deploying ASR technology in intelligent information systems while maintaining operational effectiveness and user trust.

7.2. Healthcare Information Systems

Healthcare represents one of the most impactful application domains for ASR-enabled intelligent information systems, where the technology addresses critical challenges in clinical documentation, patient care efficiency, and medical data accessibility. Modern healthcare ASR systems must handle domain-specific terminology, maintain strict accuracy requirements for patient safety, and integrate seamlessly with electronic health record (EHR) systems while ensuring compliance with privacy regulations such as HIPAA.

Clinical documentation systems powered by ASR technology have demonstrated remarkable improvements in physician productivity and patient care quality. Recent comprehensive analyses have shown that healthcare ASR systems face unique challenges including high WER in multi-speaker clinical environments, the need for specialized medical vocabulary processing, and requirements for robust performance across diverse acoustic conditions [173]. Modern clinical ASR implementations incorporate real-time correction mechanisms, contextual understanding of medical procedures, and seamless integration with electronic health record systems. These systems intelligently extract clinical entities such as symptoms, diagnoses, and treatment plans from spoken narratives, automatically populating structured data fields in the EHR and triggering appropriate clinical workflows.

Recent developments in healthcare ASR have focused on addressing multilingual challenges and improving accuracy for diverse patient populations. New datasets such as VietMed [174] and PriMock57 [175] have been developed to support research in medical conversation transcription, though significant challenges remain in handling real-world clinical environments with multiple speakers, background noise, and domain-specific terminology [176]. Privacy-preserving approaches and specialized medical conversation models are being developed to address both accuracy and confidentiality requirements in clinical settings.

7.3. Educational Technology Systems

The integration of ASR into educational information systems has opened new frontiers in personalized learning, accessibility, and automated assessment. These systems leverage speech recognition to create more inclusive learning environments, provide real-time feedback on language acquisition, and enable hands-free interaction with educational content.

Language learning applications represent a particularly successful implementation of ASR in educational contexts. Recent studies have demonstrated that ASR-based language learning systems can significantly improve English pronunciation skills among non-native speakers [177]. These systems utilize sophisticated feedback mechanisms, with some providing global corrective feedback while others offer detailed phonetic corrections, both showing measurable improvements in word-level and sentence-level pronunciation after sustained practice periods.

Contemporary research has shown that ASR technology with peer correction mechanisms can enhance second language pronunciation and speaking skills more effectively than traditional teacher-led instruction [178]. Modern educational ASR systems are being designed to provide immediate, personalized feedback that creates less anxiety-provoking learning environments, particularly beneficial for students with higher levels of anxiety around foreign language speaking. Recent applications focus on blended learning approaches that combine ASR technology with human instruction to optimize pronunciation learning outcomes [179].

7.4. Smart Home and IoT Ecosystems

The proliferation of smart home devices and Internet of things (IoT) ecosystems has created unprecedented opportunities for ASR integration, enabling natural voice control of connected devices and services. Modern smart home systems powered by ASR technology provide intuitive interfaces for controlling lighting, climate, security, entertainment, and household appliances through natural language commands.

The global voice and speech recognition market has experienced remarkable growth, with projections indicating expansion from USD 14.8 billion in 2024 to USD 61.27 billion by 2033, growing at a CAGR of 17.1% [180]. This growth is particularly driven by the increasing integration of AI-powered voice assistants in smart home environments. Recent developments have focused on privacy-preserving local voice processing solutions, with projects like Home Assistant’s “Year of Voice” initiative demonstrating significant advances in on-device speech recognition [181].

Modern smart home ASR systems increasingly employ local processing architectures to address privacy concerns and reduce latency. Recent implementations utilize technologies like OpenAI’s Whisper for local speech-to-text conversion and Piper for text-to-speech synthesis, enabling fully offline voice control capabilities [182]. A comprehensive study on secure smart home automation systems demonstrated that offline ASR implementations using optimized models can achieve accuracy rates comparable to cloud-based services while providing enhanced security and faster response times [172]. These systems typically require only 50% of available RAM and less than 40% of CPU resources for simultaneous operation, making them viable for resource-constrained edge devices.

7.5. Enterprise and Business Intelligence Systems

ASR technology has transformed enterprise information systems by enabling voice-driven analytics, hands-free data entry, and natural language queries against business databases. These applications are particularly valuable in scenarios where traditional input methods are impractical or inefficient, such as field operations, manufacturing environments, and mobile workforce management.

The enterprise adoption of ASR technology has been accelerated by advances in accuracy and multilingual capabilities, with leading providers like Speechmatics achieving 18% improvement in accuracy across 50+ languages in their latest Ursa 2 model [183]. Voice-enabled business intelligence platforms now allow executives and analysts to query complex datasets using natural language, democratizing access to data insights across organizational hierarchies. Enterprise-grade ASR solutions increasingly focus on real-time transcription and voice analytics, with companies like Uniphore demonstrating how ASR can unlock business intelligence from conversational data to drive strategic decision making [184].

Modern enterprise ASR implementations emphasize accuracy, scalability, and integration capabilities. Leading platforms like Azure AI Speech and Google Cloud Speech-to-Text provide enterprise-grade solutions that support multiple languages, custom vocabulary, and domain-specific adaptations [185,186]. These systems enable applications ranging from automated call center analytics to voice-controlled inventory management, with SOC 2 Type 2 compliance ensuring enterprise-level security and data protection standards [187].

7.6. Automotive and Transportation Systems

The automotive industry has embraced ASR technology as a critical component of intelligent transportation systems, enhancing both safety and user experience through hands-free operation of vehicle systems and services. Modern vehicles incorporate sophisticated speech recognition capabilities that enable drivers to control navigation, communication, entertainment, and vehicle settings without taking their hands off the wheel or eyes off the road.

The automotive voice recognition market, valued at $3.7 billion in 2024, is projected to grow at a CAGR of 10.6% through 2034, driven by increasing demand for connected cars and advanced infotainment systems [188]. Recent industry developments have seen major automotive manufacturers integrating advanced conversational AI capabilities, with Volkswagen debuting its ChatGPT-enhanced digital voice assistant at CES 2024, powered by Cerence, and Amazon collaborating with BMW to showcase a new car voice assistant that combines Alexa with large language models and vehicle-relevant data [189].

The ICASSP 2024 In-Car Multi-Channel ASR Challenge highlighted the unique technical challenges of automotive ASR, including multi-speaker scenarios, road noise, and varying acoustic environments within vehicles [190]. The challenge, which collected over 100 h of multi-channel speech data recorded inside vehicles, demonstrated that specialized ASR systems can achieve character error rates as low as 13.16% in automotive environments. Advanced automotive ASR systems now support sophisticated features such as multi-zone recognition (up to six sound zones), parallel instruction processing, cross-zone inheritance, and fully offline voice capabilities [191]. These developments enable more natural and efficient in-vehicle interactions while maintaining safety standards for hands-free operation.

8. Representative Deployment Blueprint

As a representative example of practical deployment, a reference configuration is summarized for an on-device voice UI targeting end-to-end latency below 200 ms, utilizing a Conformer-Transducer architecture with INT8 quantization. This configuration is intended for illustrative purposes and does not constitute a comprehensive deployment benchmark. Table 10 shows a representative deployment configuration.

8.1. System Flow and Reference Configuration Table

Microphone Input --> Acoustic Frontend (Feature Extraction)

--> Conformer-Transducer (INT8 Inference)

--> Streaming Decoder

--> Text Output

8.2. Supporting Evidence from Prior Work

Recent studies support the feasibility of achieving sub-200 ms latency with quantized Conformer-Transducer models on-device:

Sub-8-bit Quantization: Zhen et al. [192] report that sub-8-bit quantization for Conformer/RNN-Transducer models reduces user-perceived latency by up to 31.75% compared to standard 8-bit QAT.
Mixed-Precision Conformer: Ding et al. [193] achieve a $5 \times$ model size reduction using mixed 4-bit/8-bit QAT with only minor accuracy loss, improving real-time inference performance.
Streaming FastConformer: NVIDIA’s FastConformer with cache-based inference [194] demonstrates significant latency reductions for streaming ASR compared to baseline Conformer models.
Sherpa-ONNX INT8 Models: The Sherpa-ONNX project provides open-source Conformer-Transducer INT8 models for streaming ASR on edge devices, illustrating practical deployment scenarios [195].

While none of these works report an identical end-to-end latency benchmark on a specific mobile device, the combined evidence suggests that the target latency is plausible under favorable hardware and optimization conditions.

8.3. Mini-Playbook

This section provides a practical reference for deploying ASR systems across diverse scenarios. Table 8 and Table 9 present a decision matrix mapping ASR paradigms to deployment constraints. To complement this, we provide a Mini-Playbook with concrete steps for two key deployment settings.

9. Critical Open Challenges and Future Directions

Despite substantial progress in ASR, several open challenges continue to limit the reliability and applicability of current systems across real-world scenarios. Below, we synthesize key challenges documented in recent literature and highlight implications for future research.

9.1. Performance Disparities in Low-Resource Languages

State-of-the-art multilingual ASR systems achieve near-human performance on high-resource languages such as English, with WER below 2% on clean speech. However, performance drops drastically for low-resource languages, often exceeding 40% error rates [196,197]. Recent studies on parameter-efficient adaptation methods demonstrate promising gains even under severely limited resources [198], yet these methods still struggle when faced with languages exhibiting extreme typological diversity (e.g., agglutinative or tonal languages) or scarce linguistic resources. Addressing this gap will require not only algorithmic innovations but also the design of fair evaluation metrics and sustainable data collection pipelines in collaboration with local communities.

9.2. Robustness Under Acoustic and Conversational Variability

ASR performance continues to degrade significantly under adverse acoustic conditions, including far-field, noisy, and reverberant environments [199], as well as multi-speaker conversational scenarios commonly encountered in meeting transcription and in-vehicle speech applications [200]. Existing robustness techniques remain insufficient for handling the full spectrum of real-world variability, highlighting the need for models that can adapt seamlessly to challenging acoustic dynamics without prohibitive computational costs.

9.3. Catastrophic Forgetting in Continual Learning Settings

When deployed in dynamic environments, ASR systems often suffer catastrophic forgetting: adapting to new domains or speakers can degrade performance on previously learned tasks [201]. Although recent continual learning approaches mitigate this problem, they often involve trade-offs between adaptation speed and knowledge retention, with limited theoretical guarantees for safety-critical domains such as healthcare [173].

9.4. On-Device Deployment Constraints

Resource-constrained edge devices impose strict limits on model size, latency, and energy consumption. Existing model compression and quantization techniques face fundamental trade-offs between efficiency and accuracy, limiting the deployment of advanced ASR architectures on embedded platforms without sacrificing recognition quality.

9.5. Domain-Specific Adaptation Bottlenecks

Specialized domains, such as healthcare, demand accurate recognition of domain-specific terminology, accented speech, and atypical acoustic patterns [173,174,175]. Despite the release of medical speech datasets, current domain adaptation methods struggle to efficiently incorporate domain knowledge while preserving performance on general speech.

9.6. Implications for Future Research

These challenges collectively underscore three urgent research directions:

Architectural Innovation: New ASR architectures are needed to balance robustness, efficiency, and adaptability, potentially integrating modular continual learning, self-supervised pretraining, and efficient adaptation layers.

Evaluation Reform: Future benchmarks should capture low-resource, multi-speaker, and domain-specific scenarios, ensuring that metrics reflect linguistic diversity and deployment constraints.

Resource/Accuracy Trade-off Optimization: Systematic exploration of compression, quantization, and low-rank adaptation techniques is essential to reconcile computational efficiency with recognition accuracy on edge devices.

Addressing these challenges requires interdisciplinary collaboration, from linguistics and human–computer interaction to systems engineering, ensuring that future ASR systems achieve both technical excellence and global inclusivity.

10. Conclusions

This survey systematically traces the evolution of ASR through four pivotal phases: early template matching, statistical modeling, the deep learning revolution, and the era of large-scale models. It highlights transformative technologies such as Transformers, self-supervised learning, and end-to-end architectures. Early advancements in dynamic programming and HMMs laid the groundwork, while deep learning shifted the paradigm toward data-driven feature extraction, enabling breakthroughs in noise robustness and long-range dependency modeling via networks like LSTMs, CNNs, and the Transformer-based Conformer. Recent years have seen the rise of large models under diverse learning paradigms, such as self-supervised Wav2Vec 2.0 and semi-supervised Whisper, which leverage massive unlabeled data to address labeling costs and improve low-resource language performance.

The comprehensive analysis of ASR applications across healthcare, educational, smart home, enterprise, and automotive domains demonstrates the technology’s transformative impact on intelligent information systems, highlighting both the achievements and the ongoing challenges in real-world deployments. From clinical documentation systems that improve physician productivity to voice-enabled smart homes that enhance user privacy through local processing, ASR technology has fundamentally changed how humans interact with intelligent systems across diverse sectors.

Despite remarkable progress, significant challenges remain, including reliance on large annotated datasets, limited robustness in noisy environments, and computational inefficiency for edge deployment. Future research directions focus on multimodal fusion to integrate audio-visual cues, lightweight architectures for resource-constrained devices, and biologically inspired approaches such as spiking neural networks for energy-efficient inference. The emergence of privacy-preserving techniques, federated learning approaches, and real-time edge processing solutions addresses growing concerns about data security and computational sustainability.

By systematically bridging technical innovations with real-world demands across diverse application domains, ASR continues to evolve toward more natural human–computer interaction, promising advancements in contextual understanding, multilingual adaptability, and sustainable deployment in intelligent information systems. This evolution positions ASR as a cornerstone technology for the next generation of intelligent systems that seamlessly integrate speech understanding with complex decision-making processes across multiple domains.

Author Contributions

Conceptualization, C.W. and Y.P.; methodology, C.W.; validation, H.W., C.W. and Y.P.; formal analysis, C.W.; investigation, C.W.; resources, L.N.; data curation, H.W.; writing—original draft preparation, C.W.; writing—review and editing, Y.P., H.W. and L.N.; visualization, C.W.; supervision, Y.P.; project administration, L.N.; funding acquisition, Y.P. and L.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created.

Acknowledgments

During the preparation of this manuscript/study, the author(s) used ChatGPT-4o for the purpose of generating Figure 9. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Acronym and Terminology Glossary

Key Acronyms and Technical Terms in ASR
Acronym	Full Term	Definition
AM	Acoustic Model	Component mapping audio features to phonetic units
ANN	Artificial Neural Network	Computational model inspired by biological neural networks
ASR	Automatic Speech Recognition	Technology that converts spoken language into text
CER	Character Error Rate	Character-level accuracy metric, especially for Asian languages
CNN	Convolutional Neural Network	Network using convolution operations for local feature extraction
CTC	Connectionist Temporal Classification	Alignment-free training objective for sequence labeling
DNN	Deep Neural Network	Multi-layer neural network with >3 hidden layers
GMM	Gaussian Mixture Model	Probabilistic model using multiple Gaussian distributions
GRU	Gated Recurrent Unit	Simplified alternative to LSTM with fewer parameters

HMM	Hidden Markov Model	Statistical model for temporal sequence modeling
LAS	Listen, Attend and Spell	Encoder-decoder architecture with attention mechanism
LM	Language Model	Statistical model predicting word sequence probabilities
LSTM	Long Short-Term Memory	RNN variant addressing vanishing gradient problem
MAP	Maximum A Posteriori	Statistical estimation method incorporating prior knowledge
MBR	Minimum Bayes Risk	Training objective minimizing expected error rate
MFCC	Mel-Frequency Cepstral Coefficients	Traditional acoustic features based on human auditory perception
MLE	Maximum Likelihood Estimation	Statistical method finding parameters maximizing likelihood
MLP	Multi-Layer Perceptron	Feedforward neural network with multiple hidden layers
PER	Phoneme Error Rate	Character-level accuracy metric for phoneme recognition
RNN	Recurrent Neural Network	Network with recurrent connections for sequence processing
RNN-T	RNN Transducer	Sequence-to-sequence model for streaming ASR
RTF	Real-Time Factor	Ratio of processing time to audio duration
SNR	Signal-to-Noise Ratio	Measure of signal quality relative to background noise
SNN	Spiking Neural Network	Third-generation neural network using discrete spike events
SOTA	State-of-the-Art	Current best performance on benchmark tasks
SSL	Self-Supervised Learning	Learning paradigm using unlabeled data with pretext tasks
STFT	Short-Time Fourier Transform	Time-frequency analysis technique for audio signals
WER	Word Error Rate	Primary accuracy metric: percentage of incorrectly recognized words

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Li, J.; Wu, Y.; Gaur, Y.; Wang, C.; Zhao, R.; Liu, S. On the comparison of popular end-to-end models for large scale speech recognition. arXiv 2020, arXiv:2005.14327. [Google Scholar]
Tsai, Y.-H.H.; Ma, M.Q.; Yang, M.; Zhao, H.; Morency, L.-P.; Salakhutdinov, R. Self-supervised representation learning with relative predictive coding. arXiv 2021, arXiv:2103.11275. [Google Scholar] [CrossRef]
Guliani, D.; Beaufays, F.; Motta, G. Training speech recognition models with federated learning: A quality/cost framework. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3080–3084. [Google Scholar]
Nguyen, T.; Mdhaffar, S.; Tomashenko, N.; Bonastre, J.-F.; Estève, Y. Federated learning for ASR based on wav2vec 2.0. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Xu, M.; Song, C.; Tian, Y.; Agrawal, N.; Granqvist, F.; van Dalen, R.; Zhang, X.; Argueta, A.; Han, S.; Deng, Y. Training large-vocabulary neural language models by private federated learning for resource-constrained devices. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Hegdepatil, P.; Davuluri, K. Business intelligence based novel marketing strategy approach using automatic speech recognition and text summarization. In Proceedings of the 2021 2nd International Conference on Computing and Data Science (CDS), Stanford, CA, USA, 28–29 January 2021; pp. 595–602. [Google Scholar]
Mohan, A.; Rose, R.; Ghalehjegh, S.H.; Umesh, S. Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Commun. 2014, 56, 167–180. [Google Scholar] [CrossRef]
Seljan, S.; Dunđer, I. Combined automatic speech recognition and machine translation in business correspondence domain for English-Croatian. Int. J. Ind. Syst. Eng. 2014, 8, 1980–1986. [Google Scholar]
Vajpai, J.; Bora, A. Industrial applications of automatic speech recognition systems. Int. J. Eng. Res. Appl. 2016, 6, 88–95. [Google Scholar]
Kheddar, H.; Hemis, M.; Himeur, Y.; Megías, D.; Amira, A. Automatic speech recognition using advanced deep learning approaches: A survey. Inf. Fusion 2024, 109, 102422. [Google Scholar] [CrossRef]
Dhanjal, A.S.; Singh, W. A comprehensive survey on automatic speech recognition using neural networks. Multimed. Tools Appl. 2024, 83, 23367–23412. [Google Scholar] [CrossRef]
Zahorian, S.; Karnjanadecha, M. Trends and developments in automatic speech recognition research. Comput. Speech Lang. 2023, 84, 101572. [Google Scholar]
Khapra, C. A survey on end-to-end speech recognition systems. Int. J. Comput. Inf. Technol. 2024, 5, 100–110. [Google Scholar] [CrossRef]
Kumar, A.; Verma, S.; Mangla, H. A survey of deep learning techniques in speech recognition. In Proceedings of the 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 12–13 October 2018; pp. 179–185. [Google Scholar]
Malik, M.; Malik, M.K.; Mehmood, K.; Makhdoom, I. Automatic speech recognition: A survey. Multimed. Tools Appl. 2021, 80, 9411–9457. [Google Scholar] [CrossRef]
Prabhavalkar, R.; Hori, T.; Sainath, T.N.; Schlüter, R.; Watanabe, S. End-to-end speech recognition: A survey. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 32, 325–351. [Google Scholar] [CrossRef]
Davis, K.H.; Biddulph, R.; Balashek, S. Automatic recognition of spoken digits. J. Acoust. Soc. Am. 1952, 24, 637–642. [Google Scholar] [CrossRef]
Forgie, J.W.; Forgie, C.D. Results obtained from a vowel recognition computer program. J. Acoust. Soc. Am. 1959, 31, 1480–1489. [Google Scholar] [CrossRef]
Olson, H.F.; Belar, H. Phonetic typewriter. J. Acoust. Soc. Am. 1956, 28, 1072–1081. [Google Scholar] [CrossRef]
Olson, H.F.; Belar, H. Phonetic typewriter III. J. Acoust. Soc. Am. 1961, 33, 1610–1615. [Google Scholar] [CrossRef]
Atal, B.S.; Hanauer, S.L. Speech Analysis and Synthesis by Linear Prediction of the Speech Wave. J. Acoust. Soc. Am. 1971, 50, 637–655. [Google Scholar] [CrossRef]
Suzuki, J. Recognition of Japanese vowels. J. Radio Res. Lab. 1961, 8, 193–211. [Google Scholar]
Sakai, T.; Doshita, S. Phonetic typewriter. J. Acoust. Soc. Am. 1961, 33 (Suppl. 11), 1664. [Google Scholar] [CrossRef]
Itakura, F. Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 2003, 23, 67–72. [Google Scholar] [CrossRef]
Sakoe, H.; Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 43–49. [Google Scholar] [CrossRef]
Klatt, D.H. Review of the ARPA speech understanding project. J. Acoust. Soc. Am. 1977, 62, 1345–1366. [Google Scholar] [CrossRef]
Rabiner, L.; Juang, B.-H. Fundamentals of Speech Recognition; Prentice-Hall, Inc.: Englewood Cliffs, NJ, USA, 1993; pp. 380–410. ISBN 0130151572. [Google Scholar]
Myers, C.; Rabiner, L. A level building dynamic time warping algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process. 2003, 29, 284–297. [Google Scholar] [CrossRef]
Myers, C.; Rabiner, L.; Rosenberg, A. An investigation of the use of dynamic time warping for word spotting and connected speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Denver, CO, USA, 9–11 April 1980; pp. 173–177. [Google Scholar]
Lee, C.-H.; Rabiner, L.R. A frame-synchronous network search algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 1649–1658. [Google Scholar] [CrossRef]
Bridle, J.S.; Brown, M.; Chamberlain, R. An Algorithm for Connected Word Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Paris, France, 3–5 April 1982; pp. 899–902. [Google Scholar]
Lowerre, B.T. The Harpy Speech Recognition System; Carnegie Mellon University: Pittsburgh, PA, USA, 1976. [Google Scholar]
Rabiner, L.; Juang, B. An introduction to hidden Markov models. IEEE ASSP Mag. 1986, 3, 4–16. [Google Scholar] [CrossRef]
Forney, G.D. The viterbi algorithm. Proc. IEEE 2005, 61, 268–278. [Google Scholar] [CrossRef]
Juang, B.H.; Rabiner, L.R. Hidden Markov models for speech recognition. Technometrics 1991, 33, 251–272. [Google Scholar] [CrossRef]
Lou, H.-L. Implementing the Viterbi algorithm. IEEE Signal Process. Mag. 1995, 12, 42–52. [Google Scholar] [CrossRef]
Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
Gauvain, J.L.; Lee, C.-H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 1994, 2, 291–298. [Google Scholar] [CrossRef]
Milvus. What Is the History of Speech Recognition Technology? Available online: https://milvus.io/ai-quick-reference/what-is-the-history-of-speech-recognition-technology (accessed on 20 August 2025).
Dahl, G.E.; Yu, D.; Deng, L.; Acero, A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 2011, 20, 30–42. [Google Scholar] [CrossRef]
Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
Graves, A.; Mohamed, A.-r.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
Sutskever, I.; Martens, J.; Hinton, G.E. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; pp. 1017–1024. [Google Scholar]
Bourlard, H.A.; Morgan, N. Connectionist Speech Recognition: A Hybrid Approach; Springer Science & Business Media: New York, NY, USA, 2012; Volume 247, pp. 155–182. ISBN 1461532108. [Google Scholar]
Trentin, E.; Gori, M. A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 2001, 37, 91–126. [Google Scholar] [CrossRef]
Hinton, G.E.; Osindero, S.; Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A. Deep speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
Graves, A.; Eck, D.; Beringer, N.; Schmidhuber, J. Biologically plausible speech recognition with LSTM neural nets. In Proceedings of the Biologically Inspired Approaches to Advanced Information Technology: First International Workshop (BioADIT), Lausanne, Switzerland, 29–30 January 2004; pp. 127–136. [Google Scholar]
Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech recognition using deep neural networks: A systematic review. IEEE Access 2019, 7, 19143–19165. [Google Scholar] [CrossRef]
Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation; Institute for Cognitive Science, University of California: San Diego, CA, USA, 1985. [Google Scholar]
Yuan, Q.; Dai, Y.; Li, G. Exploration of English speech translation recognition based on the LSTM RNN algorithm. Neural Comput. Appl. 2023, 35, 24961–24970. [Google Scholar] [CrossRef]
Bengio, Y.; Simard, P.; Frasconi, P. Learning Long-Term Dependencies with Gradient Descent Is Difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G. Deep Speech 2: End-to-end Speech Recognition in English and Mandarin. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 173–182. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Tech. Rep. N 1993, 93, 27403. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Tombaloğlu, B.; Erdem, H. Turkish speech recognition techniques and applications of recurrent units (LSTM and GRU). Gazi Univ. J. Sci. 2021, 34, 1035–1049. [Google Scholar] [CrossRef]
Graves, A.; Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; pp. 1764–1772. [Google Scholar]
Hau, D.; Chen, K. Exploring hierarchical speech representations with a deep convolutional neural network. In Proceedings of the 11th Annual Workshop on Computational Intelligence (UKCI), Manchester, UK, 7–9 September 2011; pp. 31–37. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [Google Scholar] [CrossRef]
Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. Phoneme Recognition Using Time-Delay Neural Networks. Backpropagation; Psychology Press: New York, NY, USA, 2013; pp. 35–61. ISBN 978-1-84872-863-9. [Google Scholar]
Sainath, T.N.; Kingsbury, B.; Saon, G.; Soltau, H.; Mohamed, A.; Dahl, G.; Ramabhadran, B. Deep convolutional neural networks for large-scale speech tasks. Neural Netw. 2015, 64, 39–48. [Google Scholar] [CrossRef]
Shon, S.; Ali, A.; Glass, J. Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv 2018, arXiv:1803.04567. [Google Scholar]
Ghafoor, K.J.; Rawf, K.M.H.; Abdulrahman, A.O.; Taher, S.H. Kurdish dialect recognition using 1D CNN. ARO Sci. J. Koya Univ. 2021, 9, 10–14. [Google Scholar] [CrossRef]
Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar] [CrossRef]
Passricha, V.; Aggarwal, R.K. A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J. Intell. Syst. 2019, 29, 1261–1274. [Google Scholar] [CrossRef]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4960–4964. [Google Scholar]
Sak, H.; Shannon, M.; Rao, K.; Beaufays, F. Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1298–1302. [Google Scholar]
Jaitly, N.; Le, Q.V.; Vinyals, O.; Sutskever, I.; Sussillo, D.; Bengio, S. An online sequence-to-sequence model using partial conditioning. Adv. Neural Inf. Process. Syst. 2016, 29, 5074–5082. [Google Scholar]
Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711. [Google Scholar] [CrossRef]
Chiu, C.-C.; Raffel, C. Monotonic chunkwise attention. arXiv 2017, arXiv:1712.05382. [Google Scholar]
Chanchaochai, N.; Cieri, C.; Debrah, J.; Liberman, M.; Graff, D.; Lee, J.; Walker, K.; Walter, T.; Wu, J. GlobalTIMIT: Acoustic-Phonetic Datasets for the World’s Languages. In Proceedings of the INTERSPEECH 2018, Hyderabad, India, 2–6 September 2018; pp. 192–196. [Google Scholar]
Rydning, D.R.-J.G.; Gantz, J. The digitization of the world from edge to core. Framingham: Int. Data Corp. 2018, 16, 1–28. [Google Scholar]
Halevy, A.; Norvig, P.; Pereira, F. The unreasonable effectiveness of data. IEEE Intell. Syst. 2009, 24, 8–12. [Google Scholar] [CrossRef]
Najafabadi, M.M.; Villanustre, F.; Khoshgoftaar, T.M.; Seliya, N.; Wald, R.; Muharemagic, E. Deep learning applications and challenges in big data analytics. J. Big Data 2015, 2, 1–21. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 2015, 28, 577–585. [Google Scholar]
Moritz, N.; Hori, T.; Le, J. Streaming automatic speech recognition with the transformer model. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–8 May 2020; pp. 6074–6078. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Han, W.; Zhang, Z.; Zhang, Y.; Yu, J.; Chiu, C.-C.; Qin, J.; Gulati, A.; Pang, R.; Wu, Y. Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv 2020, arXiv:2005.03191. [Google Scholar] [CrossRef]
Peng, Y.; Dalmia, S.; Lane, I.; Watanabe, S. Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 17627–17643. [Google Scholar]
Zhang, Q.; Lu, H.; Sak, H.; Tripathi, A.; McDermott, E.; Koo, S.; Kumar, S. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7829–7833. [Google Scholar]
Dong, L.; Xu, S.; Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
Paul, D.B.; Baker, J. The design for the Wall Street Journal-based CSR corpus. In Proceedings of the Speech and Natural Language: Proceedings of a Workshop, Harriman, New York, NY, USA, 23–26 February 1992; p. 357. [Google Scholar]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
Hugging Face. Wav2Vec2-Conformer. Available online: https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer (accessed on 14 August 2025).
Chen, Z.; Ramabhadran, B.; Biadsy, F.; Zhang, X.; Chen, Y.; Jiang, L.; Moreno, P.J. Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech. In Proceedings of the INTERSPEECH 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 4828–4832. [Google Scholar]
Google Cloud. Migrate from classic to Conformer Models. Available online: https://cloud.google.com/speech-to-text/docs/conformer-migration (accessed on 20 August 2025).
Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Gu, Y.; Shivakumar, P.G.; Kolehmainen, J.; Brusco, P.; Sim, K.C.; Ramabhadran, B.; Picheny, M. Scaling Laws for Discriminative Speech Recognition Rescoring Models. arXiv 2023, arXiv:2306.15815. [Google Scholar] [CrossRef]
Subbaswamy, A.; Saria, S. Counterfactual Normalization: Proactively Addressing Dataset Shift Using Causal Mechanisms. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Monterey, CA, USA, 6–10 August 2018; pp. 947–957. [Google Scholar]
Xu, K.-T.; Xie, F.-L.; Tang, X.; Hu, Y. FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration. arXiv 2025, arXiv:2501.14350. [Google Scholar]
Bai, Y.; Chen, J.; Chen, J.; Chen, W.; Chen, Z.; Ding, C.; Dong, L.; Dong, Q.; Du, Y.; Gao, K. Seed-asr: Understanding Diverse Speech and Contexts with LLM-Based Speech Recognition. arXiv 2024, arXiv:2407.04675. [Google Scholar]
Shakhadri, S.A.G.; Kr, K.; Angadi, K.B. Samba-asr state-of-the-art speech recognition leveraging structured state-space models. arXiv 2025, arXiv:2501.02832. [Google Scholar]
Hwang, D. FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information. arXiv 2024, arXiv:2405.12807. [Google Scholar] [CrossRef]
Zhang, Y.; Qin, J.; Park, D.S.; Han, W.; Chiu, C.-C.; Pang, R.; Le, Q.V.; Wu, Y. Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv 2020, arXiv:2010.10504. [Google Scholar]
Chung, Y.-A.; Zhang, Y.; Han, W.; Chiu, C.-C.; Qin, J.; Pang, R.; Wu, Y. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 244–250. [Google Scholar]
Rekesh, D.; Koluguri, N.R.; Kriman, S.; Majumdar, S.; Noroozi, V.; Huang, H.; Hrinchuk, O.; Puvvada, K.; Kumar, A.; Balam, J. Fast conformer with linearly scalable attention for efficient speech recognition. In Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 16–20 December 2023; pp. 1–8. [Google Scholar]
Xu, Q.; Baevski, A.; Likhomanenko, T.; Tomasello, P.; Conneau, A.; Collobert, R.; Synnaeve, G.; Auli, M. Self-training and pre-training are complementary for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–11 June 2021; pp. 3030–3034. [Google Scholar]
Park, D.S.; Zhang, Y.; Jia, Y.; Han, W.; Chiu, C.-C.; Li, B.; Wu, Y.; Le, Q.V. Improved noisy student training for automatic speech recognition. arXiv 2020, arXiv:2005.09629. [Google Scholar] [CrossRef]
Chan, W.; Park, D.; Lee, C.; Zhang, Y.; Le, Q.; Norouzi, M. Speechstew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network. arXiv 2021, arXiv:2104.02133. [Google Scholar] [CrossRef]
Pan, J.; Shapiro, J.; Wohlwend, J.; Han, K.J.; Lei, T.; Ma, T. ASAPP-ASR: Multistream CNN and self-attentive SRU for SOTA speech recognition. arXiv 2020, arXiv:2005.10469. [Google Scholar]
Fathullah, Y.; Wu, C.; Shangguan, Y.; Jia, J.; Xiong, W.; Mahadeokar, J.; Liu, C.; Shi, Y.; Kalinli, O.; Seltzer, M. Multi-head state space model for speech recognition. arXiv 2023, arXiv:2305.12498. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Kim, K.; Wu, F.; Peng, Y.; Pan, J.; Sridhar, P.; Han, K.J.; Watanabe, S. E-branchformer: Branchformer with enhanced merging for speech recognition. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2022; pp. 84–91. [Google Scholar]
Yao, Z.; Kang, W.; Yang, X.; Kuang, F.; Guo, L.; Zhu, H.; Jin, Z.; Li, Z.; Lin, L.; Povey, D. CR-CTC: Consistency regularization on CTC for improved speech recognition. arXiv 2024, arXiv:2410.05101. [Google Scholar] [CrossRef]
Akmal, H.M.; Chao, X.; Mehdi, R. Transformer-based ASR incorporating time-reduction layer and fine-tuning with self-knowledge distillation. arXiv 2021, arXiv:2103.09903. [Google Scholar]
Liu, C.; Zhang, F.; Le, D.; Kim, S.; Saraf, Y.; Zweig, G. Improving RNN transducer based ASR with auxiliary tasks. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 172–179. [Google Scholar]
Baevski, A.; Hsu, W.-N.; Xu, Q.; Babu, A.; Gu, J.; Auli, M. Data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 1298–1312. [Google Scholar]
Xu, Q.; Likhomanenko, T.; Kahn, J.; Hannun, A.; Synnaeve, G.; Collobert, R. Iterative pseudo-labeling for speech recognition. arXiv 2020, arXiv:2005.09267. [Google Scholar] [CrossRef]
Synnaeve, G.; Xu, Q.; Kahn, J.; Likhomanenko, T.; Grave, E.; Pratap, V.; Sriram, A.; Liptchinsky, V.; Collobert, R. End-to-end asr: From supervised to semi-supervised learning with modern architectures. arXiv 2019, arXiv:1911.08460. [Google Scholar]
Zhang, F.; Wang, Y.; Zhang, X.; Liu, C.; Saraf, Y.; Zweig, G. Faster, simpler and more accurate hybrid asr systems using wordpieces. arXiv 2020, arXiv:2005.09150. [Google Scholar] [CrossRef]
Nartey, O.T.; Yang, G.; Asare, S.K.; Wu, J.; Frempong, L.N. Robust semi-supervised traffic sign recognition via self-training and weakly-supervised learning. Sensors 2020, 20, 2684. [Google Scholar] [CrossRef]
Souly, N.; Spampinato, C.; Shah, M. Semi and weakly supervised semantic segmentation using generative adversarial network. arXiv 2017, arXiv:1703.09695. [Google Scholar] [CrossRef]
Ren, Z.; Wang, S.; Zhang, Y. Weakly supervised machine learning. CAAI Trans. Intell. Technol. 2023, 8, 549–580. [Google Scholar] [CrossRef]
Zhou, Z.-H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2018, 5, 44–53. [Google Scholar] [CrossRef]
Merz, C.J.; Clair, D.C.S.; Bond, W.E. Semi-supervised adaptive resonance theory (smart2). In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Baltimore, MD, USA, 7–11 June 1992; Volume 3, pp. 851–856. [Google Scholar]
Chapelle, O.; Scholkopf, B.; Zien, A. Semi-supervised learning (Chapelle, O. et al., eds.; 2006) [book reviews]. IEEE Trans. Neural Netw. 2009, 20, 542. [Google Scholar] [CrossRef]
Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the Workshop on Challenges in Representation Learning, Atlanta, GA, USA, 16–21 June 2013; p. 896. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar] [CrossRef]
Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
Zhang, Y.; Park, D.S.; Han, W.; Qin, J.; Gulati, A.; Shor, J.; Jansen, A.; Xu, Y.; Huang, Y.; Wang, S. Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE J. Sel. Top. Signal Process. 2022, 16, 1519–1532. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Xie, Q.; Luong, M.-T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Bucci, S.; D’Innocente, A.; Liao, Y.; Carlucci, F.M.; Caputo, B.; Tommasi, T. Self-Supervised Learning across Domains. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5516–5528. [Google Scholar] [CrossRef]
Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with gumbel-softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 117–128. [Google Scholar] [CrossRef]
Xu, M.; Jin, A.; Wang, S.; Su, M.; Ng, T.; Mason, H.; Han, S.; Lei, Z.; Deng, Y.; Huang, Z. Conformer-based speech recognition on extreme edge-computing devices. arXiv 2023, arXiv:2312.10359. [Google Scholar]
AssemblyAI. Conformer-2: A State-of-the-Art Speech Recognition Model Trained on 1.1M hours of Data. AssemblyAI Technical Blog, 2023. Available online: https://www.assemblyai.com/blog/conformer-2/ (accessed on 15 January 2025).
Miao, H.; Cheng, G.; Zhang, P.; Yan, Y. Online Hybrid CTC/attention End-to-End Automatic Speech Recognition Architecture. arXiv 2023, arXiv:2307.02351. [Google Scholar] [CrossRef]
Bao, C.; Huo, C.; Chen, Q.; Gao, C. AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition. arXiv 2025, arXiv:2506.06566. [Google Scholar]
NVIDIA. What Is Automatic Speech Recognition? NVIDIA Technical Blog, 2023. Available online: https://developer.nvidia.com/blog/essential-guide-to-automatic-speech-recognition-technology/ (accessed on 15 January 2025).
Wang, H.; Guo, P.; Zhou, P.; Xie, L. Mlca-avsr: Multi-layer cross attention fusion based audio-visual speech recognition. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; pp. 8150–8154. [Google Scholar]
Chen, C.; Li, R.; Hu, Y.; Siniscalchi, S.M.; Chen, P.-Y.; Chng, E.; Yang, C.-H.H. It’s never too late: Fusing acoustic information into large language models for automatic speech recognition. arXiv 2024, arXiv:2402.05457. [Google Scholar]
Seo, P.H.; Nagrani, A.; Schmid, C. Avformer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023; pp. 22922–22931. [Google Scholar]
Hu, J.; Li, Z.; Wang, P.; Wang, J.; Li, X.; Zhao, W. VHASR: A Multimodal Speech Recognition System with Vision Hotwords. arXiv 2023, arXiv:2410.00822. [Google Scholar]
Gabeur, V.; Seo, P.H.; Nagrani, A.; Schmid, C.; Vedaldi, A. Avatar: Unconstrained Audiovisual Speech Recognition. arXiv 2022, arXiv:2206.07684. [Google Scholar] [CrossRef]
Xu, B.; Lu, C.; Guo, Y.; Wang, J. Discriminative multi-modality speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14433–14442. [Google Scholar]
Afouras, T.; Chung, J.S.; Zisserman, A. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv 2018, arXiv:1809.00496. [Google Scholar]
Yang, G.; Ma, Z.; Yu, F.; Gao, Z.; Zhang, S.; Chen, X. Mala-asr: Multimedia-assisted llm-based asr. arXiv 2024, arXiv:2406.05839. [Google Scholar]
Wang, H.; Yu, F.; Shi, X.; Wang, Y.; Zhang, S.; Li, M. SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11076–11080. [Google Scholar]
Qin, R.; Liu, D.; Xu, G.; Yan, Z.; Xu, C.; Hu, Y.; Hu, X.S.; Xiong, J.; Shi, Y. Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge. arXiv 2024, arXiv:2411.13766. [Google Scholar] [CrossRef]
Manepalli, S.G.; Whitenack, D.; Nemecek, J. DYN-ASR: Compact, multilingual speech recognition via spoken language and accent identification. In Proceedings of the 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), New Orleans, LA, USA, 14–31 July 2021; pp. 830–835. [Google Scholar]
Kim, S.; Gholami, A.; Shaw, A.; Lee, N.; Mangalam, K.; Malik, J.; Mahoney, M.W.; Keutzer, K. Squeezeformer: An efficient transformer for automatic speech recognition. Adv. Neural Inf. Process. Syst. 2022, 35, 9361–9373. [Google Scholar]
Kriman, S.; Beliaev, S.; Ginsburg, B.; Huang, J.; Kuchaiev, O.; Lavrukhin, V.; Leary, R.; Li, J.; Zhang, Y. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–8 May 2020; pp. 6124–6128. [Google Scholar]
Yao, Z.; Guo, L.; Yang, X.; Kang, W.; Kuang, F.; Yang, Y.; Jin, Z.; Lin, L.; Povey, D. Zipformer: A faster and better encoder for automatic speech recognition. arXiv 2023, arXiv:2310.11230. [Google Scholar]
Jeffries, N.; King, E.; Kudlur, M.; Nicholson, G.; Wang, J.; Warden, P. Moonshine: Speech Recognition for Live Transcription and Voice Commands. arXiv 2024, arXiv:2410.15608. [Google Scholar] [CrossRef]
Maass, W. Networks of spiking neurons: The third generation of neural network models. Neural Netw. 1997, 10, 1659–1671. [Google Scholar] [CrossRef]
Tavanaei, A.; Ghodrati, M.; Kheradpisheh, S.R.; Masquelier, T.; Maida, A. Deep learning in spiking neural networks. Neural Netw. 2019, 111, 47–63. [Google Scholar] [CrossRef]
Caporale, N.; Dan, Y. Spike Timing–Dependent Plasticity: A Hebbian Learning Rule. Annu. Rev. Neurosci. 2008, 31, 25–46. [Google Scholar] [CrossRef]
Auge, D.; Hille, J.; Kreutz, F.; Mueller, E.; Knoll, A. End-to-End Spiking Neural Network for Speech Recognition Using Resonating Input Neurons. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Bratislava, Slovakia, 14–17 September 2021; pp. 245–256. [Google Scholar]
Wu, J.; Yılmaz, E.; Zhang, M.; Li, H.; Tan, K.C. Deep spiking neural networks for large vocabulary automatic speech recognition. Front. Neurosci. 2020, 14, 199. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, T.; Han, M.; Wang, Y.; Zhang, D.; Xu, B. Complex dynamic neurons improved spiking transformer network for efficient automatic speech recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 102–109. [Google Scholar]
Irugalbandara, C.; Naseem, A.S.; Perera, S.; Kiruthikan, S.; Logeeshan, V. A Secure and Smart Home Automation System with Speech Recognition and Power Measurement Capabilities. Sensors 2023, 23, 5784. [Google Scholar] [CrossRef]
Kumar, Y. A Comprehensive Analysis of Speech Recognition Systems in Healthcare: Current Research Challenges and Future Prospects. SN Comput. Sci. 2024, 5, 137. [Google Scholar] [CrossRef]
Le-Duc, K. Vietmed: A dataset and benchmark for automatic speech recognition of vietnamese in the medical domain. arXiv 2024, arXiv:2404.05659. [Google Scholar] [CrossRef]
Korfiatis, A.P.; Moramarco, F.; Sarac, R.; Cuendet, M.A.; Chary, M.; Velupillai, S.; Nenadic, G.; Gkotsis, G. Primock57: A Dataset of Primary Care Mock Consultations. arXiv 2022, arXiv:2204.00333. [Google Scholar] [CrossRef]
Adedeji, A.; Sanni, M.; Ayodele, E.; Joshi, S.; Olatunji, T. The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? arXiv 2025, arXiv:2501.15310. [Google Scholar] [CrossRef]
Bashori, M.; van Hout, R.; Strik, H.; Cucchiarini, C. I Can Speak: Improving English Pronunciation through Automatic Speech Recognition-Based Language Learning Systems. Innov. Lang. Learn. Teach. 2024, 18, 443–461. [Google Scholar] [CrossRef]
Sun, W. The Impact of Automatic Speech Recognition Technology on Second Language Pronunciation and Speaking Skills of EFL Learners: A Mixed Methods Investigation. Front. Psychol. 2023, 14, 1210187. [Google Scholar] [CrossRef]
Cai, Y. The Application of Automatic Speech Recognition Technology in English as Foreign. In Proceedings of the 2nd International Conference on Humanities, Wisdom Education and Service Management (HWESM 2023), Xi’an, China, 14–16 July 2023; Volume 760, p. 356. [Google Scholar]
Straits Research. Voice and Speech Recognition Market Size, Share & Trends Analysis Report by Function (Speech Recognition, Voice Recognition), by Technology (Artificial Intelligence Based, Non-Artificial Intelligence Based), by Vertical (Automotive, Enterprise, Consumer, BFSI, Government, Retail, Healthcare, Military, Legal, Education) and by Region (North America, Europe, APAC, Middle East and Africa, LATAM) Forecasts, 2025–2033; Report Code: SRTE2654DR. Available online: https://straitsresearch.com/report/voice-and-speech-recognition-market (accessed on 23 August 2025).
Paulus Schoutsen. 2023: Home Assistant’s Year of Voice. Available online: https://www.home-assistant.io/blog/2022/12/20/year-of-voice/ (accessed on 23 August 2025).
Schoutsen, P. Year of the Voice-Chapter 2: Let’s Talk. Home Assistant Blog, 27 April 2023. Available online: https://www.home-assistant.io/blog/2023/04/27/year-of-the-voice-chapter-2/ (accessed on 23 August 2025).
Steadman, L.; Williams, W. Ursa 2: Elevating Speech Recognition Across 50+ Languages. Available online: https://www.speechmatics.com/company/articles-and-news/ursa-2-elevating-speech-recognition-across-52-languages (accessed on 23 August 2025).
Uniphore. What Is Automatic Speech Recognition (ASR)? Available online: https://www.uniphore.com/glossary/automatic-speech-recognition/ (accessed on 23 August 2025).
Microsoft. Speech to Text documentation–Tutorials, API Reference. Azure AI Services. Available online: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/index-speech-to-text (accessed on 23 August 2025).
Google Cloud. Speech-to-Text documentation. Google Cloud Documentation. Available online: https://cloud.google.com/speech-to-text/docs (accessed on 23 August 2025).
Voicegain. Speech-to-Text APIs. Available online: https://www.voicegain.ai/speech-to-text-apis (accessed on 23 August 2025).
Wadhwani, P. Automotive Voice Recognition Market Analysis: Market Size, Share & Forecasts 2023–2032. Global Market Insights. Available online: https://www.gminsights.com/industry-analysis/automotive-voice-recognition-market (accessed on 23 August 2025).
Behera, R. Advances in Automotive Voice Recognition Systems Redefining the In-Car Experience. Allied Market Research Blog, 20 May 2024. Available online: https://blog.alliedmarketresearch.com/latest-technologies-in-automotive-voice-recognition-systems-1972 (accessed on 23 August 2025).
Wang, H.; Guo, P.; Li, Y.; Zhang, A.; Sun, J.; Xie, L.; Chen, W.; Zhou, P.; Bu, H.; Xu, X.; et al. ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Seoul, Republic of Korea, 14–19 April 2024; pp. 63–64. [Google Scholar]
ResearchInChina. Automotive Voice Industry Review 2023–2024. AutoTech News, 27 December 2023. Available online: https://autotech.news/automotive-voice-industry-review-2023-2024/ (accessed on 23 August 2025).
Zhen, K.; Radfar, M.; Nguyen, H.; Strimel, G.P.; Susanj, N.; Mouchtaris, A. Sub-8-bit quantization for on-device speech recognition: A regularization-free approach. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023; pp. 13–20. [Google Scholar]
Ding, S.; Meadowlark, P.; He, Y.; Lew, L.; Agrawal, S.; Rybakov, O. 4-bit Conformer with Native Quantization Aware Training for Efficient Speech Recognition. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 1458–1462. [Google Scholar]
Noroozi, V.; Majumdar, S.; Kumar, A.; Balam, J.; Ginsburg, B. Stateful Conformer with Cache-Based Inference for Streaming Automatic Speech Recognition. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, 14–19 April 2024; pp. 12041–12045. [Google Scholar]
K2-FSA Team. Sherpa-ONNX: Streaming Conformer-Transducer Models for On-Device ASR. Available online: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/conformer-transducer-models.html (accessed on 22 September 2025).
Gupta, A.; Parulekar, A.; Chattopadhyay, S.; Jyothi, P. Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR. arXiv 2024, arXiv:2410.13445. [Google Scholar]
Liu, Z.; Venkateswaran, N.; Le Ferrand, E.; Prud’hommeaux, E. How Important is a Language Model for Low-resource ASR? In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 206–213. [Google Scholar]
Mainzinger, J.; Levow, G.-A. Fine-Tuning ASR models for Very Low-Resource Languages: A Study on Mvskoke. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Bangkok, Thailand, 11–16 August 2024; pp. 76–82. [Google Scholar]
Ranjan, S.; Tripathi, A.; Kumar, K.; Hansen, J.H.L. Curriculum Learning based approaches for robust end-to-end far-field speech recognition. Speech Commun. 2021, 132, 17–27. [Google Scholar] [CrossRef]
Dai, Y.; Liu, S.; Bataev, V.; Shi, Y.; Chen, X.; Wang, H.; Bu, H.; Li, S. AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition. arXiv 2024, arXiv:2505.23036. [Google Scholar]
Wang, Z.; Hou, F.; Wang, R. CLRL-Tuning: A Novel Continual Learning Approach for Automatic Speech Recognition. In Proceedings of the INTERSPEECH 2023, Dublin, Ireland, 20–24 August 2023; pp. 4583–4587. [Google Scholar]

Figure 1. Deep neural network–hidden Markov model (DNN–HMM) system architecture for automatic speech recognition (ASR). The input observation consists of a speech waveform (left) and its corresponding time–frequency representation in the form of a Mel-spectrogram (middle, frequency in Hz, time in seconds). The deep neural network (DNN) predicts posterior probabilities for each hidden Markov model (HMM) state

s_{i}

, where transition probabilities

a_{s_{i}, s_{j}}

model state dynamics.

Figure 1. Deep neural network–hidden Markov model (DNN–HMM) system architecture for automatic speech recognition (ASR). The input observation consists of a speech waveform (left) and its corresponding time–frequency representation in the form of a Mel-spectrogram (middle, frequency in Hz, time in seconds). The deep neural network (DNN) predicts posterior probabilities for each hidden Markov model (HMM) state

s_{i}

, where transition probabilities

a_{s_{i}, s_{j}}

model state dynamics.

Figure 2. Organization of independent feature maps, where at each time step in a 15-frame context window, the static features, first-order derivatives, and second-order derivatives serve as the red, green, and blue channels, respectively.

Figure 3. For the input speech “hello”, CTC identifies the most probable sequence as

[h, h, ϵ, e, ϵ, l, l, ϵ, l, o, ϵ]

. By merging consecutive repeated characters (such as the two ‘h’s and two ‘l’s) and removing the blank labels, the target sequence is obtained.

Figure 3. For the input speech “hello”, CTC identifies the most probable sequence as

[h, h, ϵ, e, ϵ, l, l, ϵ, l, o, ϵ]

. By merging consecutive repeated characters (such as the two ‘h’s and two ‘l’s) and removing the blank labels, the target sequence is obtained.

Figure 4. Traditional automatic speech recognition (ASR) system architecture (a) and modern automatic speech recognition (ASR) system architecture (b).

Figure 5. The audio encoder first processes the input through a convolutional downsampling layer and is then composed of multiple Convolution-augmented Transformer (Conformer) blocks. Each Conformer block contains four main modules, namely the feed-forward module, the multi-head self-attention module, the convolutional module, and another feed-forward module. The two feed-forward layers, which are similar to sandwich cookies, sandwich the multi-head self-attention module and the convolutional module in the middle, and finally, a layer normalization (Layernorm) is added. The internal structures of the feed-forward module, the multi-head attention module, and the convolutional module are also shown in the figure.

Figure 6. The detailed architecture of (a) supervised learning, (b) semi-supervised learning, and (c) self-supervised learning.

Figure 7. Performance of seven models on public Chinese ASR benchmark tests. Whisper, as a well-known ASR model, has demonstrated outstanding performance across various speech recognition tasks. However, when it comes to recognizing Mandarin Chinese, its performance lags behind certain models that have been specifically optimized for Chinese speech recognition.

Figure 8. The farther outward a point is located, the higher the corresponding demand. Only the relative distribution of demand capabilities of each individual learning paradigm is depicted, and the scenario of the combination of learning paradigms is not addressed in this context.

Figure 9. Edge deployment architecture showing complete on-device processing pipeline from audio capture to text output, with all components (feature extraction, ASR inference, and optional language model rescoring) running locally on the device.

Figure 10. Cloud-centric architecture illustrating the separation between device-side audio capture and cloud-based processing, where large-scale ASR models and language model rescoring are performed on remote servers.

Figure 11. Hybrid edge–cloud architecture demonstrating dual processing paths that enable both immediate local output and enhanced cloud-processed results, allowing dynamic adaptation based on network conditions and performance requirements.

Figure 12. Multimodal automatic speech recognition framework. The system processes three input modalities—lip movement features from video, auxiliary textual information (e.g., captions), and audio waveforms—to generate text transcriptions.

Table 1. Comparison of bidirectional LSTM, bidirectional GRU, LSTM, and GRU.

Comparison Dimension	Bidirectional LSTM	Bidirectional GRU	Unidirectional LSTM	Unidirectional GRU
Application Scenarios	Tasks highly dependent on both preceding and succeeding contextual information in the sequence.	Tasks that require capturing bidirectional information under limited computational resources or with real-time constraints.	Tasks where modeling long-term dependencies is critical and ample data is available.	Scenarios with resource constraints or high real-time requirements.
Advantages	Captures bidirectional dependencies and exhibits strong memory capacity.	Captures bidirectional dependencies with relatively higher computational efficiency	Strong capability in modeling long-term dependencies.	Simple and efficient, suitable for relatively short sequences.
Disadvantages	High computational cost, complex architecture, and slow training speed.	Slightly weaker memory capacity compared to LSTM.	Higher parameter count and increased computational complexity.	Inferior ability to capture long-sequence dependencies compared to LSTM. ¹

¹ This limitation stems from the simplified gating mechanism of GRU compared to LSTM’s separate forget and input gates.

Table 3. Commonly used attention mechanisms in ASR.

Mechanism	Features	Key Formula Component	Variable Explanation
Additive Attention	Nonlinear combination of query and key, explicit alignment	$e_{i, j} = V_{a}^{T} tanh (w_{a} q_{i} + U_{a} k_{j})$	$q_{i}$ : query; $k_{j}$ : key; $V_{a}, w_{a}, U_{a}$ : learnable weights
Dot-Product Attention	Efficient dot-product computation, normalized scaling	$e_{i, j} = \frac{q_{i}^{T} k_{j}}{\sqrt{d_{k}}}$	$q_{i}$ : query; $k_{j}$ : key; $d_{k}$ : dimension of key
Multi-Head Attention	Parallel modeling in multiple subspaces, captures diverse features	$C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{O}$	$h e a d_{n}$ : single attention head; $W^{O}$ : output projection matrix
Location-Aware Attention [87]	Incorporates previous attention weights, enhances monotonic alignment	$e_{i, j} = V_{a}^{T} tanh (w_{a} q_{i} + U_{a} k_{j} + V_{a} f_{i, j})$	$q_{i}$ : query; $k_{j}$ : key; $f_{i, j}$ : previous attention info; $V_{a}, w_{a}, U_{a}$ : learnable weights
Adaptive Attention	Dynamically adjusts range or weights, optimizes efficiency	Dot-product + dynamic masking	See dot-product attention; dynamic masking adapts attention weights

Table 4. Common loss functions in supervised ASR.

Loss Function	Principle	Formula	Variable Explanation
Cross-Entropy Loss	Measures the difference between the predicted probability distribution and the true label distribution.	$L = - \frac{1}{n} \sum_{i = 1}^{n} y_{i} log ({\hat{y}}_{i})$	n: number of samples; $y_{i}$ : true label; ${\hat{y}}_{i}$ : predicted probability for class i
CTC Loss	Automatically learns the alignment between speech feature sequences and text labels by summing over all possible alignment paths.	$L = - log (\sum_{π \in B^{- 1} (y)} \prod_{t = 1}^{T} p (π_{t} ∣ x))$	T: input length; $π$ : alignment path; $B^{- 1} (y)$ : set of paths mapping to label y; $p (π_{t} ∣ x)$ : probability of label at time t
Attention Loss	Supervises the learning of attention mechanisms to focus on important parts of the input speech, often combined with cross-entropy loss and regularized/constrained attention weights.	$L = \sum_{i = 1}^{n} H ({attention}_{i})$	n: number of attention heads or time steps; $H (\cdot)$ : entropy; ${attention}_{i}$ : attention weight distribution at step i
RNN-T Loss	Optimizes sequence transduction models by jointly modeling the acoustic and label sequence without requiring pre-aligned data.	$L = - log \sum_{π \in A (y)} \prod_{t, u} p (π_{t, u} ∣ x)$	t: acoustic time step; u: label index; $π$ : alignment path; $A (y)$ : set of valid alignments for y; $p (π_{t, u} ∣ x)$ : probability at $(t, u)$
Minimum Bayes Risk (MBR) Loss	Minimizes the expected error rate (e.g., WER) by weighting hypotheses according to their posterior probabilities.	$L_{MBR} = \sum_{h \in H} P (h ∣ x) \cdot Cos t (h, y)$	$H$ : hypothesis space; $P (h ∣ x)$ : posterior probability of hypothesis h; $Cos t (h, y)$ : cost between h and reference y

Table 5. Top 20 models on LibriSpeech Test-Clean leaderboard with training context.

Model	WER (%)	Language Model ^{^†}	Streaming ^‡	Labeled Data (h) ^§	Unlabeled Data (h) ^‖	Year
SAMBA ASR [107]	1.17	N/A	N	15.46k h	-	2025
FAdam [108]	1.34	N/A	N/A	N/A	N/A	2024
Conformer + Wav2vec 2.0 + SpecAugment NST [109]	1.4	Y	N	960 h	60k h (Libri-Light)	2020
w2v-BERT XXL [110]	1.4	Y	N	960 h	60k h (Libri-Light)	2021
parakeet-rnnt-1.1b [111]	1.46	N	Y	64k h	-	2023
Conv + Transformer + wav2vec2.0 [112]	1.5	Y	N	960 h	53k h (LibriVox)	2020
ContextNet + SpecAugment NST [113]	1.7	Y	N/A	960 h	60k h (Libri-Light)	2020
SpeechStew (1B) [114]	1.7	N	N/A	5.14k h	-	2021
Multistream CNN + Self-Attentive SRU [115]	1.75	Y	N/A	960 h	N/A	2020
Stateformer [116]	1.76	N	N/A	N/A	N/A	2023
wav2vec 2.0 with Libri-Light [117]	1.8	Y	N	960 h	53.2k h (LibriVox)	2020
HuBERT with Libri-Light [118]	1.8	N/A	N	960 h	60k h (Libri-Light)	2021
WavLM Large [119]	1.8	Y	N/A	960 h	94k h (multi-source)	2021
E-Branchformer (L) [120]	1.81	Y	N/A	960 h	N/A	2022
Zipformer+pruned transducer [121]	1.88	N	Y	1k+ h	-	2024
ContextNet (L) [93]	1.9	Y	N/A	1k+ h	N/A	2020
Conformer (L) [71]	1.9	Y	N	960 h	N/A	2020
Transformer+Time reduction+SKD [122]	1.9	Y	N	960 h	N/A	2021
ContextNet (M) [93]	2	Y	N/A	960 h	N/A	2020
Transformer Transducer [123]	2	N	Y	N/A	N/A	2020

^† External language model usage: Y = yes, N = no, N/A = not specified in literature; ^‡ real-time streaming capability; ^§ hours of labeled training data; ^‖ hours of unlabeled data for pre-training; NST = noisy student training; SKD = self-knowledge distillation.

Table 6. Top 20 models on LibriSpeech Test-Other leaderboard with training context.

Model	WER (%)	Language Model ^†	Streaming ^‡	Labeled Data (h) ^§	Unlabeled Data (h) ^‖	Year
SAMBA ASR [107]	2.48	N/A	N	15.46k h	-	2025
FAdam [108]	2.49	N/A	N/A	N/A	N/A	2024
w2v-BERT XXL [110]	2.5	Y	N	960 h	60k h (Libri-Light)	2021
Conformer + Wav2vec 2.0 + SpecAugment NST [109]	2.6	Y	N	960 h	60k h (Libri-Light)	2020
HuBERT with Libri-Light [118]	2.9	N/A	N	960 h	60k h (Libri-Light)	2021
wav2vec 2.0 with Libri-Light [117]	3.0	Y	N	960 h	53.2k h (LibriVox)	2020
Conv + Transformer + wav2vec2.0 [112]	3.1	Y	N	960 h	53k h (LibriVox)	2020
WavLM Large [119]	3.2	Y	N/A	960 h	94k h (multi-source)	2021
SpeechStew (1B) [114]	3.3	N	N/A	5.14k h	-	2021
ContextNet + SpecAugment NST [113]	3.4	Y	N/A	960 h	60k h (Libri-Light)	2020
E-Branchformer (L) [120]	3.65	Y	N/A	960 h	N/A	2022
data2vec [124]	3.7	N/A	N	N/A	N/A	2022
Conv + Transformer AM + Pseudo-Labeling [125]	3.83	Y	N/A	960 h	N/A	2020
Conformer (L) [71]	3.9	Y	N	960 h	N/A	2020
Zipformer+pruned transducer [121]	3.95	N	Y	1k+ h	-	2024
SpeechStew (100M) [114]	4.0	N	N/A	960 h + multi-domain	-	2021
wav2vec 2.0 [117]	4.1	Y	N	960 h	53.2k h (LibriVox)	2020
ContextNet (L) [93]	4.1	Y	N/A	1k+ h	N/A	2020
Conv + Transformer AM [126]	4.11	Y	N/A	960 h	N/A	2019
CTC + Transformer LM rescoring [127]	4.20	Y	N/A	960 h	-	2020

^† External language model usage: Y = yes, N = no, N/A = not specified in literature; ^‡ real-time streaming capability; ^§ hours of labeled training data; ^‖ hours of unlabeled data for pre-training; NST = noisy student training.

Table 10. Representative deployment configuration for on-device <200 ms voice UI using Conformer-Transducer + INT8 quantization.

Stage	Configuration	Notes
Hardware	Mid-range mobile SoC/ARM Cortex-A + INT8 inference accelerator	Representative device; latency varies with hardware and threads
Model Architecture	Conformer-Transducer (small/medium)	Streaming capability enabled; end-to-end ASR pipeline
Quantization	INT8 weights + activations	Post-training quantization or QAT for minimal accuracy drop
Feature Extraction	16 kHz, 25 ms window, 10 ms frame shift	Optimized DSP/C++ kernels for low-latency front-end
Decoder	Greedy/simplified beam search	Streaming joiner to maintain real-time performance
Threads/Memory	2–4 threads, cached feature blocks	Balance latency vs. throughput
Latency Metric	Estimated <200 ms end-to-end	Based on prior work, not directly measured here
Accuracy Trade-off	<2% WER increase vs. FP32 baseline	As reported in related quantization studies

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, C.; Pan, Y.; Wu, H.; Ning, L. Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning. Informatics 2025, 12, 107. https://doi.org/10.3390/informatics12040107

AMA Style

Wu C, Pan Y, Wu H, Ning L. Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning. Informatics. 2025; 12(4):107. https://doi.org/10.3390/informatics12040107

Chicago/Turabian Style

Wu, Chaoji, Yi Pan, Haipan Wu, and Lei Ning. 2025. "Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning" Informatics 12, no. 4: 107. https://doi.org/10.3390/informatics12040107

APA Style

Wu, C., Pan, Y., Wu, H., & Ning, L. (2025). Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning. Informatics, 12(4), 107. https://doi.org/10.3390/informatics12040107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning

Abstract

1. Introduction

2. Early Exploration and Template Matching Era

3. Statistical-Model-Driven Period

4. Deep Learning Revolution Phase

4.1. Introduction of Deep Neural Networks (DNNs)

4.2. Application of Recurrent Neural Networks (RNNs)

4.3. Fusion of Convolutional Neural Networks (CNNs)

4.4. Connectionist Temporal Classification (CTC) and Listen, Attend and Spell (LAS)

5. The Era of Large-Scale Models in Different Learning Paradigms

5.1. Necessity of Large-Scale Models

5.2. The Preeminent Leadership of the Transformer

5.3. The Critical Role of Learning Paradigms in the Development of Large ASR Models

5.3.1. Large ASR Models Under Supervised Learning

5.3.2. Large ASR Models Under Semi-Supervised Learning

5.3.3. Large ASR Models Under Self-Supervised Learning

5.4. The Selection and Trade-Off of Different Learning Paradigms

6. Frontier Technology Research and Future Directions

6.1. ASR Deployment Architectures: Edge, Cloud, and Hybrid Approaches

6.1.1. Edge Deployment Architecture

6.1.2. Cloud-Centric Architecture

6.1.3. Hybrid Edge–Cloud Architecture

6.2. Multimodal Fusion

6.3. Breakthroughs in Edge ASR Technology

6.4. Combination with Spiking Neural Networks (SNNs )

7. Speech Recognition Applications in Intelligent Information Systems

7.1. Privacy, Security, and Safety Considerations

7.1.1. Privacy-Preserving Deployment Strategies

7.1.2. Security and Safety Mechanisms

7.1.3. Concrete Mitigation Strategies

7.2. Healthcare Information Systems

7.3. Educational Technology Systems

7.4. Smart Home and IoT Ecosystems

7.5. Enterprise and Business Intelligence Systems

7.6. Automotive and Transportation Systems

8. Representative Deployment Blueprint

8.1. System Flow and Reference Configuration Table

8.2. Supporting Evidence from Prior Work

8.3. Mini-Playbook

9. Critical Open Challenges and Future Directions

9.1. Performance Disparities in Low-Resource Languages

9.2. Robustness Under Acoustic and Conversational Variability

9.3. Catastrophic Forgetting in Continual Learning Settings

9.4. On-Device Deployment Constraints

9.5. Domain-Specific Adaptation Bottlenecks

9.6. Implications for Future Research

10. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Acronym and Terminology Glossary

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI