Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning
Abstract
1. Introduction
- (1)
- Early Exploration and Template Matching Era: Initial efforts to develop ASR technology.
- (2)
- Statistical Model-Driven Period: Emergence of statistical frameworks as a key driver of progress.
- (3)
- Deep Learning Revolution Phase: Deep learning’s transformative impact on ASR development.
- (4)
- Era of Large Models under Different Learning Paradigms: Growth of innovative paradigms within large-scale architectures.
- Identify Trends: Systematically uncover domain-specific technological trends in ASR development across different learning paradigms and application domains.
- Evaluate Challenges: Critically assess persistent technical bottlenecks hindering progress, including both algorithmic limitations and practical deployment challenges in intelligent information systems.
- Propose Directions: Offer forward-looking perspectives for future theoretical and applied research, emphasizing the integration of ASR within complex, evolving intelligent system environments.
2. Early Exploration and Template Matching Era
3. Statistical-Model-Driven Period
4. Deep Learning Revolution Phase
- (1)
- The GMM has limited capabilities in modeling high-dimensional and nonlinear speech features. It struggles to model such features effectively. In addition, it lacks robustness in complex noisy environments and requires a large number of mixture components to represent intricate acoustic distributions.
- (2)
- Defects in Context Modeling [43]The Markov assumption in HMM focuses on local state transitions, constraining its capacity to capture long-term dependencies in speech signals.
- (3)
- Dependence on Artificial FeaturesGMM-HMM relies on manually designed acoustic features (such as MFCC and PLP), which may lose key information from the original speech signal.
- (4)
- Limitations of the n-gram Language Model [44]Due to the Markov assumption, n-gram language models overlook long-range semantic dependencies, reducing their effectiveness in resolving ambiguity within complex contexts.
- (1)
- In the 1990s, Bourlard and others integrated artificial neural networks (ANNs) with HMM. However, constrained by limited computing resources and training efficiency, this approach did not substantially outperform the traditional GMM-HMM model.
- (2)
- Theoretical Reserves for Deep Learning [47]In 2006, Hinton’s team introduced a layer-by-layer pre-training strategy for the deep belief network (DBN), addressing optimization challenges in deep networks, though not yet applied to speech tasks.
- (3)
- The Maturity of Big Data and GPU Computing Power [48]In 2009, the general-purpose GPU computing framework (CUDA 2.0) and the availability of thousand-hour corpora unleashed the potential of deep networks.
- (4)
- The Germination of End-to-End LearningGraves et al. (2004) [49] incorporated long short-term memory (LSTM) into speech sequence modeling, suggesting an alternative to the traditional HMM alignment mechanism.
4.1. Introduction of Deep Neural Networks (DNNs)
4.2. Application of Recurrent Neural Networks (RNNs)
4.3. Fusion of Convolutional Neural Networks (CNNs)
4.4. Connectionist Temporal Classification (CTC) and Listen, Attend and Spell (LAS)
Technology | Description | Contribution |
---|---|---|
Recurrent Neural Aligner [76] | Based on an encoder-decoder framework, it introduces blank labels to define probability distributions and is trained using approximate dynamic programming and sampling techniques. | Achieves accuracy comparable to CTC word models on YouTube video transcription tasks; bidirectional models perform well in mobile dictation tasks. |
Neural Transducer [77] | Makes incremental predictions based on partial input and output sequences and is trained using dynamic programming algorithms. | Suitable for online tasks, with performance in the TIMIT phoneme recognition task on par with state-of-the-art models. |
RNN Transducer [78] | Comprises transcription and prediction network, extends the output space to define distributions, and is trained using the forward-backward algorithm with beam search during testing. | Achieves low error rates in the TIMIT phoneme recognition task, effectively integrating acoustic and linguistic information. |
Monotonic Chunkwise Attention [79] | Adaptively segments the input sequence into chunks, applies soft attention within each chunk, supports online decoding with linear time complexity, and can handle local reordering. | Delivers state-of-the-art performance in online speech recognition and significantly enhances performance in document summarization tasks. 2 |
5. The Era of Large-Scale Models in Different Learning Paradigms
5.1. Necessity of Large-Scale Models
5.2. The Preeminent Leadership of the Transformer
- (1)
- High Computational Demand: The attention mechanism in Transformer scales quadratically with sequence length, leading to a significant increase in computational cost when processing long speech inputs.
- (2)
- Insufficient Capture of Short-Term Local Features: Local characteristics of speech signals, such as the instantaneous variations of phonemes, are crucial for accurate recognition. However, pure Transformer models are less effective in capturing these fine-grained temporal features compared to CNNs.
- (3)
- Limitations to Offline ASR Tasks Due to Architectural Features: The architectural requirements of encoder-decoder-based Transformers, which necessitate the complete speech utterance as input, limit their application to offline ASR tasks and make them challenging to deploy in streaming scenarios that require real-time output generation, such as generating transcriptions shortly after each spoken word [88].
5.3. The Critical Role of Learning Paradigms in the Development of Large ASR Models
5.3.1. Large ASR Models Under Supervised Learning
5.3.2. Large ASR Models Under Semi-Supervised Learning
- Incomplete supervision: Only part of the data is labeled, e.g., a few labeled texts among massive unlabeled social media data.
- Imprecise supervision: Annotations are coarse or high-level, such as labeling a bird image simply as “bird” without species information.
- Inaccurate supervision: Labels contain errors or noise, as in mislabeling medical images due to human mistakes.
5.3.3. Large ASR Models Under Self-Supervised Learning
5.4. The Selection and Trade-Off of Different Learning Paradigms
6. Frontier Technology Research and Future Directions
6.1. ASR Deployment Architectures: Edge, Cloud, and Hybrid Approaches
6.1.1. Edge Deployment Architecture
6.1.2. Cloud-Centric Architecture
6.1.3. Hybrid Edge–Cloud Architecture
6.2. Multimodal Fusion
6.3. Breakthroughs in Edge ASR Technology
Model | WER (%, Test-Clean) | WER (%, Test-Other) | Parma |
---|---|---|---|
Squeezeformer-XS [162] | 3.74 | 9.09 | 9 M |
ContextNet-S [93] | 2.9 | 7.0 | 10.8 M |
QuartzNet-15x5 [163] | 3.9 | 11.28 | 19 M |
Zipformer-S [164] | 2.42 | 5.73 | 23.3 M |
Moonshine Tiny [165] | 4.52 | 11.7 | 27.1 M |
whisper.tiny.en [138] | 5.66 | 15.45 | 37.8 M |
E-Branchformer (B) [120] | 2.49 | 5.61 | 41.12 M |
6.4. Combination with Spiking Neural Networks (SNNs )
7. Speech Recognition Applications in Intelligent Information Systems
Model Family | WER (%) † | RTF ‡ | SNR Robust § | Languages ‖ | Memory ¶ | Streaming # | Edge Deploy †† |
---|---|---|---|---|---|---|---|
Traditional RNN | 5–12% | 0.2–0.4 | <10 dB | 1–5 | 1–2 GB | Yes | Feasible |
(DeepSpeech, CTC-RNN) | |||||||
CNN-based | 2–4% | 0.1–0.3 | 10–20 dB | 1–10 | 1–4 GB | Yes | Optimal |
(QuartzNet, ContextNet) | |||||||
Hybrid CNN-RNN | 3–6% | 0.2–0.4 | 5–15 dB | 1–20 | 2–4 GB | Yes | Feasible |
(Jasper, CitriNet) | |||||||
Self-Supervised SSL | 1.8–3% | 0.3–0.5 | 0–10 dB | 1–10 | 2–5 GB | Partial | Challenging |
(wav2vec2, HuBERT, WavLM) | |||||||
Transformer Encoder | 1.8–2% | 0.2–0.4 | −5 to 15 dB | 20–50 | 3–6 GB | Variable | Feasible |
(Conformer, Branchformer) | |||||||
Enhanced Transformer | 1.8–2.4% | 0.2–0.4 | −10 to 20 dB | 20–50 | 3–6 GB | Variable | Feasible |
(E-Branchformer, Zipformer) | |||||||
Sequence-to-Sequence | 2–8% | 0.4–1.0 | −5 to 15 dB | 20–99 | 4–8 GB | Partial | Challenging |
(LAS, Transformer S2S) | |||||||
Large Multimodal | 2.4–15% | 0.5–2.0 | −20 to 40 dB | 99+ | 1–10 GB | No | Limited |
(Whisper, Universal ASR) | |||||||
Efficient/Mobile | 3–12% | 0.05–0.2 | 5–20 dB | 1–5 | <1–2 GB | Yes | Optimal |
(SqueezeFormer, Moonshine) | |||||||
Streaming-Optimized | 2.5–5% | 0.1–0.3 | 0–15 dB | 10–50 | 1–3 GB | Yes | Feasible |
(Streaming Conformer, RNN-T) | |||||||
Cross-Modal Fusion | 1.5–3% | 0.8–1.5 | −15 to 30 dB | 10–99 | 8–15 GB | Partial | Impractical |
(Audio-Visual, Multi-modal) |
ASR Paradigm | WER Range | Latency (ms) | RTF | Noise Robust | Languages | Memory (GB) | Deployment Scenario |
---|---|---|---|---|---|---|---|
CTC-based Models | 3–8% | 50–200 | 0.1–0.3 | Medium | 1–20 | 1–3 | Edge devices, streaming |
(DeepSpeech, QuartzNet) | (>5 dB SNR) | applications | |||||
RNN-Transducer | 2–5% | 80–300 | 0.2–0.4 | High | 1–50 | 2–4 | Real-time streaming, |
(Conformer-RNN-T) | (0–15 dB) | mobile applications | |||||
Conformer Encoder | 1.8–3% | 100–400 | 0.2–0.4 | Very High | 20–99 | 3–6 | High-accuracy offline, |
(Transformer-based) | (−5 to 20 dB) | server deployment | |||||
SSL Pretrained | 1.5–2.5% | 200–600 | 0.3–0.6 | Excellent | 10–50 | 4–8 | Research, fine-tuning |
(wav2vec2, HuBERT) | (−10 to 25 dB) | for specific domains | |||||
Whisper-class | 2–15% | 500–2000 | 0.5–2.0 | Excellent | 99+ | 1–10 | Multilingual, robust |
(Large Multimodal) | (varies) | (−20 to 40 dB) | (var sizes) | general-purpose |
7.1. Privacy, Security, and Safety Considerations
7.1.1. Privacy-Preserving Deployment Strategies
7.1.2. Security and Safety Mechanisms
7.1.3. Concrete Mitigation Strategies
- Federated Learning with Privacy Preservation: Implement federated ASR training frameworks that enable collaborative model improvement without exposing individual voice samples [4,5]. This approach allows organizations to benefit from collective learning while maintaining strict data governance requirements.
7.2. Healthcare Information Systems
7.3. Educational Technology Systems
7.4. Smart Home and IoT Ecosystems
7.5. Enterprise and Business Intelligence Systems
7.6. Automotive and Transportation Systems
8. Representative Deployment Blueprint
8.1. System Flow and Reference Configuration Table
8.2. Supporting Evidence from Prior Work
- Sub-8-bit Quantization: Zhen et al. [192] report that sub-8-bit quantization for Conformer/RNN-Transducer models reduces user-perceived latency by up to 31.75% compared to standard 8-bit QAT.
- Mixed-Precision Conformer: Ding et al. [193] achieve a model size reduction using mixed 4-bit/8-bit QAT with only minor accuracy loss, improving real-time inference performance.
- Streaming FastConformer: NVIDIA’s FastConformer with cache-based inference [194] demonstrates significant latency reductions for streaming ASR compared to baseline Conformer models.
- Sherpa-ONNX INT8 Models: The Sherpa-ONNX project provides open-source Conformer-Transducer INT8 models for streaming ASR on edge devices, illustrating practical deployment scenarios [195].
8.3. Mini-Playbook
9. Critical Open Challenges and Future Directions
9.1. Performance Disparities in Low-Resource Languages
9.2. Robustness Under Acoustic and Conversational Variability
9.3. Catastrophic Forgetting in Continual Learning Settings
9.4. On-Device Deployment Constraints
9.5. Domain-Specific Adaptation Bottlenecks
9.6. Implications for Future Research
10. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Acronym and Terminology Glossary
Key Acronyms and Technical Terms in ASR | ||
Acronym | Full Term | Definition |
AM | Acoustic Model | Component mapping audio features to phonetic units |
ANN | Artificial Neural Network | Computational model inspired by biological neural networks |
ASR | Automatic Speech Recognition | Technology that converts spoken language into text |
CER | Character Error Rate | Character-level accuracy metric, especially for Asian languages |
CNN | Convolutional Neural Network | Network using convolution operations for local feature extraction |
CTC | Connectionist Temporal Classification | Alignment-free training objective for sequence labeling |
DNN | Deep Neural Network | Multi-layer neural network with >3 hidden layers |
GMM | Gaussian Mixture Model | Probabilistic model using multiple Gaussian distributions |
GRU | Gated Recurrent Unit | Simplified alternative to LSTM with fewer parameters |
HMM | Hidden Markov Model | Statistical model for temporal sequence modeling |
LAS | Listen, Attend and Spell | Encoder-decoder architecture with attention mechanism |
LM | Language Model | Statistical model predicting word sequence probabilities |
LSTM | Long Short-Term Memory | RNN variant addressing vanishing gradient problem |
MAP | Maximum A Posteriori | Statistical estimation method incorporating prior knowledge |
MBR | Minimum Bayes Risk | Training objective minimizing expected error rate |
MFCC | Mel-Frequency Cepstral Coefficients | Traditional acoustic features based on human auditory perception |
MLE | Maximum Likelihood Estimation | Statistical method finding parameters maximizing likelihood |
MLP | Multi-Layer Perceptron | Feedforward neural network with multiple hidden layers |
PER | Phoneme Error Rate | Character-level accuracy metric for phoneme recognition |
RNN | Recurrent Neural Network | Network with recurrent connections for sequence processing |
RNN-T | RNN Transducer | Sequence-to-sequence model for streaming ASR |
RTF | Real-Time Factor | Ratio of processing time to audio duration |
SNR | Signal-to-Noise Ratio | Measure of signal quality relative to background noise |
SNN | Spiking Neural Network | Third-generation neural network using discrete spike events |
SOTA | State-of-the-Art | Current best performance on benchmark tasks |
SSL | Self-Supervised Learning | Learning paradigm using unlabeled data with pretext tasks |
STFT | Short-Time Fourier Transform | Time-frequency analysis technique for audio signals |
WER | Word Error Rate | Primary accuracy metric: percentage of incorrectly recognized words |
References
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Li, J.; Wu, Y.; Gaur, Y.; Wang, C.; Zhao, R.; Liu, S. On the comparison of popular end-to-end models for large scale speech recognition. arXiv 2020, arXiv:2005.14327. [Google Scholar]
- Tsai, Y.-H.H.; Ma, M.Q.; Yang, M.; Zhao, H.; Morency, L.-P.; Salakhutdinov, R. Self-supervised representation learning with relative predictive coding. arXiv 2021, arXiv:2103.11275. [Google Scholar] [CrossRef]
- Guliani, D.; Beaufays, F.; Motta, G. Training speech recognition models with federated learning: A quality/cost framework. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3080–3084. [Google Scholar]
- Nguyen, T.; Mdhaffar, S.; Tomashenko, N.; Bonastre, J.-F.; Estève, Y. Federated learning for ASR based on wav2vec 2.0. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Xu, M.; Song, C.; Tian, Y.; Agrawal, N.; Granqvist, F.; van Dalen, R.; Zhang, X.; Argueta, A.; Han, S.; Deng, Y. Training large-vocabulary neural language models by private federated learning for resource-constrained devices. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Hegdepatil, P.; Davuluri, K. Business intelligence based novel marketing strategy approach using automatic speech recognition and text summarization. In Proceedings of the 2021 2nd International Conference on Computing and Data Science (CDS), Stanford, CA, USA, 28–29 January 2021; pp. 595–602. [Google Scholar]
- Mohan, A.; Rose, R.; Ghalehjegh, S.H.; Umesh, S. Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Commun. 2014, 56, 167–180. [Google Scholar] [CrossRef]
- Seljan, S.; Dunđer, I. Combined automatic speech recognition and machine translation in business correspondence domain for English-Croatian. Int. J. Ind. Syst. Eng. 2014, 8, 1980–1986. [Google Scholar]
- Vajpai, J.; Bora, A. Industrial applications of automatic speech recognition systems. Int. J. Eng. Res. Appl. 2016, 6, 88–95. [Google Scholar]
- Kheddar, H.; Hemis, M.; Himeur, Y.; Megías, D.; Amira, A. Automatic speech recognition using advanced deep learning approaches: A survey. Inf. Fusion 2024, 109, 102422. [Google Scholar] [CrossRef]
- Dhanjal, A.S.; Singh, W. A comprehensive survey on automatic speech recognition using neural networks. Multimed. Tools Appl. 2024, 83, 23367–23412. [Google Scholar] [CrossRef]
- Zahorian, S.; Karnjanadecha, M. Trends and developments in automatic speech recognition research. Comput. Speech Lang. 2023, 84, 101572. [Google Scholar]
- Khapra, C. A survey on end-to-end speech recognition systems. Int. J. Comput. Inf. Technol. 2024, 5, 100–110. [Google Scholar] [CrossRef]
- Kumar, A.; Verma, S.; Mangla, H. A survey of deep learning techniques in speech recognition. In Proceedings of the 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 12–13 October 2018; pp. 179–185. [Google Scholar]
- Malik, M.; Malik, M.K.; Mehmood, K.; Makhdoom, I. Automatic speech recognition: A survey. Multimed. Tools Appl. 2021, 80, 9411–9457. [Google Scholar] [CrossRef]
- Prabhavalkar, R.; Hori, T.; Sainath, T.N.; Schlüter, R.; Watanabe, S. End-to-end speech recognition: A survey. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 32, 325–351. [Google Scholar] [CrossRef]
- Davis, K.H.; Biddulph, R.; Balashek, S. Automatic recognition of spoken digits. J. Acoust. Soc. Am. 1952, 24, 637–642. [Google Scholar] [CrossRef]
- Forgie, J.W.; Forgie, C.D. Results obtained from a vowel recognition computer program. J. Acoust. Soc. Am. 1959, 31, 1480–1489. [Google Scholar] [CrossRef]
- Olson, H.F.; Belar, H. Phonetic typewriter. J. Acoust. Soc. Am. 1956, 28, 1072–1081. [Google Scholar] [CrossRef]
- Olson, H.F.; Belar, H. Phonetic typewriter III. J. Acoust. Soc. Am. 1961, 33, 1610–1615. [Google Scholar] [CrossRef]
- Atal, B.S.; Hanauer, S.L. Speech Analysis and Synthesis by Linear Prediction of the Speech Wave. J. Acoust. Soc. Am. 1971, 50, 637–655. [Google Scholar] [CrossRef]
- Suzuki, J. Recognition of Japanese vowels. J. Radio Res. Lab. 1961, 8, 193–211. [Google Scholar]
- Sakai, T.; Doshita, S. Phonetic typewriter. J. Acoust. Soc. Am. 1961, 33 (Suppl. 11), 1664. [Google Scholar] [CrossRef]
- Itakura, F. Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 2003, 23, 67–72. [Google Scholar] [CrossRef]
- Sakoe, H.; Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 43–49. [Google Scholar] [CrossRef]
- Klatt, D.H. Review of the ARPA speech understanding project. J. Acoust. Soc. Am. 1977, 62, 1345–1366. [Google Scholar] [CrossRef]
- Rabiner, L.; Juang, B.-H. Fundamentals of Speech Recognition; Prentice-Hall, Inc.: Englewood Cliffs, NJ, USA, 1993; pp. 380–410. ISBN 0130151572. [Google Scholar]
- Myers, C.; Rabiner, L. A level building dynamic time warping algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process. 2003, 29, 284–297. [Google Scholar] [CrossRef]
- Myers, C.; Rabiner, L.; Rosenberg, A. An investigation of the use of dynamic time warping for word spotting and connected speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Denver, CO, USA, 9–11 April 1980; pp. 173–177. [Google Scholar]
- Lee, C.-H.; Rabiner, L.R. A frame-synchronous network search algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 1649–1658. [Google Scholar] [CrossRef]
- Bridle, J.S.; Brown, M.; Chamberlain, R. An Algorithm for Connected Word Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Paris, France, 3–5 April 1982; pp. 899–902. [Google Scholar]
- Lowerre, B.T. The Harpy Speech Recognition System; Carnegie Mellon University: Pittsburgh, PA, USA, 1976. [Google Scholar]
- Rabiner, L.; Juang, B. An introduction to hidden Markov models. IEEE ASSP Mag. 1986, 3, 4–16. [Google Scholar] [CrossRef]
- Forney, G.D. The viterbi algorithm. Proc. IEEE 2005, 61, 268–278. [Google Scholar] [CrossRef]
- Juang, B.H.; Rabiner, L.R. Hidden Markov models for speech recognition. Technometrics 1991, 33, 251–272. [Google Scholar] [CrossRef]
- Lou, H.-L. Implementing the Viterbi algorithm. IEEE Signal Process. Mag. 1995, 12, 42–52. [Google Scholar] [CrossRef]
- Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
- Gauvain, J.L.; Lee, C.-H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 1994, 2, 291–298. [Google Scholar] [CrossRef]
- Milvus. What Is the History of Speech Recognition Technology? Available online: https://milvus.io/ai-quick-reference/what-is-the-history-of-speech-recognition-technology (accessed on 20 August 2025).
- Dahl, G.E.; Yu, D.; Deng, L.; Acero, A. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 2011, 20, 30–42. [Google Scholar] [CrossRef]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Graves, A.; Mohamed, A.-r.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
- Sutskever, I.; Martens, J.; Hinton, G.E. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; pp. 1017–1024. [Google Scholar]
- Bourlard, H.A.; Morgan, N. Connectionist Speech Recognition: A Hybrid Approach; Springer Science & Business Media: New York, NY, USA, 2012; Volume 247, pp. 155–182. ISBN 1461532108. [Google Scholar]
- Trentin, E.; Gori, M. A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 2001, 37, 91–126. [Google Scholar] [CrossRef]
- Hinton, G.E.; Osindero, S.; Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
- Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A. Deep speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
- Graves, A.; Eck, D.; Beringer, N.; Schmidhuber, J. Biologically plausible speech recognition with LSTM neural nets. In Proceedings of the Biologically Inspired Approaches to Advanced Information Technology: First International Workshop (BioADIT), Lausanne, Switzerland, 29–30 January 2004; pp. 127–136. [Google Scholar]
- Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech recognition using deep neural networks: A systematic review. IEEE Access 2019, 7, 19143–19165. [Google Scholar] [CrossRef]
- Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation; Institute for Cognitive Science, University of California: San Diego, CA, USA, 1985. [Google Scholar]
- Yuan, Q.; Dai, Y.; Li, G. Exploration of English speech translation recognition based on the LSTM RNN algorithm. Neural Comput. Appl. 2023, 35, 24961–24970. [Google Scholar] [CrossRef]
- Bengio, Y.; Simard, P.; Frasconi, P. Learning Long-Term Dependencies with Gradient Descent Is Difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G. Deep Speech 2: End-to-end Speech Recognition in English and Mandarin. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 173–182. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Tech. Rep. N 1993, 93, 27403. [Google Scholar]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
- Cho, K.; Van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar] [CrossRef]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
- Tombaloğlu, B.; Erdem, H. Turkish speech recognition techniques and applications of recurrent units (LSTM and GRU). Gazi Univ. J. Sci. 2021, 34, 1035–1049. [Google Scholar] [CrossRef]
- Graves, A.; Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; pp. 1764–1772. [Google Scholar]
- Hau, D.; Chen, K. Exploring hierarchical speech representations with a deep convolutional neural network. In Proceedings of the 11th Annual Workshop on Computational Intelligence (UKCI), Manchester, UK, 7–9 September 2011; pp. 31–37. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [Google Scholar] [CrossRef]
- Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. Phoneme Recognition Using Time-Delay Neural Networks. Backpropagation; Psychology Press: New York, NY, USA, 2013; pp. 35–61. ISBN 978-1-84872-863-9. [Google Scholar]
- Sainath, T.N.; Kingsbury, B.; Saon, G.; Soltau, H.; Mohamed, A.; Dahl, G.; Ramabhadran, B. Deep convolutional neural networks for large-scale speech tasks. Neural Netw. 2015, 64, 39–48. [Google Scholar] [CrossRef]
- Shon, S.; Ali, A.; Glass, J. Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv 2018, arXiv:1803.04567. [Google Scholar]
- Ghafoor, K.J.; Rawf, K.M.H.; Abdulrahman, A.O.; Taher, S.H. Kurdish dialect recognition using 1D CNN. ARO Sci. J. Koya Univ. 2021, 9, 10–14. [Google Scholar] [CrossRef]
- Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar] [CrossRef]
- Passricha, V.; Aggarwal, R.K. A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J. Intell. Syst. 2019, 29, 1261–1274. [Google Scholar] [CrossRef]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4960–4964. [Google Scholar]
- Sak, H.; Shannon, M.; Rao, K.; Beaufays, F. Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1298–1302. [Google Scholar]
- Jaitly, N.; Le, Q.V.; Vinyals, O.; Sutskever, I.; Sussillo, D.; Bengio, S. An online sequence-to-sequence model using partial conditioning. Adv. Neural Inf. Process. Syst. 2016, 29, 5074–5082. [Google Scholar]
- Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711. [Google Scholar] [CrossRef]
- Chiu, C.-C.; Raffel, C. Monotonic chunkwise attention. arXiv 2017, arXiv:1712.05382. [Google Scholar]
- Chanchaochai, N.; Cieri, C.; Debrah, J.; Liberman, M.; Graff, D.; Lee, J.; Walker, K.; Walter, T.; Wu, J. GlobalTIMIT: Acoustic-Phonetic Datasets for the World’s Languages. In Proceedings of the INTERSPEECH 2018, Hyderabad, India, 2–6 September 2018; pp. 192–196. [Google Scholar]
- Rydning, D.R.-J.G.; Gantz, J. The digitization of the world from edge to core. Framingham: Int. Data Corp. 2018, 16, 1–28. [Google Scholar]
- Halevy, A.; Norvig, P.; Pereira, F. The unreasonable effectiveness of data. IEEE Intell. Syst. 2009, 24, 8–12. [Google Scholar] [CrossRef]
- Najafabadi, M.M.; Villanustre, F.; Khoshgoftaar, T.M.; Seliya, N.; Wald, R.; Muharemagic, E. Deep learning applications and challenges in big data analytics. J. Big Data 2015, 2, 1–21. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
- Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 2015, 28, 577–585. [Google Scholar]
- Moritz, N.; Hori, T.; Le, J. Streaming automatic speech recognition with the transformer model. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–8 May 2020; pp. 6074–6078. [Google Scholar]
- Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
- Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
- Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
- Han, W.; Zhang, Z.; Zhang, Y.; Yu, J.; Chiu, C.-C.; Qin, J.; Gulati, A.; Pang, R.; Wu, Y. Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv 2020, arXiv:2005.03191. [Google Scholar] [CrossRef]
- Peng, Y.; Dalmia, S.; Lane, I.; Watanabe, S. Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 17627–17643. [Google Scholar]
- Zhang, Q.; Lu, H.; Sak, H.; Tripathi, A.; McDermott, E.; Koo, S.; Kumar, S. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7829–7833. [Google Scholar]
- Dong, L.; Xu, S.; Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
- Paul, D.B.; Baker, J. The design for the Wall Street Journal-based CSR corpus. In Proceedings of the Speech and Natural Language: Proceedings of a Workshop, Harriman, New York, NY, USA, 23–26 February 1992; p. 357. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Hugging Face. Wav2Vec2-Conformer. Available online: https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer (accessed on 14 August 2025).
- Chen, Z.; Ramabhadran, B.; Biadsy, F.; Zhang, X.; Chen, Y.; Jiang, L.; Moreno, P.J. Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech. In Proceedings of the INTERSPEECH 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 4828–4832. [Google Scholar]
- Google Cloud. Migrate from classic to Conformer Models. Available online: https://cloud.google.com/speech-to-text/docs/conformer-migration (accessed on 20 August 2025).
- Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
- Gu, Y.; Shivakumar, P.G.; Kolehmainen, J.; Brusco, P.; Sim, K.C.; Ramabhadran, B.; Picheny, M. Scaling Laws for Discriminative Speech Recognition Rescoring Models. arXiv 2023, arXiv:2306.15815. [Google Scholar] [CrossRef]
- Subbaswamy, A.; Saria, S. Counterfactual Normalization: Proactively Addressing Dataset Shift Using Causal Mechanisms. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Monterey, CA, USA, 6–10 August 2018; pp. 947–957. [Google Scholar]
- Xu, K.-T.; Xie, F.-L.; Tang, X.; Hu, Y. FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration. arXiv 2025, arXiv:2501.14350. [Google Scholar]
- Bai, Y.; Chen, J.; Chen, J.; Chen, W.; Chen, Z.; Ding, C.; Dong, L.; Dong, Q.; Du, Y.; Gao, K. Seed-asr: Understanding Diverse Speech and Contexts with LLM-Based Speech Recognition. arXiv 2024, arXiv:2407.04675. [Google Scholar]
- Shakhadri, S.A.G.; Kr, K.; Angadi, K.B. Samba-asr state-of-the-art speech recognition leveraging structured state-space models. arXiv 2025, arXiv:2501.02832. [Google Scholar]
- Hwang, D. FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information. arXiv 2024, arXiv:2405.12807. [Google Scholar] [CrossRef]
- Zhang, Y.; Qin, J.; Park, D.S.; Han, W.; Chiu, C.-C.; Pang, R.; Le, Q.V.; Wu, Y. Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv 2020, arXiv:2010.10504. [Google Scholar]
- Chung, Y.-A.; Zhang, Y.; Han, W.; Chiu, C.-C.; Qin, J.; Pang, R.; Wu, Y. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 244–250. [Google Scholar]
- Rekesh, D.; Koluguri, N.R.; Kriman, S.; Majumdar, S.; Noroozi, V.; Huang, H.; Hrinchuk, O.; Puvvada, K.; Kumar, A.; Balam, J. Fast conformer with linearly scalable attention for efficient speech recognition. In Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 16–20 December 2023; pp. 1–8. [Google Scholar]
- Xu, Q.; Baevski, A.; Likhomanenko, T.; Tomasello, P.; Conneau, A.; Collobert, R.; Synnaeve, G.; Auli, M. Self-training and pre-training are complementary for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–11 June 2021; pp. 3030–3034. [Google Scholar]
- Park, D.S.; Zhang, Y.; Jia, Y.; Han, W.; Chiu, C.-C.; Li, B.; Wu, Y.; Le, Q.V. Improved noisy student training for automatic speech recognition. arXiv 2020, arXiv:2005.09629. [Google Scholar] [CrossRef]
- Chan, W.; Park, D.; Lee, C.; Zhang, Y.; Le, Q.; Norouzi, M. Speechstew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network. arXiv 2021, arXiv:2104.02133. [Google Scholar] [CrossRef]
- Pan, J.; Shapiro, J.; Wohlwend, J.; Han, K.J.; Lei, T.; Ma, T. ASAPP-ASR: Multistream CNN and self-attentive SRU for SOTA speech recognition. arXiv 2020, arXiv:2005.10469. [Google Scholar]
- Fathullah, Y.; Wu, C.; Shangguan, Y.; Jia, J.; Xiong, W.; Mahadeokar, J.; Liu, C.; Shi, Y.; Kalinli, O.; Seltzer, M. Multi-head state space model for speech recognition. arXiv 2023, arXiv:2305.12498. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Kim, K.; Wu, F.; Peng, Y.; Pan, J.; Sridhar, P.; Han, K.J.; Watanabe, S. E-branchformer: Branchformer with enhanced merging for speech recognition. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2022; pp. 84–91. [Google Scholar]
- Yao, Z.; Kang, W.; Yang, X.; Kuang, F.; Guo, L.; Zhu, H.; Jin, Z.; Li, Z.; Lin, L.; Povey, D. CR-CTC: Consistency regularization on CTC for improved speech recognition. arXiv 2024, arXiv:2410.05101. [Google Scholar] [CrossRef]
- Akmal, H.M.; Chao, X.; Mehdi, R. Transformer-based ASR incorporating time-reduction layer and fine-tuning with self-knowledge distillation. arXiv 2021, arXiv:2103.09903. [Google Scholar]
- Liu, C.; Zhang, F.; Le, D.; Kim, S.; Saraf, Y.; Zweig, G. Improving RNN transducer based ASR with auxiliary tasks. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 172–179. [Google Scholar]
- Baevski, A.; Hsu, W.-N.; Xu, Q.; Babu, A.; Gu, J.; Auli, M. Data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 1298–1312. [Google Scholar]
- Xu, Q.; Likhomanenko, T.; Kahn, J.; Hannun, A.; Synnaeve, G.; Collobert, R. Iterative pseudo-labeling for speech recognition. arXiv 2020, arXiv:2005.09267. [Google Scholar] [CrossRef]
- Synnaeve, G.; Xu, Q.; Kahn, J.; Likhomanenko, T.; Grave, E.; Pratap, V.; Sriram, A.; Liptchinsky, V.; Collobert, R. End-to-end asr: From supervised to semi-supervised learning with modern architectures. arXiv 2019, arXiv:1911.08460. [Google Scholar]
- Zhang, F.; Wang, Y.; Zhang, X.; Liu, C.; Saraf, Y.; Zweig, G. Faster, simpler and more accurate hybrid asr systems using wordpieces. arXiv 2020, arXiv:2005.09150. [Google Scholar] [CrossRef]
- Nartey, O.T.; Yang, G.; Asare, S.K.; Wu, J.; Frempong, L.N. Robust semi-supervised traffic sign recognition via self-training and weakly-supervised learning. Sensors 2020, 20, 2684. [Google Scholar] [CrossRef]
- Souly, N.; Spampinato, C.; Shah, M. Semi and weakly supervised semantic segmentation using generative adversarial network. arXiv 2017, arXiv:1703.09695. [Google Scholar] [CrossRef]
- Ren, Z.; Wang, S.; Zhang, Y. Weakly supervised machine learning. CAAI Trans. Intell. Technol. 2023, 8, 549–580. [Google Scholar] [CrossRef]
- Zhou, Z.-H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2018, 5, 44–53. [Google Scholar] [CrossRef]
- Merz, C.J.; Clair, D.C.S.; Bond, W.E. Semi-supervised adaptive resonance theory (smart2). In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Baltimore, MD, USA, 7–11 June 1992; Volume 3, pp. 851–856. [Google Scholar]
- Chapelle, O.; Scholkopf, B.; Zien, A. Semi-supervised learning (Chapelle, O. et al., eds.; 2006) [book reviews]. IEEE Trans. Neural Netw. 2009, 20, 542. [Google Scholar] [CrossRef]
- Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the Workshop on Challenges in Representation Learning, Atlanta, GA, USA, 16–21 June 2013; p. 896. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar] [CrossRef]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
- Zhang, Y.; Park, D.S.; Han, W.; Qin, J.; Gulati, A.; Shor, J.; Jansen, A.; Xu, Y.; Huang, Y.; Wang, S. Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE J. Sel. Top. Signal Process. 2022, 16, 1519–1532. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
- Xie, Q.; Luong, M.-T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Bucci, S.; D’Innocente, A.; Liao, Y.; Carlucci, F.M.; Caputo, B.; Tommasi, T. Self-Supervised Learning across Domains. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5516–5528. [Google Scholar] [CrossRef]
- Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with gumbel-softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
- Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 117–128. [Google Scholar] [CrossRef]
- Xu, M.; Jin, A.; Wang, S.; Su, M.; Ng, T.; Mason, H.; Han, S.; Lei, Z.; Deng, Y.; Huang, Z. Conformer-based speech recognition on extreme edge-computing devices. arXiv 2023, arXiv:2312.10359. [Google Scholar]
- AssemblyAI. Conformer-2: A State-of-the-Art Speech Recognition Model Trained on 1.1M hours of Data. AssemblyAI Technical Blog, 2023. Available online: https://www.assemblyai.com/blog/conformer-2/ (accessed on 15 January 2025).
- Miao, H.; Cheng, G.; Zhang, P.; Yan, Y. Online Hybrid CTC/attention End-to-End Automatic Speech Recognition Architecture. arXiv 2023, arXiv:2307.02351. [Google Scholar] [CrossRef]
- Bao, C.; Huo, C.; Chen, Q.; Gao, C. AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition. arXiv 2025, arXiv:2506.06566. [Google Scholar]
- NVIDIA. What Is Automatic Speech Recognition? NVIDIA Technical Blog, 2023. Available online: https://developer.nvidia.com/blog/essential-guide-to-automatic-speech-recognition-technology/ (accessed on 15 January 2025).
- Wang, H.; Guo, P.; Zhou, P.; Xie, L. Mlca-avsr: Multi-layer cross attention fusion based audio-visual speech recognition. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; pp. 8150–8154. [Google Scholar]
- Chen, C.; Li, R.; Hu, Y.; Siniscalchi, S.M.; Chen, P.-Y.; Chng, E.; Yang, C.-H.H. It’s never too late: Fusing acoustic information into large language models for automatic speech recognition. arXiv 2024, arXiv:2402.05457. [Google Scholar]
- Seo, P.H.; Nagrani, A.; Schmid, C. Avformer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023; pp. 22922–22931. [Google Scholar]
- Hu, J.; Li, Z.; Wang, P.; Wang, J.; Li, X.; Zhao, W. VHASR: A Multimodal Speech Recognition System with Vision Hotwords. arXiv 2023, arXiv:2410.00822. [Google Scholar]
- Gabeur, V.; Seo, P.H.; Nagrani, A.; Schmid, C.; Vedaldi, A. Avatar: Unconstrained Audiovisual Speech Recognition. arXiv 2022, arXiv:2206.07684. [Google Scholar] [CrossRef]
- Xu, B.; Lu, C.; Guo, Y.; Wang, J. Discriminative multi-modality speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14433–14442. [Google Scholar]
- Afouras, T.; Chung, J.S.; Zisserman, A. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv 2018, arXiv:1809.00496. [Google Scholar]
- Yang, G.; Ma, Z.; Yu, F.; Gao, Z.; Zhang, S.; Chen, X. Mala-asr: Multimedia-assisted llm-based asr. arXiv 2024, arXiv:2406.05839. [Google Scholar]
- Wang, H.; Yu, F.; Shi, X.; Wang, Y.; Zhang, S.; Li, M. SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11076–11080. [Google Scholar]
- Qin, R.; Liu, D.; Xu, G.; Yan, Z.; Xu, C.; Hu, Y.; Hu, X.S.; Xiong, J.; Shi, Y. Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge. arXiv 2024, arXiv:2411.13766. [Google Scholar] [CrossRef]
- Manepalli, S.G.; Whitenack, D.; Nemecek, J. DYN-ASR: Compact, multilingual speech recognition via spoken language and accent identification. In Proceedings of the 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), New Orleans, LA, USA, 14–31 July 2021; pp. 830–835. [Google Scholar]
- Kim, S.; Gholami, A.; Shaw, A.; Lee, N.; Mangalam, K.; Malik, J.; Mahoney, M.W.; Keutzer, K. Squeezeformer: An efficient transformer for automatic speech recognition. Adv. Neural Inf. Process. Syst. 2022, 35, 9361–9373. [Google Scholar]
- Kriman, S.; Beliaev, S.; Ginsburg, B.; Huang, J.; Kuchaiev, O.; Lavrukhin, V.; Leary, R.; Li, J.; Zhang, Y. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–8 May 2020; pp. 6124–6128. [Google Scholar]
- Yao, Z.; Guo, L.; Yang, X.; Kang, W.; Kuang, F.; Yang, Y.; Jin, Z.; Lin, L.; Povey, D. Zipformer: A faster and better encoder for automatic speech recognition. arXiv 2023, arXiv:2310.11230. [Google Scholar]
- Jeffries, N.; King, E.; Kudlur, M.; Nicholson, G.; Wang, J.; Warden, P. Moonshine: Speech Recognition for Live Transcription and Voice Commands. arXiv 2024, arXiv:2410.15608. [Google Scholar] [CrossRef]
- Maass, W. Networks of spiking neurons: The third generation of neural network models. Neural Netw. 1997, 10, 1659–1671. [Google Scholar] [CrossRef]
- Tavanaei, A.; Ghodrati, M.; Kheradpisheh, S.R.; Masquelier, T.; Maida, A. Deep learning in spiking neural networks. Neural Netw. 2019, 111, 47–63. [Google Scholar] [CrossRef]
- Caporale, N.; Dan, Y. Spike Timing–Dependent Plasticity: A Hebbian Learning Rule. Annu. Rev. Neurosci. 2008, 31, 25–46. [Google Scholar] [CrossRef]
- Auge, D.; Hille, J.; Kreutz, F.; Mueller, E.; Knoll, A. End-to-End Spiking Neural Network for Speech Recognition Using Resonating Input Neurons. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Bratislava, Slovakia, 14–17 September 2021; pp. 245–256. [Google Scholar]
- Wu, J.; Yılmaz, E.; Zhang, M.; Li, H.; Tan, K.C. Deep spiking neural networks for large vocabulary automatic speech recognition. Front. Neurosci. 2020, 14, 199. [Google Scholar] [CrossRef]
- Wang, Q.; Zhang, T.; Han, M.; Wang, Y.; Zhang, D.; Xu, B. Complex dynamic neurons improved spiking transformer network for efficient automatic speech recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 102–109. [Google Scholar]
- Irugalbandara, C.; Naseem, A.S.; Perera, S.; Kiruthikan, S.; Logeeshan, V. A Secure and Smart Home Automation System with Speech Recognition and Power Measurement Capabilities. Sensors 2023, 23, 5784. [Google Scholar] [CrossRef]
- Kumar, Y. A Comprehensive Analysis of Speech Recognition Systems in Healthcare: Current Research Challenges and Future Prospects. SN Comput. Sci. 2024, 5, 137. [Google Scholar] [CrossRef]
- Le-Duc, K. Vietmed: A dataset and benchmark for automatic speech recognition of vietnamese in the medical domain. arXiv 2024, arXiv:2404.05659. [Google Scholar] [CrossRef]
- Korfiatis, A.P.; Moramarco, F.; Sarac, R.; Cuendet, M.A.; Chary, M.; Velupillai, S.; Nenadic, G.; Gkotsis, G. Primock57: A Dataset of Primary Care Mock Consultations. arXiv 2022, arXiv:2204.00333. [Google Scholar] [CrossRef]
- Adedeji, A.; Sanni, M.; Ayodele, E.; Joshi, S.; Olatunji, T. The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? arXiv 2025, arXiv:2501.15310. [Google Scholar] [CrossRef]
- Bashori, M.; van Hout, R.; Strik, H.; Cucchiarini, C. I Can Speak: Improving English Pronunciation through Automatic Speech Recognition-Based Language Learning Systems. Innov. Lang. Learn. Teach. 2024, 18, 443–461. [Google Scholar] [CrossRef]
- Sun, W. The Impact of Automatic Speech Recognition Technology on Second Language Pronunciation and Speaking Skills of EFL Learners: A Mixed Methods Investigation. Front. Psychol. 2023, 14, 1210187. [Google Scholar] [CrossRef]
- Cai, Y. The Application of Automatic Speech Recognition Technology in English as Foreign. In Proceedings of the 2nd International Conference on Humanities, Wisdom Education and Service Management (HWESM 2023), Xi’an, China, 14–16 July 2023; Volume 760, p. 356. [Google Scholar]
- Straits Research. Voice and Speech Recognition Market Size, Share & Trends Analysis Report by Function (Speech Recognition, Voice Recognition), by Technology (Artificial Intelligence Based, Non-Artificial Intelligence Based), by Vertical (Automotive, Enterprise, Consumer, BFSI, Government, Retail, Healthcare, Military, Legal, Education) and by Region (North America, Europe, APAC, Middle East and Africa, LATAM) Forecasts, 2025–2033; Report Code: SRTE2654DR. Available online: https://straitsresearch.com/report/voice-and-speech-recognition-market (accessed on 23 August 2025).
- Paulus Schoutsen. 2023: Home Assistant’s Year of Voice. Available online: https://www.home-assistant.io/blog/2022/12/20/year-of-voice/ (accessed on 23 August 2025).
- Schoutsen, P. Year of the Voice-Chapter 2: Let’s Talk. Home Assistant Blog, 27 April 2023. Available online: https://www.home-assistant.io/blog/2023/04/27/year-of-the-voice-chapter-2/ (accessed on 23 August 2025).
- Steadman, L.; Williams, W. Ursa 2: Elevating Speech Recognition Across 50+ Languages. Available online: https://www.speechmatics.com/company/articles-and-news/ursa-2-elevating-speech-recognition-across-52-languages (accessed on 23 August 2025).
- Uniphore. What Is Automatic Speech Recognition (ASR)? Available online: https://www.uniphore.com/glossary/automatic-speech-recognition/ (accessed on 23 August 2025).
- Microsoft. Speech to Text documentation–Tutorials, API Reference. Azure AI Services. Available online: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/index-speech-to-text (accessed on 23 August 2025).
- Google Cloud. Speech-to-Text documentation. Google Cloud Documentation. Available online: https://cloud.google.com/speech-to-text/docs (accessed on 23 August 2025).
- Voicegain. Speech-to-Text APIs. Available online: https://www.voicegain.ai/speech-to-text-apis (accessed on 23 August 2025).
- Wadhwani, P. Automotive Voice Recognition Market Analysis: Market Size, Share & Forecasts 2023–2032. Global Market Insights. Available online: https://www.gminsights.com/industry-analysis/automotive-voice-recognition-market (accessed on 23 August 2025).
- Behera, R. Advances in Automotive Voice Recognition Systems Redefining the In-Car Experience. Allied Market Research Blog, 20 May 2024. Available online: https://blog.alliedmarketresearch.com/latest-technologies-in-automotive-voice-recognition-systems-1972 (accessed on 23 August 2025).
- Wang, H.; Guo, P.; Li, Y.; Zhang, A.; Sun, J.; Xie, L.; Chen, W.; Zhou, P.; Bu, H.; Xu, X.; et al. ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Seoul, Republic of Korea, 14–19 April 2024; pp. 63–64. [Google Scholar]
- ResearchInChina. Automotive Voice Industry Review 2023–2024. AutoTech News, 27 December 2023. Available online: https://autotech.news/automotive-voice-industry-review-2023-2024/ (accessed on 23 August 2025).
- Zhen, K.; Radfar, M.; Nguyen, H.; Strimel, G.P.; Susanj, N.; Mouchtaris, A. Sub-8-bit quantization for on-device speech recognition: A regularization-free approach. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023; pp. 13–20. [Google Scholar]
- Ding, S.; Meadowlark, P.; He, Y.; Lew, L.; Agrawal, S.; Rybakov, O. 4-bit Conformer with Native Quantization Aware Training for Efficient Speech Recognition. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 1458–1462. [Google Scholar]
- Noroozi, V.; Majumdar, S.; Kumar, A.; Balam, J.; Ginsburg, B. Stateful Conformer with Cache-Based Inference for Streaming Automatic Speech Recognition. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, 14–19 April 2024; pp. 12041–12045. [Google Scholar]
- K2-FSA Team. Sherpa-ONNX: Streaming Conformer-Transducer Models for On-Device ASR. Available online: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/conformer-transducer-models.html (accessed on 22 September 2025).
- Gupta, A.; Parulekar, A.; Chattopadhyay, S.; Jyothi, P. Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR. arXiv 2024, arXiv:2410.13445. [Google Scholar]
- Liu, Z.; Venkateswaran, N.; Le Ferrand, E.; Prud’hommeaux, E. How Important is a Language Model for Low-resource ASR? In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 206–213. [Google Scholar]
- Mainzinger, J.; Levow, G.-A. Fine-Tuning ASR models for Very Low-Resource Languages: A Study on Mvskoke. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Bangkok, Thailand, 11–16 August 2024; pp. 76–82. [Google Scholar]
- Ranjan, S.; Tripathi, A.; Kumar, K.; Hansen, J.H.L. Curriculum Learning based approaches for robust end-to-end far-field speech recognition. Speech Commun. 2021, 132, 17–27. [Google Scholar] [CrossRef]
- Dai, Y.; Liu, S.; Bataev, V.; Shi, Y.; Chen, X.; Wang, H.; Bu, H.; Li, S. AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition. arXiv 2024, arXiv:2505.23036. [Google Scholar]
- Wang, Z.; Hou, F.; Wang, R. CLRL-Tuning: A Novel Continual Learning Approach for Automatic Speech Recognition. In Proceedings of the INTERSPEECH 2023, Dublin, Ireland, 20–24 August 2023; pp. 4583–4587. [Google Scholar]
Comparison Dimension | Bidirectional LSTM | Bidirectional GRU | Unidirectional LSTM | Unidirectional GRU |
---|---|---|---|---|
Application Scenarios | Tasks highly dependent on both preceding and succeeding contextual information in the sequence. | Tasks that require capturing bidirectional information under limited computational resources or with real-time constraints. | Tasks where modeling long-term dependencies is critical and ample data is available. | Scenarios with resource constraints or high real-time requirements. |
Advantages | Captures bidirectional dependencies and exhibits strong memory capacity. | Captures bidirectional dependencies with relatively higher computational efficiency | Strong capability in modeling long-term dependencies. | Simple and efficient, suitable for relatively short sequences. |
Disadvantages | High computational cost, complex architecture, and slow training speed. | Slightly weaker memory capacity compared to LSTM. | Higher parameter count and increased computational complexity. | Inferior ability to capture long-sequence dependencies compared to LSTM. 1 |
Mechanism | Features | Key Formula Component | Variable Explanation |
---|---|---|---|
Additive Attention | Nonlinear combination of query and key, explicit alignment | : query; : key; : learnable weights | |
Dot-Product Attention | Efficient dot-product computation, normalized scaling | : query; : key; : dimension of key | |
Multi-Head Attention | Parallel modeling in multiple subspaces, captures diverse features | : single attention head; : output projection matrix | |
Location-Aware Attention [87] | Incorporates previous attention weights, enhances monotonic alignment | : query; : key; : previous attention info; : learnable weights | |
Adaptive Attention | Dynamically adjusts range or weights, optimizes efficiency | Dot-product + dynamic masking | See dot-product attention; dynamic masking adapts attention weights |
Loss Function | Principle | Formula | Variable Explanation |
---|---|---|---|
Cross-Entropy Loss | Measures the difference between the predicted probability distribution and the true label distribution. | n: number of samples; : true label; : predicted probability for class i | |
CTC Loss | Automatically learns the alignment between speech feature sequences and text labels by summing over all possible alignment paths. | T: input length; : alignment path; : set of paths mapping to label y; : probability of label at time t | |
Attention Loss | Supervises the learning of attention mechanisms to focus on important parts of the input speech, often combined with cross-entropy loss and regularized/constrained attention weights. | n: number of attention heads or time steps; : entropy; : attention weight distribution at step i | |
RNN-T Loss | Optimizes sequence transduction models by jointly modeling the acoustic and label sequence without requiring pre-aligned data. | t: acoustic time step; u: label index; : alignment path; : set of valid alignments for y; : probability at | |
Minimum Bayes Risk (MBR) Loss | Minimizes the expected error rate (e.g., WER) by weighting hypotheses according to their posterior probabilities. | : hypothesis space; : posterior probability of hypothesis h; : cost between h and reference y |
Model | WER (%) | Language Model † | Streaming ‡ | Labeled Data (h) § | Unlabeled Data (h) ‖ | Year |
---|---|---|---|---|---|---|
SAMBA ASR [107] | 1.17 | N/A | N | 15.46k h | - | 2025 |
FAdam [108] | 1.34 | N/A | N/A | N/A | N/A | 2024 |
Conformer + Wav2vec 2.0 + SpecAugment NST [109] | 1.4 | Y | N | 960 h | 60k h (Libri-Light) | 2020 |
w2v-BERT XXL [110] | 1.4 | Y | N | 960 h | 60k h (Libri-Light) | 2021 |
parakeet-rnnt-1.1b [111] | 1.46 | N | Y | 64k h | - | 2023 |
Conv + Transformer + wav2vec2.0 [112] | 1.5 | Y | N | 960 h | 53k h (LibriVox) | 2020 |
ContextNet + SpecAugment NST [113] | 1.7 | Y | N/A | 960 h | 60k h (Libri-Light) | 2020 |
SpeechStew (1B) [114] | 1.7 | N | N/A | 5.14k h | - | 2021 |
Multistream CNN + Self-Attentive SRU [115] | 1.75 | Y | N/A | 960 h | N/A | 2020 |
Stateformer [116] | 1.76 | N | N/A | N/A | N/A | 2023 |
wav2vec 2.0 with Libri-Light [117] | 1.8 | Y | N | 960 h | 53.2k h (LibriVox) | 2020 |
HuBERT with Libri-Light [118] | 1.8 | N/A | N | 960 h | 60k h (Libri-Light) | 2021 |
WavLM Large [119] | 1.8 | Y | N/A | 960 h | 94k h (multi-source) | 2021 |
E-Branchformer (L) [120] | 1.81 | Y | N/A | 960 h | N/A | 2022 |
Zipformer+pruned transducer [121] | 1.88 | N | Y | 1k+ h | - | 2024 |
ContextNet (L) [93] | 1.9 | Y | N/A | 1k+ h | N/A | 2020 |
Conformer (L) [71] | 1.9 | Y | N | 960 h | N/A | 2020 |
Transformer+Time reduction+SKD [122] | 1.9 | Y | N | 960 h | N/A | 2021 |
ContextNet (M) [93] | 2 | Y | N/A | 960 h | N/A | 2020 |
Transformer Transducer [123] | 2 | N | Y | N/A | N/A | 2020 |
Model | WER (%) | Language Model † | Streaming ‡ | Labeled Data (h) § | Unlabeled Data (h) ‖ | Year |
---|---|---|---|---|---|---|
SAMBA ASR [107] | 2.48 | N/A | N | 15.46k h | - | 2025 |
FAdam [108] | 2.49 | N/A | N/A | N/A | N/A | 2024 |
w2v-BERT XXL [110] | 2.5 | Y | N | 960 h | 60k h (Libri-Light) | 2021 |
Conformer + Wav2vec 2.0 + SpecAugment NST [109] | 2.6 | Y | N | 960 h | 60k h (Libri-Light) | 2020 |
HuBERT with Libri-Light [118] | 2.9 | N/A | N | 960 h | 60k h (Libri-Light) | 2021 |
wav2vec 2.0 with Libri-Light [117] | 3.0 | Y | N | 960 h | 53.2k h (LibriVox) | 2020 |
Conv + Transformer + wav2vec2.0 [112] | 3.1 | Y | N | 960 h | 53k h (LibriVox) | 2020 |
WavLM Large [119] | 3.2 | Y | N/A | 960 h | 94k h (multi-source) | 2021 |
SpeechStew (1B) [114] | 3.3 | N | N/A | 5.14k h | - | 2021 |
ContextNet + SpecAugment NST [113] | 3.4 | Y | N/A | 960 h | 60k h (Libri-Light) | 2020 |
E-Branchformer (L) [120] | 3.65 | Y | N/A | 960 h | N/A | 2022 |
data2vec [124] | 3.7 | N/A | N | N/A | N/A | 2022 |
Conv + Transformer AM + Pseudo-Labeling [125] | 3.83 | Y | N/A | 960 h | N/A | 2020 |
Conformer (L) [71] | 3.9 | Y | N | 960 h | N/A | 2020 |
Zipformer+pruned transducer [121] | 3.95 | N | Y | 1k+ h | - | 2024 |
SpeechStew (100M) [114] | 4.0 | N | N/A | 960 h + multi-domain | - | 2021 |
wav2vec 2.0 [117] | 4.1 | Y | N | 960 h | 53.2k h (LibriVox) | 2020 |
ContextNet (L) [93] | 4.1 | Y | N/A | 1k+ h | N/A | 2020 |
Conv + Transformer AM [126] | 4.11 | Y | N/A | 960 h | N/A | 2019 |
CTC + Transformer LM rescoring [127] | 4.20 | Y | N/A | 960 h | - | 2020 |
Stage | Configuration | Notes |
---|---|---|
Hardware | Mid-range mobile SoC/ARM Cortex-A + INT8 inference accelerator | Representative device; latency varies with hardware and threads |
Model Architecture | Conformer-Transducer (small/medium) | Streaming capability enabled; end-to-end ASR pipeline |
Quantization | INT8 weights + activations | Post-training quantization or QAT for minimal accuracy drop |
Feature Extraction | 16 kHz, 25 ms window, 10 ms frame shift | Optimized DSP/C++ kernels for low-latency front-end |
Decoder | Greedy/simplified beam search | Streaming joiner to maintain real-time performance |
Threads/Memory | 2–4 threads, cached feature blocks | Balance latency vs. throughput |
Latency Metric | Estimated <200 ms end-to-end | Based on prior work, not directly measured here |
Accuracy Trade-off | <2% WER increase vs. FP32 baseline | As reported in related quantization studies |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, C.; Pan, Y.; Wu, H.; Ning, L. Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning. Informatics 2025, 12, 107. https://doi.org/10.3390/informatics12040107
Wu C, Pan Y, Wu H, Ning L. Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning. Informatics. 2025; 12(4):107. https://doi.org/10.3390/informatics12040107
Chicago/Turabian StyleWu, Chaoji, Yi Pan, Haipan Wu, and Lei Ning. 2025. "Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning" Informatics 12, no. 4: 107. https://doi.org/10.3390/informatics12040107
APA StyleWu, C., Pan, Y., Wu, H., & Ning, L. (2025). Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning. Informatics, 12(4), 107. https://doi.org/10.3390/informatics12040107