You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

29 June 2023

A Speech Recognition Method Based on Domain-Specific Datasets and Confidence Decision Networks

,
,
and
School of Electrical and Control Engineering, North China University of Technology, Beijing 100041, China
*
Author to whom correspondence should be addressed.
This article belongs to the Topic Complex Systems and Artificial Intelligence

Abstract

This paper proposes a speech recognition method based on a domain-specific language speech network (DSL-Net) and a confidence decision network (CD-Net). The method involves automatically training a domain-specific dataset, using pre-trained model parameters for migration learning, and obtaining a domain-specific speech model. Importance sampling weights were set for the trained domain-specific speech model, which was then integrated with the trained speech model from the benchmark dataset. This integration automatically expands the lexical content of the model to accommodate the input speech based on the lexicon and language model. The adaptation attempts to address the issue of out-of-vocabulary words that are likely to arise in most realistic scenarios and utilizes external knowledge sources to extend the existing language model. By doing so, the approach enhances the adaptability of the language model in new domains or scenarios and improves the prediction accuracy of the model. For domain-specific vocabulary recognition, a deep fully convolutional neural network (DFCNN) and a candidate temporal classification (CTC)-based approach were employed to achieve effective recognition of domain-specific vocabulary. Furthermore, a confidence-based classifier was added to enhance the accuracy and robustness of the overall approach. In the experiments, the method was tested on a proprietary domain audio dataset and compared with an automatic speech recognition (ASR) system trained on a large-scale dataset. Based on experimental verification, the model achieved an accuracy improvement from 82% to 91% in the medical domain. The inclusion of domain-specific datasets resulted in a 5% to 7% enhancement over the baseline, while the introduction of model confidence further improved the baseline by 3% to 5%. These findings demonstrate the significance of incorporating domain-specific datasets and model confidence in advancing speech recognition technology.

1. Introduction

Speech recognition technology is gaining increasing importance as an advanced technology for converting speech signals into text. With the expanding range of applications, it has become a key tool for enhancing human productivity and efficiency by replacing traditional typing and mouse operations with spoken input. This enables individuals to perform various tasks, including office work and learning, more efficiently [1]. Given this context, speech recognition technology is receiving growing attention and significance, with both academia and industry actively promoting its development and improvement.
In recent years, there have been significant advancements in ASR, with deep learning techniques emerging as a key development. Deep learning models such as convolutional neural networks (CNN) [2] and recurrent neural networks (RNN) [3] have become prominent methods for speech recognition. Additionally, end-to-end [4] speech recognition has simplified the architecture of ASR systems, improving their performance and efficiency. Machine learning methods, such as transfer learning [5], have also been employed for phoneme or word selection [6], especially in noisy and complex environments. These technological advancements have expanded the applications of speech recognition. With the advancement of technology and the growing need for human-computer interaction [7], speech recognition has found widespread use across various fields. Domain-specific speech recognition, for instance, converts speech signals into text or commands, facilitating natural, efficient, and convenient human-computer interaction [8]. Keyword recognition methods [9] are effective for phrase recognition in domain-specific contexts, but they require prior identification of keywords and are not suitable for handling unknown speech content. Unlike traditional Hidden Markov Model (HMM)-based speech recognition methods that only consider short-term transition relationships between consecutive frames, our study proposes a novel approach called DFCNN+CTC for speech recognition. By incorporating Deep Fully Convolutional Neural Networks (DFCNN) with the Connectionist Temporal Classification (CTC) framework, our method captures and leverages longer temporal dependencies, overcoming the limitations of HMM-based methods. Domain-specific speech recognition techniques [10] aim to develop specialized speech recognition models for specific professional domains, such as medical [11], legal, and financial fields. In domain-specific speech recognition, the recognition performance can be enhanced through the use of custom language models [12] or domain-adaptive methods [13,14]. These approaches employ techniques such as data augmentation [15], domain-adaptive training, and domain-specific model fine-tuning. Custom language models can be trained using a corpus of domain-specific vocabulary or adjusted by combining a general-purpose language model with domain-specific vocabulary. This helps reduce domain differences and improves recognition accuracy. In domain-specific contexts, the terms and expressions used in speech inputs differ from those in general-purpose speech recognition techniques. This disparity poses challenges for accurately recognizing domain-specific content. To address this issue, this paper proposes a method for creating a speech network using domain-specific datasets.
In this paper, we propose a joint modeling [16] approach that leverages acoustic and language models to collaborate with each other in parsing and translating the speech signal, leading to more accurate recognition results. The main contributions of this work can be summarized as follows:
(1)
We propose an acoustic modeling approach that combines the speech spectrogram, DFCNN, and CTC. This approach utilizes the rich information provided by the speech spectrogram, the powerful feature extraction capability of DFCNN, and the sequence modeling capabilities of CTC without alignment. By handling speech signals of varying lengths, we achieved improved speech recognition results;
(2)
To address unfamiliar words in new domains, we present a comprehensive system based on N-gram technology and the construction of domain-specific datasets incorporated into language models. This enables speech recognition in new domains;
(3)
To optimize the model, we propose a speech confidence-based determination method that dynamically adjusts the use of the language model, thereby enhancing the accuracy of the speech recognition model;
(4)
We designed experimental comparisons using different domain datasets to verify the effectiveness of the proposed method in addressing domain-specific speech recognition. These experiments follow a step-by-step incremental model experimental approach.

4. Analysis of Experimental Results

4.1. Acoustic Model Comparison

Speech recognition is a significant research area within the field of artificial intelligence. The most widely used method for speech feature extraction is the mainstream Mel Frequency Cepstral Coefficients (MFCC) method. This method converts speech signals into MFCC coefficient feature vectors, which are then used for speech recognition with the GMM-HMM model. While the MFCC method demonstrates good recognition performance in practical applications, it does have some limitations. For instance, it struggles with processing long-duration sequence data and lacks location sensitivity. In order to address certain limitations of MFCC methods, deep learning techniques based on speech spectrograms have been gradually introduced into the field of speech recognition. The purpose of this experiment is to compare the mainstream MFCC approach with spectral map, DFCNN, and CTC-based approaches. The aim is to identify a model with high accuracy that can serve as a benchmark for subsequent network construction. By conducting the experiments, we aim to determine which method performs better and can provide a solid foundation for further model development. As shown in Table 1, in the experiments conducted in this paper, we selected TensorFlow 1.14 as the primary deep learning framework and utilized it to construct a speech recognition model; we utilized the TIMIT speech dataset, which consists of 6300 pronunciations of American English words. The dataset involves recordings from 630 speakers, with each speaker reading out 10 sentences. Among these sentences, 3000 were randomly selected for the training set, 1000 for the validation set, and another 1000 for the test set. Additionally, we also employed the Chinese speech dataset THCH-30, which is divided into training, validation, and test sets.
Table 1. Speech recognition experimental data table.
For our experiments, we utilized a computer with a specific configuration and installed the necessary software and libraries for TensorFlow 1.14 on it. We also employed a GPU acceleration card to enhance the training and inference efficiency. In order to ensure the reproducibility of the experiments, we meticulously recorded the experimental settings, including hyperparameter values such as learning rate, optimizer type, and number of iterations. Furthermore, we conducted reasonable parameter tuning and cross-validation.
This paper employed the word error rate as a metric to evaluate speech recognition accuracy. The edit distance is calculated by comparing the aligned transcription results with the correct text transcription. A lower word error rate indicates a higher accuracy in the speech recognition system. The same procedure was applied to all speech samples in the test dataset to determine the word error rate for the entire dataset, which was then converted to accuracy. The complete word error rate calculation formula (4) is shown below:
W E R = # D e l e t i o n s + # I n s e r t i o n s + # S u b s t i t u t i o n s # R e f e r e n c e W o r d s
The accuracy calculation was performed by iteratively processing all the recorded operations for editing the distance.
In the English dataset TIMIT, the mainstream MFCC method along with the GMM-HMM model was utilized for training and testing. Based on the experimental results, the accuracy of speech recognition using the MFCC method on the test set was 76.5%. In the speech spectrogram, DFCNN, and CTC-based methods, the speech signal was first transformed into a speech spectrogram, and the extracted feature vectors were then inputted into the CTC model for training and recognition. According to the experimental results, the methods based on the speech spectrogram, DFCNN, and CTC achieved an accuracy of 87.2% on the test set, which is 10.7% higher than the mainstream MFCC method.
The performance on the Chinese dataset THCH-30 test set revealed that the accuracy of the MFCC-based method was approximately 80%, while the accuracy of the speech spectrogram, DFCNN, and CTC-based methods surpassed 90%.
Comparing the mainstream speech recognition experiments and the results based on speech spectrograms in Figure 6, (a) for the validation of the results on the dataset Timit; (b) for the validation of the results on the dataset Thch30; the following conclusions can be drawn. The mainstream MFCC methods demonstrate a certain level of recognition ability in speech recognition, but they are weak when it comes to handling long-duration sequence data and location sensitivity. On the other hand, the methods based on speech spectrogram, DFCNN, and CTC exhibit better performance in terms of characterizing speech signals, retaining temporal sequence information, and capturing spectral information compared to the mainstream MFCC methods. As a result, these methods can achieve more accurate speech recognition. Moreover, these methods also exhibit stronger robustness and perform better in the presence of interference factors such as background noise and slurred speech.
Figure 6. Comparison results of speech recognition experiments.

4.2. ASR System Experimental Results

In order to address challenges such as processing difficulties with domain-specific words, this paper investigated an ASR system based on speech spectrograms and proposes two incremental improvement methods: domain-specific datasets and a confidence model decision based on domain-specific datasets.
One of the methods proposes an approach based on a domain-specific dataset speech network for automatically training domain-specific datasets, resulting in domain-specific speech models. This method utilizes integrated learning to merge the domain-specific speech model with the speech model obtained from training the benchmark dataset. This process automatically enhances the lexical content of the ASR model, improves the adaptability of the language model in new domains or scenarios, and increases the prediction accuracy of the model.
On the other hand, the confidence model decision method based on domain-specific datasets combines a deep fully convolutional neural network with a speech recognition algorithm based on candidate temporal classification, along with a specialized confidence-based classifier. During training, the classifier utilizes the confidence of the dual model as the label input and is capable of predicting not only the sample’s class but also utilizing the accuracy rate as an evaluation metric. This approach effectively enhances the robustness and accuracy of the overall method, enabling the ASR system to select the optimal solution from the dual-channel model for output, thus improving recognition accuracy.
In this paper, a series of experiments were conducted to verify the effectiveness of the proposed method. Two datasets were utilized: a medical symptom speech dataset and the Chinese speech dataset THCHS-30. To demonstrate the effectiveness of each network in the speech recognition system, we adopted a progressive model experimentation approach, where network models were incrementally added. This step-by-step process allowed us to assess the contribution of each network and evaluate the overall performance of the system. Table 2 illustrates the features utilized by different methods on the different datasets.
Table 2. Data sheet for model experiments.
The results of the specific experimental data, based on the word error rate and character error rate as the evaluation criteria, are as follows:
Experiment 1: A basic ASR system was constructed based on the speech spectrogram, utilizing conventional image-based acoustic features. These features were combined with GMM-HMM models to create a speech recognition system. The experimental results indicate that the speech recognition system performs well in recognizing common words. However, it tends to have false recognitions when encountering uncommon words, resulting in poor performance for domain-specific speech recognition. This limitation is attributed to the lack of specialized domain knowledge in the system’s recognition model;
Experiment 2: To enhance the recognition accuracy of the system, this paper explored the incorporation of domain-specific datasets. Specifically, domain-specific audio datasets are utilized to create a dedicated dataset for the domain. The experimental results demonstrated an improvement in the system’s recognition accuracy upon incorporating domain-specific datasets. This improvement can be attributed to the fact that the domain-specific dataset encompasses information on the specific features of the domain-related speech signals. Consequently, it enables better adaptation to the domain-specific speech dataset, thereby enhancing the accuracy of speech recognition;
Experiment 3: In the aforementioned model, a confidence-based model decision method was introduced. This method enables the selection of the optimal outcome from two candidate results based on the confidence level of the dual-channel model. Convolutional neural networks are employed to construct the confidence-based model, and experimental verification demonstrates significant improvements in the system’s recognition accuracy, highlighting the effectiveness of this approach.
The experimental results, as shown in Table 3 above, indicated that the gradual incorporation of domain-specific datasets and the utilization of a confidence-based model decision method led to significant enhancements in the recognition accuracy of the ASR system for two different datasets. In the first set of experiments, the addition of domain-specific datasets resulted in an approximate 6% improvement in recognition accuracy on the medical dataset test set compared to the baseline system. This indicates a notable enhancement in the performance of the ASR system through the inclusion of domain-specific datasets. In the second set of experiments, the integration of the confidence-based model decision method into the domain-specific dataset system led to an approximately 4% improvement in recognition accuracy on the test set compared to the benchmark system. This highlights the further enhancement in the performance of the ASR system achieved by incorporating the confidence-based model decision method.
Table 3. Comparison table of model experimental results.
To mitigate the impact of random errors on the experimental results, this paper utilized the statistical method of paired t-test to determine if there are significant differences between the sample data. For each of the three systems, ten groups of domain-specific speech samples were randomly selected, and recognition experiments were conducted. The accuracy rates obtained for each system are recorded in Table 4, with labels A representing ASR, B representing ASR+DSL-Net, and C representing ASR+DSL-Net+CD-Net. By employing the paired t-test, it is possible to assess whether there exists a statistically significant difference in the performance of these systems on the same sample set.
Table 4. Paired sample t-test results.
The significance level was set at 0.05, with v = 9, and the t-boundary table was consulted to obtain t 0.05,9 = 1.833 . Upon calculating S d A - B = 0.04147 ,   t 1 = 2.38 ,   t 1 > t 0.05,9   S d B - C = 0.04613 ,   t 2 = 2.41 ,   t 2 > t 0.05,9 ; These experimental verifications confirmed that the differences were statistically significant, indicating significant variations in the accuracy rates among the three systems.
To further investigate the impact of domain-specific datasets on proprietary domains within this approach, various sizes of loaded feature sets were utilized to validate the model’s recognition effectiveness. As depicted in Figure 7, it is evident that the accuracy of speech recognition gradually improves with an increase in the loading of the feature set. Although the validation set for each iteration is randomized and the dataset may undergo perturbations, resulting in a decrease in accuracy for certain segments, the overall accuracy shows an upward trend. This suggests that the overall efficacy of the method remains unaffected by domain specificity and that the accuracy of speech recognition can be further enhanced by augmenting the size of the domain-specific dataset.
Figure 7. Trend graph of load/correct rate.
The speech recognition approach proposed in this study incorporates a variety of advanced techniques, such as DSL-Net, CD-Net, DFCNN, CTC, and confidence-based classifiers, with the aim of enhancing the accuracy and adaptability of speech recognition. The method proves particularly advantageous in training and migrating domain-specific datasets, and its effectiveness has been demonstrated in diverse domains when combined with large-scale datasets. This holds significant practical importance for addressing the challenges of speech recognition in real-world application scenarios.

5. Conclusions

The speech recognition approach proposed in this paper is based on domain-specific dataset speech networks and confidence decision networks. The aim of this approach is to enhance existing language models by incorporating external knowledge to better handle unknown words and language rules. In comparison to current approaches in the field of speech recognition, particular attention is given to transformer-based models and listen, attend and spell (LAS) models. These models offer better parallel computation compared to traditional RNN models, utilizing self-attentive mechanisms and positional encoding to capture long-range dependencies. However, the Transformer model may encounter challenges when dealing with long-duration series speech inputs, particularly in the case of large vocabularies and complex speech inputs. In contrast, the method presented in this paper leverages domain-specific datasets and a confidence-based model decision approach optimized for domain-specific speech inputs. As a result, significant accuracy improvements were achieved in processing domain-specific speech recognition tasks. While the LAS model serves as an end-to-end speech recognition model based on an attention mechanism, implementing direct training from acoustic inputs to character level outputs, this paper’s approach introduced domain-specific datasets and a confidence-based model decision method as the primary innovations. The utilization of domain-specific datasets enables more precise training samples, leading to enhanced accuracy and performance of the system. The confidence-based decision method improves recognition accuracy and robustness by selecting candidate results with high confidence levels. In future research, there are plans to further optimize the construction and utilization methods of domain-specific datasets. Additionally, exploring other possible combinations and uses of feature sets will be considered to further enhance the performance and adaptability of the ASR system.

Author Contributions

Conceptualization, Z.D.; methodology, Q.D. and W.Z.; validation, Q.D.; writing—original draft preparation, Q.D. and M.Z.; writing—review and editing, Q.D. and Z.D.; visualization, Q.D.; supervision, Z.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ramadan, R.A. Retraction Note: Detecting Adversarial Attacks on Audio-Visual Speech Recognition Using Deep Learning Method; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  2. Yu, J. Mobile Communication Voice Enhancement Under Convolutional Neural Networks and the Internet of Things. Intell. Autom. Soft Comput. 2023, 37, 777–797. [Google Scholar] [CrossRef]
  3. Youa, Y.; Sun, X. Research on dialect speech recognition based on DenseNet-CTC. Acad. J. Comput. Inf. Sci. 2023, 6, 23–27. [Google Scholar]
  4. Lin, Y.; Wang, L.; Dang, J.; Li, S.; Ding, C. End-to-End Articulatory Modeling for Dysarthric Articulatory Attribute Detection. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7349–7353. [Google Scholar]
  5. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  6. Currey, A.; Illina, I.; Fohr, D. Dynamic Adjustment of Language Models for Automatic Speech Recognition Using Word Similarity. In Proceedings of the 2016 IEEE Spoken Language Technology Workshop (SLT), San Diego, CA, USA, 13–16 December 2016; pp. 426–432. [Google Scholar]
  7. Cuesta, L.; Manuel, J. Contributions to the Contextualization of Human-Machine Spoken Interaction Systems. Telecomunicacion. Ph.D. Thesis, Universiti Putra Malaysia, Serdang, Malaysia, 2013. [Google Scholar]
  8. Ibrahim, Y.A.; Odiketa, J.C.; Ibiyemi, T.S. Preprocessing technique in automatic speech recognition for human computer interaction: An overview. Ann. Comput. Sci. Ser. 2017, 15, 186–191. [Google Scholar]
  9. Qiu, B. Keyword Detection of Japanese Media Teaching Based on Support Vector Machines and Speech Detection. Mob. Inf. Syst. 2022, 2022, 6095859. [Google Scholar] [CrossRef]
  10. Errattahi, R.; El Hannani, A.; Ouahmane, H. Automatic speech recognition errors detection and correction: A review. Procedia Comput. Sci. 2018, 128, 32–37. [Google Scholar] [CrossRef]
  11. Kumar, Y.; Koul, A.; Mahajan, S. A deep learning approaches and fastai text classification to predict 25 medical diseases from medical speech utterances, transcription and intent. Soft Comput. 2022, 26, 8253–8272. [Google Scholar] [CrossRef]
  12. Zhang, J.; Wushouer, M.; Tuerhong, G.; Wang, H. Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data. Appl. Sci. 2023, 13, 5724. [Google Scholar] [CrossRef]
  13. Jafarlou, M.Z. Domain-Specific Model Differencing for graphical Domain-Specific Languages. In Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings, Montreal, QC, Canada, 23–28 October 2022; pp. 205–208. [Google Scholar]
  14. Xue, W.; Cucchiarini, C.; van Hout, R.; Strik, H. Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language. Speech Commun. 2023, 148, 23–30. [Google Scholar] [CrossRef]
  15. Robert, N.R.; Anija, S.B.; Samuel, F.J.; Kala, K.; Preethi, J.J.; Kumar, M.S.; Edison, M.J. ILeHCSA: An internet of things enabled smart home automation scheme with speech enabled controlling options using machine learning strategy. Int. J. Adv. Technol. Eng. Explor. 2021, 8, 1695. [Google Scholar]
  16. Taha, M.; Azarov, E.S.; Likhachov, D.S.; Petrovsky, A.A. An efficient speech generative model based on deterministic/stochastic separation of spectral envelopes. Dokl. BGUIR 2020, 18, 23–29. [Google Scholar] [CrossRef]
  17. Valin, J.-M. A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement. In Proceedings of the 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), Vancouver, BC, Canada, 29–31 August 2018; pp. 1–5. [Google Scholar]
  18. Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 44, 8717–8727. [Google Scholar] [CrossRef] [PubMed]
  19. Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Bengio, C.L.Y.; Courville, A. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. arXiv 2017, arXiv:1701.02720. [Google Scholar]
  20. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
  21. Vergin, R.; O’Shaughnessy, D.; Farhat, A. Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition. IEEE Trans. Speech Audio Process. 1999, 7, 525–532. [Google Scholar] [CrossRef]
  22. Coucheiro-Limeres, A.; Ferreiros-López, J.; Fernández-Martínez, F.; Córdoba, R. A dynamic term discovery strategy for automatic speech recognizers with evolving dictionaries. Expert Syst. Appl. 2021, 176, 114860. [Google Scholar] [CrossRef]
  23. Sitaula, C.; He, J.; Priyadarshi, A.; Tracy, M.; Kavehei, O.; Hinder, M.; Withana, A.; McEwan, A.; Marzbanrad, F. Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 1853–1864. [Google Scholar] [CrossRef]
  24. Burne, L.; Sitaula, C.; Priyadarshi, A.; Tracy, M.; Kavehei, O.; Hinder, M.; Withana, A.; McEwan, A.; Marzbanrad, F. Ensemble Approach on Deep and Handcrafted Features for Neonatal Bowel Sound Detection. IEEE J. Biomed. Health Inf. 2023, 27, 2603–2613. [Google Scholar] [CrossRef] [PubMed]
  25. Imran, Z.; Grooby, E.; Sitaula, C.; Malgi, V.; Aryal, S.; Marzbanrad, F. A Fusion of Handcrafted Features and Deep Learning Classifiers for Heart Murmur Detection. In Proceedings of the 2022 Computing in Cardiology Conference (CinC), Tampere, Finland, 4–7 September 2022. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.