Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths
Abstract
:1. Introduction
- This paper addresses the shortage of Chinese disfluent speech data by creating the PSC-PS-DF dataset, which includes four disfluent features: interjections, blocks, prolongations, and repetitions.
- A classification network is developed in this paper for automated speech disfluency detection by combining CNN and Transformer and utilizing context embeddings from the pre-training model, wav2vec2.0. The network outperforms the baseline model in terms of detection accuracy and training time, even when trained with limited data.
- Considering that the length of disfluent speech data varies in practical detection scenarios, this paper improves the model based on the entropy invariance of attention mechanisms, allowing the model’s results to generalize to speech data of different lengths, which means that even if the training and testing data have different disfluent data lengths, the model can still achieve good results.
- To ensure that the proposed model can achieve good disfluency detection results in different language environments, this paper conducts experiments on the self-built PSC-PS-DF dataset for Chinese and the open-source SEP-28k dataset for English disfluent speech. The results demonstrate the potential of the proposed model to detect speech disfluency in various language environments.
2. Related Work
2.1. Feature Extraction
2.2. Disfluency Detection with Machine Learning
2.3. Disfluency Detection with Deep Learning
3. Proposed Method
3.1. Model Architecture
3.2. Entropy Invariance of Attention Mechanisms
4. Experiments
4.1. Datasets
4.1.1. SEP-28k
4.1.2. PSC-PS-DF
4.2. Basic Settings
5. Results and Analysis
5.1. Evaluation on Limited Data
5.2. Comparison with Baseline Models
5.3. Ablation Study
5.4. Length-Scaled Attention
6. Conclusions
- Since the accuracy of speech disfluency detection methods is directly related to the size of the data, but the cost of collecting and labeling disfluent data is high, we can use more efficient data processing and extraction methods to obtain more reliable disfluency detection results with limited data.
- The wav2vec2.0 model has a good effect on extracting disfluent speech features. In this paper, we only used the context representation of the last hidden layer of the wav2vec2.0 model as the input to the model. In future research, we can consider fine-tuning the wav2vec2.0 model to obtain effective detection results. In addition, we can also try using other speech pre-training models such as HuBERT and WavLM instead of the wav2vec2.0 model used in this paper to obtain better results.
- This paper conduct experiments on self-built and open-source English datasets, including SEP-28k, to verify that the model can be applied in different language environments. In the future, we can conduct experiments on this model in more languages to verify its reliability.
- In the future, we can expand our work from single-task scenarios to multi-classification scenarios, not only detecting disfluency but also distinguishing between different types of speech disfluency events.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
CNN | Convolutional Neural Network |
PSC | Putonghua Shuiping Ceshi |
MFCC | Mel Frequency Cepstral Coefficient |
LPC | Linear Predictive Coding |
LPCC | Linear Prediction Cepstral Coefficient |
PLP | Perceptual Linear Prediction |
ANN | Artificial Neural Network |
HMM | Hidden Markov Model |
SVM | Support Vector Machine |
KNN | K-Nearest Neighbor |
LDA | Linear Discriminant Analysis |
DTW | Dynamic Time Warping |
MLP | Multilayer Perceptron |
BLSTM | Bidirection Long Short-Term Memory |
CT-Transformer | Controllable Time-delay Transformer |
TDNN | Time-delay Neural Network |
LSTM | Long Short-Term Memory |
References
- Gupta, S.; Shukla, R.S.; Shukla, R.K. Literature survey and review of techniques used for automatic assessment of Stuttered Speech. Int. J. Manag. Technol. Eng. 2019, 9, 229–240. [Google Scholar]
- Starkweather, C.W. Fluency and Stuttering; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 1987. [Google Scholar]
- Maguire, G.A.; Yeh, C.Y.; Ito, B.S. Overview of the diagnosis and treatment of stuttering. J. Exp. Clin. Med. 2012, 4, 92–97. [Google Scholar] [CrossRef]
- Lawrence, M.; Barclay, D.M., III. Stuttering: A brief review. Am. Fam. Physician 1998, 57, 2175. [Google Scholar] [PubMed]
- Yairi, E.; Ambrose, N. Epidemiology of stuttering: 21st century advances. J. Fluen. Disord. 2013, 38, 66–87. [Google Scholar] [CrossRef] [Green Version]
- Seitz, S.R.; Choo, A.L. Stuttering: Stigma and perspectives of (dis) ability in organizational communication. Hum. Resour. Manag. Rev. 2022, 32, 100875. [Google Scholar] [CrossRef]
- Manjula, G.; Kumar, M.S. Overview of analysis and classification of stuttered speech. Int. J. Ind. Electron. Electr. Eng. 2016, 4, 80–86. [Google Scholar]
- Sheikh, S.A.; Sahidullah, M.; Hirsch, F.; Ouni, S. Machine learning for stuttering identification: Review, challenges and future directions. Neurocomputing 2022, 514, 385–402. [Google Scholar] [CrossRef]
- Barrett, L.; Hu, J.; Howell, P. Systematic review of machine learning approaches for detecting developmental stuttering. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 1160–1172. [Google Scholar] [CrossRef]
- Khara, S.; Singh, S.; Vir, D. A comparative study of the techniques for feature extraction and classification in stuttering. In Proceedings of the 2018 IEEE Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 20–21 April 2018; pp. 887–893. [Google Scholar]
- Sharma, N.M.; Kumar, V.; Mahapatra, P.K.; Gandhi, V. Comparative Analysis of Various Feature Extraction Techniques for Classification of Speech Disfluencies. Speech Commun. 2023, 150, 23–31. [Google Scholar] [CrossRef]
- Howell, P.; Davis, S.; Bartrip, J. The UCLASS archive of stuttered speech. J. Speech Lang. Hear. Res. 2009, 52, 556–569. [Google Scholar] [CrossRef]
- Ratner, N.B.; MacWhinney, B. Fluency Bank: A new resource for fluency research and practice. J. Fluen. Disord. 2018, 56, 69–80. [Google Scholar] [CrossRef]
- Lea, C.; Mitra, V.; Joshi, A.; Kajarekar, S.; Bigham, J.P. Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6798–6802. [Google Scholar]
- Bayerl, S.P.; von Gudenberg, A.W.; Hönig, F.; Nöth, E.; Riedhammer, K. KSoF: The Kassel State of Fluency Dataset–A Therapy Centered Dataset of Stuttering. arXiv 2022, arXiv:2203.05383. [Google Scholar]
- Tan, T.S.; Ariff, A.; Ting, C.M.; Salleh, S.H. Application of Malay speech technology in Malay speech therapy assistance tools. In Proceedings of the 2007 International Conference on Intelligent and Advanced Systems, Kuala Lumpur, Malaysia, 25–28 November 2007; pp. 330–334. [Google Scholar]
- Ravikumar, K.; Rajagopal, R.; Nagaraj, H. An approach for objective assessment of stuttered speech using MFCC. In Proceedings of the The International Congress for Global Science and Technology, Ottawa, ON, Canada, 2–17 July 2009; Volume 19. [Google Scholar]
- Chee, L.S.; Ai, O.C.; Hariharan, M.; Yaacob, S. MFCC based recognition of repetitions and prolongations in stuttered speech using k-NN and LDA. In Proceedings of the 2009 IEEE Student Conference on Research and Development (SCOReD), Kuala Lumpur, Malaysia, 16–18 November 2009; pp. 146–149. [Google Scholar]
- Km, R.K.; Ganesan, S. Comparison of multidimensional MFCC feature vectors for objective assessment of stuttered disfluencies. Int. J. Adv. Netw. Appl. 2011, 2, 854–860. [Google Scholar]
- Ai, O.C.; Hariharan, M.; Yaacob, S.; Chee, L.S. Classification of speech dysfluencies with MFCC and LPCC features. Expert Syst. Appl. 2012, 39, 2157–2165. [Google Scholar]
- Pálfy, J. Analysis of dysfluencies by computational intelligence. Inf. Sci. Technol. 2014, 6, 45. [Google Scholar]
- Jabeen, S.; Ravikumar, K. Analysis of 0dB and 10dB babble noise on stuttered speech. In Proceedings of the 2015 International Conference on Soft-Computing and Networks Security (ICSNS), Coimbatore, India, 25–27 February 2015; pp. 1–5. [Google Scholar]
- Esmaili, I.; Dabanloo, N.J.; Vali, M. Automatic classification of speech dysfluencies in continuous speech based on similarity measures and morphological image processing tools. Biomed. Signal Process. Control 2016, 23, 104–114. [Google Scholar] [CrossRef]
- Sheikh, S.A.; Sahidullah, M.; Hirsch, F.; Ouni, S. Stutternet: Stuttering detection using time delay neural network. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 426–430. [Google Scholar]
- Hariharan, M.; Chee, L.S.; Ai, O.C.; Yaacob, S. Classification of speech dysfluencies using LPC based parameterization techniques. J. Med Syst. 2012, 36, 1821–1830. [Google Scholar] [CrossRef]
- Thiang, W. Speech Recognition Using LPC and HMM Applied for Controlling Movement of Mobile Robot. Semin. Nas. Teknol. Inf. 2010, 97-031. [Google Scholar]
- Fook, C.Y.; Muthusamy, H.; Chee, L.S.; Yaacob, S.B.; Adom, A.H.B. Comparison of speech parameterization techniques for the classification of speech disfluencies. Turk. J. Electr. Eng. Comput. Sci. 2013, 21, 1983–1994. [Google Scholar] [CrossRef]
- Chee, L.S.; Ai, O.C.; Hariharan, M.; Yaacob, S. Automatic detection of prolongations and repetitions using LPCC. In Proceedings of the 2009 International Conference for Technical Postgraduates (TECHPOS), Kuala Lumpur, Malaysia, 14–15 December 2009; pp. 1–4. [Google Scholar]
- Kourkounakis, T.; Hajavi, A.; Etemad, A. Detecting multiple speech disfluencies using a deep residual network with bidirectional long short-term memory. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6089–6093. [Google Scholar]
- Kourkounakis, T.; Hajavi, A.; Etemad, A. Fluentnet: End-to-end detection of stuttered speech disfluencies with deep learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2986–2999. [Google Scholar] [CrossRef]
- Al-Banna, A.K.; Edirisinghe, E.; Fang, H. Stuttering Detection Using Atrous Convolutional Neural Networks. In Proceedings of the 2022 13th International Conference on Information and Communication Systems (ICICS), Dalian, China, 17–19 October 2022; pp. 252–256. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Pepino, L.; Riera, P.; Ferrer, L. Emotion recognition from speech using wav2vec 2.0 embeddings. arXiv 2021, arXiv:2104.03502. [Google Scholar]
- Xu, X.; Kang, Y.; Cao, S.; Lin, B.; Ma, L. Explore wav2vec 2.0 for Mispronunciation Detection. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 4428–4432. [Google Scholar]
- Mohapatra, P.; Pandey, A.; Islam, B.; Zhu, Q. Speech disfluency detection with contextual representation and data distillation. In Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications, Portland, OR, USA, 1 July 2022; pp. 19–24. [Google Scholar]
- Bayerl, S.P.; Wagner, D.; Nöth, E.; Riedhammer, K. Detecting dysfluencies in stuttering therapy using wav2vec 2.0. arXiv 2022, arXiv:2204.03417. [Google Scholar]
- Bayerl, S.P.; Wagner, D.; Nöth, E.; Bocklet, T.; Riedhammer, K. The Influence of Dataset Partitioning on Dysfluency Detection Systems. In Proceedings of the Text, Speech, and Dialogue: 25th International Conference, TSD 2022, Brno, Czech Republic, 6–9 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 423–436. [Google Scholar]
- Bayerl, S.P.; Wagner, D.; Hönig, F.; Bocklet, T.; Nöth, E.; Riedhammer, K. Dysfluencies Seldom Come Alone–Detection as a Multi-Label Problem. arXiv 2022, arXiv:2210.15982. [Google Scholar]
- Bayerl, S.P.; Gerczuk, M.; Batliner, A.; Bergler, C.; Amiriparian, S.; Schuller, B.; Nöth, E.; Riedhammer, K. Classification of stuttering–The ComParE challenge and beyond. Comput. Speech Lang. 2023, 81, 101519. [Google Scholar] [CrossRef]
- Howell, P.; Sackin, S. Automatic recognition of repetitions and prolongations in stuttered speech. In Proceedings of the First World Congress on Fluency Disorders, Munich, Germany, 8–11 August 1995; University Press Nijmegen: Nijmegen, The Netherlands, 1995; Volume 2, pp. 372–374. [Google Scholar]
- Geetha, Y.; Pratibha, K.; Ashok, R.; Ravindra, S.K. Classification of childhood disfluencies using neural networks. J. Fluen. Disord. 2000, 25, 99–117. [Google Scholar] [CrossRef]
- Savin, P.; Ramteke, P.B.; Koolagudi, S.G. Recognition of repetition and prolongation in stuttered speech using ANN. In Proceedings of the 3rd International Conference on Advanced Computing, Networking and Informatics: ICACNI 2015, Bhubaneswar, India, 23–25 June 2015; Springer: Berlin/Heidelberg, Germany, 2016; Volume 1, pp. 65–71. [Google Scholar]
- Hariharan, M.; Vijean, V.; Fook, C.; Yaacob, S. Speech stuttering assessment using sample entropy and Least Square Support Vector Machine. In Proceedings of the 2012 IEEE 8th International Colloquium on Signal Processing and Its Applications, Malacca, Malaysia, 23–25 March 2012; pp. 240–245. [Google Scholar]
- Ramteke, P.B.; Koolagudi, S.G.; Afroz, F. Repetition detection in stuttered speech. In Proceedings of the 3rd International Conference on Advanced Computing, Networking and Informatics: ICACNI 2015, Bhubaneswar, India, 23–25 June 2015; Springer: Berlin/Heidelberg, Germany, 2016; Volume 1, pp. 611–617. [Google Scholar]
- Świetlicka, I.; Kuniszyk-Jóźkowiak, W.; Smołka, E. Hierarchical ANN system for stuttering identification. Comput. Speech Lang. 2013, 27, 228–242. [Google Scholar] [CrossRef]
- Szczurowska, I.; Kuniszyk-Jóźkowiak, W.; Smołka, E. The application of Kohonen and multilayer perceptron networks in the speech nonfluency analysis. Arch. Acoust. 2014, 31, 205–210. [Google Scholar]
- Zayats, V.; Ostendorf, M.; Hajishirzi, H. Disfluency detection using a bidirectional LSTM. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), San Francisco, CA, USA, 8–12 September 2016; pp. 2523–2527. [Google Scholar]
- Zayats, V.; Ostendorf, M. Giving Attention to the Unexpected: Using Prosody Innovations in Disfluency Detection. In Proceedings of the NAACL-HLT, Online, 6–11 June 2019; pp. 86–95. [Google Scholar]
- Santoso, J.; Yamada, T.; Makino, S. Classification of causes of speech recognition errors using attention-based bidirectional long short-term memory and modulation spectrum. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019; pp. 302–306. [Google Scholar]
- Wang, S.; Che, W.; Liu, Q.; Qin, P.; Liu, T.; Wang, W.Y. Multi-task self-supervised learning for disfluency detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9193–9200. [Google Scholar]
- Chen, Q.; Chen, M.; Li, B.; Wang, W. Controllable time-delay transformer for real-time punctuation prediction and disfluency detection. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 8069–8073. [Google Scholar]
- Su, J. Entropy Invariance in Softmax Operation. Available online: https://kexue.fm/archives/9034 (accessed on 11 April 2022).
- Chiang, D.; Cholak, P. Overcoming a Theoretical Limitation of Self-Attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 7654–7664. [Google Scholar]
- Liu, F.; Shen, S.Y.; Fu, Z.W.; Wang, H.Y.; Zhou, A.M.; Qi, J.Y. LGCCT: A light gated and crossed complementation transformer for multimodal speech emotion recognition. Entropy 2022, 24, 1010. [Google Scholar] [CrossRef]
- Liu, J.; Wumaier, A.; Fan, C.; Guo, S. Automatic Fluency Assessment Method for Spontaneous Speech without Reference Text. Electronics 2023, 12, 1775. [Google Scholar] [CrossRef]
- Raupach, M. Temporal variables in first and second language speech production. In Temporal Variables in Speech; De Gruyter Mouton: Berlin, Germany, 2011; pp. 263–270. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Disfluency Labels | Definition | Examples |
---|---|---|
Sound repetitions (Snd) | Repetitions of syllables | I (wh-wh-) whispered a secret |
Word repetitions (WP) | Repetitions of words | I know (know) a secret |
Prolongations (Pro) | Extended syllables | I kn(nnnnn)ow |
Interjections (Intrj) | Filler words or non-words | I (um) know a (uh) secret |
Blocks (Bl) | Long stuttered pauses | I know (pause) a secret |
Disfluency Labels | Train | Dev | Test | Total | Train Data Size in Minutes |
---|---|---|---|---|---|
WP | 958 | 96 | 320 | 1374 | 48 |
Pro | 576 | 58 | 218 | 852 | 29 |
Intrj | 442 | 45 | 146 | 633 | 22 |
Bl | 1086 | 109 | 360 | 1555 | 54 |
Hyperparameters | Setting |
---|---|
Learning rate | l × 10 |
Batch size | 512 |
Optimizer | Adam |
Loss function | CrossEntropyLoss |
Audio feature dimension | 768 |
Attention dimension/number of heads | 50/10 |
CNN hidden layer dimension | 50 |
wav2vec2.0 (Chinese) | TencentGameMate/chinese-wav2vec2-base |
wav2vec2.0 (English) | facebook/wav2vec2-base-960h |
Disfluency | Dataset | Data Size in Minutes | F1 | Precision | Recall | Accuracy (%) |
---|---|---|---|---|---|---|
Snd | 1/1 | 75 | 0.90 | 0.82 | 0.99 | 89.08 |
1/2 | 37 | 0.81 | 0.74 | 0.89 | 78.88 | |
1/4 | 19 | 0.74 | 0.70 | 0.77 | 72.33 | |
WP | 1/1 | 148 | 0.90 | 0.84 | 0.96 | 86.76 |
1/2 | 74 | 0.79 | 0.78 | 0.81 | 76.36 | |
1/4 | 37 | 0.75 | 0.74 | 0.77 | 70.17 | |
Pro | 1/1 | 75 | 0.89 | 0.82 | 0.98 | 88.11 |
1/2 | 37 | 0.78 | 0.71 | 0.87 | 75.97 | |
1/4 | 19 | 0.72 | 0.68 | 0.77 | 70.39 | |
Intrj | 1/1 | 248 | 0.84 | 0.87 | 0.81 | 83.82 |
1/2 | 124 | 0.83 | 0.85 | 0.81 | 78.18 | |
1/4 | 62 | 0.80 | 0.84 | 0.76 | 76.90 | |
Bl | 1/1 | 45 | 0.79 | 0.77 | 0.81 | 78.57 |
1/2 | 22 | 0.72 | 0.68 | 0.78 | 70.41 | |
1/4 | 11 | 0.69 | 0.64 | 0.73 | 66.33 |
Disfluency | Data Size in Minutes | F1 | Precision | Recall | Accuracy (%) |
---|---|---|---|---|---|
WP | 48 | 0.98 | 0.99 | 0.98 | 98.44 |
Pro | 29 | 0.84 | 0.73 | 0.98 | 81.73 |
Intrj | 22 | 0.94 | 0.97 | 0.92 | 94.52 |
Bl | 54 | 0.99 | 0.99 | 0.99 | 98.89 |
Disfluency | Model | F1 | Precision | Recall | Accuracy (%) |
---|---|---|---|---|---|
Snd | LSTM | 0.66 | 0.65 | 0.67 | 65.05 |
MLP | 0.70 | 0.68 | 0.72 | 69.17 | |
DisfluencyNet | 0.72 | 0.67 | 0.79 | 70.00 | |
Ours | 0.74 | 0.70 | 0.77 | 72.33 | |
WP | LSTM | 0.72 | 0.71 | 0.74 | 67.70 |
MLP | 0.71 | 0.72 | 0.69 | 69.68 | |
DisfluencyNet | 0.71 | 0.75 | 0.66 | 71.00 | |
Ours | 0.75 | 0.74 | 0.77 | 70.17 | |
Pro | LSTM | 0.60 | 0.64 | 0.60 | 60.44 |
MLP | 0.63 | 0.62 | 0.65 | 62.14 | |
DisfluencyNet | 0.73 | 0.80 | 0.76 | 75.70 | |
Ours | 0.72 | 0.68 | 0.77 | 70.39 | |
Intrj | LSTM | 0.69 | 0.67 | 0.71 | 70.00 |
MLP | 0.68 | 0.81 | 0.59 | 72.73 | |
DisfluencyNet | 0.79 | 0.79 | 0.79 | 74.50 | |
Ours | 0.80 | 0.84 | 0.76 | 76.90 | |
Bl | LSTM | 0.49 | 0.50 | 0.49 | 50.00 |
MLP | 0.56 | 0.53 | 0.59 | 53.06 | |
DisfluencyNet | 0.58 | 0.54 | 0.61 | 55.00 | |
Ours | 0.69 | 0.64 | 0.73 | 66.33 |
Disfluency | Model | F1 | Precision | Recall | Accuracy (%) |
---|---|---|---|---|---|
WP | LSTM | 0.97 | 0.97 | 0.97 | 97.19 |
MLP | 0.98 | 0.98 | 0.98 | 97.81 | |
DisfluencyNet | 0.98 | 0.99 | 0.98 | 98.44 | |
Ours | 0.98 | 0.99 | 0.98 | 98.44 | |
Pro | LSTM | 0.82 | 0.71 | 0.95 | 78.44 |
MLP | 0.82 | 0.71 | 0.97 | 78.44 | |
DisfluencyNet | 0.83 | 0.83 | 0.83 | 82.57 | |
Ours | 0.84 | 0.73 | 0.98 | 81.73 | |
Intrj | LSTM | 0.92 | 0.94 | 0.90 | 92.47 |
MLP | 0.93 | 0.96 | 0.90 | 93.15 | |
DisfluencyNet | 0.93 | 0.94 | 0.92 | 93.15 | |
Ours | 0.94 | 0.97 | 0.92 | 94.52 | |
Bl | LSTM | 0.95 | 0.94 | 0.96 | 95.00 |
MLP | 0.96 | 0.95 | 0.98 | 96.11 | |
DisfluencyNet | 0.97 | 0.98 | 0.97 | 97.50 | |
Ours | 0.99 | 0.99 | 0.99 | 98.89 |
Disfluency | Model | F1 | Precision | Recall | Accuracy (%) |
---|---|---|---|---|---|
Snd | w/o CNN | 0.69 | 0.69 | 0.69 | 68.69 |
w/o Transformer | 0.66 | 0.68 | 0.65 | 67.23 | |
Ours | 0.74 | 0.70 | 0.77 | 72.33 | |
WP | w/o CNN | 0.71 | 0.66 | 0.76 | 69.18 |
w/o Transformer | 0.70 | 0.68 | 0.71 | 68.31 | |
Ours | 0.75 | 0.74 | 0.77 | 70.17 | |
Pro | w/o CNN | 0.68 | 0.63 | 0.73 | 65.53 |
w/o Transformer | 0.62 | 0.61 | 0.63 | 61.65 | |
Ours | 0.72 | 0.68 | 0.77 | 70.39 | |
Intrj | w/o CNN | 0.77 | 0.83 | 0.71 | 74.91 |
w/o Transformer | 0.67 | 0.71 | 0.63 | 71.45 | |
Ours | 0.80 | 0.84 | 0.76 | 76.90 | |
Bl | w/o CNN | 0.61 | 0.56 | 0.67 | 57.14 |
w/o Transformer | 0.56 | 0.56 | 0.55 | 56.12 | |
Ours | 0.69 | 0.64 | 0.73 | 66.33 |
Disfluency | Model | F1 | Precision | Recall | Accuracy (%) |
---|---|---|---|---|---|
WP | w/o CNN | 0.98 | 0.98 | 0.98 | 97.81 |
w/o Transformer | 0.97 | 0.98 | 0.97 | 97.50 | |
Ours | 0.98 | 0.99 | 0.98 | 98.44 | |
Pro | w/o CNN | 0.84 | 0.73 | 0.98 | 80.73 |
w/o Transformer | 0.82 | 0.70 | 0.97 | 77.98 | |
Ours | 0.84 | 0.73 | 0.98 | 81.73 | |
Intrj | w/o CNN | 0.94 | 0.96 | 0.92 | 93.84 |
w/o Transformer | 0.94 | 0.97 | 0.90 | 93.84 | |
Ours | 0.94 | 0.97 | 0.92 | 94.52 | |
Bl | w/o CNN | 0.95 | 0.96 | 0.94 | 95.00 |
w/o Transformer | 0.97 | 0.96 | 0.97 | 96.67 | |
Ours | 0.99 | 0.99 | 0.99 | 98.89 |
Disfluency | Type | Train All Test All | Train 1/2 Test All | Train All Test 1/2 |
---|---|---|---|---|
Snd | Length-scaled | 0.74 | 0.66 | 0.72 |
w/o length-scaled | 0.74 | 0.67 | 0.72 | |
WP | Length-scaled | 0.75 | 0.53 | 0.43 |
w/o length-scaled | 0.71 | 0.39 | 0.52 | |
Pro | Length-scaled | 0.72 | 0.67 | 0.70 |
w/o length-scaled | 0.70 | 0.67 | 0.71 | |
Intrj | Length-scaled | 0.80 | 0.79 | 0.50 |
w/o length-scaled | 0.79 | 0.76 | 0.79 | |
Bl | Length-scaled | 0.69 | 0.58 | 0.69 |
w/o length-scaled | 0.67 | 0.58 | 0.68 |
Disfluency | Type | Train All Test All | Train 1/2 Test All | Train All Test 1/2 |
---|---|---|---|---|
WP | Length-scaled | 0.98 | 0.97 | 0.98 |
w/o length-scaled | 0.97 | 0.94 | 0.97 | |
Pro | Length-scaled | 0.84 | 0.84 | 0.77 |
w/o length-scaled | 0.79 | 0.83 | 0.70 | |
Intrj | Length-scaled | 0.94 | 0.94 | 0.88 |
w/o length scaled | 0.93 | 0.93 | 0.86 | |
Bl | Length-scaled | 0.99 | 0.96 | 0.99 |
w/o length-scaled | 0.98 | 0.97 | 0.98 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, J.; Wumaier, A.; Wei, D.; Guo, S. Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths. Appl. Sci. 2023, 13, 7579. https://doi.org/10.3390/app13137579
Liu J, Wumaier A, Wei D, Guo S. Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths. Applied Sciences. 2023; 13(13):7579. https://doi.org/10.3390/app13137579
Chicago/Turabian StyleLiu, Jiajun, Aishan Wumaier, Dongping Wei, and Shen Guo. 2023. "Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths" Applied Sciences 13, no. 13: 7579. https://doi.org/10.3390/app13137579
APA StyleLiu, J., Wumaier, A., Wei, D., & Guo, S. (2023). Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths. Applied Sciences, 13(13), 7579. https://doi.org/10.3390/app13137579