MDPI - Publisher of Open Access Journals

15 pages, 1909 KiB

Open AccessArticle

Helium Speech Recognition Method Based on Spectrogram with Deep Learning

by Yonghong Chen, Shibing Zhang and Dongmei Li

Big Data Cogn. Comput. 2025, 9(5), 136; https://doi.org/10.3390/bdcc9050136 - 20 May 2025

Viewed by 503

With the development of the marine economy and the increase in marine activities, deep saturation diving has gained significant attention. Helium speech communication is indispensable for saturation diving operations and is a critical technology for deep saturation diving, serving as the sole communication [...] Read more.

With the development of the marine economy and the increase in marine activities, deep saturation diving has gained significant attention. Helium speech communication is indispensable for saturation diving operations and is a critical technology for deep saturation diving, serving as the sole communication method to ensure the smooth execution of such operations. This study introduces deep learning into helium speech recognition and proposes a spectrogram-based dual-model helium speech recognition method. First, we extract the spectrogram features from the helium speech. Then, we combine a deep fully convolutional neural network with connectionist temporal classification (CTC) to form an acoustic model, in which the spectrogram features of helium speech are used as an input to convert speech signals into phonetic sequences. Finally, a maximum entropy hidden Markov model (MEMM) is employed as the language model to convert the phonetic sequences to word outputs, which is regarded as a dynamic programming problem. We use a Viterbi algorithm to find the optimal path to decode the phonetic sequences to word sequences. The simulation results show that the method can effectively recognize helium speech with a recognition rate of 97.89% for isolated words and 95.99% for continuous helium speech. Full article

► Show Figures

Figure 1

31 pages, 6120 KiB

Open AccessArticle

Enhancing Security of Online Interfaces: Adversarial Handwritten Arabic CAPTCHA Generation

by Ghady Alrasheed and Suliman A. Alsuhibany

Appl. Sci. 2025, 15(6), 2972; https://doi.org/10.3390/app15062972 - 10 Mar 2025

Viewed by 1004

Abstract

With the increasing online activity of Arabic speakers, the development of effective CAPTCHAs (Completely Automated Public Turing Tests to Tell Computers and Humans Apart) tailored for Arabic users has become crucial. Traditional CAPTCHAs, however, are increasingly vulnerable to machine learning-based attacks. To address [...] Read more.

With the increasing online activity of Arabic speakers, the development of effective CAPTCHAs (Completely Automated Public Turing Tests to Tell Computers and Humans Apart) tailored for Arabic users has become crucial. Traditional CAPTCHAs, however, are increasingly vulnerable to machine learning-based attacks. To address this challenge, we introduce a method for generating adversarial handwritten Arabic CAPTCHAs that remain user-friendly yet difficult for machines to solve. Our approach involves synthesizing handwritten Arabic words using a simulation technique, followed by the application of five adversarial perturbation techniques: Expectation Over Transformation (EOT), Scaled Gaussian Translation with Channel Shifts (SGTCS), Jacobian-based Saliency Map Attack (JSMA), Immutable Adversarial Noise (IAN), and Connectionist Temporal Classification (CTC). Evaluation results demonstrate that JSMA provides the highest level of security, with 30% of meaningless word CAPTCHAs remaining completely unrecognized by automated systems falling to 6.66% for meaningful words. From a usability perspective, JSMA also achieves the highest accuracy rates, with 75.6% for meaningless words and 90.6% for meaningful words. Our work presents an effective strategy for enhancing the security of Arabic websites and online interfaces against bot attacks, contributing to the advancement of CAPTCHA systems. Full article

► Show Figures

Figure 1

18 pages, 585 KiB

Open AccessArticle

Improving Diacritical Arabic Speech Recognition: Transformer-Based Models with Transfer Learning and Hybrid Data Augmentation

by Haifa Alaqel and Khalil El Hindi

Information 2025, 16(3), 161; https://doi.org/10.3390/info16030161 - 20 Feb 2025

Viewed by 1664

Abstract

Diacritical Arabic (DA) refers to Arabic text with diacritical marks that guide pronunciation and clarify meanings, making their recognition crucial for accurate linguistic interpretation. These diacritical marks (short vowels) significantly influence meaning and pronunciation, and their accurate recognition is vital for the effectiveness [...] Read more.

Diacritical Arabic (DA) refers to Arabic text with diacritical marks that guide pronunciation and clarify meanings, making their recognition crucial for accurate linguistic interpretation. These diacritical marks (short vowels) significantly influence meaning and pronunciation, and their accurate recognition is vital for the effectiveness of automatic speech recognition (ASR) systems, particularly in applications requiring high semantic precision, such as voice-enabled translation services. Despite its importance, leveraging advanced machine learning techniques to enhance ASR for diacritical Arabic has remained underexplored. A key challenge in developing DA ASR is the limited availability of training data. This study introduces a transformer-based approach leveraging transfer learning and data augmentation to address these challenges. Using a cross-lingual speech representation (XLSR) model pretrained on 53 languages, we fine-tune it on DA and integrate connectionist temporal classification (CTC) with transformers for improved performance. Data augmentation techniques, including volume adjustment, pitch shift, speed alteration, and hybrid strategies, further mitigate data limitations, significantly reducing word error rates (WER). Our methods achieve a WER of 12.17%, outperforming traditional ASR systems and setting a new benchmark for DA ASR. These findings demonstrate the potential of advanced machine learning to address longstanding challenges in DA ASR and enhance its accuracy. Full article

(This article belongs to the Collection Natural Language Processing and Applications: Challenges and Perspectives)

► Show Figures

Figure 1

16 pages, 1512 KiB

Open AccessArticle

An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation

by Yi Qin and Feifan Yu

Sensors 2025, 25(2), 341; https://doi.org/10.3390/s25020341 - 9 Jan 2025

Viewed by 866

Abstract

The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end [...] Read more.

The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

20 pages, 1150 KiB

Open AccessArticle

MPSA-Conformer-CTC/Attention: A High-Accuracy, Low-Complexity End-to-End Approach for Tibetan Speech Recognition

by Changlin Wu, Huihui Sun, Kaifeng Huang and Long Wu

Sensors 2024, 24(21), 6824; https://doi.org/10.3390/s24216824 - 24 Oct 2024

Viewed by 1758

Abstract

This study addresses the challenges of low accuracy and high computational demands in Tibetan speech recognition by investigating the application of end-to-end networks. We propose a decoding strategy that integrates Connectionist Temporal Classification (CTC) and Attention mechanisms, capitalizing on the benefits of automatic [...] Read more.

This study addresses the challenges of low accuracy and high computational demands in Tibetan speech recognition by investigating the application of end-to-end networks. We propose a decoding strategy that integrates Connectionist Temporal Classification (CTC) and Attention mechanisms, capitalizing on the benefits of automatic alignment and attention weight extraction. The Conformer architecture is utilized as the encoder, leading to the development of the Conformer-CTC/Attention model. This model first extracts global features from the speech signal using the Conformer, followed by joint decoding of these features through CTC and Attention mechanisms. To mitigate convergence issues during training, particularly with longer input feature sequences, we introduce a Probabilistic Sparse Attention mechanism within the joint CTC/Attention framework. Additionally, we implement a maximum entropy optimization algorithm for CTC, effectively addressing challenges such as increased path counts, spike distributions, and local optima during training. We designate the proposed method as the MaxEnt-Optimized Probabilistic Sparse Attention Conformer-CTC/Attention Model (MPSA-Conformer-CTC/Attention). Experimental results indicate that our improved model achieves a word error rate reduction of 10.68% and 9.57% on self-constructed and open-source Tibetan datasets, respectively, compared to the baseline model. Furthermore, the enhanced model not only reduces memory consumption and training time but also improves generalization capability and accuracy. Full article

(This article belongs to the Special Issue New Trends in Biometric Sensing and Information Processing)

► Show Figures

Figure 1

18 pages, 4420 KiB

Open AccessArticle

Machine Learning Approach for Arabic Handwritten Recognition

by A. M. Mutawa, Mohammad Y. Allaho and Monirah Al-Hajeri

Appl. Sci. 2024, 14(19), 9020; https://doi.org/10.3390/app14199020 - 6 Oct 2024

Cited by 2 | Viewed by 3745

Abstract

Text recognition is an important area of the pattern recognition field. Natural language processing (NLP) and pattern recognition have been utilized efficiently in script recognition. Much research has been conducted on handwritten script recognition. However, the research on the Arabic language for handwritten [...] Read more.

Text recognition is an important area of the pattern recognition field. Natural language processing (NLP) and pattern recognition have been utilized efficiently in script recognition. Much research has been conducted on handwritten script recognition. However, the research on the Arabic language for handwritten text recognition received little attention compared with other languages. Therefore, it is crucial to develop a new model that can recognize Arabic handwritten text. Most of the existing models used to acknowledge Arabic text are based on traditional machine learning techniques. Therefore, we implemented a new model using deep machine learning techniques by integrating two deep neural networks. In the new model, the architecture of the Residual Network (ResNet) model is used to extract features from raw images. Then, the Bidirectional Long Short-Term Memory (BiLSTM) and connectionist temporal classification (CTC) are used for sequence modeling. Our system improved the recognition rate of Arabic handwritten text compared to other models of a similar type with a character error rate of 13.2% and word error rate of 27.31%. In conclusion, the domain of Arabic handwritten recognition is advancing swiftly with the use of sophisticated deep learning methods. Full article

(This article belongs to the Special Issue Applied Intelligence in Natural Language Processing)

► Show Figures

Figure 1

10 pages, 585 KiB

Open AccessTechnical Note

Text-Independent Phone-to-Audio Alignment Leveraging SSL (TIPAA-SSL) Pre-Trained Model Latent Representation and Knowledge Transfer

by Noé Tits, Prernna Bhatnagar and Thierry Dutoit

Acoustics 2024, 6(3), 772-781; https://doi.org/10.3390/acoustics6030042 - 29 Aug 2024

Cited by 1 | Viewed by 1879

Abstract

In this paper, we present a novel approach for text-independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (Wav2Vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model [...] Read more.

In this paper, we present a novel approach for text-independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (Wav2Vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained using forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages. Full article

(This article belongs to the Special Issue Developments in Acoustic Phonetic Research)

► Show Figures

Figure 1

13 pages, 2651 KiB

Open AccessArticle

Speech Recognition for Air Traffic Control Utilizing a Multi-Head State-Space Model and Transfer Learning

by Haijun Liang, Hanwen Chang and Jianguo Kong

Aerospace 2024, 11(5), 390; https://doi.org/10.3390/aerospace11050390 - 14 May 2024

Cited by 1 | Viewed by 1803

Abstract

In the present study, a novel end-to-end automatic speech recognition (ASR) framework, namely, ResNeXt-Mssm-CTC, has been developed for air traffic control (ATC) systems. This framework is built upon the Multi-Head State-Space Model (Mssm) and incorporates transfer learning techniques. Residual Networks with Cardinality (ResNeXt) [...] Read more.

In the present study, a novel end-to-end automatic speech recognition (ASR) framework, namely, ResNeXt-Mssm-CTC, has been developed for air traffic control (ATC) systems. This framework is built upon the Multi-Head State-Space Model (Mssm) and incorporates transfer learning techniques. Residual Networks with Cardinality (ResNeXt) employ multi-layered convolutions with residual connections to augment the extraction of intricate feature representations from speech signals. The Mssm is endowed with specialized gating mechanisms, which incorporate parallel heads that acquire knowledge of both local and global temporal dynamics in sequence data. Connectionist temporal classification (CTC) is utilized in the context of sequence labeling, eliminating the requirement for forced alignment and accommodating labels of varying lengths. Moreover, the utilization of transfer learning has been shown to improve performance on the target task by leveraging knowledge acquired from a source task. The experimental results indicate that the model proposed in this study exhibits superior performance compared to other baseline models. Specifically, when pretrained on the Aishell corpus, the model achieves a minimum character error rate (CER) of 7.2% and 8.3%. Furthermore, when applied to the ATC corpus, the CER is reduced to 5.5% and 6.7%. Full article

(This article belongs to the Special Issue Application of Multidisciplinary Optimization and Artificial Intelligence Techniques to Aerospace Engineering (Volume II))

► Show Figures

Figure 1

25 pages, 2228 KiB

Open AccessArticle

Improving Text-Independent Forced Alignment to Support Speech-Language Pathologists with Phonetic Transcription

by Ying Li, Bryce Johannas Wohlan, Duc-Son Pham, Kit Yan Chan, Roslyn Ward, Neville Hennessey and Tele Tan

Sensors 2023, 23(24), 9650; https://doi.org/10.3390/s23249650 - 6 Dec 2023

Cited by 1 | Viewed by 3282

Abstract

Problem: Phonetic transcription is crucial in diagnosing speech sound disorders (SSDs) but is susceptible to transcriber experience and perceptual bias. Current forced alignment (FA) tools, which annotate audio files to determine spoken content and its placement, often require manual transcription, limiting their [...] Read more.

Problem: Phonetic transcription is crucial in diagnosing speech sound disorders (SSDs) but is susceptible to transcriber experience and perceptual bias. Current forced alignment (FA) tools, which annotate audio files to determine spoken content and its placement, often require manual transcription, limiting their effectiveness. Method: We introduce a novel, text-independent forced alignment model that autonomously recognises individual phonemes and their boundaries, addressing these limitations. Our approach leverages an advanced, pre-trained wav2vec 2.0 model to segment speech into tokens and recognise them automatically. To accurately identify phoneme boundaries, we utilise an unsupervised segmentation tool, UnsupSeg. Labelling of segments employs nearest-neighbour classification with wav2vec 2.0 labels, before connectionist temporal classification (CTC) collapse, determining class labels based on maximum overlap. Additional post-processing, including overfitting cleaning and voice activity detection, is implemented to enhance segmentation. Results: We benchmarked our model against existing methods using the TIMIT dataset for normal speakers and, for the first time, evaluated its performance on the TORGO dataset containing SSD speakers. Our model demonstrated competitive performance, achieving a harmonic mean score of 76.88% on TIMIT and 70.31% on TORGO. Implications: This research presents a significant advancement in the assessment and diagnosis of SSDs, offering a more objective and less biased approach than traditional methods. Our model’s effectiveness, particularly with SSD speakers, opens new avenues for research and clinical application in speech pathology. Full article

(This article belongs to the Special Issue Artificial Intelligence in Medical Sensors II)

► Show Figures

Figure 1

20 pages, 3255 KiB

Open AccessArticle

Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition

by Rina Buoy, Masakazu Iwamura, Sovila Srun and Koichi Kise

J. Imaging 2023, 9(11), 248; https://doi.org/10.3390/jimaging9110248 - 15 Nov 2023

Cited by 1 | Viewed by 2966

Abstract

Connectionist temporal classification (CTC) is a favored decoder in scene text recognition (STR) for its simplicity and efficiency. However, most CTC-based methods utilize one-dimensional (1D) vector sequences, usually derived from a recurrent neural network (RNN) encoder. This results in the absence of explainable [...] Read more.

Connectionist temporal classification (CTC) is a favored decoder in scene text recognition (STR) for its simplicity and efficiency. However, most CTC-based methods utilize one-dimensional (1D) vector sequences, usually derived from a recurrent neural network (RNN) encoder. This results in the absence of explainable 2D spatial relationship between the predicted characters and corresponding image regions, essential for model explainability. On the other hand, 2D attention-based methods enhance recognition accuracy and offer character location information via cross-attention mechanisms, linking predictions to image regions. However, these methods are more computationally intensive, compared with the 1D CTC-based methods. To achieve both low latency and model explainability via character localization using a 1D CTC decoder, we propose a marginalization-based method that processes 2D feature maps and predicts a sequence of 2D joint probability distributions over the height and class dimensions. Based on the proposed method, we newly introduce an association map that aids in character localization and model prediction explanation. This map parallels the role of a cross-attention map, as seen in computationally-intensive attention-based architectures. With the proposed method, we consider a ViT-CTC STR architecture that uses a 1D CTC decoder and a pretrained vision Transformer (ViT) as a 2D feature extractor. Our ViT-CTC models were trained on synthetic data and fine-tuned on real labeled sets. These models outperform the recent state-of-the-art (SOTA) CTC-based methods on benchmarks in terms of recognition accuracy. Compared with the baseline Transformer-decoder-based models, our ViT-CTC models offer a speed boost up to 12 times regardless of the backbone, with a maximum 3.1% reduction in total word recognition accuracy. In addition, both qualitative and quantitative assessments of character locations estimated from the association map align closely with those from the cross-attention map and ground-truth character-level bounding boxes. Full article

(This article belongs to the Section Document Analysis and Processing)

► Show Figures

Figure 1

14 pages, 3012 KiB

Open AccessArticle

Bidirectional Representations for Low-Resource Spoken Language Understanding

by Quentin Meeus, Marie-Francine Moens and Hugo Van hamme

Appl. Sci. 2023, 13(20), 11291; https://doi.org/10.3390/app132011291 - 14 Oct 2023

Cited by 1 | Viewed by 1340

Abstract

Speech representation models lack the ability to efficiently store semantic information and require fine tuning to deliver decent performance. In this research, we introduce a transformer encoder–decoder framework with a multiobjective training strategy, incorporating connectionist temporal classification (CTC) and masked language modeling (MLM) [...] Read more.

Speech representation models lack the ability to efficiently store semantic information and require fine tuning to deliver decent performance. In this research, we introduce a transformer encoder–decoder framework with a multiobjective training strategy, incorporating connectionist temporal classification (CTC) and masked language modeling (MLM) objectives. This approach enables the model to learn contextual bidirectional representations. We evaluate the representations in a challenging low-resource scenario, where training data is limited, necessitating expressive speech embeddings to compensate for the scarcity of examples. Notably, we demonstrate that our model’s initial embeddings outperform comparable models on multiple datasets before fine tuning. Fine tuning the top layers of the representation model further enhances performance, particularly on the Fluent Speech Command dataset, even under low-resource conditions. Additionally, we introduce the concept of class attention as an efficient module for spoken language understanding, characterized by its speed and minimal parameter requirements. Class attention not only aids in explaining model predictions but also enhances our understanding of the underlying decision-making processes. Our experiments cover both English and Dutch languages, offering a comprehensive evaluation of our proposed approach. Full article

(This article belongs to the Special Issue Deep Learning for Speech Processing)

► Show Figures

Figure 1

13 pages, 1367 KiB

Open AccessArticle

JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning

by Nurmemet Yolwas and Weijing Meng

Appl. Sci. 2023, 13(9), 5239; https://doi.org/10.3390/app13095239 - 22 Apr 2023

Cited by 3 | Viewed by 2213

Abstract

In recent years, the end-to-end speech recognition model has emerged as a popular alternative to the traditional Deep Neural Network—Hidden Markov Model (DNN-HMM). This approach maps acoustic features directly onto text sequences via a single network architecture, significantly streamlining the model construction process. [...] Read more.

In recent years, the end-to-end speech recognition model has emerged as a popular alternative to the traditional Deep Neural Network—Hidden Markov Model (DNN-HMM). This approach maps acoustic features directly onto text sequences via a single network architecture, significantly streamlining the model construction process. However, the training of end-to-end speech recognition models typically necessitates a significant quantity of supervised data to achieve good performance, which poses a challenge in low-resource conditions. The use of unsupervised representation significantly reduces this necessity. Recent research has focused on end-to-end techniques employing joint Connectionist Temporal Classification (CTC) and attention mechanisms, with some also concentrating on unsupervised presentation learning. This paper proposes a joint supervised and unsupervised multi-task learning model (JSUM). Our approach leverages the unsupervised pre-trained wav2vec 2.0 model as a shared encoder that integrates the joint CTC-Attention network and the generative adversarial network into a unified end-to-end architecture. Our method provides a new low-resource language speech recognition solution that optimally utilizes supervised and unsupervised datasets by combining CTC, attention, and generative adversarial losses. Furthermore, our proposed approach is suitable for both monolingual and cross-lingual scenarios. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

16 pages, 2087 KiB

Open AccessArticle

Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax

by Ting Guo, Nurmemet Yolwas and Wushour Slamu

Appl. Sci. 2023, 13(7), 4642; https://doi.org/10.3390/app13074642 - 6 Apr 2023

Cited by 4 | Viewed by 2873

Abstract

Recently, the performance of end-to-end speech recognition has been further improved based on the proposed Conformer framework, which has also been widely used in the field of speech recognition. However, the Conformer model is mostly applied to very widespread languages, such as Chinese [...] Read more.

Recently, the performance of end-to-end speech recognition has been further improved based on the proposed Conformer framework, which has also been widely used in the field of speech recognition. However, the Conformer model is mostly applied to very widespread languages, such as Chinese and English, and rarely applied to speech recognition of Central and West Asian agglutinative languages. There are more network parameters in the Conformer end-to-end speech recognition model, so the structure of the model is complex, and it consumes more resources. At the same time, we found that there is a long-tail problem in Kazakh, i.e., the distribution of high-frequency words and low-frequency words is not uniform, which makes the recognition accuracy of the model low. For these reasons, we made the following improvements to the Conformer baseline model. First, we constructed a low-rank multi-head self-attention encoder and decoder using low-rank approximation decomposition to reduce the number of parameters of the multi-head self-attention module and model’s storage space. Second, to alleviate the long-tail problem in Kazakh, the original softmax function was replaced by a balanced softmax function in the Conformer model; Third, we use connectionist temporal classification (CTC) as an auxiliary task to speed up the model training and build a multi-task lightweight but efficient Conformer speech recognition model with hybrid CTC/Attention. To evaluate the effectiveness of the proposed model, we conduct experiments on the open-source Kazakh language dataset, during which no external language model is used, and the number of parameters is relatively compressed by 7.4% and the storage space is relatively reduced by 13.5 MB, while the training speed and word error rate remain basically unchanged. Full article

(This article belongs to the Section Acoustics and Vibrations)

► Show Figures

Figure 1

25 pages, 6500 KiB

Open AccessArticle

Using Phase-Sensitive Optical Time Domain Reflectometers to Develop an Alignment-Free End-to-End Multitarget Recognition Model

by Nachuan Yang, Yongjun Zhao, Fuqiang Wang and Jinyang Chen

Electronics 2023, 12(7), 1617; https://doi.org/10.3390/electronics12071617 - 29 Mar 2023

Cited by 5 | Viewed by 2218

Abstract

This pattern recognition method can effectively identify vibration signals collected by a phase-sensitive optical time-domain reflectometer (Φ-OTDR) and improve the accuracy of alarms. An alignment-free end-to-end multi-vibration event detection method based on Φ-OTDR is proposed, effectively detecting different vibration events in different frequency [...] Read more.

This pattern recognition method can effectively identify vibration signals collected by a phase-sensitive optical time-domain reflectometer (Φ-OTDR) and improve the accuracy of alarms. An alignment-free end-to-end multi-vibration event detection method based on Φ-OTDR is proposed, effectively detecting different vibration events in different frequency bands. The pulse accumulation and pulse cancellers determine the location of vibration events. The local differential detection method demodulates the vibration event time-domain variation signals. After the extraction of the signal time-frequency features by sliding window, the convolution neural network (CNN) further extracts the signal features. It analyzes the temporal relationship of each group of signal features using a bidirectional long short-term memory network (Bi-LSTM). Finally, the connectionist temporal classification (CTC) is used to label the unsegmented sequence data to achieve single detection of multiple vibration targets. Experiments show that using this method to process the collected 8563 data, containing 5 different frequency bands of multi-vibration acoustic sensing signal, the system F1 score is 99.49% with a single detection time of 2.2 ms. The highest frequency response is 1 kHz. It is available to quickly and efficiently identify multiple vibration signals when a single demodulated acoustic sensing signal contains multiple vibration events. Full article

► Show Figures

Figure 1

18 pages, 2795 KiB

Open AccessArticle

A Helium Speech Unscrambling Algorithm Based on Deep Learning

by Yonghong Chen and Shibing Zhang

Information 2023, 14(3), 189; https://doi.org/10.3390/info14030189 - 17 Mar 2023

Cited by 3 | Viewed by 2428

Abstract

Helium speech, the language spoken by divers in the deep sea who breathe a high-pressure helium–oxygen mixture, is almost unintelligible. To accurately unscramble helium speech, a neural network based on deep learning is proposed. First, an isolated helium speech corpus and a continuous [...] Read more.

Helium speech, the language spoken by divers in the deep sea who breathe a high-pressure helium–oxygen mixture, is almost unintelligible. To accurately unscramble helium speech, a neural network based on deep learning is proposed. First, an isolated helium speech corpus and a continuous helium speech corpus in a normal atmosphere are constructed, and an algorithm to automatically generate label files is proposed. Then, a convolution neural network (CNN), connectionist temporal classification (CTC) and a transformer are combined into a speech recognition network. Finally, an optimization algorithm is proposed to improve the recognition of continuous helium speech, which combines depth-wise separable convolution (DSC), a gated linear unit (GLU) and a feedforward neural network (FNN). The experimental results show that the accuracy of the algorithm, upon combining the CNN, CTC and the transformer, is 91.38%, and the optimization algorithm improves the accuracy of continuous helium speech recognition by 9.26%. Full article

(This article belongs to the Special Issue Intelligent Information Processing for Sensors and IoT Communications)

► Show Figures

Figure 1

Search Results (28)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (28)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI