Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (9)

Search Parameters:
Keywords = hybrid CTC/attention

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 19265 KB  
Article
A Novel Microfluidic Platform for Circulating Tumor Cell Identification in Non-Small-Cell Lung Cancer
by Tingting Tian, Shanni Ma, Yan Wang, He Yin, Tiantian Dang, Guangqi Li, Jiaming Li, Weijie Feng, Mei Tian, Jinbo Ma and Zhijun Zhao
Micromachines 2025, 16(10), 1136; https://doi.org/10.3390/mi16101136 - 1 Oct 2025
Viewed by 255
Abstract
Circulating tumor cells (CTCs) are crucial biomarkers for lung cancer metastasis and recurrence, garnering significant clinical attention. Despite this, efficient and cost-effective detection methods remain scarce. Consequently, there is an urgent demand for the development of highly sensitive CTC detection technologies to enhance [...] Read more.
Circulating tumor cells (CTCs) are crucial biomarkers for lung cancer metastasis and recurrence, garnering significant clinical attention. Despite this, efficient and cost-effective detection methods remain scarce. Consequently, there is an urgent demand for the development of highly sensitive CTC detection technologies to enhance lung cancer diagnosis and treatment. This study utilized microspheres and A549 cells to model CTCs, assessing the impact of acoustic field forces on cell viability and proliferation and confirming capture efficiency. Subsequently, CTCs from the peripheral blood of patients with lung cancer were captured and identified using fluorescence in situ hybridization, and the results were compared to the immunomagnetic bead method to evaluate the differences between the techniques. Finally, epidermal growth factor receptor (EGFR) mutation analysis was conducted on CTC-positive samples. The findings showed that acoustic microfluidic technology effectively captures microspheres, A549 cells, and CTCs without compromising cell viability or proliferation. Moreover, EGFR mutation analysis successfully identified mutation types in four samples, establishing a basis for personalized targeted therapy. In conclusion, acoustic microfluidic technology preserves cell viability while efficiently capturing CTCs. When integrated with EGFR mutation analysis, it provides robust support for the precise diagnosis and treatment of lung cancer as well as personalized drug therapy. Full article
(This article belongs to the Special Issue Application of Microfluidic Technology in Bioengineering)
Show Figures

Figure 1

17 pages, 2077 KB  
Article
Nonlinear Regularization Decoding Method for Speech Recognition
by Jiang Zhang, Liejun Wang, Yinfeng Yu and Miaomiao Xu
Sensors 2024, 24(12), 3846; https://doi.org/10.3390/s24123846 - 14 Jun 2024
Cited by 4 | Viewed by 1448
Abstract
Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex [...] Read more.
Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex and unfriendly to small datasets. Hence, we propose a Nonlinear Regularization Decoding Method for Speech Recognition. Firstly, we introduce the nonlinear Transformer decoder, breaking away from traditional left-to-right or right-to-left decoding orders and enabling associations between any characters, mitigating the limitations of Transformer architectures on small datasets. Secondly, we propose a novel regularization attention module to optimize the attention score matrix, reducing the impact of early errors on later outputs. Finally, we introduce the tiny model to address the challenge of overly large model parameters. The experimental results indicate that our model demonstrates good performance. Compared to the baseline, our model achieves recognition improvements of 0.12%, 0.54%, 0.51%, and 1.2% on the Aishell1, Primewords, Free ST Chinese Corpus, and Common Voice 16.1 datasets of Uyghur, respectively. Full article
Show Figures

Figure 1

16 pages, 2087 KB  
Article
Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax
by Ting Guo, Nurmemet Yolwas and Wushour Slamu
Appl. Sci. 2023, 13(7), 4642; https://doi.org/10.3390/app13074642 - 6 Apr 2023
Cited by 4 | Viewed by 3245
Abstract
Recently, the performance of end-to-end speech recognition has been further improved based on the proposed Conformer framework, which has also been widely used in the field of speech recognition. However, the Conformer model is mostly applied to very widespread languages, such as Chinese [...] Read more.
Recently, the performance of end-to-end speech recognition has been further improved based on the proposed Conformer framework, which has also been widely used in the field of speech recognition. However, the Conformer model is mostly applied to very widespread languages, such as Chinese and English, and rarely applied to speech recognition of Central and West Asian agglutinative languages. There are more network parameters in the Conformer end-to-end speech recognition model, so the structure of the model is complex, and it consumes more resources. At the same time, we found that there is a long-tail problem in Kazakh, i.e., the distribution of high-frequency words and low-frequency words is not uniform, which makes the recognition accuracy of the model low. For these reasons, we made the following improvements to the Conformer baseline model. First, we constructed a low-rank multi-head self-attention encoder and decoder using low-rank approximation decomposition to reduce the number of parameters of the multi-head self-attention module and model’s storage space. Second, to alleviate the long-tail problem in Kazakh, the original softmax function was replaced by a balanced softmax function in the Conformer model; Third, we use connectionist temporal classification (CTC) as an auxiliary task to speed up the model training and build a multi-task lightweight but efficient Conformer speech recognition model with hybrid CTC/Attention. To evaluate the effectiveness of the proposed model, we conduct experiments on the open-source Kazakh language dataset, during which no external language model is used, and the number of parameters is relatively compressed by 7.4% and the storage space is relatively reduced by 13.5 MB, while the training speed and word error rate remain basically unchanged. Full article
(This article belongs to the Section Acoustics and Vibrations)
Show Figures

Figure 1

17 pages, 2364 KB  
Article
Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
by Zeyu Ren, Nurmemet Yolwas, Wushour Slamu, Ronghe Cao and Huiru Wang
Sensors 2022, 22(19), 7319; https://doi.org/10.3390/s22197319 - 27 Sep 2022
Cited by 12 | Viewed by 3737
Abstract
Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive [...] Read more.
Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

23 pages, 4802 KB  
Article
Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language
by Abdinabi Mukhamadiyev, Ilyos Khujayarov, Oybek Djuraev and Jinsoo Cho
Sensors 2022, 22(10), 3683; https://doi.org/10.3390/s22103683 - 12 May 2022
Cited by 63 | Viewed by 9435
Abstract
Communication has been an important aspect of human life, civilization, and globalization for thousands of years. Biometric analysis, education, security, healthcare, and smart cities are only a few examples of speech recognition applications. Most studies have mainly concentrated on English, Spanish, Japanese, or [...] Read more.
Communication has been an important aspect of human life, civilization, and globalization for thousands of years. Biometric analysis, education, security, healthcare, and smart cities are only a few examples of speech recognition applications. Most studies have mainly concentrated on English, Spanish, Japanese, or Chinese, disregarding other low-resource languages, such as Uzbek, leaving their analysis open. In this paper, we propose an End-To-End Deep Neural Network-Hidden Markov Model speech recognition model and a hybrid Connectionist Temporal Classification (CTC)-attention network for the Uzbek language and its dialects. The proposed approach reduces training time and improves speech recognition accuracy by effectively using CTC objective function in attention model training. We evaluated the linguistic and lay-native speaker performances on the Uzbek language dataset, which was collected as a part of this study. Experimental results show that the proposed model achieved a word error rate of 14.3% using 207 h of recordings as an Uzbek language training dataset. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

16 pages, 3354 KB  
Article
Automatic Detection of Chewing and Swallowing
by Akihiro Nakamura, Takato Saito, Daizo Ikeda, Ken Ohta, Hiroshi Mineno and Masafumi Nishimura
Sensors 2021, 21(10), 3378; https://doi.org/10.3390/s21103378 - 12 May 2021
Cited by 14 | Viewed by 5252
Abstract
A series of eating behaviors, including chewing and swallowing, is considered to be crucial to the maintenance of good health. However, most such behaviors occur within the human body, and highly invasive methods such as X-rays and fiberscopes must be utilized to collect [...] Read more.
A series of eating behaviors, including chewing and swallowing, is considered to be crucial to the maintenance of good health. However, most such behaviors occur within the human body, and highly invasive methods such as X-rays and fiberscopes must be utilized to collect accurate behavioral data. A simpler method of measurement is needed in healthcare and medical fields; hence, the present study concerns the development of a method to automatically recognize a series of eating behaviors from the sounds produced during eating. The automatic detection of left chewing, right chewing, front biting, and swallowing was tested through the deployment of the hybrid CTC/attention model, which uses sound recorded through 2ch microphones under the ear and weak labeled data as training data to detect the balance of chewing and swallowing. N-gram based data augmentation was first performed using weak labeled data to generate many weak labeled eating sounds to augment the training data. The detection performance was improved through the use of the hybrid CTC/attention model, which can learn the context. In addition, the study confirmed a similar detection performance for open and closed foods. Full article
(This article belongs to the Special Issue Acoustic Event Detection and Sensing)
Show Figures

Figure 1

18 pages, 1681 KB  
Article
Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model
by Yong-Hyeok Lee, Dong-Won Jang, Jae-Bin Kim, Rae-Hong Park and Hyung-Min Park
Appl. Sci. 2020, 10(20), 7263; https://doi.org/10.3390/app10207263 - 17 Oct 2020
Cited by 20 | Viewed by 5821
Abstract
Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, [...] Read more.
Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model. Full article
Show Figures

Figure 1

24 pages, 1703 KB  
Article
End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC/Attention Architecture
by Long Zhang, Ziping Zhao, Chunmei Ma, Linlin Shan, Huazhi Sun, Lifen Jiang, Shiwen Deng and Chang Gao
Sensors 2020, 20(7), 1809; https://doi.org/10.3390/s20071809 - 25 Mar 2020
Cited by 44 | Viewed by 8226
Abstract
Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to [...] Read more.
Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC/attention architecture is close to the state-of-the-art ASR system based on the deep neural network–deep neural network (DNN–DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task. Full article
Show Figures

Figure 1

14 pages, 1071 KB  
Article
Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
by Long Wu, Ta Li, Li Wang and Yonghong Yan
Appl. Sci. 2019, 9(21), 4639; https://doi.org/10.3390/app9214639 - 31 Oct 2019
Cited by 10 | Viewed by 5219
Abstract
As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, [...] Read more.
As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

Back to TopTop