Intelligent Speech and Acoustic Signal Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (20 August 2020) | Viewed by 67532

Special Issue Editor


E-Mail Website
Guest Editor
School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, Korea
Interests: speech enhancement; voice activity detection; source localization; acoustic echo cancellation; speech emotion recognition

Special Issue Information

Dear Colleagues,

Speech and acoustic signals are one of the most natural modalities that carry important information from/to people. In the last decade, machine learning techniques have greatly enhanced the performance of speech and acoustic signal processing, while traditional signal processing approaches still provide values especially in areas with limited resources. This Special Issue is dedicated to recent advances in the intelligent speech and acoustic signal processing with signal processing and/or machine learning approaches in the vast area including but not limited to speech/speaker/language/emotion/prosody recognition, speech synthesis, far-field acoustic signal processing, speech enhancement, acoustic echo cancellation, source localization, spoken term detection, keyword detection, active noise cancellation, language processing, forensics/security/privacy/obfuscation, spoken language understanding, and music analysis and processing.

Prof. Jong Won Shin
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Speech signal processing
  • Acoustic signal processing
  • Microphone array processing
  • Language processing
  • Machine learning
  • Statistical signal processing

Published Papers (23 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

12 pages, 822 KiB  
Article
Dual-Mic Speech Enhancement Based on TF-GSC with Leakage Suppression and Signal Recovery
by Hansol Kim and Jong Won Shin
Appl. Sci. 2021, 11(6), 2816; https://doi.org/10.3390/app11062816 - 22 Mar 2021
Cited by 4 | Viewed by 2581
Abstract
The transfer function-generalized sidelobe canceller (TF-GSC) is one of the most popular structures for the adaptive beamformer used in multi-channel speech enhancement. Although the TF-GSC has shown decent performance, a certain amount of steering error is inevitable, which causes leakage of speech components [...] Read more.
The transfer function-generalized sidelobe canceller (TF-GSC) is one of the most popular structures for the adaptive beamformer used in multi-channel speech enhancement. Although the TF-GSC has shown decent performance, a certain amount of steering error is inevitable, which causes leakage of speech components through the blocking matrix (BM) and distortion in the fixed beamformer (FBF) output. In this paper, we propose to suppress the leaked signal in the output of the BM and restore the desired signal in the FBF output of the TF-GSC. To reduce the risk of attenuating speech in the adaptive noise canceller (ANC), the speech component in the output of the BM is suppressed by applying a gain function similar to the square-root Wiener filter, assuming that a certain portion of the desired speech should be leaked into the BM output. Additionally, we propose to restore the attenuated desired signal in the FBF output by adding some of the microphone signal components back, depending on how microphone signals are related to the FBF and BM outputs. The experimental results showed that the proposed TF-GSC outperformed conventional TF-GSC in terms of the perceptual evaluation of speech quality (PESQ) scores under various noise conditions and the direction of arrivals for the desired and interfering sources. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

12 pages, 1497 KiB  
Article
Relationship of Cepstral Peak Prominence-Smoothed and Long-Term Average Spectrum with Auditory–Perceptual Analysis
by Angélica Emygdio da Silva Antonetti, Larissa Thais Donalonso Siqueira, Maria Paula de Almeida Gobbo, Alcione Ghedini Brasolotto and Kelly Cristina Alves Silverio
Appl. Sci. 2020, 10(23), 8598; https://doi.org/10.3390/app10238598 - 01 Dec 2020
Cited by 15 | Viewed by 3376
Abstract
Cepstral peak prominence-smoothed (CPPs) and long-term average spectrum (LTAS) are robust measures that represent the glottal source and source-filter interactions, respectively. Until now, little has been known about how physiological events impact auditory–perceptual characteristics in the objective measures of CPPs and LTAS (alpha [...] Read more.
Cepstral peak prominence-smoothed (CPPs) and long-term average spectrum (LTAS) are robust measures that represent the glottal source and source-filter interactions, respectively. Until now, little has been known about how physiological events impact auditory–perceptual characteristics in the objective measures of CPPs and LTAS (alpha ratio; L1–L0). Thus, this paper aims to analyze the relationship between such acoustic measures and auditory–perceptual analysis and then determine which acoustic measure best represents voice quality. We analyzed 53 voice samples of vocally healthy participants (vocally healthy group-VHG) and 49 voice samples of participants with behavioral dysphonia (dysphonic group-DG). Each voice sample was composed of sustained vowel /a/ and connected speech. CPPs seem to be the best predictor of voice deviation in both studied populations because there was moderate to strong negative correlations with general degree, breathiness, roughness, and strain (auditory–perceptual parameters). Regarding L1–L0, this measure is related to breathiness (moderate negative correlations). Hence, L1–L0 provides information about air leak through closed glottis, assisting the phonatory efficiency analysis. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

12 pages, 661 KiB  
Article
Deep Learning-Based Portable Device for Audio Distress Signal Recognition in Urban Areas
by Jorge Felipe Gaviria, Alejandra Escalante-Perez, Juan Camilo Castiblanco, Nicolas Vergara, Valentina Parra-Garces, Juan David Serrano, Andres Felipe Zambrano and Luis Felipe Giraldo
Appl. Sci. 2020, 10(21), 7448; https://doi.org/10.3390/app10217448 - 23 Oct 2020
Cited by 9 | Viewed by 3572
Abstract
Real-time automatic identification of audio distress signals in urban areas is a task that in a smart city can improve response times in emergency alert systems. The main challenge in this problem lies in finding a model that is able to accurately recognize [...] Read more.
Real-time automatic identification of audio distress signals in urban areas is a task that in a smart city can improve response times in emergency alert systems. The main challenge in this problem lies in finding a model that is able to accurately recognize these type of signals in the presence of background noise and allows for real-time processing. In this paper, we present the design of a portable and low-cost device for accurate audio distress signal recognition in real urban scenarios based on deep learning models. As real audio distress recordings in urban areas have not been collected and made publicly available so far, we first constructed a database where audios were recorded in urban areas using a low-cost microphone. Using this database, we trained a deep multi-headed 2D convolutional neural network that processed temporal and frequency features to accurately recognize audio distress signals in noisy environments with a significant performance improvement to other methods from the literature. Then, we deployed and assessed the trained convolutional neural network model on a Raspberry Pi that, along with the low-cost microphone, constituted a device for accurate real-time audio recognition. Source code and database are publicly available. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

24 pages, 3054 KiB  
Article
A Preprocessing Strategy for Denoising of Speech Data Based on Speech Segment Detection
by Seung-Jun Lee and Hyuk-Yoon Kwon
Appl. Sci. 2020, 10(20), 7385; https://doi.org/10.3390/app10207385 - 21 Oct 2020
Cited by 7 | Viewed by 3686
Abstract
In this paper, we propose a preprocessing strategy for denoising of speech data based on speech segment detection. A design of computationally efficient speech denoising is necessary to develop a scalable method for large-scale data sets. Furthermore, it becomes more important as the [...] Read more.
In this paper, we propose a preprocessing strategy for denoising of speech data based on speech segment detection. A design of computationally efficient speech denoising is necessary to develop a scalable method for large-scale data sets. Furthermore, it becomes more important as the deep learning-based methods have been developed because they require significant costs while showing high performance in general. The basic idea of the proposed method is using the speech segment detection so as to exclude non-speech segments before denoising. The speech segmentation detection can exclude non-speech segments with a negligible cost, which will be removed in denoising process with a much higher cost, while maintaining the accuracy of denoising. First, we devise a framework to choose the best preprocessing method for denoising based on the speech segment detection for a target environment. For this, we speculate the environments for denoising using different levels of signal-to-noise ratio (SNR) and multiple evaluation metrics. The framework finds the best speech segment detection method tailored to a target environment according to the performance evaluation of speech segment detection methods. Next, we investigate the accuracy of the speech segment detection methods extensively. We conduct the performance evaluation of five speech segment detection methods with different levels of SNRs and evaluation metrics. Especially, we show that we can adjust the accuracy between the precision and recall of each method by controlling a parameter. Finally, we incorporate the best speech segment detection method for a target environment into a denoising process. Through extensive experiments, we show that the accuracy of the proposed scheme is comparable to or even better than that of Wavenet-based denoising, which is one of recent advanced denoising methods based on deep neural networks, in terms of multiple evaluation metrics of denoising, i.e., SNR, STOI, and PESQ, while it can reduce the denoising time of the Wavenet-based denoising by approximately 40–50% according to the used speech segment detection method. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

17 pages, 410 KiB  
Article
Language Model Using Neural Turing Machine Based on Localized Content-Based Addressing
by Donghyun Lee, Jeong-Sik Park, Myoung-Wan Koo and Ji-Hwan Kim
Appl. Sci. 2020, 10(20), 7181; https://doi.org/10.3390/app10207181 - 15 Oct 2020
Cited by 2 | Viewed by 2220
Abstract
The performance of a long short-term memory (LSTM) recurrent neural network (RNN)-based language model has been improved on language model benchmarks. Although a recurrent layer has been widely used, previous studies showed that an LSTM RNN-based language model (LM) cannot overcome the limitation [...] Read more.
The performance of a long short-term memory (LSTM) recurrent neural network (RNN)-based language model has been improved on language model benchmarks. Although a recurrent layer has been widely used, previous studies showed that an LSTM RNN-based language model (LM) cannot overcome the limitation of the context length. To train LMs on longer sequences, attention mechanism-based models have recently been used. In this paper, we propose a LM using a neural Turing machine (NTM) architecture based on localized content-based addressing (LCA). The NTM architecture is one of the attention-based model. However, the NTM encounters a problem with content-based addressing because all memory addresses need to be accessed for calculating cosine similarities. To address this problem, we propose an LCA method. The LCA method searches for the maximum of all cosine similarities generated from all memory addresses. Next, a specific memory area including the selected memory address is normalized with the softmax function. The LCA method is applied to pre-trained NTM-based LM during the test stage. The proposed architecture is evaluated on Penn Treebank and enwik8 LM tasks. The experimental results indicate that the proposed approach outperforms the previous NTM architecture. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

12 pages, 3494 KiB  
Article
Comparison of Multivariate Analysis Methods as Applied to English Speech
by Yixin Zhang, Yoshitaka Nakajima, Kazuo Ueda, Takuya Kishida and Gerard B. Remijn
Appl. Sci. 2020, 10(20), 7076; https://doi.org/10.3390/app10207076 - 12 Oct 2020
Cited by 3 | Viewed by 1987
Abstract
A newly developed factor analysis, origin-shifted factor analysis, was compared with a normal factor analysis to analyze the spectral changes of English speech. Our first aim was to investigate whether these analyses would cause differences in the factor loadings and the extracted spectral-factor [...] Read more.
A newly developed factor analysis, origin-shifted factor analysis, was compared with a normal factor analysis to analyze the spectral changes of English speech. Our first aim was to investigate whether these analyses would cause differences in the factor loadings and the extracted spectral-factor scores. The methods mainly differed in whether to use cepstral liftering and an origin shift. The results showed that three spectral factors were obtained in four main frequency bands, but neither the cepstral liftering nor the origin shift distorted the essential characteristics of the factors. This confirms that the origin-shifted factor analysis is more recommendable for future speech analyses, since it would reduce the generation of noise in resynthesized speech. Our second aim was to further identify acoustic correlates of English phonemes. Our data show for the first time that the distribution of obstruents in English speech constitutes an L-shape related to two spectral factors on the three-dimensional configuration. One factor had center loadings around 4100 Hz, while the other was bimodal with peaks around 300 Hz and 2300 Hz. This new finding validates the use of multivariate analyses to connect English phonology and speech acoustics. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

13 pages, 1551 KiB  
Article
Intelligibility of English Mosaic Speech: Comparison between Native and Non-Native Speakers of English
by Santi, Yoshitaka Nakajima, Kazuo Ueda and Gerard B. Remijn
Appl. Sci. 2020, 10(19), 6920; https://doi.org/10.3390/app10196920 - 02 Oct 2020
Cited by 3 | Viewed by 3025
Abstract
Mosaic speech is degraded speech that is segmented into time × frequency blocks. Earlier research with Japanese mosaic speech has shown that its intelligibility is almost perfect for mosaic block durations (MBD) up to 40 ms. The purpose of the present study was [...] Read more.
Mosaic speech is degraded speech that is segmented into time × frequency blocks. Earlier research with Japanese mosaic speech has shown that its intelligibility is almost perfect for mosaic block durations (MBD) up to 40 ms. The purpose of the present study was to investigate the intelligibility of English mosaic speech, and whether its intelligibility would vary if it was compressed in time, preserved, or stretched in time. Furthermore, we investigated whether intelligibility differed between native and non-native speakers of English. English (n = 19), Indonesian (n = 19), and Chinese (n = 20) listeners participated in an experiment, in which the mosaic speech stimuli were presented, and they had to type what they had heard. The results showed that compressing or stretching the English mosaic speech resulted in similar trends in intelligibility among the three language groups, with some exceptions. Generally, the intelligibility for MBDs of 20 and 40 ms after preserving/stretching was higher, and decreased beyond MBDs of 80 ms after stretching. Compression also lowered intelligibility. This suggests that humans can extract new information from individual speech segments of about 40 ms, but that there is a limit to the amount of linguistic information that can be conveyed within a block of about 40 ms or below. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

21 pages, 6324 KiB  
Article
Regularized Within-Class Precision Matrix Based PLDA in Text-Dependent Speaker Verification
by Sung-Hyun Yoon, Jong-June Jeon and Ha-Jin Yu
Appl. Sci. 2020, 10(18), 6571; https://doi.org/10.3390/app10186571 - 20 Sep 2020
Cited by 3 | Viewed by 2938
Abstract
In the field of speaker verification, probabilistic linear discriminant analysis (PLDA) is the dominant method for back-end scoring. To estimate the PLDA model, the between-class covariance and within-class precision matrices must be estimated from samples. However, the empirical covariance/precision estimated from samples has [...] Read more.
In the field of speaker verification, probabilistic linear discriminant analysis (PLDA) is the dominant method for back-end scoring. To estimate the PLDA model, the between-class covariance and within-class precision matrices must be estimated from samples. However, the empirical covariance/precision estimated from samples has estimation errors due to the limited number of samples available. In this paper, we propose a method to improve the conventional PLDA by estimating the PLDA model using the regularized within-class precision matrix. We use graphical least absolute shrinking and selection operator (GLASSO) for the regularization. The GLASSO regularization decreases the estimation errors in the empirical precision matrix by making the precision matrix sparse, which corresponds to the reflection of the conditional independence structure. The experimental results on text-dependent speaker verification reveal that the proposed method reduce the relative equal error rate by up to 23% compared with the conventional PLDA. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

18 pages, 2948 KiB  
Article
Tonal Contour Generation for Isarn Speech Synthesis Using Deep Learning and Sampling-Based F0 Representation
by Pongsathon Janyoi and Pusadee Seresangtakul
Appl. Sci. 2020, 10(18), 6381; https://doi.org/10.3390/app10186381 - 13 Sep 2020
Cited by 5 | Viewed by 2125
Abstract
The modeling of fundamental frequency (F0) in speech synthesis is a critical factor affecting the intelligibility and naturalness of synthesized speech. In this paper, we focus on improving the modeling of F0 for Isarn speech synthesis. We propose the [...] Read more.
The modeling of fundamental frequency (F0) in speech synthesis is a critical factor affecting the intelligibility and naturalness of synthesized speech. In this paper, we focus on improving the modeling of F0 for Isarn speech synthesis. We propose the F0 model for this based on a recurrent neural network (RNN). Sampled values of F0 are used at the syllable level of continuous Isarn speech combined with their dynamic features to represent supra-segmental properties of the F0 contour. Different architectures of the deep RNNs and different combinations of linguistic features are analyzed to obtain conditions for the best performance. To assess the proposed method, we compared it with several RNN-based baselines. The results of objective and subjective tests indicate that the proposed model significantly outperformed the baseline RNN model that predicts values of F0 at the frame level, and the baseline RNN model that represents the F0 contours of syllables by using discrete cosine transform. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

15 pages, 3184 KiB  
Article
Speech Recognition for Task Domains with Sparse Matched Training Data
by Byung Ok Kang, Hyeong Bae Jeon and Jeon Gue Park
Appl. Sci. 2020, 10(18), 6155; https://doi.org/10.3390/app10186155 - 04 Sep 2020
Cited by 3 | Viewed by 1964
Abstract
We propose two approaches to handle speech recognition for task domains with sparse matched training data. One is an active learning method that selects training data for the target domain from another general domain that already has a significant amount of labeled speech [...] Read more.
We propose two approaches to handle speech recognition for task domains with sparse matched training data. One is an active learning method that selects training data for the target domain from another general domain that already has a significant amount of labeled speech data. This method uses attribute-disentangled latent variables. For the active learning process, we designed an integrated system consisting of a variational autoencoder with an encoder that infers latent variables with disentangled attributes from the input speech, and a classifier that selects training data with attributes matching the target domain. The other method combines data augmentation methods for generating matched target domain speech data and transfer learning methods based on teacher/student learning. To evaluate the proposed method, we experimented with various task domains with sparse matched training data. The experimental results show that the proposed method has qualitative characteristics that are suitable for the desired purpose, it outperforms random selection, and is comparable to using an equal amount of additional target domain data. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

10 pages, 1013 KiB  
Article
Speech Enhancement for Hearing Aids with Deep Learning on Environmental Noises
by Gyuseok Park, Woohyeong Cho, Kyu-Sung Kim and Sangmin Lee
Appl. Sci. 2020, 10(17), 6077; https://doi.org/10.3390/app10176077 - 02 Sep 2020
Cited by 19 | Viewed by 4992
Abstract
Hearing aids are small electronic devices designed to improve hearing for persons with impaired hearing, using sophisticated audio signal processing algorithms and technologies. In general, the speech enhancement algorithms in hearing aids remove the environmental noise and enhance speech while still giving consideration [...] Read more.
Hearing aids are small electronic devices designed to improve hearing for persons with impaired hearing, using sophisticated audio signal processing algorithms and technologies. In general, the speech enhancement algorithms in hearing aids remove the environmental noise and enhance speech while still giving consideration to hearing characteristics and the environmental surroundings. In this study, a speech enhancement algorithm was proposed to improve speech quality in a hearing aid environment by applying noise reduction algorithms with deep neural network learning based on noise classification. In order to evaluate the speech enhancement in an actual hearing aid environment, ten types of noise were self-recorded and classified using convolutional neural networks. In addition, noise reduction for speech enhancement in the hearing aid were applied by deep neural networks based on the noise classification. As a result, the speech quality based on the speech enhancements removed using the deep neural networks—and associated environmental noise classification—exhibited a significant improvement over that of the conventional hearing aid algorithm. The improved speech quality was also evaluated by objective measure through the perceptual evaluation of speech quality score, the short-time objective intelligibility score, the overall quality composite measure, and the log likelihood ratio score. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

19 pages, 1460 KiB  
Article
Gated Recurrent Attention for Multi-Style Speech Synthesis
by Sung Jun Cheon, Joun Yeop Lee, Byoung Jin Choi, Hyeonseung Lee and Nam Soo Kim
Appl. Sci. 2020, 10(15), 5325; https://doi.org/10.3390/app10155325 - 31 Jul 2020
Cited by 3 | Viewed by 2865
Abstract
End-to-end neural network-based speech synthesis techniques have been developed to represent and synthesize speech in various prosodic style. Although the end-to-end techniques enable the transfer of a style with a single vector of style representation, it has been reported that the speaker similarity [...] Read more.
End-to-end neural network-based speech synthesis techniques have been developed to represent and synthesize speech in various prosodic style. Although the end-to-end techniques enable the transfer of a style with a single vector of style representation, it has been reported that the speaker similarity observed from the speech synthesized with unseen speaker-style is low. One of the reasons for this problem is that the attention mechanism in the end-to-end model is overfitted to the training data. To learn and synthesize voices of various styles, an attention mechanism that can preserve longer-term context and control the context is required. In this paper, we propose a novel attention model which employs gates to control the recurrences in the attention. To verify the proposed attention’s style modeling capability, perceptual listening tests were conducted. The experiments show that the proposed attention outperforms the location-sensitive attention in both similarity and naturalness. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Graphical abstract

14 pages, 466 KiB  
Article
Residual Echo Suppression Considering Harmonic Distortion and Temporal Correlation
by Hyungchan Song and Jong Won Shin
Appl. Sci. 2020, 10(15), 5291; https://doi.org/10.3390/app10155291 - 30 Jul 2020
Cited by 3 | Viewed by 2647
Abstract
In acoustic echo cancellation, a certain level of residual echo resides in the output of the linear echo canceller because of the nonlinearity of the power amplifier, loudspeaker, and acoustic transfer function in addition to the estimation error of the linear echo canceller. [...] Read more.
In acoustic echo cancellation, a certain level of residual echo resides in the output of the linear echo canceller because of the nonlinearity of the power amplifier, loudspeaker, and acoustic transfer function in addition to the estimation error of the linear echo canceller. The residual echo in the current frame is correlated not only to the linear echo estimates for the harmonically-related frequency bins in the current frame, but also with linear echo estimates, residual echo estimates, and microphone signals in adjacent frames. In this paper, we propose a residual echo suppression scheme considering harmonic distortion and temporal correlation in the short-time Fourier transform domain. To exploit residual echo estimates and microphone signals in past frames without the adverse effect of the near-end speech and noise, we adopt a double-talk detector which is tuned to have a low false rejection rate of double-talks. Experimental results show that the proposed method outperformed the conventional approach in terms of the echo return loss enhancement during single-talk periods and the perceptual evaluation of speech quality scores during double-talk periods. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

11 pages, 1859 KiB  
Article
Augmented Latent Features of Deep Neural Network-Based Automatic Speech Recognition for Motor-Driven Robots
by Moa Lee and Joon-Hyuk Chang
Appl. Sci. 2020, 10(13), 4602; https://doi.org/10.3390/app10134602 - 02 Jul 2020
Cited by 2 | Viewed by 1846
Abstract
Speech recognition for intelligent robots seems to suffer from performance degradation due to ego-noise. The ego-noise is caused by the motors, fans, and mechanical parts inside the intelligent robots especially when the robot moves or shakes its body. To overcome the problems caused [...] Read more.
Speech recognition for intelligent robots seems to suffer from performance degradation due to ego-noise. The ego-noise is caused by the motors, fans, and mechanical parts inside the intelligent robots especially when the robot moves or shakes its body. To overcome the problems caused by the ego-noise, we propose a robust speech recognition algorithm that uses motor-state information of the robot as an auxiliary feature. For this, we use two deep neural networks (DNN) in this paper. Firstly, we design the latent features using a bottleneck layer, one of the internal layers having a smaller number of hidden units relative to the other layers, to represent whether the motor is operating or not. The latent features maximizing the representation of the motor-state information are generated by taking the motor data and acoustic features as the input of the first DNN. Secondly, once the motor-state dependent latent features are designed at the first DNN, the second DNN, accounting for acoustic modeling, receives the latent features as the input along with the acoustic features. We evaluated the proposed system on LibriSpeech database. The proposed network enables efficient compression of the acoustic and motor-state information, and the resulting word error rate (WER) are superior to that of a conventional speech recognition system. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

14 pages, 2117 KiB  
Article
A Simple Distortion-Free Method to Handle Variable Length Sequences for Recurrent Neural Networks in Text Dependent Speaker Verification
by Sung-Hyun Yoon and Ha-Jin Yu
Appl. Sci. 2020, 10(12), 4092; https://doi.org/10.3390/app10124092 - 14 Jun 2020
Cited by 9 | Viewed by 3351
Abstract
Recurrent neural networks (RNNs) can model the time-dependency of time-series data. It has also been widely used in text-dependent speaker verification to extract speaker-and-phrase-discriminant embeddings. As with other neural networks, RNNs are trained in mini-batch units. In order to feed input sequences into [...] Read more.
Recurrent neural networks (RNNs) can model the time-dependency of time-series data. It has also been widely used in text-dependent speaker verification to extract speaker-and-phrase-discriminant embeddings. As with other neural networks, RNNs are trained in mini-batch units. In order to feed input sequences into an RNN in mini-batch units, all the sequences in each mini-batch must have the same length. However, the sequences have variable lengths and we have no knowledge of these lengths in advance. Truncation/padding are most commonly used to make all sequences the same length. However, the truncation/padding causes information distortion because some information is lost and/or unnecessary information is added, which can degrade the performance of text-dependent speaker verification. In this paper, we propose a method to handle variable length sequences for RNNs without adding information distortion by truncating the output sequence so that it has the same length as corresponding original input sequence. The experimental results for the text-dependent speaker verification task in part 2 of RSR 2015 show that our method reduces the relative equal error rate by approximately 1.3% to 27.1%, depending on the task, compared to the baselines but with an associated, small overhead in execution time. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

21 pages, 1249 KiB  
Article
Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector
by Yoo Rhee Oh, Kiyoung Park and Jeon Gyu Park
Appl. Sci. 2020, 10(12), 4091; https://doi.org/10.3390/app10124091 - 14 Jun 2020
Cited by 7 | Viewed by 3007
Abstract
This paper aims to design an online, low-latency, and high-performance speech recognition system using a bidirectional long short-term memory (BLSTM) acoustic model. To achieve this, we adopt a server-client model and a context-sensitive-chunk-based approach. The speech recognition server manages a main thread and [...] Read more.
This paper aims to design an online, low-latency, and high-performance speech recognition system using a bidirectional long short-term memory (BLSTM) acoustic model. To achieve this, we adopt a server-client model and a context-sensitive-chunk-based approach. The speech recognition server manages a main thread and a decoder thread for each client and one worker thread. The main thread communicates with the connected client, extracts speech features, and buffers the features. The decoder thread performs speech recognition, including the proposed multichannel parallel acoustic score computation of a BLSTM acoustic model, the proposed deep neural network-based voice activity detector, and Viterbi decoding. The proposed acoustic score computation method estimates the acoustic scores of a context-sensitive-chunk BLSTM acoustic model for the batched speech features from concurrent clients, using the worker thread. The proposed deep neural network-based voice activity detector detects short pauses in the utterance to reduce response latency, while the user utters long sentences. From the experiments of Korean speech recognition, the number of concurrent clients is increased from 22 to 44 using the proposed acoustic score computation. When combined with the frame skipping method, the number is further increased up to 59 clients with a small accuracy degradation. Moreover, the average user-perceived latency is reduced from 11.71 s to 3.09–5.41 s by using the proposed deep neural network-based voice activity detector. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

13 pages, 349 KiB  
Article
Semi-Supervised Speech Recognition Acoustic Model Training Using Policy Gradient
by Hoon Chung, Sung Joo Lee, Hyeong Bae Jeon and Jeon Gue Park
Appl. Sci. 2020, 10(10), 3542; https://doi.org/10.3390/app10103542 - 20 May 2020
Cited by 4 | Viewed by 2348
Abstract
In this paper, we propose a policy gradient-based semi-supervised speech recognition acoustic model training. In practice, self-training and teacher/student learning are one of the widely used semi-supervised training methods due to their scalability and effectiveness. These methods are based on generating pseudo labels [...] Read more.
In this paper, we propose a policy gradient-based semi-supervised speech recognition acoustic model training. In practice, self-training and teacher/student learning are one of the widely used semi-supervised training methods due to their scalability and effectiveness. These methods are based on generating pseudo labels for unlabeled samples using a pre-trained model and selecting reliable samples using confidence measure. However, there are some considerations in this approach. The generated pseudo labels can be biased depending on which pre-trained model is used, and the training process can be complicated because the confidence measure is usually carried out in post-processing using external knowledge. Therefore, to address these issues, we propose a policy gradient method-based approach. Policy gradient is a reinforcement learning algorithm to find an optimal behavior strategy for an agent to obtain optimal rewards. The policy gradient-based approach provides a framework for exploring unlabeled data as well as exploiting labeled data, and it also provides a way to incorporate external knowledge in the same training cycle. The proposed approach was evaluated on an in-house non-native Korean recognition domain. The experimental results show that the method is effective in semi-supervised acoustic model training. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

15 pages, 3243 KiB  
Article
Multi-Task Learning U-Net for Single-Channel Speech Enhancement and Mask-Based Voice Activity Detection
by Geon Woo Lee and Hong Kook Kim
Appl. Sci. 2020, 10(9), 3230; https://doi.org/10.3390/app10093230 - 06 May 2020
Cited by 23 | Viewed by 4384
Abstract
In this paper, a multi-task learning U-shaped neural network (MTU-Net) is proposed and applied to single-channel speech enhancement (SE). The proposed MTU-based SE method estimates an ideal binary mask (IBM) or an ideal ratio mask (IRM) by extending the decoding network of a [...] Read more.
In this paper, a multi-task learning U-shaped neural network (MTU-Net) is proposed and applied to single-channel speech enhancement (SE). The proposed MTU-based SE method estimates an ideal binary mask (IBM) or an ideal ratio mask (IRM) by extending the decoding network of a conventional U-Net to simultaneously model the speech and noise spectra as the target. The effectiveness of the proposed SE method was evaluated under both matched and mismatched noise conditions between training and testing by measuring the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). Consequently, the proposed SE method with IRM achieved a substantial improvement with higher average PESQ scores by 0.17, 0.52, and 0.40 than other state-of-the-art deep-learning-based methods, such as the deep recurrent neural network (DRNN), SE generative adversarial network (SEGAN), and conventional U-Net, respectively. In addition, the STOI scores of the proposed SE method are 0.07, 0.05, and 0.05 higher than those of the DRNN, SEGAN, and U-Net, respectively. Next, voice activity detection (VAD) is also proposed by using the IRM estimated by the proposed MTU-Net-based SE method, which is fundamentally an unsupervised method without any model training. Then, the performance of the proposed VAD method was compared with the performance of supervised learning-based methods using a deep neural network (DNN), a boosted DNN, and a long short-term memory (LSTM) network. Consequently, the proposed VAD methods show a slightly better performance than the three neural network-based methods under mismatched noise conditions. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Graphical abstract

19 pages, 1373 KiB  
Article
Estimating the Rank of a Nonnegative Matrix Factorization Model for Automatic Music Transcription Based on Stein’s Unbiased Risk Estimator
by Seokjin Lee
Appl. Sci. 2020, 10(8), 2911; https://doi.org/10.3390/app10082911 - 23 Apr 2020
Cited by 3 | Viewed by 2483
Abstract
In this paper, methods to estimate the number of basis vectors of the nonnegative matrix factorization (NMF) of automatic music transcription (AMT) systems are proposed. Previously, studies on NMF-based AMT have demonstrated that the number of basis vectors affects the performance and that [...] Read more.
In this paper, methods to estimate the number of basis vectors of the nonnegative matrix factorization (NMF) of automatic music transcription (AMT) systems are proposed. Previously, studies on NMF-based AMT have demonstrated that the number of basis vectors affects the performance and that the number of note events can be a good selection as the rank of NMF. However, many NMF-based AMT methods do not provide a method to estimate the appropriate number of basis vectors; instead, the number is assumed to be given in advance, even though the number of basis vectors significantly affects the algorithm’s performance. Recently, based on Bayesian methods, certain estimation algorithms for the number of basis vectors have been proposed; however, they are not designed to be used as music transcription algorithms but are components of specific NMF methods and thus cannot be used generally as NMF-based transcription algorithms. Our proposed estimation algorithms are based on eigenvalue decomposition and Stein’s unbiased risk estimator (SURE). Because the SURE method requires variance in undesired components as a priori knowledge, the proposed algorithms estimate the value using random matrix theory and first and second onset information in the input music signal. Experiments were then conducted for the AMT task using the MIDI-aligned piano sounds (MAPS) database, and these algorithms were compared with variational NMF, gamma process NMF, and NMF with automatic relevance determination algorithms. Based on experimental results, the conventional NMF-based transcription algorithm with the proposed rank estimation algorithms demonstrated enhanced F1 score performances of 2–3% compared to the algorithms. While the performance advantages are not significantly large, the results are meaningful because the proposed algorithms are lightweight, are easy to combine with any other NMF methods that require an a priori rank parameter, and do not have setting parameters that considerably affect the performance. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

16 pages, 595 KiB  
Article
Voice Conversion Using a Perceptual Criterion
by Ki-Seung Lee
Appl. Sci. 2020, 10(8), 2884; https://doi.org/10.3390/app10082884 - 22 Apr 2020
Cited by 3 | Viewed by 2735
Abstract
In voice conversion (VC), it is highly desirable to obtain transformed speech signals that are perceptually close to a target speaker’s voice. To this end, a perceptually meaningful criterion where the human auditory system was taken into consideration in measuring the distances between [...] Read more.
In voice conversion (VC), it is highly desirable to obtain transformed speech signals that are perceptually close to a target speaker’s voice. To this end, a perceptually meaningful criterion where the human auditory system was taken into consideration in measuring the distances between the converted and the target voices was adopted in the proposed VC scheme. The conversion rules for the features associated with the spectral envelope and the pitch modification factor were jointly constructed so that perceptual distance measurement was minimized. This minimization problem was solved using a deep neural network (DNN) framework where input features and target features were derived from source speech signals and time-aligned version of target speech signals, respectively. The validation tests were carried out for the CMU ARCTIC database to evaluate the effectiveness of the proposed method, especially in terms of perceptual quality. The experimental results showed that the proposed method yielded perceptually preferred results compared with independent conversion using conventional mean-square error (MSE) criterion. The maximum improvement in perceptual evaluation of speech quality (PESQ) was 0.312, compared with the conventional VC method. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

15 pages, 365 KiB  
Article
Improving Singing Voice Separation Using Curriculum Learning on Recurrent Neural Networks
by Seungtae Kang, Jeong-Sik Park  and Gil-Jin Jang 
Appl. Sci. 2020, 10(7), 2465; https://doi.org/10.3390/app10072465 - 03 Apr 2020
Cited by 3 | Viewed by 2060
Abstract
Single-channel singing voice separation has been considered a difficult task, as it requires predicting two different audio sources independently from mixed vocal and instrument sounds recorded by a single microphone. We propose a new singing voice separation approach based on the curriculum learning [...] Read more.
Single-channel singing voice separation has been considered a difficult task, as it requires predicting two different audio sources independently from mixed vocal and instrument sounds recorded by a single microphone. We propose a new singing voice separation approach based on the curriculum learning framework, in which learning is started with only easy examples and then task difficulty is gradually increased. In this study, we regard the data providing obviously dominant characteristics of a single source as an easy case and the other data as a difficult case. To quantify the dominance property between two sources, we define a dominance factor that determines a difficulty level according to relative intensity between vocal sound and instrument sound. If a given data is determined to provide obviously dominant characteristics of a single source according to the factor, it is regarded as an easy case; otherwise, it belongs to a difficult case. Early stages in the learning focus on easy cases, thus allowing rapidly learning overall characteristics of each source. On the other hand, later stages handle difficult cases, allowing more careful and sophisticated learning. In experiments conducted on three song datasets, the proposed approach demonstrated superior performance compared to the conventional approaches. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

8 pages, 850 KiB  
Article
An Unsupervised Deep Learning System for Acoustic Scene Analysis
by Mou Wang, Xiao-Lei Zhang and Susanto Rahardja
Appl. Sci. 2020, 10(6), 2076; https://doi.org/10.3390/app10062076 - 19 Mar 2020
Cited by 5 | Viewed by 2054
Abstract
Acoustic scene analysis has attracted a lot of attention recently. Existing methods are mostly supervised, which requires well-predefined acoustic scene categories and accurate labels. In practice, there exists a large amount of unlabeled audio data, but labeling large-scale data is not only costly [...] Read more.
Acoustic scene analysis has attracted a lot of attention recently. Existing methods are mostly supervised, which requires well-predefined acoustic scene categories and accurate labels. In practice, there exists a large amount of unlabeled audio data, but labeling large-scale data is not only costly but also time-consuming. Unsupervised acoustic scene analysis on the other hand does not require manual labeling but is known to have significantly lower performance and therefore has not been well explored. In this paper, a new unsupervised method based on deep auto-encoder networks and spectral clustering is proposed. It first extracts a bottleneck feature from the original acoustic feature of audio clips by an auto-encoder network, and then employs spectral clustering to further reduce the noise and unrelated information in the bottleneck feature. Finally, it conducts hierarchical clustering on the low-dimensional output of the spectral clustering. To fully utilize the spatial information of stereo audio, we further apply the binaural representation and conduct joint clustering on that. To the best of our knowledge, this is the first time that a binaural representation is being used in unsupervised learning. Experimental results show that the proposed method outperforms the state-of-the-art competing methods. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Graphical abstract

23 pages, 3611 KiB  
Article
An Optimal Feature Parameter Set Based on Gated Recurrent Unit Recurrent Neural Networks for Speech Segment Detection
by Özlem BATUR DİNLER and Nizamettin AYDIN
Appl. Sci. 2020, 10(4), 1273; https://doi.org/10.3390/app10041273 - 13 Feb 2020
Cited by 24 | Viewed by 3529
Abstract
Speech segment detection based on gated recurrent unit (GRU) recurrent neural networks for the Kurdish language was investigated in the present study. The novelties of the current research are the utilization of a GRU in Kurdish speech segment detection, creation of a unique [...] Read more.
Speech segment detection based on gated recurrent unit (GRU) recurrent neural networks for the Kurdish language was investigated in the present study. The novelties of the current research are the utilization of a GRU in Kurdish speech segment detection, creation of a unique database from the Kurdish language, and optimization of processing parameters for Kurdish speech segmentation. This study is the first attempt to find the optimal feature parameters of the model and to form a large Kurdish vocabulary dataset for a speech segment detection based on consonant, vowel, and silence (C/V/S) discrimination. For this purpose, four window sizes and three window types with three hybrid feature vector techniques were used to describe the phoneme boundaries. Identification of the phoneme boundaries using a GRU recurrent neural network was performed with six different classification algorithms for the C/V/S discrimination. We have demonstrated that the GRU model has achieved outstanding speech segmentation performance for characterizing Kurdish acoustic signals. The experimental findings of the present study show the significance of the segment detection of speech signals by effectively utilizing hybrid features, window sizes, window types, and classification models for Kurdish speech. Full article
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)
Show Figures

Figure 1

Back to TopTop