MDPI - Publisher of Open Access Journals

12 pages, 4323 KiB

Open AccessArticle

Threshold-Based Combination of Ideal Binary Mask and Ideal Ratio Mask for Single-Channel Speech Separation

by Peng Chen, Binh Thien Nguyen, Kenta Iwai and Takanobu Nishiura

Information 2024, 15(10), 608; https://doi.org/10.3390/info15100608 - 4 Oct 2024

Cited by 1 | Viewed by 1239

An effective approach to addressing the speech separation problem is utilizing a time–frequency (T-F) mask. The ideal binary mask (IBM) and ideal ratio mask (IRM) have long been widely used to separate speech signals. However, the IBM is better at improving speech intelligibility, [...] Read more.

An effective approach to addressing the speech separation problem is utilizing a time–frequency (T-F) mask. The ideal binary mask (IBM) and ideal ratio mask (IRM) have long been widely used to separate speech signals. However, the IBM is better at improving speech intelligibility, while the IRM is better at improving speech quality. To leverage their respective strengths and overcome weaknesses, we propose an ideal threshold-based mask (ITM) to combine these two masks. By adjusting two thresholds, these two masks are combined to jointly act on speech separation. We list the impact of using different threshold combinations on speech separation performance under ideal conditions and discuss a reasonable range for fine tuning the thresholds. By using masks as a training target, to evaluate the effectiveness of the proposed method, we conducted supervised speech separation experiments applying a deep neural network (DNN) and long short-term memory (LSTM), the results of which were measured by three objective indicators: the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifact ratio improvement (SAR). Experimental results show that the proposed mask combines the strengths of the IBM and IRM and implies that the accuracy of speech separation can potentially be further improved by effectively leveraging the advantages of different masks. Full article

► Show Figures

Figure 1

20 pages, 8238 KiB

Open AccessArticle

Spaceborne Algorithm for Recognizing Lightning Whistler Recorded by an Electric Field Detector Onboard the CSES Satellite

by Yalan Li, Jing Yuan, Jie Cao, Yaohui Liu, Jianping Huang, Bin Li, Qiao Wang, Zhourong Zhang, Zhixing Zhao, Ying Han, Haijun Liu, Jinsheng Han, Xuhui Shen and Yali Wang

Atmosphere 2023, 14(11), 1633; https://doi.org/10.3390/atmos14111633 - 30 Oct 2023

Cited by 2 | Viewed by 1504

Abstract

The electric field detector of the CSES satellite has captured a vast number of lightning whistler events. To recognize them effectively from the massive amount of electric field detector data, a recognition algorithm based on speech technology has attracted attention. However, this approach [...] Read more.

The electric field detector of the CSES satellite has captured a vast number of lightning whistler events. To recognize them effectively from the massive amount of electric field detector data, a recognition algorithm based on speech technology has attracted attention. However, this approach has failed to recognize the lightning whistler events which are contaminated by other low-frequency electromagnetic disturbances. To overcome this limitation, we apply the single-channel blind source separation method and audio recognition approach to develop a novel model, which consists of two stages. (1) The training stage: Firstly, we preprocess the electric field detector wave data into the audio fragment. Then, for each audio fragment, mel-frequency cepstral coefficients are extracted and input into the long short-term memory network for training the novel lightning whistler recognition model. (2) The inference stage: Firstly, we process each audio fragment with the single-channel blind source to generate two different sub-signals. Then, for each sub-signal, the mel-frequency cepstral coefficient features are extracted and input into the lightning whistler recognition model to recognize the lightning whistler. Finally, the two results above are processed by decision fusion to obtain the final recognition result. Experimental results based on the electric field detector data of the CSES satellite demonstrate the effectiveness of the algorithm. Compared with classical methods, the accuracy, recall, and F1-score of this algorithm can be increased by 17%, 62.2%, and 50%, respectively. However, the time cost only increases by 0.41 s. Full article

(This article belongs to the Section Upper Atmosphere)

► Show Figures

Figure 1

14 pages, 433 KiB

Open AccessArticle

An Automatic Speaker Clustering Pipeline for the Air Traffic Communication Domain

by Driss Khalil, Amrutha Prasad, Petr Motlicek, Juan Zuluaga-Gomez, Iuliia Nigmatulina, Srikanth Madikeri and Christof Schuepbach

Aerospace 2023, 10(10), 876; https://doi.org/10.3390/aerospace10100876 - 10 Oct 2023

Cited by 3 | Viewed by 2220

Abstract

In air traffic management (ATM), voice communications are critical for ensuring the safe and efficient operation of aircraft. The pertinent voice communications—air traffic controller (ATCo) and pilot—are usually transmitted in a single channel, which poses a challenge when developing automatic systems for air [...] Read more.

In air traffic management (ATM), voice communications are critical for ensuring the safe and efficient operation of aircraft. The pertinent voice communications—air traffic controller (ATCo) and pilot—are usually transmitted in a single channel, which poses a challenge when developing automatic systems for air traffic management. Speaker clustering is one of the challenges when applying speech processing algorithms to identify and group the same speaker among different speakers. We propose a pipeline that deploys (i) speech activity detection (SAD) to identify speech segments, (ii) an automatic speech recognition system to generate the text for audio segments, (iii) text-based speaker role classification to detect the role of the speaker—ATCo or pilot in our case—and (iv) unsupervised speaker clustering to create a cluster of each individual pilot speaker from the obtained speech utterances. The speech segments obtained by SAD are input into an automatic speech recognition (ASR) engine to generate the automatic English transcripts. The speaker role classification system takes the transcript as input and uses it to determine whether the speech was from the ATCo or the pilot. As the main goal of this project is to group the speakers in pilot communication, only pilot data acquired from the classification system is employed. We present a method for separating the speech parts of pilots into different clusters based on the speaker’s voice using agglomerative hierarchical clustering (AHC). The performance of the speaker role classification and speaker clustering is evaluated on two publicly available datasets: the ATCO2 corpus and the Linguistic Data Consortium Air Traffic Control Corpus (LDC-ATCC). Since the pilots’ real identities are unknown, the ground truth is generated based on logical hypotheses regarding the creation of each dataset, timing information, and the information extracted from associated callsigns. In the case of speaker clustering, the proposed algorithm achieves an accuracy of 70% on the LDC-ATCC dataset and 50% on the more noisy ATCO2 dataset. Full article

(This article belongs to the Special Issue Automatic Speech Recognition and Understanding in Air Traffic Management)

► Show Figures

Figure 1

14 pages, 6266 KiB

Open AccessArticle

Supervised Single Channel Speech Enhancement Method Using UNET

by Md. Nahid Hossain, Samiul Basir, Md. Shakhawat Hosen, A.O.M. Asaduzzaman, Md. Mojahidul Islam, Mohammad Alamgir Hossain and Md Shohidul Islam

Electronics 2023, 12(14), 3052; https://doi.org/10.3390/electronics12143052 - 12 Jul 2023

Cited by 9 | Viewed by 3896

Abstract

This paper proposes an innovative single-channel supervised speech enhancement (SE) method based on UNET, a convolutional neural network (CNN) architecture that expands on a few changes in the basic CNN architecture. In the training phase, short-time Fourier transform (STFT) is exploited on the [...] Read more.

This paper proposes an innovative single-channel supervised speech enhancement (SE) method based on UNET, a convolutional neural network (CNN) architecture that expands on a few changes in the basic CNN architecture. In the training phase, short-time Fourier transform (STFT) is exploited on the noisy time domain signal to build a noisy time-frequency domain signal which is called a complex noisy matrix. We take the real and imaginary parts of the complex noisy matrix and concatenate both of them to form the noisy concatenated matrix. We apply UNET to the noisy concatenated matrix for extracting speech components and train the CNN model. In the testing phase, the same procedure is applied to the noisy time-domain signal as in the training phase in order to construct another noisy concatenated matrix that can be tested using a pre-trained or saved model in order to construct an enhanced concatenated matrix. Finally, from the enhanced concatenated matrix, we separate both the imaginary and real parts to form an enhanced complex matrix. Magnitude and phase are then extracted from the newly created enhanced complex matrix. By using that magnitude and phase, the inverse STFT (ISTFT) can generate the enhanced speech signal. Utilizing the IEEE databases and various types of noise, including stationary and non-stationary noise, the proposed method is evaluated. Comparing the exploratory results of the proposed algorithm to the other five methods of STFT, sparse non-negative matrix factorization (SNMF), dual-tree complex wavelet transform (DTCWT)-SNMF, DTCWT-STFT-SNMF, STFT-convolutional denoising auto encoder (CDAE) and casual multi-head attention mechanism (CMAM) for speech enhancement, we determine that the proposed algorithm generally improves speech quality and intelligibility at all considered signal-to-noise ratios (SNRs). The suggested approach performs better than the other five competing algorithms in every evaluation metric. Full article

(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)

► Show Figures

Figure 1

21 pages, 39261 KiB

Open AccessArticle

A Dual Stream Generative Adversarial Network with Phase Awareness for Speech Enhancement

by Xintao Liang, Yuhang Li, Xiaomin Li, Yue Zhang and Youdong Ding

Information 2023, 14(4), 221; https://doi.org/10.3390/info14040221 - 4 Apr 2023

Cited by 1 | Viewed by 2502

Abstract

Implementing single-channel speech enhancement under unknown noise conditions is a challenging problem. Most existing time-frequency domain methods are based on the amplitude spectrogram, and these methods often ignore the phase mismatch between noisy speech and clean speech, which largely limits the performance of [...] Read more.

Implementing single-channel speech enhancement under unknown noise conditions is a challenging problem. Most existing time-frequency domain methods are based on the amplitude spectrogram, and these methods often ignore the phase mismatch between noisy speech and clean speech, which largely limits the performance of speech enhancement. To solve the phase mismatch problem and further improve enhancement performance, this paper proposes a dual-stream Generative Adversarial Network (GAN) with phase awareness, named DPGAN. Our generator uses a dual-stream structure to predict amplitude and phase separately and adds an information communication module between the two streams to fully apply the phase information. To make the prediction more efficient, we apply Transformer to build the generator, which can learn the sound’s structural properties more easily. Finally, we designed a perceptually guided discriminator that quantitatively evaluates the quality of speech, optimising the generator for specific evaluation metrics. We conducted experiments on the most widely used Voicebank-DEMAND dataset and DPGAN achieved state-of-the-art on most metrics. Full article

► Show Figures

Figure 1

17 pages, 2731 KiB

Open AccessArticle

VAT-SNet: A Convolutional Music-Separation Network Based on Vocal and Accompaniment Time-Domain Features

by Xiaoman Qiao, Min Luo, Fengjing Shao, Yi Sui, Xiaowei Yin and Rencheng Sun

Electronics 2022, 11(24), 4078; https://doi.org/10.3390/electronics11244078 - 8 Dec 2022

Cited by 3 | Viewed by 2395

Abstract

The study of separating the vocal from the accompaniment in single-channel music is foundational and critical in the field of music information retrieval (MIR). Mainstream music-separation methods are usually based on the frequency-domain characteristics of music signals, and the phase information of the [...] Read more.

The study of separating the vocal from the accompaniment in single-channel music is foundational and critical in the field of music information retrieval (MIR). Mainstream music-separation methods are usually based on the frequency-domain characteristics of music signals, and the phase information of the music is lost during time–frequency decomposition. In recent years, deep learning models based on speech time-domain signals, such as Conv-TasNet, have shown great potential. However, for the vocal and accompaniment separation problem, there is no suitable time-domain music-separation model. Since the vocal and the accompaniment in music have a higher synergy and similarity than the voices of two speakers in speech, separating the vocal and accompaniment using a speech-separation model is not ideal. Based on this, we propose VAT-SNet; this optimizes the network structure of Conv-TasNet, which sets sample-level convolution in the encoder and decoder to preserve deep acoustic features, and takes vocal embedding and accompaniment embedding generated by the auxiliary network as references to improve the purity of the separation of the vocal and accompaniment. The results from public music datasets show that the quality of the vocal and accompaniment separated by VAT-SNet is improved in GSNR, GSIR, and GSAR compared with Conv-TasNet and mainstream separation methods, such as U-Net, SH-4stack, etc. Full article

(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)

► Show Figures

Figure 1

17 pages, 2495 KiB

Open AccessArticle

Language Identification-Based Evaluation of Single Channel Speech Separation of Overlapped Speeches

by Zuhragvl Aysa, Mijit Ablimit, Hankiz Yilahun and Askar Hamdulla

Information 2022, 13(10), 492; https://doi.org/10.3390/info13100492 - 11 Oct 2022

Cited by 4 | Viewed by 2535

Abstract

In multi-lingual, multi-speaker environments (e.g., international conference scenarios), speech, language, and background sounds can overlap. In real-world scenarios, source separation techniques are needed to separate target sounds. Downstream tasks, such as ASR, speaker recognition, speech recognition, VAD, etc., can be combined with speech [...] Read more.

In multi-lingual, multi-speaker environments (e.g., international conference scenarios), speech, language, and background sounds can overlap. In real-world scenarios, source separation techniques are needed to separate target sounds. Downstream tasks, such as ASR, speaker recognition, speech recognition, VAD, etc., can be combined with speech separation tasks to gain a better understanding. Since most of the evaluation methods for monophonic separation are either single or subjective, this paper used the downstream recognition task as an overall evaluation criterion. Thus, the performance could be directly evaluated by the metrics of the downstream task. In this paper, we investigated a two-stage training scheme that combined speech separation and language identification tasks. To analyze and optimize the separation performance of single-channel overlapping speech, the separated speech was fed to a language identification engine to evaluate its accuracy. The speech separation model was a single-channel speech separation network trained with WSJ0-2mix. For the language identification system, we used an Oriental Language Dataset and a dataset synthesized by directly mixing different proportions of speech groups. The combined effect of these two models was evaluated for various overlapping speech scenarios. When the language identification network model was based on single-person single-speech frequency spectrum features, Chinese, Japanese, Korean, Indonesian, and Vietnamese had significantly improved recognition results over the mixed audio spectrum. Full article

► Show Figures

Figure 1

19 pages, 3296 KiB

Open AccessArticle

Speech Enhancement Based on Fusion of Both Magnitude/Phase-Aware Features and Targets

by Haitao Lang and Jie Yang

Electronics 2020, 9(7), 1125; https://doi.org/10.3390/electronics9071125 - 10 Jul 2020

Cited by 6 | Viewed by 3924

Abstract

Recently, supervised learning methods have shown promising performance, especially deep neural network-based (DNN) methods, in the application of single-channel speech enhancement. Generally, those approaches extract the acoustic features directly from the noisy speech to train a magnitude-aware target. In this paper, we propose [...] Read more.

Recently, supervised learning methods have shown promising performance, especially deep neural network-based (DNN) methods, in the application of single-channel speech enhancement. Generally, those approaches extract the acoustic features directly from the noisy speech to train a magnitude-aware target. In this paper, we propose to extract the acoustic features not only from the noisy speech but also from the pre-estimated speech, noise and phase separately, then fuse them into a new complementary feature for the purpose of obtaining more discriminative acoustic representation. In addition, on the basis of learning a magnitude-aware target, we also utilize the fusion feature to learn a phase-aware target, thereby further improving the accuracy of the recovered speech. We conduct extensive experiments, including performance comparison with some typical existing methods, generalization ability evaluation on unseen noise, ablation study, and subjective test by human listener, to demonstrate the feasibility and effectiveness of the proposed method. Experimental results prove that the proposed method has the ability to improve the quality and intelligibility of the reconstructed speech. Full article

(This article belongs to the Special Issue Theory and Applications in Digital Signal Processing)

► Show Figures

Figure 1

24 pages, 1799 KiB

Open AccessArticle

A Geometric Algebra Co-Processor for Color Edge Detection

by Biswajit Mishra, Peter Wilson and Reuben Wilcock

Electronics 2015, 4(1), 94-117; https://doi.org/10.3390/electronics4010094 - 26 Jan 2015

Cited by 17 | Viewed by 8560

Abstract

This paper describes advancement in color edge detection, using a dedicated Geometric Algebra (GA) co-processor implemented on an Application Specific Integrated Circuit (ASIC). GA provides a rich set of geometric operations, giving the advantage that many signal and image processing operations become straightforward [...] Read more.

This paper describes advancement in color edge detection, using a dedicated Geometric Algebra (GA) co-processor implemented on an Application Specific Integrated Circuit (ASIC). GA provides a rich set of geometric operations, giving the advantage that many signal and image processing operations become straightforward and the algorithms intuitive to design. The use of GA allows images to be represented with the three R, G, B color channels defined as a single entity, rather than separate quantities. A novel custom ASIC is proposed and fabricated that directly targets GA operations and results in significant performance improvement for color edge detection. Use of the hardware described in this paper also shows that the convolution operation with the rotor masks within GA belongs to a class of linear vector filters and can be applied to image or speech signals. The contribution of the proposed approach has been demonstrated by implementing three different types of edge detection schemes on the proposed hardware. The overall performance gains using the proposed GA Co-Processor over existing software approaches are more than 3.2× faster than GAIGEN and more than 2800× faster than GABLE. The performance of the fabricated GA co-processor is approximately an order of magnitude faster than previously published results for hardware implementations. Full article

(This article belongs to the Special Issue FPGA and SoC Devices Applied to New Trends in Image/Video and Signal Processing Fields)

► Show Figures

Graphical abstract

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI