MDPI - Publisher of Open Access Journals

12 pages, 351 KB

Open AccessArticle

A Combined Method for Localizing Two Overlapping Acoustic Sources Based on Deep Learning

by Alexander Lyapin, Ghiath Shahoud and Evgeny Agafonov

Appl. Sci. 2025, 15(12), 6768; https://doi.org/10.3390/app15126768 - 16 Jun 2025

Viewed by 754

Deep learning approaches for multi-source sound localization face significant challenges, particularly the need for extensive training datasets encompassing diverse spatial configurations to achieve robust generalization. This requirement leads to substantial computational demands, which are further exacerbated when localizing overlapping sources in complex acoustic [...] Read more.

Deep learning approaches for multi-source sound localization face significant challenges, particularly the need for extensive training datasets encompassing diverse spatial configurations to achieve robust generalization. This requirement leads to substantial computational demands, which are further exacerbated when localizing overlapping sources in complex acoustic environments with reverberation and noise. In this paper, a new methodology is proposed for simultaneous localization of two overlapping sound sources in the time–frequency domain in a closed, reverberant environment with a spatial resolution of

10^{°}

using a small-sized microphone array. The proposed methodology is based on the integration of the sound source separation method with a single-source sound localization model. A hybrid model was proposed to separate the sound source signals received by each microphone in the array. The model was built using a bidirectional long short-term memory (BLSTM) network and trained on a dataset using the ideal binary mask (IBM) as the training target. The modeling results show that the proposed localization methodology is efficient in determining the directions for two overlapping sources simultaneously, with an average localization accuracy of 86.1% for the test dataset containing short-term signals of 500 ms duration with different signal-to-signal ratio values. Full article

(This article belongs to the Section Acoustics and Vibrations)

► Show Figures

Figure 1

24 pages, 4763 KB

Open AccessArticle

Impact of Mask Type as Training Target for Speech Intelligibility and Quality in Cochlear-Implant Noise Reduction

by Fergal Henry, Martin Glavin, Edward Jones and Ashkan Parsi

Sensors 2024, 24(20), 6614; https://doi.org/10.3390/s24206614 - 14 Oct 2024

Cited by 1 | Viewed by 1675

Abstract

The selection of a target when training deep neural networks for speech enhancement is an important consideration. Different masks have been shown to exhibit different performance characteristics depending on the application and the conditions. This paper presents a comprehensive comparison of several different [...] Read more.

The selection of a target when training deep neural networks for speech enhancement is an important consideration. Different masks have been shown to exhibit different performance characteristics depending on the application and the conditions. This paper presents a comprehensive comparison of several different masks for noise reduction in cochlear implants. The study incorporated three well-known masks, namely the Ideal Binary Mask (IBM), Ideal Ratio Mask (IRM) and the Fast Fourier Transform Mask (FFTM), as well as two newly proposed masks, based on existing masks, called the Quantized Mask (QM) and the Phase-Sensitive plus Ideal Ratio Mask (PSM+). These five masks are used to train networks to estimate masks for the purpose of separating speech from noisy mixtures. A vocoder was used to simulate the behavior of a cochlear implant. Short-time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) scores indicate that the two new masks proposed in this study (QM and PSM+) perform best for normal speech intelligibility and quality in the presence of stationary and non-stationary noise over a range of signal-to-noise ratios (SNRs). The Normalized Covariance Measure (NCM) and similarity scores indicate that they also perform best for speech intelligibility/gauging the similarity of vocoded speech. The Quantized Mask performs better than the Ideal Binary Mask due to its better resolution as it approximates the Wiener Gain Function. The PSM+ performs better than the three existing benchmark masks (IBM, IRM, and FFTM) as it incorporates both magnitude and phase information. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

12 pages, 4323 KB

Open AccessArticle

Threshold-Based Combination of Ideal Binary Mask and Ideal Ratio Mask for Single-Channel Speech Separation

by Peng Chen, Binh Thien Nguyen, Kenta Iwai and Takanobu Nishiura

Information 2024, 15(10), 608; https://doi.org/10.3390/info15100608 - 4 Oct 2024

Cited by 2 | Viewed by 1540

Abstract

An effective approach to addressing the speech separation problem is utilizing a time–frequency (T-F) mask. The ideal binary mask (IBM) and ideal ratio mask (IRM) have long been widely used to separate speech signals. However, the IBM is better at improving speech intelligibility, [...] Read more.

An effective approach to addressing the speech separation problem is utilizing a time–frequency (T-F) mask. The ideal binary mask (IBM) and ideal ratio mask (IRM) have long been widely used to separate speech signals. However, the IBM is better at improving speech intelligibility, while the IRM is better at improving speech quality. To leverage their respective strengths and overcome weaknesses, we propose an ideal threshold-based mask (ITM) to combine these two masks. By adjusting two thresholds, these two masks are combined to jointly act on speech separation. We list the impact of using different threshold combinations on speech separation performance under ideal conditions and discuss a reasonable range for fine tuning the thresholds. By using masks as a training target, to evaluate the effectiveness of the proposed method, we conducted supervised speech separation experiments applying a deep neural network (DNN) and long short-term memory (LSTM), the results of which were measured by three objective indicators: the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifact ratio improvement (SAR). Experimental results show that the proposed mask combines the strengths of the IBM and IRM and implies that the accuracy of speech separation can potentially be further improved by effectively leveraging the advantages of different masks. Full article

► Show Figures

Figure 1

24 pages, 6595 KB

Open AccessArticle

An ISAR Image Component Recognition Method Based on Semantic Segmentation and Mask Matching

by Xinli Zhu, Yasheng Zhang, Wang Lu, Yuqiang Fang and Jun He

Sensors 2023, 23(18), 7955; https://doi.org/10.3390/s23187955 - 18 Sep 2023

Cited by 8 | Viewed by 2170

Abstract

The inverse synthetic aperture radar (ISAR) image is a kind of target feature data acquired by radar for moving targets, which can reflect the shape, structure, and motion information of the target, and has attracted a great deal of attention from the radar [...] Read more.

The inverse synthetic aperture radar (ISAR) image is a kind of target feature data acquired by radar for moving targets, which can reflect the shape, structure, and motion information of the target, and has attracted a great deal of attention from the radar automatic target recognition (RATR) community. The identification of ISAR image components in radar satellite identification missions has not been carried out in related research, and the relevant segmentation methods of optical images applied to the research of semantic segmentation of ISAR images do not achieve ideal segmentation results. To address this problem, this paper proposes an ISAR image part recognition method based on semantic segmentation and mask matching. Furthermore, a reliable automatic ISAR image component labeling method is designed, and the satellite target component labeling ISAR image samples are obtained accurately and efficiently, and the satellite target component labeling ISAR image data set is obtained. On this basis, an ISAR image component recognition method based on semantic segmentation and mask matching is proposed in this paper. U-Net and Siamese Network are designed to complete the ISAR image binary semantic segmentation and binary mask matching, respectively. The component label of the ISAR image is predicted by the mask matching results. Experiments based on satellite component labeling ISAR image datasets confirm that the proposed method is feasible and effective, and it has greater comparative advantages compared to other classical semantic segmentation networks. Full article

(This article belongs to the Section Radar Sensors)

► Show Figures

Figure 1

25 pages, 5965 KB

Open AccessArticle

Experimental Investigation of Acoustic Features to Optimize Intelligibility in Cochlear Implants

by Fergal Henry, Ashkan Parsi, Martin Glavin and Edward Jones

Sensors 2023, 23(17), 7553; https://doi.org/10.3390/s23177553 - 31 Aug 2023

Cited by 5 | Viewed by 2159

Abstract

Although cochlear implants work well for people with hearing impairment in quiet conditions, it is well-known that they are not as effective in noisy environments. Noise reduction algorithms based on machine learning allied with appropriate speech features can be used to address this [...] Read more.

Although cochlear implants work well for people with hearing impairment in quiet conditions, it is well-known that they are not as effective in noisy environments. Noise reduction algorithms based on machine learning allied with appropriate speech features can be used to address this problem. The purpose of this study is to investigate the importance of acoustic features in such algorithms. Acoustic features are extracted from speech and noise mixtures and used in conjunction with the ideal binary mask to train a deep neural network to estimate masks for speech synthesis to produce enhanced speech. The intelligibility of this speech is objectively measured using metrics such as Short-time Objective Intelligibility (STOI), Hit Rate minus False Alarm Rate (HIT-FA) and Normalized Covariance Measure (NCM) for both simulated normal-hearing and hearing-impaired scenarios. A wide range of existing features is experimentally evaluated, including features that have not been traditionally applied in this application. The results demonstrate that frequency domain features perform best. In particular, Gammatone features performed best for normal hearing over a range of signal-to-noise ratios and noise types (STOI = 0.7826). Mel spectrogram features exhibited the best overall performance for hearing impairment (NCM = 0.7314). There is a stronger correlation between STOI and NCM than HIT-FA and NCM, suggesting that the former is a better predictor of intelligibility for hearing-impaired listeners. The results of this study may be useful in the design of adaptive intelligibility enhancement systems for cochlear implants based on both the noise level and the nature of the noise (stationary or non-stationary). Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

24 pages, 2175 KB

Open AccessArticle

Ensemble System of Deep Neural Networks for Single-Channel Audio Separation

by Musab T. S. Al-Kaltakchi, Ahmad Saeed Mohammad and Wai Lok Woo

Information 2023, 14(7), 352; https://doi.org/10.3390/info14070352 - 21 Jun 2023

Cited by 4 | Viewed by 2767

Abstract

Speech separation is a well-known problem, especially when there is only one sound mixture available. Estimating the Ideal Binary Mask (IBM) is one solution to this problem. Recent research has focused on the supervised classification approach. The challenge of extracting features from the [...] Read more.

Speech separation is a well-known problem, especially when there is only one sound mixture available. Estimating the Ideal Binary Mask (IBM) is one solution to this problem. Recent research has focused on the supervised classification approach. The challenge of extracting features from the sources is critical for this method. Speech separation has been accomplished by using a variety of feature extraction models. The majority of them, however, are concentrated on a single feature. The complementary nature of various features have not been thoroughly investigated. In this paper, we propose a deep neural network (DNN) ensemble architecture to completely explore the complimentary nature of the diverse features obtained from raw acoustic features. We examined the penultimate discriminative representations instead of employing the features acquired from the output layer. The learned representations were also fused to produce a new features vector, which was then classified by using the Extreme Learning Machine (ELM). In addition, a genetic algorithm (GA) was created to optimize the parameters globally. The results of the experiments showed that our proposed system completely considered various features and produced a high-quality IBM under different conditions. Full article

(This article belongs to the Topic Advances in Artificial Neural Networks)

► Show Figures

Figure 1

17 pages, 3278 KB

Open AccessEditor’s ChoiceArticle

Deep Learning Diagnostics of Gray Leaf Spot in Maize under Mixed Disease Field Conditions

by Hamish A. Craze, Nelishia Pillay, Fourie Joubert and Dave K. Berger

Plants 2022, 11(15), 1942; https://doi.org/10.3390/plants11151942 - 26 Jul 2022

Cited by 25 | Viewed by 4162

Abstract

Maize yields worldwide are limited by foliar diseases that could be fungal, oomycete, bacterial, or viral in origin. Correct disease identification is critical for farmers to apply the correct control measures, such as fungicide sprays. Deep learning has the potential for automated disease [...] Read more.

Maize yields worldwide are limited by foliar diseases that could be fungal, oomycete, bacterial, or viral in origin. Correct disease identification is critical for farmers to apply the correct control measures, such as fungicide sprays. Deep learning has the potential for automated disease classification from images of leaf symptoms. We aimed to develop a classifier to identify gray leaf spot (GLS) disease of maize in field images where mixed diseases were present (18,656 images after augmentation). In this study, we compare deep learning models trained on mixed disease field images with and without background subtraction. Performance was compared with models trained on PlantVillage images with single diseases and uniform backgrounds. First, we developed a modified VGG16 network referred to as “GLS_net” to perform binary classification of GLS, which achieved a 73.4% accuracy. Second, we used MaskRCNN to dynamically segment leaves from backgrounds in combination with GLS_net to identify GLS, resulting in a 72.6% accuracy. Models trained on PlantVillage images were 94.1% accurate at GLS classification with the PlantVillage testing set but performed poorly with the field image dataset (55.1% accuracy). In contrast, the GLS_net model was 78% accurate on the PlantVillage testing set. We conclude that deep learning models trained with realistic mixed disease field data obtain superior degrees of generalizability and external validity when compared to models trained using idealized datasets. Full article

(This article belongs to the Special Issue Deep Learning in Plant Sciences)

► Show Figures

Figure 1

15 pages, 3243 KB

Open AccessArticle

Multi-Task Learning U-Net for Single-Channel Speech Enhancement and Mask-Based Voice Activity Detection

by Geon Woo Lee and Hong Kook Kim

Appl. Sci. 2020, 10(9), 3230; https://doi.org/10.3390/app10093230 - 6 May 2020

Cited by 28 | Viewed by 5845

Abstract

In this paper, a multi-task learning U-shaped neural network (MTU-Net) is proposed and applied to single-channel speech enhancement (SE). The proposed MTU-based SE method estimates an ideal binary mask (IBM) or an ideal ratio mask (IRM) by extending the decoding network of a [...] Read more.

In this paper, a multi-task learning U-shaped neural network (MTU-Net) is proposed and applied to single-channel speech enhancement (SE). The proposed MTU-based SE method estimates an ideal binary mask (IBM) or an ideal ratio mask (IRM) by extending the decoding network of a conventional U-Net to simultaneously model the speech and noise spectra as the target. The effectiveness of the proposed SE method was evaluated under both matched and mismatched noise conditions between training and testing by measuring the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). Consequently, the proposed SE method with IRM achieved a substantial improvement with higher average PESQ scores by 0.17, 0.52, and 0.40 than other state-of-the-art deep-learning-based methods, such as the deep recurrent neural network (DRNN), SE generative adversarial network (SEGAN), and conventional U-Net, respectively. In addition, the STOI scores of the proposed SE method are 0.07, 0.05, and 0.05 higher than those of the DRNN, SEGAN, and U-Net, respectively. Next, voice activity detection (VAD) is also proposed by using the IRM estimated by the proposed MTU-Net-based SE method, which is fundamentally an unsupervised method without any model training. Then, the performance of the proposed VAD method was compared with the performance of supervised learning-based methods using a deep neural network (DNN), a boosted DNN, and a long short-term memory (LSTM) network. Consequently, the proposed VAD methods show a slightly better performance than the three neural network-based methods under mismatched noise conditions. Full article

(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)

► Show Figures

Graphical abstract

13 pages, 1175 KB

Open AccessArticle

Dual-Channel Cosine Function Based ITD Estimation for Robust Speech Separation

by Xuliang Li, Zhaogui Ding, Weifeng Li and Qingmin Liao

Sensors 2017, 17(6), 1447; https://doi.org/10.3390/s17061447 - 20 Jun 2017

Cited by 3 | Viewed by 5458

Abstract

In speech separation tasks, many separation methods have the limitation that the microphones are closely spaced, which means that these methods are unprevailing for phase wrap-around. In this paper, we present a novel speech separation scheme by using two microphones that does not [...] Read more.

In speech separation tasks, many separation methods have the limitation that the microphones are closely spaced, which means that these methods are unprevailing for phase wrap-around. In this paper, we present a novel speech separation scheme by using two microphones that does not have this restriction. The technique utilizes the estimation of interaural time difference (ITD) statistics and binary time-frequency mask for the separation of mixed speech sources. The novelties of the paper consist in: (1) the extended application of delay-and-sum beamforming (DSB) and cosine function for ITD calculation; and (2) the clarification of the connection between ideal binary mask and DSB amplitude ratio. Our objective quality evaluation experiments demonstrate the effectiveness of the proposed method. Full article

(This article belongs to the Special Issue Recent Advances in Array Signal Processing and Its Applications in IoT Security)

► Show Figures

Figure 1

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI