applsci-logo

Journal Browser

Journal Browser

Advances in Audio Signal Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: 20 July 2026 | Viewed by 8869

Special Issue Editors


E-Mail Website
Guest Editor
Instituto Universitario para el Desarrollo Tecnológico y la Innovación en Comunicaciones (IDeTIC), University of Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
Interests: audio signal processing; voice biometrics; environmental sound analysis. machine learning for audio classification; biomedical audio applications
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor Assistant
Instituto Universitario para el Desarrollo Tecnológico y la Innovación en Comunicaciones (IDeTIC), University of Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
Interests: machine learning for audio classification; biomedical audio applications

Special Issue Information

Dear Colleagues,

This Special Issue aims to highlight the latest advances in audio signal processing and its application across a wide range of domains. We invite researchers to submit their most recent innovative research related to applications in areas such as audio quality enhancement, the use of artificial intelligence in audio analysis, and environmental acoustics.

Topics of interest include, but are not limited to, the following:

  • Advanced filtering algorithms and source separation techniques.
  • Speech recognition and intelligibility improvement in noisy environments.
  • Applications of deep learning in the classification and analysis of audio signals.
  • Audio processing for biomedical and assistive technologies.
  • Novel methodologies in multichannel and spatial audio analysis.
  • Advances in environmental acoustics and their impact on soundscape design.

This Special Issue seeks to foster collaboration among researchers, engineers, and industry professionals by providing a platform for sharing findings that drive future innvoations in audio signal processing.

Prof. Dr. Jesús B. Alonso-Hernández
Guest Editor

Dr. María Luisa Barragán-Pulido
Guest Editor Assistant

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • audio signal processing
  • artificial intelligence in audio
  • speech enhancement and recognition
  • environmental acoustics
  • multichannel
  • spatial audio analysis

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 973 KB  
Article
How Far Can a U-Net Go? An Empirical Analysis of Music Source Separation Performance
by Daniel Kostrzewa, Mikolaj Kondziolka, Robert Brzeski, Jeremiah Abimbola and Pawel Benecki
Appl. Sci. 2026, 16(5), 2195; https://doi.org/10.3390/app16052195 - 25 Feb 2026
Viewed by 778
Abstract
Music source separation (MSS) focuses on decomposing a mixed audio signal into individual instrumental components and is increasingly relevant for music production, restoration, remixing, education, and music information retrieval. Deep learning methods, particularly U-Net architectures operating on time–frequency representations, have recently advanced the [...] Read more.
Music source separation (MSS) focuses on decomposing a mixed audio signal into individual instrumental components and is increasingly relevant for music production, restoration, remixing, education, and music information retrieval. Deep learning methods, particularly U-Net architectures operating on time–frequency representations, have recently advanced the state of the art beyond traditional signal-processing techniques. This work presents an optimized multi-source U-Net model for separating selected musical instruments from stereo mixtures. The system uses magnitude spectrograms generated by the short-time Fourier transform and is trained and evaluated on the MUSDB18 dataset. We systematically examine architectural and training-related factors, including normalization strategies, dropout placement, optimizer selection, loss weighting, data augmentation, and spectrogram-domain modifications. Separation quality is measured using BSS Eval metrics, assessing artifacts, interference, and distortion. Experimental results show that the proposed configuration achieves competitive performance relative to established convolutional and U-Net-based open-source systems, especially in terms of vocal track separation, offering practical insights into designing efficient models for multi-instrument separation. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

14 pages, 8948 KB  
Article
Parallel Enhancement and Bandwidth Extension of Coded Speech
by Jongwook Chae, Eunkyun Lee, Sooyoung Park and Jong Won Shin
Appl. Sci. 2026, 16(3), 1439; https://doi.org/10.3390/app16031439 - 30 Jan 2026
Viewed by 691
Abstract
An important use case of speech bandwidth extension (BWE) is generating high-frequency components from band-limited speech processed by a speech codec. Recent works on BWE have demonstrated remarkable capabilities in generating high-quality, high-band components using deep learning techniques. Among them, Streaming SEANet (StrmSEANet) [...] Read more.
An important use case of speech bandwidth extension (BWE) is generating high-frequency components from band-limited speech processed by a speech codec. Recent works on BWE have demonstrated remarkable capabilities in generating high-quality, high-band components using deep learning techniques. Among them, Streaming SEANet (StrmSEANet) has also been shown to be effective for BWE with reduced delay and computational complexity, making it suitable for real-time speech processing. However, the effect of the coding artifact in the lower band of the input signal has not been sufficiently considered in many deep learning-based BWE methods. In this work, we propose Parallel Enhancement and Bandwidth Extension of coded speech (PEBE), where two lightweight networks, referred to as Compact Streaming SEANet (CompSEANet), for coded speech enhancement (CSE) and BWE are configured in parallel. The CSE and BWE models are separately trained with the task-specific training settings, thereby effectively improving the reconstruction quality of the band-limited speech signals degraded by coding artifacts. Experimental results demonstrate that the proposed PEBE significantly outperforms the baseline AP-BWE, StrmSEANet, and standalone CompSEANet in reconstructing wideband (WB) and fullband speech from Opus-coded narrowband and WB signals. The proposed method achieves the highest scores in the subjective MUSHRA test while providing the fastest inference among all compared methods, with real-time factors (RTF) of 33.95× and 18.38× measured on a Samsung SM-F711 mobile device under single-thread execution. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

25 pages, 1229 KB  
Article
YOLO-Based Transfer Learning for Sound Event Detection Using Visual Object Detection Techniques
by Sergio Segovia González, Sara Barahona Quiros and Doroteo T. Toledano
Appl. Sci. 2026, 16(1), 205; https://doi.org/10.3390/app16010205 - 24 Dec 2025
Viewed by 1311
Abstract
Traditional Sound Event Detection (SED) approaches are based on either specialized models or these models in combination with general audio embedding extractors. In this article, we propose to reframe SED as an object detection task in the time–frequency plane and introduce a direct [...] Read more.
Traditional Sound Event Detection (SED) approaches are based on either specialized models or these models in combination with general audio embedding extractors. In this article, we propose to reframe SED as an object detection task in the time–frequency plane and introduce a direct adaptation of modern YOLO detectors to audio. To our knowledge, this is among the first works to employ YOLOv8 and YOLOv11 not merely as feature extractors but as end-to-end models that localize and classify sound events on mel-spectrograms. Methodologically, our approach (i) generates mel-spectrograms on the fly from raw audio to streamline the pipeline and enable transfer learning from vision models; (ii) applies curriculum learning that exposes the detector to progressively more complex mixtures, improving robustness to overlaps; and (iii) augments training with synthetic audio constructed under DCASE 2023 guidelines to enrich rare classes and challenging scenarios. Comprehensive experiments compare our YOLO-based framework against strong CRNN and Conformer baselines. In our experiments on the DCASE-style setting, the method achieves competitive detection accuracy relative to CRNN and Conformer baselines, with gains in some overlapping/noisy conditions and shortcomings for several short-duration classes. These results suggest that adapting modern object detectors to audio can be effective in this setting, while broader generalization and encoder-augmented comparisons remain open. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

16 pages, 1427 KB  
Article
Acoustic Vector Sensor–Based Speaker Diarization Using Sound Intensity Analysis for Two-Speaker Dialogues
by Grzegorz Szwoch, Józef Kotus and Szymon Zaporowski
Appl. Sci. 2025, 15(23), 12780; https://doi.org/10.3390/app152312780 - 3 Dec 2025
Viewed by 2572
Abstract
Speaker diarization is a key component of automatic speech recognition (ASR) systems, particularly in interview scenarios where speech segments must be assigned to individual speakers. This study presents a diarization algorithm based on sound intensity analysis using an Acoustic Vector Sensor (AVS). The [...] Read more.
Speaker diarization is a key component of automatic speech recognition (ASR) systems, particularly in interview scenarios where speech segments must be assigned to individual speakers. This study presents a diarization algorithm based on sound intensity analysis using an Acoustic Vector Sensor (AVS). The algorithm determines the azimuth of each speaker, defines directional beams, and detects speaker activity by analyzing intensity distributions within each beam, enabling identification of both single and overlapping speech segments. A dedicated dataset of interview recordings involving five speakers was created for evaluation. Performance was assessed using the Diarization Error Rate (DER) metric and compared with the State-of-the-Art Pyannote.audio system. The proposed AVS-based method achieved a lower DER value (0.112) than Pyannote (0.213) without overlapping speech, and a DER equal to 0.187 with overlapping speech included, demonstrating improved diarization accuracy and better handling of overlapping speech. The algorithm does not require training, operates independently of speaker-specific features, and can be adapted to various acoustic conditions. The results confirm that AVS-based diarization provides a robust and interpretable alternative to neural approaches, particularly suitable for structured two-speaker dialogues such as physician–patient or interviewer–interviewee scenarios. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

18 pages, 1138 KB  
Article
Speech-Based Depression Recognition in Hikikomori Patients Undergoing Cognitive Behavioral Therapy
by Samara Soares Leal, Stavros Ntalampiras, Maria Gloria Rossetti, Antonio Trabacca, Marcella Bellani and Roberto Sassi
Appl. Sci. 2025, 15(21), 11750; https://doi.org/10.3390/app152111750 - 4 Nov 2025
Cited by 1 | Viewed by 1165
Abstract
Major depressive disorder (MDD) affects approximately 4.4% of the global population. Its prevalence is increasing among adolescents and has led to the psychosocial condition known as hikikomori. MDD is typically assessed by self-report questionnaires, which, although informative, are subject to evaluator bias [...] Read more.
Major depressive disorder (MDD) affects approximately 4.4% of the global population. Its prevalence is increasing among adolescents and has led to the psychosocial condition known as hikikomori. MDD is typically assessed by self-report questionnaires, which, although informative, are subject to evaluator bias and subjectivity. To address these limitations, recent studies have explored machine learning (ML) for automated MDD detection. Among the input data used, speech signals stand out due to their low cost and minimal intrusiveness. However, many speech-based approaches lack integration with cognitive behavioral therapy (CBT) and adherence to evidence-based, patient-centered care—often aiming to replace rather than support clinical monitoring. In this context, we propose ML models to assess MDD in hikikomori patients using speech data from a real-world clinical trial. The trial is conducted in Italy, supervised by physicians, and comprises an eight-session CBT plan that is clinical evidence-based and follows patient-centered practices. Patients’ speech is recorded during therapy, and the Mel-Frequency Cepstral Coefficients (MFCCs) and wav2vec 2.0 embedding are extracted to train the models. The results show that the Multi-Layer Perceptron (MLP) predicted depression outcomes with a Root Mean Squared Error (RMSE) of 0.064 using only MFCCs from the first session, suggesting that early-session speech may be valuable for outcome prediction. When considering the entire CBT treatment (i.e., all sessions), the MLP achieved an RMSE of 0.063 using MFCCs and a lower RMSE of 0.057 with wav2vec 2.0, indicating approximately a 9.5% performance improvement. To aid the interpretability of the treatment outcomes, a binary task was conducted, where Logistic Regression (LR) achieved 70% recall in predicting depression improvement among young adults using wav2vec 2.0. These findings position speech as a valuable predictive tool in clinical informatics, potentially supporting clinicians in anticipating treatment response. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

22 pages, 5191 KB  
Article
Neural Network Regression for Sound Source Localization Using Time Difference of Arrival Based on Parametric Homomorphic Deconvolution
by Keonwook Kim and Anthony Choi
Appl. Sci. 2025, 15(17), 9272; https://doi.org/10.3390/app15179272 - 23 Aug 2025
Cited by 1 | Viewed by 1680
Abstract
This paper proposes a novel sound source localization system that combines parametric homomorphic deconvolution with neural network regression to estimate the angle of arrival from a single-channel signal. The system uses an analog adder to sum signals from three spatially arranged microphones, reducing [...] Read more.
This paper proposes a novel sound source localization system that combines parametric homomorphic deconvolution with neural network regression to estimate the angle of arrival from a single-channel signal. The system uses an analog adder to sum signals from three spatially arranged microphones, reducing system hardware complexity and requiring the estimation of time delays from a single-channel signal. Time delay features are extracted through parametric homomorphic deconvolution methods—Yule–Walker, Prony, and Steiglitz–McBride—and input to multilayer perceptrons configured with various structures. Simulations confirm that Steiglitz–McBride provides the sharpest and most accurate predictions with reduced model order, while Yule–Walker shows slightly better performance than Prony at higher orders. A hybrid learning strategy that combines synthetic and real-world data improves generalization and robustness across all angles. Experimental validations in an anechoic chamber support the simulation results, showing high correlation and low deviation values, especially with the Steiglitz–McBride method. The proposed sound source localization system demonstrates a compact and scalable design suitable for real-time and resource-constrained applications and provides a promising platform for future extensions in complex environments and broader signal interpretation domains. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

Back to TopTop