Recent Advances in Audio, Speech and Music Processing and Analysis, 2nd Edition

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Circuit and Signal Processing".

Deadline for manuscript submissions: 15 March 2026 | Viewed by 8511

Special Issue Editors


E-Mail Website
Guest Editor
Department of Music Technology & Acoustics, Hellenic Mediterranean University, 74133 Rethymnon, Greece
Interests: networked music performance; machine musicianship; music information retrieval; musical acoustics
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Audio plays an important role in everyday life since it is incorporated in various applications from broadcasting and telecommunications to the entertainment, multimedia, and gaming industries. Although less popular than image processing technology, which has overwhelmed the industry in recent years, audio processing in academia is under vigorous research and technological development. The relevant research initiatives are involved with speech recognition, audio compression, noise canceling, speaker verification and identification, voice synthesis, and voice transcription systems, to name a few. Furthermore, with respect to music signals, research initiatives focus on music information retrieval for music streaming and recommendation, networked music making, teaching and performing, autonomous and semi-autonomous computer musicians, and many more. This Special Issue gives the opportunity to disseminate state-of-the-art progress on emerging applications, algorithms, and systems related to audio, speech, and music processing and analysis.

Topics of interest include but are not limited to the following:

  • Audio and speech analysis and recognition;
  • Deep learning for robust speech recognition systems;
  • Active noise canceling systems;
  • Blind speech separation;
  • Robust speech recognition in multi-simultaneous speaker environments;
  • Room acoustics modeling;
  • Dereverberation;
  • Environmental sound recognition;
  • Music information retrieval;
  • Networked music performance systems;
  • Technologies and applications of internet of sounds;
  • Computer accompaniment and machine musicianship;
  • Digital music representations and collaborative music making;
  • Online music education technologies;
  • Computational approaches to musical acoustics;
  • Music generation using deep learning.

We look forward to your valuable contributions.

Dr. Athanasios Koutras
Dr. Chrisoula Alexandraki
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • sound analysis
  • sound processing
  • music information retrieval
  • audio analysis
  • audio recognition
  • music technology
  • computational music cognition

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

15 pages, 51754 KB  
Article
Underwater Acoustic Data Transmission in the Presence of Challenging Multipath Conditions and Shadow Zones: Sea Trial Analysis and Lessons Learned
by Jacopo Lazzarin, Antonio Montanari, Diego Spinosa, Davide Cosimo, Riccardo Costanzi, Filippo Campagnaro and Michele Zorzi
Electronics 2026, 15(2), 358; https://doi.org/10.3390/electronics15020358 - 13 Jan 2026
Viewed by 201
Abstract
In comparison to traditional wired and wireless communication scenarios, the underwater channel is peculiar, being significantly more difficult for communication and presenting a unique set of features and impairments, thus necessitating special care in selecting ad hoc encoding and modulation technologies to achieve [...] Read more.
In comparison to traditional wired and wireless communication scenarios, the underwater channel is peculiar, being significantly more difficult for communication and presenting a unique set of features and impairments, thus necessitating special care in selecting ad hoc encoding and modulation technologies to achieve successful transmissions. This process can be aided by simulations, which can be effectively carried out only using a good, detailed channel model validated through sea measurements. This study presents the results of a sea measurement campaign run in May 2024 off the Gulf of La Spezia, Italy, characterized by challenging shallow water conditions and the presence of shadow zones. The collected data is then used to model a simulated channel as faithful as possible to the one experienced during the sea trial. The obtained channel is then used to carry out a comparison of different forward error correction (FEC) codes, highlighting each scheme’s performance in our working context. Conclusive results show that a satisfactory simulated channel was obtained and that a different choice of FEC schemes could have improved the performance of the underwater acoustic communication. Full article
Show Figures

Figure 1

15 pages, 1366 KB  
Article
Multi-Feature Fusion for Automatic Piano Transcription Based on Mel Cyclic and STFT Spectrograms
by Jinliang Dai, Qiuyue Zheng, Yang Wang, Qihuan Shan, Jie Wan and Weiwei Zhang
Electronics 2025, 14(23), 4720; https://doi.org/10.3390/electronics14234720 - 29 Nov 2025
Viewed by 437
Abstract
Automatic piano transcription (APT) is a challenging problem in music information retrieval. In recent years, most APT approaches have been based on neural networks and have demonstrated higher performance. However, most previous works utilize a short-time Fourier transform (STFT) spectrogram as input, which [...] Read more.
Automatic piano transcription (APT) is a challenging problem in music information retrieval. In recent years, most APT approaches have been based on neural networks and have demonstrated higher performance. However, most previous works utilize a short-time Fourier transform (STFT) spectrogram as input, which results in a noisy spectrogram due to the mixing of harmonics from concurrent notes. To address this issue, a novel APT network based on two spectrograms is proposed. Firstly, the Mel cyclic and Mel STFT spectrograms of the piano musical signal are computed to represent the mixed audio. Next, separate modules for onset, offset, and frame-level note detection are constructed to achieve distinct objectives. To capture the temporal dynamics of notes, an axial attention mechanism is incorporated into the frame-level note detection modules. Finally, a multi-feature fusion module is introduced to aggregate different features and generate the piano note sequences. In this work, the two spectrograms provide complementary information, the axial attention mechanism enhances the temporal relevance of notes, and the multi-feature fusion module incorporates frame-level note, note onset, and note offset features together to deduce final piano notes. Experimental results demonstrate that the proposed approach achieves higher accuracies with lower error rates in automatic piano transcription compared with other reference approaches. Full article
Show Figures

Figure 1

27 pages, 1533 KB  
Article
Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions
by Bastian Estay Zamorano, Ali Dehghan Firoozabadi, Alessio Brutti, Pablo Adasme, David Zabala-Blanco, Pablo Palacios Játiva and Cesar A. Azurdia-Meza
Electronics 2025, 14(14), 2778; https://doi.org/10.3390/electronics14142778 - 10 Jul 2025
Cited by 2 | Viewed by 1968
Abstract
Sound event localization and detection (SELD) is a fundamental task in spatial audio processing that involves identifying both the type and location of sound events in acoustic scenes. Current SELD models often struggle with low signal-to-noise ratios (SNRs) and high reverberation. This article [...] Read more.
Sound event localization and detection (SELD) is a fundamental task in spatial audio processing that involves identifying both the type and location of sound events in acoustic scenes. Current SELD models often struggle with low signal-to-noise ratios (SNRs) and high reverberation. This article addresses SELD by reformulating direction of arrival (DOA) estimation as a multi-class classification task, leveraging deep convolutional recurrent neural networks (CRNNs). We propose and evaluate two modified architectures: M-DOAnet, an optimized version of DOAnet for localization and tracking, and M-SELDnet, a modified version of SELDnet, which has been designed for joint SELD. Both modified models were rigorously evaluated on the STARSS23 dataset, which comprises 13-class, real-world indoor scenes totaling over 7 h of audio, using spectrograms and acoustic intensity maps from first-order Ambisonics (FOA) signals. M-DOAnet achieved exceptional localization (6.00° DOA error, 72.8% F1-score) and perfect tracking (100% MOTA with zero identity switches). It also demonstrated high computational efficiency, training in 4.5 h (164 s/epoch). In contrast, M-SELDnet delivered strong overall SELD performance (0.32 rad DOA error, 0.75 F1-score, 0.38 error rate, 0.20 SELD score), but with significantly higher resource demands, training in 45 h (1620 s/epoch). Our findings underscore a clear trade-off between model specialization and multifunctionality, providing practical insights for designing SELD systems in real-time and computationally constrained environments. Full article
Show Figures

Figure 1

14 pages, 2069 KB  
Article
The Role of Facial Action Units in Investigating Facial Movements During Speech
by Aliya A. Newby, Ambika Bhatta, Charles Kirkland III, Nicole Arnold and Lara A. Thompson
Electronics 2025, 14(10), 2066; https://doi.org/10.3390/electronics14102066 - 20 May 2025
Viewed by 1566
Abstract
Investigating how facial movements can be used to characterize and quantify speech is important, in particular, to aid those suffering from motor control speech disorders. Here, we sought to investigate how facial action units (AUs), previously used to classify human expressions and emotion, [...] Read more.
Investigating how facial movements can be used to characterize and quantify speech is important, in particular, to aid those suffering from motor control speech disorders. Here, we sought to investigate how facial action units (AUs), previously used to classify human expressions and emotion, could be used to quantify and understand unimpaired human speech. Fourteen (14) adult participants (30.1 ± 7.9 years old), fluent in English, with no speech impairments, were examined. Within each data collection session, 6 video trials per participant per phoneme were acquired (i.e., 102 trials total/phoneme). The participants were asked to vocalize the vowels /æ/, /ɛ/, /ɪ/, /ɒ/, and /ʊ/; the consonants /b/, /n/, /m/, /p/, /h/, /w/, and /d/; and the diphthongs /eI/, /ʌɪ/, /i/, /a:/, and /u:/. Through the use of Python Py-Feat, our analysis displayed the AU contributions for each phoneme. The important implication of our methodological findings is that AUs could be used to quantify speech in populations with no speech disability; this has the potential to be broadened toward providing feedback and characterization of speech changes and improvements in impaired populations. This would be of interest to persons with speech disorders, speech language pathologists, engineers, and physicians. Full article
Show Figures

Figure 1

21 pages, 4948 KB  
Article
Simultaneous Localization of Two Talkers Placed in an Area Surrounded by Asynchronous Six-Microphone Arrays
by Toru Takahashi, Taiki Kanbayashi and Masato Nakayama
Electronics 2025, 14(4), 711; https://doi.org/10.3390/electronics14040711 - 12 Feb 2025
Cited by 2 | Viewed by 3616
Abstract
If we can understand dialogue activities, it will be possible to know the role of each person in the discussion, and it will be possible to provide basic materials for formulating facilitation strategies. This understanding can be expected to be used for business [...] Read more.
If we can understand dialogue activities, it will be possible to know the role of each person in the discussion, and it will be possible to provide basic materials for formulating facilitation strategies. This understanding can be expected to be used for business negotiations, group work, active learning, etc. To develop a system that can monitor speech activity over a wide range of areas, we propose a method for detecting multiple acoustic events and localizing sound sources using an asynchronous distributed microphone array arranged in a regular hexagonal repeating structure. In contrast to conventional methods based on sound source direction using triangulation with microphone arrays, we propose a method for detecting acoustic events and determining sound sources from local maximum positions based on estimation of the spatial energy distribution inside the observation space. We evaluated the conventional method and the proposed method in an experimental environment in which a dialogue between two people was simulated under 22,104 conditions by using the sound source signal convolving the measured impulse response.We found that the performance changes depending on the selection of the microphone array used for estimation. Our finding is that it is best to choose five microphone arrays close to the evaluation position. Full article
Show Figures

Figure 1

Back to TopTop