Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (232)

Search Parameters:
Keywords = Mel spectrograms

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
5 pages, 1305 KB  
Proceeding Paper
Audiovisual Fusion Technique for Detecting Sensitive Content in Videos
by Daniel Povedano Álvarez, Ana Lucila Sandoval Orozco and Luis Javier García Villalba
Eng. Proc. 2026, 123(1), 11; https://doi.org/10.3390/engproc2026123011 - 2 Feb 2026
Abstract
The detection of sensitive content in online videos is a key challenge for ensuring digital safety and effective content moderation. This work proposes the Multimodal Audiovisual Attention (MAV-Att), a multimodal deep learning framework that jointly exploits audio and visual cues to improve detection [...] Read more.
The detection of sensitive content in online videos is a key challenge for ensuring digital safety and effective content moderation. This work proposes the Multimodal Audiovisual Attention (MAV-Att), a multimodal deep learning framework that jointly exploits audio and visual cues to improve detection accuracy. The model was evaluated on the LSPD dataset, comprising 52,427 video segments of 20 s each, with optimized keyframe extraction. MAV-Att consists of dual audio and image branches enhanced by attention mechanisms to capture both temporal and cross-modal dependencies. Trained using a joint optimisation loss, the system achieved F1-scores of 94.9% on segments and 94.5% on entire videos, surpassing previous state-of-the-art models by 6.75%. Full article
(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)
Show Figures

Figure 1

20 pages, 1950 KB  
Article
Anomalous Sound Detection by Fusing Spectral Enhancement and Frequency-Gated Attention
by Zhongqin Bi, Jun Jiang, Weina Zhang and Meijing Shan
Mathematics 2026, 14(3), 530; https://doi.org/10.3390/math14030530 - 2 Feb 2026
Viewed by 34
Abstract
Unsupervised anomalous sound detection aims to learn acoustic features solely from the operational sounds of normal equipment and identify potential anomalies based on these features. Recent self-supervised classification frameworks based on machine ID metadata have achieved promising results, but they still face two [...] Read more.
Unsupervised anomalous sound detection aims to learn acoustic features solely from the operational sounds of normal equipment and identify potential anomalies based on these features. Recent self-supervised classification frameworks based on machine ID metadata have achieved promising results, but they still face two challenges in industrial acoustic scenarios: Log-Mel spectrograms tend to weaken high-frequency details, leading to insufficient spectral characterization, and when normal sounds from different machine IDs are highly similar, classification constraints alone struggle to form clear intra-class structures and inter-class boundaries, resulting in false positives. To address these issues, this paper proposes FGASpecNet, an anomaly detection model integrating spectral enhancement and frequency-gated attention. For feature modeling, a spectral enhancement branch is designed to explicitly supplement spectral details, while a frequency-gated attention mechanism highlights key frequency bands and temporal intervals conditioned on temporal context. Regarding loss design, a joint training strategy combining classification loss and metric learning loss is adopted. Multi-center prototypes enhance intra-class compactness and inter-class separability, improving detection performance in scenarios with similar machine IDs. Experimental results on the DCASE 2020 Challenge Task 2 for anomalous sound detection demonstrate that FGASpecNet achieves 95.04% average AUC and 89.68% pAUC, validating the effectiveness of the proposed approach. Full article
Show Figures

Figure 1

19 pages, 3617 KB  
Article
Deep Learning-Based Classification of Common Lung Sounds via Auto-Detected Respiratory Cycles
by Mustafa Alptekin Engin, Rukiye Uzun Arslan, İrem Senyer Yapici, Selim Aras and Ali Gangal
Bioengineering 2026, 13(2), 170; https://doi.org/10.3390/bioengineering13020170 - 30 Jan 2026
Viewed by 212
Abstract
Chronic respiratory diseases, the third leading cause of mortality on a global scale, can be diagnosed at an early stage through non-invasive auscultation. However, effective manual differentiation of lung sounds (LSs) requires not only sharp auditory skills but also significant clinical experience. With [...] Read more.
Chronic respiratory diseases, the third leading cause of mortality on a global scale, can be diagnosed at an early stage through non-invasive auscultation. However, effective manual differentiation of lung sounds (LSs) requires not only sharp auditory skills but also significant clinical experience. With technological advancements, artificial intelligence (AI) has demonstrated the capability to distinguish LSs with accuracy comparable to or surpassing that of human experts. This study broadly compares the methods used in AI-based LSs classification. Firstly, respiratory cycles—consisting of inhalation and exhalation parts in LSs of different lengths depending on individual variability, obtained and labelled under expert guidance—were automatically detected using a series of signal processing procedures and a database was obtained in this way. This database of common LSs was then classified using various time-frequency representations such as spectrograms, scalograms, Mel-spectrograms and gammatonegrams for comparison. The utilisation of proven, convolutional neural network (CNN)-based pre-trained models through the application of transfer learning facilitated the comparison, thereby enabling the acquisition of the features to be employed in the classification process. The performances of CNN, CNN and Long Short-Term Memory (LSTM) hybrid architecture and support vector machine methods were compared in the classification process. When the spectral structure of gammatonegrams, which capture the spectral structure of signals in the low-frequency range with high fidelity and their noise-resistant structures, is combined with a CNN architecture, the best classification accuracy of 97.3% ± 1.9 is obtained. Full article
Show Figures

Figure 1

41 pages, 2850 KB  
Article
Automated Classification of Humpback Whale Calls Using Deep Learning: A Comparative Study of Neural Architectures and Acoustic Feature Representations
by Jack C. Johnson and Yue Rong
Sensors 2026, 26(2), 715; https://doi.org/10.3390/s26020715 - 21 Jan 2026
Viewed by 185
Abstract
Passive acoustic monitoring (PAM) using hydrophones enables collecting acoustic data to be collected in large and diverse quantities, necessitating the need for a reliable automated classification system. This paper presents a data-processing pipeline and a set of neural networks designed for a humpback-whale-detection [...] Read more.
Passive acoustic monitoring (PAM) using hydrophones enables collecting acoustic data to be collected in large and diverse quantities, necessitating the need for a reliable automated classification system. This paper presents a data-processing pipeline and a set of neural networks designed for a humpback-whale-detection system. A collection of audio segments is compiled using publicly available audio repositories and extensively curated via manual methods, undertaking thorough examination, editing and clipping to produce a dataset minimizing bias or categorization errors. An array of standard data-augmentation techniques are applied to the collected audio, diversifying and expanding the original dataset. Multiple neural networks are designed and trained using TensorFlow 2.20.0 and Keras 3.13.1 frameworks, resulting in a custom curated architecture layout based on research and iterative improvements. The pre-trained model MobileNetV2 is also included for further analysis. Model performance demonstrates a strong dependence on both feature representation and network architecture. Mel spectrogram inputs consistently outperformed MFCC (Mel-Frequency Cepstral Coefficients) features across all model types. The highest performance was achieved by the pretrained MobileNetV2 using mel spectrograms without augmentation, reaching a test accuracy of 99.01% with balanced precision and recall of 99% and a Matthews correlation coefficient of 0.98. The custom CNN with mel spectrograms also achieved strong performance, with 98.92% accuracy and a false negative rate of only 0.75%. In contrast, models trained with MFCC representations exhibited consistently lower robustness and higher false negative rates. These results highlight the comparative strengths of the evaluated feature representations and network architectures for humpback whale detection. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

18 pages, 3170 KB  
Article
A Terrain Perception Method for Quadruped Robots Based on Acoustic Signal Fusion
by Meng Hong, Nian Wang, Xingyu Liu, Chao Huang, Ganchang Li, Zijian Li, Shuai Shu, Ruixuan Chen, Jincheng Sheng, Zhongren Wang, Sijia Guan and Min Guo
Sensors 2026, 26(2), 594; https://doi.org/10.3390/s26020594 - 15 Jan 2026
Viewed by 221
Abstract
In unstructured environments, terrain perception is essential for stability and environmental awareness of Quadruped robot locomotion. Existing approaches primarily rely on visual or proprioceptive signals, but their effectiveness is limited under conditions of visual occlusion or ambiguous terrain features. To address this, this [...] Read more.
In unstructured environments, terrain perception is essential for stability and environmental awareness of Quadruped robot locomotion. Existing approaches primarily rely on visual or proprioceptive signals, but their effectiveness is limited under conditions of visual occlusion or ambiguous terrain features. To address this, this study proposes a multimodal terrain perception method that integrates acoustic features with proprioceptive signals. This terrain perception method collects environmental acoustic information through an externally mounted sound sensor, and combines the sound signal with proprioceptive sensor data from IMU and joint encoder of the quadruped robot. The method was deployed on the quadruped robot Lite2 platform developed by Deep Robotics, and experiments were conducted on four representative terrain types: concrete, gravel, sand, and carpet. Mel-spectrogram features are extracted from the acoustic signals and concatenated with the IMU and joint encoder to form feature vectors, which are subsequently fed into a support vector machine for terrain classification. For each terrain type, 400 s of data were collected. Experimental results show that the terrain classification accuracy reaches 78.28% without using acoustic signals, while increasing to 82.52% when acoustic features are incorporated. To further enhance the classification performance, this study performs a combined exploration of the SVM hyperparameters C and γ as well as the time-window length win. The final results demonstrate that the classification accuracy can be improved to as high as 99.53% across all four terrains. Full article
(This article belongs to the Special Issue Dynamics and Control System Design for Robotics)
Show Figures

Figure 1

29 pages, 808 KB  
Review
Spectrogram Features for Audio and Speech Analysis
by Ian McLoughlin, Lam Pham, Yan Song, Xiaoxiao Miao, Huy Phan, Pengfei Cai, Qing Gu, Jiang Nan, Haoyu Song and Donny Soh
Appl. Sci. 2026, 16(2), 572; https://doi.org/10.3390/app16020572 - 6 Jan 2026
Viewed by 677
Abstract
Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency [...] Read more.
Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a range of machine learning techniques such as convolutional neural networks, which had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

23 pages, 1037 KB  
Article
Acoustic Side-Channel Vulnerabilities in Keyboard Input Explored Through Convolutional Neural Network Modeling: A Pilot Study
by Michał Rzemieniuk, Artur Niewiarowski and Wojciech Książek
Appl. Sci. 2026, 16(2), 563; https://doi.org/10.3390/app16020563 - 6 Jan 2026
Viewed by 350
Abstract
This paper presents the findings of a pilot study investigating the feasibility of recognizing keyboard keystroke sounds using Convolutional Neural Networks (CNNs) as a means of simulating an acoustic side-channel attack aimed at recovering typed text. A dedicated dataset of keyboard audio recordings [...] Read more.
This paper presents the findings of a pilot study investigating the feasibility of recognizing keyboard keystroke sounds using Convolutional Neural Networks (CNNs) as a means of simulating an acoustic side-channel attack aimed at recovering typed text. A dedicated dataset of keyboard audio recordings was collected and preprocessed using signal-processing techniques, including Fourier-transform-based feature extraction and mel-spectrogram analysis. Data augmentation methods were applied to improve model robustness, and a CNN-based prediction architecture was developed and trained. A series of experiments was performed under multiple conditions, including controlled laboratory settings, scenarios with background noise interference, tests involving a different keyboard model, and evaluations following model quantization. The results indicate that CNN-based models can achieve high keystroke-prediction accuracy, demonstrating that this class of acoustic side-channel attacks is technically viable. Additionally, the study outlines potential mitigation strategies designed to reduce exposure to such threats. Overall, the findings highlight the need for increased awareness of acoustic side-channel vulnerabilities and underscore the importance of further research to more comprehensively understand, evaluate, and prevent attacks of this nature. Full article
(This article belongs to the Special Issue Artificial Neural Network and Deep Learning in Cybersecurity)
Show Figures

Figure 1

14 pages, 1392 KB  
Article
AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots
by Xiugong Qin, Fenghu Pan, Jing Gao, Shilong Huang, Yichen Sun and Xiao Zhong
Electronics 2026, 15(1), 239; https://doi.org/10.3390/electronics15010239 - 5 Jan 2026
Viewed by 329
Abstract
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited [...] Read more.
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited adaptability to diverse speaker characteristics. The quality of the Mel spectrogram directly affects the performance of TTS systems, yet existing methods overlook the potential of enhancing Mel spectrogram quality through more comprehensive speech features. To address the complex acoustic characteristics of home environments, this paper introduces AirSpeech, a post-processing model for Mel-spectrogram synthesis. We adopt a Generative Adversarial Network (GAN) to improve the accuracy of Mel spectrogram prediction and enhance the expressiveness of synthesized speech. By incorporating additional conditioning extracted from synthesized audio using specified speech feature parameters, our method significantly enhances the expressiveness and emotional adaptability of synthesized speech in home environments. Furthermore, we propose a global normalization strategy to stabilize the GAN training process. Through extensive evaluations, we demonstrate that the proposed method significantly improves the signal quality and naturalness of synthesized speech, providing a more user-friendly speech interaction solution for smart home applications. Full article
Show Figures

Figure 1

22 pages, 1784 KB  
Article
Automated Severity and Breathiness Assessment of Disordered Speech Using a Speech Foundation Model
by Vahid Ashkanichenarlogh, Arman Hassanpour and Vijay Parsa
Information 2026, 17(1), 32; https://doi.org/10.3390/info17010032 - 3 Jan 2026
Viewed by 264
Abstract
In this study, we propose a novel automated model for speech quality estimation that objectively evaluates perceptual dysphonia severity and breathiness in audio samples, demonstrating strong correlation with expert ratings. The proposed model integrates Whisper encoder embeddings with Mel spectrograms augmented by second-order [...] Read more.
In this study, we propose a novel automated model for speech quality estimation that objectively evaluates perceptual dysphonia severity and breathiness in audio samples, demonstrating strong correlation with expert ratings. The proposed model integrates Whisper encoder embeddings with Mel spectrograms augmented by second-order delta features combined with a sequential-attention fusion network feature mapping path. This hybrid approach enhances the model’s sensitivity to phonetic, high-level feature representation, and spectral variations, enabling more accurate predictions of perceptual speech quality. A sequential-attention fusion network feature mapping module captures long-range dependencies through the multi-head attention network, while LSTM layers refine the learned representations by modeling temporal dynamics. Comparative analysis against state-of-the-art methods for dysphonia assessment demonstrates our model’s better correlation with clinician’s judgments across test samples. Our findings underscore the effectiveness of ASR-derived embeddings alongside the deep feature mapping structure in disordered speech quality assessment, offering a promising pathway for advancing automated evaluation systems. Full article
Show Figures

Graphical abstract

16 pages, 17043 KB  
Article
Research on Sound Recognition of Long-Distance UAV Based on Harmonic Features
by Kuangang Fan, Wenjie Pan, Jilong Zhong, Zhiyu Zeng and Wenzheng Chen
Drones 2026, 10(1), 25; https://doi.org/10.3390/drones10010025 - 1 Jan 2026
Viewed by 371
Abstract
With the extensive application of unmanned aerial vehicles (UAVs) in both military and civilian domains, the significance of UAV identification technology has become increasingly prominent. Among various recognition methods, voice recognition has garnered considerable attention due to its advantages of low cost and [...] Read more.
With the extensive application of unmanned aerial vehicles (UAVs) in both military and civilian domains, the significance of UAV identification technology has become increasingly prominent. Among various recognition methods, voice recognition has garnered considerable attention due to its advantages of low cost and easy deployment. However, most existing research primarily focuses on isolating UAV sounds from noise signals in complex environments, with limited studies on long-distance UAV sound recognition. Based on this, this paper proposes a frequency domain feature extraction method based on harmonic features. By analyzing the harmonic features of UAV sounds, we select stable parameters with strong robustness against interference capabilities as the main features to minimize information redundancy and feature fluctuation. The experimental results indicate that this method achieves a recognition accuracy of 78.03% for the DJI Phantom 4 Pro V2.0 UAV at a distance of 120 m. To validate the proposed method, comprehensive comparisons against traditional MFCC, Log-Mel Spectrogram, and modern Raw Waveform CNN (M5) baselines demonstrate the superior robustness of the proposed approach. While these comparative methods exhibited significant performance drops in challenging long-distance scenarios (e.g., accuracies falling below 24% for the DJI Mavic Pro), the proposed method maintained consistent identification capabilities, validating its effectiveness in low-signal environments. Full article
Show Figures

Figure 1

19 pages, 1187 KB  
Article
Dual-Pipeline Machine Learning Framework for Automated Interpretation of Pilot Communications at Non-Towered Airports
by Abdullah All Tanvir, Chenyu Huang, Moe Alahmad, Chuyang Yang and Xin Zhong
Aerospace 2026, 13(1), 32; https://doi.org/10.3390/aerospace13010032 - 28 Dec 2025
Viewed by 334
Abstract
Accurate estimation of aircraft operations, such as takeoffs and landings, is critical for airport planning and resource allocation, yet it remains particularly challenging at non-towered airports, where no dedicated surveillance infrastructure exists. Existing solutions, including video analytics, acoustic sensors, and transponder-based systems, are [...] Read more.
Accurate estimation of aircraft operations, such as takeoffs and landings, is critical for airport planning and resource allocation, yet it remains particularly challenging at non-towered airports, where no dedicated surveillance infrastructure exists. Existing solutions, including video analytics, acoustic sensors, and transponder-based systems, are often costly, incomplete, or unreliable in environments with mixed traffic and inconsistent radio usage, highlighting the need for a scalable, infrastructure-free alternative. To address this gap, this study proposes a novel dual-pipeline machine learning framework that classifies pilot radio communications using both textual and spectral features to infer operational intent. A total of 2489 annotated pilot transmissions collected from a U.S. non-towered airport were processed through automatic speech recognition (ASR) and Mel-spectrogram extraction. We benchmarked multiple traditional classifiers and deep learning models, including ensemble methods, long short-term memory (LSTM) networks, and convolutional neural networks (CNNs), across both feature pipelines. Results show that spectral features paired with deep architectures consistently achieved the highest performance, with F1-scores exceeding 91% despite substantial background noise, overlapping transmissions, and speaker variability These findings indicate that operational intent can be inferred reliably from existing communication audio alone, offering a practical, low-cost path toward scalable aircraft operations monitoring and supporting emerging virtual tower and automated air traffic surveillance applications. Full article
(This article belongs to the Special Issue AI, Machine Learning and Automation for Air Traffic Control (ATC))
Show Figures

Figure 1

19 pages, 3374 KB  
Article
SpaceNet: A Multimodal Fusion Architecture for Sound Source Localization in Disaster Response
by Long Nguyen-Vu and Jonghoon Lee
Sensors 2026, 26(1), 168; https://doi.org/10.3390/s26010168 - 26 Dec 2025
Viewed by 336
Abstract
Sound source localization (SSL) has evolved from traditional signal-processing methods to sophisticated deep-learning architectures. However, applying these to distributed microphone arrays in adverse environments is complicated by high reverberation and potential sensor asynchrony, which can corrupt crucial Time-Difference-of-Arrival (TDoA) information. We introduce SpaceNet, [...] Read more.
Sound source localization (SSL) has evolved from traditional signal-processing methods to sophisticated deep-learning architectures. However, applying these to distributed microphone arrays in adverse environments is complicated by high reverberation and potential sensor asynchrony, which can corrupt crucial Time-Difference-of-Arrival (TDoA) information. We introduce SpaceNet, a multimodal deep-learning architecture designed to address such issues by explicitly fusing audio features with sensor geometry. SpaceNet features: (1) a dual-branch architecture with specialized spatial processing that decomposes microphone geometry into distances, azimuths, and elevations; and (2) a feature-normalization technique to ensure stable multimodal training. Evaluation on real-world datasets from disaster sites demonstrates that SpaceNet, when trained on ILD-only mel-spectra, achieves better accuracy compared to our baseline model (CHAWA) and identical models trained on full mel-spectrograms. This approach also reduces computational overhead by a factor of 24. Our findings suggest that for distributed arrays in adverse environments, time-invariant ILD cues are a more effective and efficient feature for localization than complex temporal features corrupted by reverberation and synchronization errors. Full article
Show Figures

Figure 1

25 pages, 1229 KB  
Article
YOLO-Based Transfer Learning for Sound Event Detection Using Visual Object Detection Techniques
by Sergio Segovia González, Sara Barahona Quiros and Doroteo T. Toledano
Appl. Sci. 2026, 16(1), 205; https://doi.org/10.3390/app16010205 - 24 Dec 2025
Viewed by 488
Abstract
Traditional Sound Event Detection (SED) approaches are based on either specialized models or these models in combination with general audio embedding extractors. In this article, we propose to reframe SED as an object detection task in the time–frequency plane and introduce a direct [...] Read more.
Traditional Sound Event Detection (SED) approaches are based on either specialized models or these models in combination with general audio embedding extractors. In this article, we propose to reframe SED as an object detection task in the time–frequency plane and introduce a direct adaptation of modern YOLO detectors to audio. To our knowledge, this is among the first works to employ YOLOv8 and YOLOv11 not merely as feature extractors but as end-to-end models that localize and classify sound events on mel-spectrograms. Methodologically, our approach (i) generates mel-spectrograms on the fly from raw audio to streamline the pipeline and enable transfer learning from vision models; (ii) applies curriculum learning that exposes the detector to progressively more complex mixtures, improving robustness to overlaps; and (iii) augments training with synthetic audio constructed under DCASE 2023 guidelines to enrich rare classes and challenging scenarios. Comprehensive experiments compare our YOLO-based framework against strong CRNN and Conformer baselines. In our experiments on the DCASE-style setting, the method achieves competitive detection accuracy relative to CRNN and Conformer baselines, with gains in some overlapping/noisy conditions and shortcomings for several short-duration classes. These results suggest that adapting modern object detectors to audio can be effective in this setting, while broader generalization and encoder-augmented comparisons remain open. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

28 pages, 3628 KB  
Article
ADFF-Net: An Attention-Based Dual-Stream Feature Fusion Network for Respiratory Sound Classification
by Bing Zhu, Lijun Chen, Xiaoling Li, Songnan Zhao, Shaode Yu and Qiurui Sun
Technologies 2026, 14(1), 12; https://doi.org/10.3390/technologies14010012 - 24 Dec 2025
Viewed by 506
Abstract
Deep learning-based respiratory sound classification (RSC) has emerged as a promising non-invasive approach to assist clinical diagnosis. However, existing methods often face challenges, such as sub-optimal feature representation and limited model expressiveness. To address these issues, we propose an Attention-based Dual-stream Feature Fusion [...] Read more.
Deep learning-based respiratory sound classification (RSC) has emerged as a promising non-invasive approach to assist clinical diagnosis. However, existing methods often face challenges, such as sub-optimal feature representation and limited model expressiveness. To address these issues, we propose an Attention-based Dual-stream Feature Fusion Network (ADFF-Net). Built upon the pre-trained Audio Spectrogram Transformer, ADFF-Net takes Mel-filter bank and Mel-spectrogram features as dual-stream inputs, while an attention-based fusion module with a skip connection is introduced to preserve both the raw energy and the relevant tonal variations within the multi-scale time–frequency representation. Extensive experiments on the ICBHI2017 database with the official train–test split show that, despite critical failure in sensitivity of 42.91%, ADFF-Net achieves state-of-the-art performance in terms of aggregated metrics in the four-class RSC task, with an overall accuracy of 64.95%, specificity of 81.39%, and harmonic score of 62.14%. The results confirm the effectiveness of the proposed attention-based dual-stream acoustic feature fusion module for the RSC task, while also highlighting substantial room for improving the detection of abnormal respiratory events. Furthermore, we outline several promising research directions, including addressing class imbalance, enriching signal diversity, advancing network design, and enhancing model interpretability. Full article
Show Figures

Figure 1

16 pages, 6746 KB  
Article
Cross-Attentive CNNs for Joint Specral and Pitch Feature Learning in Predominant Instrument Recognition from Polyphonic Music
by Lekshmi Chandrika Reghunath, Rajeev Rajan, Christian Napoli and Cristian Randieri
Technologies 2026, 14(1), 3; https://doi.org/10.3390/technologies14010003 - 19 Dec 2025
Viewed by 339
Abstract
Identifying instruments in polyphonic audio is challenging due to overlapping spectra and variations in timbre and playing styles. This task is central to music information retrieval, with applications in transcription, recommendation, and indexing. We propose a dual-branch Convolutional Neural Network (CNN) that processes [...] Read more.
Identifying instruments in polyphonic audio is challenging due to overlapping spectra and variations in timbre and playing styles. This task is central to music information retrieval, with applications in transcription, recommendation, and indexing. We propose a dual-branch Convolutional Neural Network (CNN) that processes Mel-spectrograms and binary pitch masks, fused through a cross-attention mechanism to emphasize pitch-salient regions. On the IRMAS dataset, the model achieves competitive performance with state-of-the-art methods, reaching a micro F1 of 0.64 and a macro F1 of 0.57 with only 0.878M parameters. Ablation studies and t-SNE analyses further highlight the benefits of cross-modal attention for robust predominant instrument recognition. Full article
Show Figures

Figure 1

Back to TopTop