Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (187)

Search Parameters:
Keywords = Mel frequency cepstral coefficients (MFCC) features

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
41 pages, 2850 KB  
Article
Automated Classification of Humpback Whale Calls Using Deep Learning: A Comparative Study of Neural Architectures and Acoustic Feature Representations
by Jack C. Johnson and Yue Rong
Sensors 2026, 26(2), 715; https://doi.org/10.3390/s26020715 - 21 Jan 2026
Viewed by 114
Abstract
Passive acoustic monitoring (PAM) using hydrophones enables collecting acoustic data to be collected in large and diverse quantities, necessitating the need for a reliable automated classification system. This paper presents a data-processing pipeline and a set of neural networks designed for a humpback-whale-detection [...] Read more.
Passive acoustic monitoring (PAM) using hydrophones enables collecting acoustic data to be collected in large and diverse quantities, necessitating the need for a reliable automated classification system. This paper presents a data-processing pipeline and a set of neural networks designed for a humpback-whale-detection system. A collection of audio segments is compiled using publicly available audio repositories and extensively curated via manual methods, undertaking thorough examination, editing and clipping to produce a dataset minimizing bias or categorization errors. An array of standard data-augmentation techniques are applied to the collected audio, diversifying and expanding the original dataset. Multiple neural networks are designed and trained using TensorFlow 2.20.0 and Keras 3.13.1 frameworks, resulting in a custom curated architecture layout based on research and iterative improvements. The pre-trained model MobileNetV2 is also included for further analysis. Model performance demonstrates a strong dependence on both feature representation and network architecture. Mel spectrogram inputs consistently outperformed MFCC (Mel-Frequency Cepstral Coefficients) features across all model types. The highest performance was achieved by the pretrained MobileNetV2 using mel spectrograms without augmentation, reaching a test accuracy of 99.01% with balanced precision and recall of 99% and a Matthews correlation coefficient of 0.98. The custom CNN with mel spectrograms also achieved strong performance, with 98.92% accuracy and a false negative rate of only 0.75%. In contrast, models trained with MFCC representations exhibited consistently lower robustness and higher false negative rates. These results highlight the comparative strengths of the evaluated feature representations and network architectures for humpback whale detection. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

29 pages, 808 KB  
Review
Spectrogram Features for Audio and Speech Analysis
by Ian McLoughlin, Lam Pham, Yan Song, Xiaoxiao Miao, Huy Phan, Pengfei Cai, Qing Gu, Jiang Nan, Haoyu Song and Donny Soh
Appl. Sci. 2026, 16(2), 572; https://doi.org/10.3390/app16020572 - 6 Jan 2026
Viewed by 502
Abstract
Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency [...] Read more.
Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a range of machine learning techniques such as convolutional neural networks, which had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks. Full article
(This article belongs to the Special Issue AI in Audio Analysis: Spectrogram-Based Recognition)
Show Figures

Figure 1

11 pages, 1725 KB  
Article
Tool Wear Detection in Milling Using Convolutional Neural Networks and Audible Sound Signals
by Halil Ibrahim Turan and Ali Mamedov
Machines 2026, 14(1), 59; https://doi.org/10.3390/machines14010059 - 2 Jan 2026
Viewed by 357
Abstract
Timely tool wear detection has been an important target for the metal cutting industry for decades because of its significance for part quality and production cost control. With the shift toward intelligent and sustainable manufacturing, reliable tool-condition monitoring has become even more critical. [...] Read more.
Timely tool wear detection has been an important target for the metal cutting industry for decades because of its significance for part quality and production cost control. With the shift toward intelligent and sustainable manufacturing, reliable tool-condition monitoring has become even more critical. One of the main challenges in sound-based tool wear monitoring is the presence of noise interference, instability and the highly volatile nature of machining acoustics, which complicates the extraction of meaningful features. In this study, a Convolutional Neural Network (CNN) model is proposed to classify tool wear conditions in milling operations using acoustic signals. Sound recordings were collected from tools at different wear stages under two cutting speeds, and Mel-Frequency Cepstral Coefficients (MFCCs) were extracted to obtain a compact representation of the short-term power spectrum. These MFCC matrices enabled the CNN to learn discriminative spectral patterns associated with wear. To evaluate model stability and reduce the effects of algorithmic randomness, training was repeated three times for each cutting speed. For the 520 rpm dataset, the model achieved an average validation accuracy of 96.85 ± 2.07%, while for the 635 rpm dataset it achieved 93.69 ± 2.07%. The results demonstrate the feasibility of using acoustic signals, despite inherent noise challenges, as a complementary approach for identifying suitable tool replacement intervals in milling. Full article
(This article belongs to the Special Issue Intelligent Tool Wear Monitoring)
Show Figures

Figure 1

20 pages, 2548 KB  
Article
Fault Diagnosis of Motor Bearing Transmission System Based on Acoustic Characteristics
by Long Ma, Yan Zhang and Zhongqiu Wang
Sensors 2026, 26(1), 259; https://doi.org/10.3390/s26010259 - 31 Dec 2025
Viewed by 470
Abstract
Traditional vibration-based methods for bearing fault diagnosis, while prevalent, often require contact measurement, and sound signal is a broadband signal relative to the vibration signal. To overcome these limitations, this paper explores the advantages of acoustic signals, non-contact sensing, and rich broadband information [...] Read more.
Traditional vibration-based methods for bearing fault diagnosis, while prevalent, often require contact measurement, and sound signal is a broadband signal relative to the vibration signal. To overcome these limitations, this paper explores the advantages of acoustic signals, non-contact sensing, and rich broadband information and proposes a fault diagnosis framework based on acoustic features and deep learning. The core of our method is a CNN–attention mechanism–LSTM model, specifically designed to process one-dimensional sequential features: the 1D-CNN extracts local features from Mel frequency cepstral coefficient (MFCC) features, the attention mechanism (selecting ECA as the optimal solution) selectively enhances features, and the LSTM captures temporal dependencies, collectively enabling effective classification of fault types. Furthermore, to enhance model efficiency, a ReliefF-based feature selection algorithm is employed to identify and retain only the most discriminative acoustic features. Experimental results demonstrate that the proposed method achieves an average diagnostic accuracy of 99.90% in distinguishing normal, inner-ring, outer-ring, and mixed-defect bearings. Notably, results show that after using the feature selection algorithm, the number of parameters and the estimated total size are significantly reduced while ensuring that the accuracy remains basically unchanged. This work validates the effectiveness of non-contact solutions for bearing fault diagnosis using acoustic features and has enormous potential for industrial applications. Full article
Show Figures

Figure 1

17 pages, 1042 KB  
Article
Cross-Cultural Identification of Acoustic Voice Features for Depression: A Cross-Sectional Study of Vietnamese and Japanese Datasets
by Phuc Truong Vinh Le, Mitsuteru Nakamura, Masakazu Higuchi, Lanh Thi My Vuu, Nhu Huynh and Shinichi Tokuno
Bioengineering 2026, 13(1), 33; https://doi.org/10.3390/bioengineering13010033 - 27 Dec 2025
Viewed by 462
Abstract
Acoustic voice analysis demonstrates potential as a non-invasive biomarker for depression, yet its generalizability across languages remains underexplored. This cross-sectional study aimed to identify a set of cross-culturally consistent acoustic features for depression screening using distinct Vietnamese and Japanese voice datasets. We analyzed [...] Read more.
Acoustic voice analysis demonstrates potential as a non-invasive biomarker for depression, yet its generalizability across languages remains underexplored. This cross-sectional study aimed to identify a set of cross-culturally consistent acoustic features for depression screening using distinct Vietnamese and Japanese voice datasets. We analyzed anonymized recordings from 251 participants, comprising 123 Vietnamese individuals assessed via the self-report Beck Depression Inventory (BDI) and 128 Japanese individuals assessed via the clinician-rated Hamilton Depression Rating Scale (HAM-D). From 6373 features extracted with openSMILE, a multi-stage selection pipeline identified 12 cross-cultural features, primarily from the auditory spectrum (AudSpec), Mel-Frequency Cepstral Coefficients (MFCCs), and logarithmic Harmonics-to-Noise Ratio (logHNR) domains. The cross-cultural model achieved a combined Area Under the Curve (AUC) of 0.934, with performance disparities observed between the Japanese (AUC = 0.993) and Vietnamese (AUC = 0.913) cohorts. This disparity may be attributed to dataset heterogeneity, including mismatched diagnostic tools and differing sample compositions (clinical vs. mixed community). Furthermore, the limited number of high-risk cases (n = 33) warrants cautious interpretation regarding the reliability of reported AUC values for severe depression classification. These findings suggest the presence of a core acoustic signature related to physiological psychomotor changes that may transcend linguistic boundaries. This study advances the exploration of global vocal biomarkers but underscores the need for prospective, standardized multilingual trials to overcome the limitations of secondary data analysis. Full article
(This article belongs to the Special Issue Voice Analysis Techniques for Medical Diagnosis)
Show Figures

Figure 1

21 pages, 1302 KB  
Article
Heart Sound Classification with MFCCs and Wavelet Daubechies Analysis Using Machine Learning Algorithms
by Sebastian Guzman-Alfaro, Karen E. Villagrana-Bañuelos, Manuel A. Soto-Murillo, Jorge Isaac Galván-Tejada, Antonio Baltazar-Raigosa, Angel Garcia-Duran, José María Celaya-Padilla and Andrea Acuña-Correa
Diagnostics 2026, 16(1), 83; https://doi.org/10.3390/diagnostics16010083 - 26 Dec 2025
Viewed by 421
Abstract
Background/Objectives: Cardiovascular diseases are the leading cause of mortality worldwide according to the World Health Organization (WHO), highlighting the need for accessible tools for early detection. Automated classification systems based on signal processing and machine learning offer a non-invasive alternative to support clinical [...] Read more.
Background/Objectives: Cardiovascular diseases are the leading cause of mortality worldwide according to the World Health Organization (WHO), highlighting the need for accessible tools for early detection. Automated classification systems based on signal processing and machine learning offer a non-invasive alternative to support clinical diagnosis. Methods: This study implements and evaluates machine learning models for distinguishing normal and abnormal heart sounds using a hybrid feature extraction approach. Recordings labeled as normal, murmur, and extrasystolic were obtained from the PASCAL dataset and subsequently binarized into two classes. Multiple numerical datasets were generated through statistical features derived from Mel-Frequency Cepstral Coefficients (MFCCs) and Daubechies wavelet analysis. Each dataset was standardized and used to train four classifiers: support vector machines, logistic regression, random forests, and decision trees. Results: Model performance was assessed using accuracy, precision, recall, specificity, F1-score, and area under curve. All classifiers achieved notable results; however, the support vector machine model trained with 26 MFCCs and Daubechies-4 wavelet coefficients obtained the best performance. Conclusions: These findings demonstrate that the proposed hybrid MFCC–Wavelet framework provides competitive diagnostic accuracy and represents a lightweight, interpretable, and computationally efficient solution for computer-aided auscultation and early cardiovascular screening. Full article
(This article belongs to the Special Issue Artificial Intelligence and Computational Methods in Cardiology 2026)
Show Figures

Figure 1

18 pages, 4190 KB  
Article
Acoustic Characteristics of Vowel Production in Children with Cochlear Implants Using a Multi-View Fusion Model
by Qingqing Xie, Jing Wang, Ling Du, Lifang Zhang and Yanan Li
Algorithms 2026, 19(1), 9; https://doi.org/10.3390/a19010009 - 22 Dec 2025
Viewed by 312
Abstract
This study aims to examine the acoustic characteristics of Mandarin vowels produced by children with cochlear implants and to explore the differences in their speech production compared with those of children with normal hearing. We propose a multiview model-based method for vowel feature [...] Read more.
This study aims to examine the acoustic characteristics of Mandarin vowels produced by children with cochlear implants and to explore the differences in their speech production compared with those of children with normal hearing. We propose a multiview model-based method for vowel feature analysis. This approach involves extracting and fusing formant features, Mel-frequency cepstral coefficients (MFCCs), and linear predictive coding coefficients (LPCCs) to comprehensively represent vowel articulation. We conducted k-means clustering on individual features and applied multiview clustering to the fused features. The results showed that children with cochlear implants formed discernible vowel clusters in the formant space, though with lower compactness than those of normal-hearing children. Furthermore, the MFCCs and LPCCs features revealed significant inter-group differences. Most importantly, the multiview model, utilizing fused features, achieved superior clustering performance compared to any single feature. These findings demonstrated that effective fusion of frequency domain features provided a more comprehensive representation of phonetic characteristics, offering potential value for clinical assessment and targeted speech intervention in children with hearing impairment. Full article
Show Figures

Figure 1

30 pages, 4486 KB  
Article
Passive Localization in GPS-Denied Environments via Acoustic Side Channels: Harnessing Smartphone Microphones to Infer Wireless Signal Strength Using MFCC Features
by Khalid A. Darabkh, Oswa M. Amro and Feras B. Al-Qatanani
J. Sens. Actuator Netw. 2025, 14(6), 119; https://doi.org/10.3390/jsan14060119 - 16 Dec 2025
Viewed by 574
Abstract
The Global Positioning System (GPS) and Received Signal Strength Indicator (RSSI) usage for location provenance often fails in obstructed, noisy, or densely populated urban environments. This study proposes a passive location provenance method that uses the location’s acoustics and the device’s acoustic side [...] Read more.
The Global Positioning System (GPS) and Received Signal Strength Indicator (RSSI) usage for location provenance often fails in obstructed, noisy, or densely populated urban environments. This study proposes a passive location provenance method that uses the location’s acoustics and the device’s acoustic side channel to address these limitations. With the smartphone’s internal microphone, we can effectively capture the subtle vibrations produced by the capacitors within the voltage-regulating circuit during wireless transmissions. Subsequently, we extract key features from the resulting audio signals. Meanwhile, we record the RSSI values of the WiFi access points received by the smartphone in the exact location of the audio recordings. Our analysis reveals a strong correlation between acoustic features and RSSI values, indicating that passive acoustic emissions can effectively represent the strength of WiFi signals. Hence, the audio recordings can serve as proxies for Radio-Frequency (RF)-based location signals. We propose a location-provenance framework that utilizes sound features alone, particularly the Mel-Frequency Cepstral Coefficients (MFCCs), achieving coarse localization within approximately four kilometers. This method requires no specialized hardware, works in signal-degraded environments, and introduces a previously overlooked privacy concern: that internal device sounds can unintentionally leak spatial information. Our findings highlight a novel passive side-channel with implications for both privacy and security in mobile systems. Full article
Show Figures

Graphical abstract

27 pages, 3213 KB  
Article
Urban Sound Classification for IoT Devices in Smart City Infrastructures
by Simona Domazetovska Markovska, Viktor Gavriloski, Damjan Pecioski, Maja Anachkova, Dejan Shishkovski and Anastasija Angjusheva Ignjatovska
Urban Sci. 2025, 9(12), 517; https://doi.org/10.3390/urbansci9120517 - 5 Dec 2025
Cited by 1 | Viewed by 2133
Abstract
Urban noise is a major environmental concern that affects public health and quality of life, demanding new approaches beyond conventional noise level monitoring. This study investigates the development of an AI-driven Acoustic Event Detection and Classification (AED/C) system designed for urban sound recognition [...] Read more.
Urban noise is a major environmental concern that affects public health and quality of life, demanding new approaches beyond conventional noise level monitoring. This study investigates the development of an AI-driven Acoustic Event Detection and Classification (AED/C) system designed for urban sound recognition and its integration into smart city application. Using the UrbanSound8K dataset, five acoustic parameters—Mel Frequency Cepstral Coefficients (MFCC), Mel Spectrogram (MS), Spectral Contrast (SC), Tonal Centroid (TC), and Chromagram (Ch)—were mathematically modeled and applied to feature extraction. Their combinations were tested with three classical machine learning algorithms: Support Vector Machines (SVM), Random Forest (RF), Naive Bayes (NB) and a deep learning approach, i.e., Convolutional Neural Networks (CNN). A total of 52 models with the three ML algorithms were analyzed along with 4 models with CNN. The MFCC-based CNN models showed the highest accuracy, achieving up to 92.68% on test data. This achieved accuracy represents approximately +2% improvement compared to prior CNN-based approaches reported in similar studies. Additionally, the number of trained models, 56 in total, exceeds those presented in comparable research, ensuring more robust performance validation and statistical reliability. Real-time validation confirmed the applicability for IoT devices, and a low-cost wireless sensor unit (WSU) was developed with fog and cloud computing for scalable data processing. The constructed WSU demonstrates a cost reduction of at least four times compared to previously developed units, while maintaining good performance, enabling broader deployment potential in smart city applications. The findings demonstrate the potential of AI-based AED/C systems for continuous, source-specific noise classification, supporting sustainable urban planning and improved environmental management in smart cities. Full article
Show Figures

Figure 1

17 pages, 2207 KB  
Article
Water Content Detection of Red Sandstone Based on Shock Acoustic Sensing and Convolutional Neural Network
by Zhaokang Qiu, Yang Liu, Yi Zhang, Xueqi Zhao, Dongdong Chen and Shengwu Tu
Sensors 2025, 25(23), 7164; https://doi.org/10.3390/s25237164 - 24 Nov 2025
Viewed by 362
Abstract
In response to the challenge of changes in the physical and mechanical properties of red sandstone when it comes into contact with water during construction projects, this paper proposes a moisture content detection method for red sandstone based on the knocking method. Taking [...] Read more.
In response to the challenge of changes in the physical and mechanical properties of red sandstone when it comes into contact with water during construction projects, this paper proposes a moisture content detection method for red sandstone based on the knocking method. Taking red sandstone as the research object, this study explores a moisture content detection approach by combining the knocking method with Convolutional Neural Network and Support Vector Machine algorithms (CNN-SVM). Specifically, this research involves knocking the surface of red sandstone specimens with a knocking hammer and precisely capturing the acoustic signals generated during the knocking process using a microphone. Subsequently, an effective detection of the moisture content in red sandstone is achieved through a method based on feature extraction from knocking sound signals and a Convolutional Neural Network classification model. This method is easy to operate. By utilizing modern signal processing techniques combined with the CNN-SVM model, it enables accurate identification and non-destructive testing of the moisture content in red sandstone even with small sample datasets. Mel Frequency Cepstral Coefficients (MFCCs) and Continuous Wavelet Transform (CWT) were separately used as features for detecting red sandstone specimens with different moisture contents. The detection results show that the classification accuracy of red sandstone moisture content using MFCCs as the feature reaches as high as 94.4%, significantly outperforming the classification method using CWT as the feature. This study validates the effectiveness and reliability of the proposed method, providing a novel and efficient approach for rapid and non-destructive detection of the moisture content in red sandstone. Full article
(This article belongs to the Section Physical Sensors)
Show Figures

Figure 1

15 pages, 1109 KB  
Article
A Novel Unsupervised You Only Listen Once (YOLO) Machine Learning Platform for Automatic Detection and Characterization of Prominent Bowel Sounds Towards Precision Medicine
by Gayathri Yerrapragada, Jieun Lee, Mohammad Naveed Shariff, Poonguzhali Elangovan, Keerthy Gopalakrishnan, Avneet Kaur, Divyanshi Sood, Swetha Rapolu, Jay Gohri, Gianeshwaree Alias Rachna Panjwani, Rabiah Aslam Ansari, Jahnavi Mikkilineni, Naghmeh Asadimanesh, Thangeswaran Natarajan, Jayarajasekaran Janarthanan, Shiva Sankari Karuppiah, Vivek N. Iyer, Scott A. Helgeson, Venkata S. Akshintala and Shivaram P. Arunachalam
Bioengineering 2025, 12(11), 1271; https://doi.org/10.3390/bioengineering12111271 - 19 Nov 2025
Viewed by 2656
Abstract
Phonoenterography (PEG) offers a non-invasive and radiation-free technique to assess gastrointestinal activity through acoustic signal analysis. In this feasibility study, 110 high-resolution PEG recordings (44.1 kHz, 16-bit) were acquired from eight healthy individuals, yielding 6314 prominent bowel sound (PBS) segments through automated segmentation. [...] Read more.
Phonoenterography (PEG) offers a non-invasive and radiation-free technique to assess gastrointestinal activity through acoustic signal analysis. In this feasibility study, 110 high-resolution PEG recordings (44.1 kHz, 16-bit) were acquired from eight healthy individuals, yielding 6314 prominent bowel sound (PBS) segments through automated segmentation. Each event was characterized using a 279-feature acoustic profile comprising Mel-frequency cepstral coefficients (MFCCs), their first-order derivatives (Δ-MFCCs), and six global spectral parameters. After normalization and dimensionality reduction with PCA and UMAP (cosine distance, 35 neighbors, minimum distance = 0.01), five clustering strategies were evaluated. K-Means (k = 5) achieved the most favorable balance between cluster quality (silhouette = 0.60; Calinski–Harabasz = 19,165; Davies–Bouldin = 0.68) and interpretability, consistently identifying five acoustic patterns: single-burst, multiple-burst, harmonic, random-continuous, and multi-modal. Temporal modeling of clustered events further revealed distinct sequential dynamics, with Single-Burst events showing the longest dwell times, random continuous the shortest, and strong diagonal elements in the transition matrix confirming measurable state persistence. Frequent transitions between random continuous and multi-modal states suggested dynamic exchanges between transient and overlapping motility patterns. Together, these findings demonstrate that unsupervised PEG-based analysis can capture both acoustic variability and temporal organization of bowel sounds. This annotation-free approach provides a scalable framework for real-time gastrointestinal monitoring and holds potential for clinical translation in conditions such as postoperative ileus, bowel obstruction, irritable bowel syndrome, and inflammatory bowel disease. Full article
Show Figures

Figure 1

25 pages, 5621 KB  
Article
Balanced Neonatal Cry Classification: Integrating Preterm and Full-Term Data for RDS Screening
by Somaye Valizade Shayegh and Chakib Tadj
Information 2025, 16(11), 1008; https://doi.org/10.3390/info16111008 - 19 Nov 2025
Viewed by 449
Abstract
Respiratory distress syndrome (RDS) is one of the most serious neonatal conditions, frequently leading to respiratory failure and death in low-resource settings. Early detection is therefore critical, particularly where access to advanced diagnostic tools is limited. Recent advances in machine learning have enabled [...] Read more.
Respiratory distress syndrome (RDS) is one of the most serious neonatal conditions, frequently leading to respiratory failure and death in low-resource settings. Early detection is therefore critical, particularly where access to advanced diagnostic tools is limited. Recent advances in machine learning have enabled non-invasive neonatal cry diagnostic systems (NCDSs) for early screening. To the best of our knowledge, this is the first cry-based RDS detection study to include both preterm and full-term infants in a subject-balanced design, using 76 neonates (38 RDS, 38 healthy; 19 per subgroup) and 8534 expiratory cry segments (4267 per class). Cry waveforms were converted to mono, high-pass-filtered, and segmented to isolate expiratory units. Mel-Frequency Cepstral Coefficients (MFCCs) and Filterbank (FBANK) features were extracted and transformed into fixed-dimensional embeddings using a lightweight X-vector model with mean-SDor attention-based pooling, followed by a binary classifier. Model parameters were optimized via grid search. Performance was evaluated using accuracy, precision, recall, F1-score, and ROC–AUC under stratified 10-fold cross-validation. MFCC + mean–SD achieved 93.59 ± 0.48% accuracy, while MFCC + attention reached 93.53 ± 0.52% accuracy with slightly higher precision, reducing false RDS alarms and improving clinical reliability. To enhance interpretability, Integrated Gradients were applied to MFCC and FBANK features to reveal the spectral regions contributing most to the decision. Overall, the proposed NCDS reliably distinguishes RDS from healthy cries and generalizes across neonatal subgroups despite the greater variability in preterm vocalizations. Full article
(This article belongs to the Special Issue Biomedical Signal and Image Processing with Artificial Intelligence)
Show Figures

Figure 1

14 pages, 1737 KB  
Article
Classification of Speech and Associated EEG Responses from Normal-Hearing and Cochlear Implant Talkers Using Support Vector Machines
by Shruthi Raghavendra, Sungmin Lee and Chin-Tuan Tan
Audiol. Res. 2025, 15(6), 158; https://doi.org/10.3390/audiolres15060158 - 18 Nov 2025
Viewed by 558
Abstract
Background/Objectives: Speech produced by individuals with hearing loss differs notably from that of normal-hearing (NH) individuals. Although cochlear implants (CIs) provide sufficient auditory input to support speech acquisition and control, there remains considerable variability in speech intelligibility among CI users. As a [...] Read more.
Background/Objectives: Speech produced by individuals with hearing loss differs notably from that of normal-hearing (NH) individuals. Although cochlear implants (CIs) provide sufficient auditory input to support speech acquisition and control, there remains considerable variability in speech intelligibility among CI users. As a result, speech produced by CI talkers often exhibits distinct acoustic characteristics compared to that of NH individuals. Methods: Speech data were obtained from eight cochlear-implant (CI) and eight normal-hearing (NH) talkers, while electroencephalogram (EEG) responses were recorded from 11 NH listeners exposed to the same speech stimuli. Support Vector Machine (SVM) classifiers employing 3-fold cross-validation were evaluated using classification accuracy as the performance metric. This study evaluated the efficacy of Support Vector Machine (SVM) algorithms using four kernel functions (Linear, Polynomial, Gaussian, and Radial Basis Function) to classify speech produced by NH and CI talkers. Six acoustic features—Log Energy, Zero-Crossing Rate (ZCR), Pitch, Linear Predictive Coefficients (LPC), Mel-Frequency Cepstral Coefficients (MFCCs), and Perceptual Linear Predictive Cepstral Coefficients (PLP-CC)—were extracted. These same features were also extracted from electroencephalogram (EEG) recordings of NH listeners who were exposed to the speech stimuli. The EEG analysis leveraged the assumption of quasi-stationarity over short time windows. Results: Classification of speech signals using SVMs yielded the highest accuracies of 100% and 94% for the Energy and MFCC features, respectively, using Gaussian and RBF kernels. EEG responses to speech achieved classification accuracies exceeding 70% for ZCR and Pitch features using the same kernels. Other features such as LPC and PLP-CC yielded moderate to low classification performance. Conclusions: The results indicate that both speech-derived and EEG-derived features can effectively differentiate between CI and NH talkers. Among the tested kernels, Gaussian and RBF provided superior performance, particularly when using Energy and MFCC features. These findings support the application of SVMs for multimodal classification in hearing research, with potential applications in improving CI speech processing and auditory rehabilitation. Full article
(This article belongs to the Section Hearing)
Show Figures

Figure 1

18 pages, 3175 KB  
Article
AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio
by Samia Dilbar, Muhammad Ali Qureshi, Serosh Karim Noon and Abdul Mannan
Algorithms 2025, 18(11), 716; https://doi.org/10.3390/a18110716 - 13 Nov 2025
Viewed by 1054
Abstract
Deepfake audio refers to the generation of voice recordings using deep neural networks that replicate a specific individual’s voice, often for deceptive or fraud purposes. Although this has been an area of research for quite some time, deepfakes still pose substantial challenges for [...] Read more.
Deepfake audio refers to the generation of voice recordings using deep neural networks that replicate a specific individual’s voice, often for deceptive or fraud purposes. Although this has been an area of research for quite some time, deepfakes still pose substantial challenges for reliable true speaker authentication. To address the issue, we propose AudioFakeNet, a hybrid deep learning architecture that use Convolutional Neural Networks (CNNs) along with Long Short-Term Memory (LSTM) units, and Multi-Head Attention (MHA) mechanisms for robust deepfake detection. CNN extracts spatial and spectral features, LSTM captures temporal dependencies, and MHA enhances to focus on informative audio segments. The model is trained using Mel-Frequency Cepstral Coefficients (MFCCs) from the publicly available dataset and was validated on self-collected dataset, ensuring reproducibility. Performance comparisons with state-of-the-art machine learning and deep learning models show that our proposed AudioFakeNet achieves higher accuracy, better generalization, and lower Equal Error Rate (EER). Its modular design allows for broader adaptability in fake-audio detection tasks, offering significant potential across diverse speech synthesis applications. Full article
(This article belongs to the Section Algorithms for Multidisciplinary Applications)
Show Figures

Figure 1

11 pages, 744 KB  
Proceeding Paper
A Deep Learning Framework for Early Detection of Potential Cardiac Anomalies via Murmur Pattern Analysis in Phonocardiograms
by Aymane Edder, Fatima-Ezzahraa Ben-Bouazza, Oumaima Manchadi, Youssef Ait Bigane, Djeneba Sangare and Bassma Jioudi
Eng. Proc. 2025, 112(1), 63; https://doi.org/10.3390/engproc2025112063 - 31 Oct 2025
Viewed by 592
Abstract
Heart murmurs, resulting from turbulent blood flow within the cardiac structure, represent some of the initial acoustic manifestations of potential underlying cardiovascular anomalies, such as arrhythmias. This research presents a deep learning framework aimed at the early detection of potential cardiac anomalies through [...] Read more.
Heart murmurs, resulting from turbulent blood flow within the cardiac structure, represent some of the initial acoustic manifestations of potential underlying cardiovascular anomalies, such as arrhythmias. This research presents a deep learning framework aimed at the early detection of potential cardiac anomalies through the analysis of murmur patterns in phonocardiogram (PCG) signals. Our methodology employs a spectro-temporal feature fusion technique that integrates Mel spectrograms, Mel Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Power Spectral Density (PSD) representations. The features are derived from segmented 5-second phonocardiogram (PCG) windows and subsequently input into a two-dimensional convolutional neural network (CNN) for the purpose of classification. In order to mitigate class imbalance and enhance generalization, We employ data augmentation techniques, including pitch moving and noise injection. The model under consideration has undergone training and evaluation utilizing a carefully selected subset of the CirCor DigiScope dataset. The experimental findings indicate a robust performance, with a classification accuracy recorded at 92.40% and a cross-entropy loss measured at 0.2242. The results indicate that an analysis of PCG signals informed by murmurs may function as an effective non-invasive method for the early screening of conditions that may include arrhythmias, particularly in clinical environments with limited resources. Full article
Show Figures

Figure 1

Back to TopTop