MDPI - Publisher of Open Access Journals

25 pages, 3819 KB

Open AccessArticle

Cross-Modal and Contrastive Optimization for Explainable Multimodal Recognition of Predatory and Parasitic Insects

by Mingyu Liu, Liuxin Wang, Ruihao Jia, Shiyu Ji, Yalin Wu, Yuxin Wu, Luozehan Xie and Min Dong

Insects 2025, 16(12), 1187; https://doi.org/10.3390/insects16121187 - 22 Nov 2025

Viewed by 895

Abstract

Natural enemies play a vital role in pest suppression and ecological balance within agricultural ecosystems. However, conventional vision-based recognition methods are highly susceptible to illumination variation, occlusion, and background noise in complex field environments, making it difficult to accurately distinguish morphologically similar species. [...] Read more.

Natural enemies play a vital role in pest suppression and ecological balance within agricultural ecosystems. However, conventional vision-based recognition methods are highly susceptible to illumination variation, occlusion, and background noise in complex field environments, making it difficult to accurately distinguish morphologically similar species. To address these challenges, a multimodal natural enemy recognition and ecological interpretation framework, termed MAVC-XAI, is proposed to enhance recognition accuracy and ecological interpretability in real-world agricultural scenarios. The framework employs a dual-branch spatiotemporal feature extraction network for deep modeling of both visual and acoustic signals, introduces a cross-modal sampling attention mechanism for dynamic inter-modality alignment, and incorporates cross-species contrastive learning to optimize inter-class feature boundaries. Additionally, an explainable generation module is designed to provide ecological visualizations of the model’s decision-making process in both visual and acoustic domains. Experiments conducted on multimodal datasets collected across multiple agricultural regions confirm the effectiveness of the proposed approach. The MAVC-XAI framework achieves an accuracy of 0.938, a precision of 0.932, a recall of 0.927, an F1-score of 0.929, an mAP@50 of 0.872, and a Top-5 recognition rate of 97.8%, all significantly surpassing unimodal models such as ResNet, Swin-T, and VGGish, as well as multimodal baselines including MMBT and ViLT. Ablation experiments further validate the critical contributions of the cross-modal sampling attention and contrastive learning modules to performance enhancement. The proposed framework not only enables high-precision natural enemy identification under complex ecological conditions but also provides an interpretable and intelligent foundation for AI-driven ecological pest management and food security monitoring. Full article

(This article belongs to the Special Issue Artificial Intelligence (AI) and Insect Pests Management: Securing Food Security, Human Health, and Natural Resources)

► Show Figures

Figure 1

13 pages, 2126 KB

Open AccessArticle

Comparison of Deep Neural Networks for the Classification of Adventitious Lung Sounds

by Said Polanco-Martagón, Yahir Hernández-Mier, Marco Aurelio Nuño-Maganda, José Hugo Barrón-Zambrano, Andrea Magadán-Salazar and César Alejandro Medellín-Vergara

J. Clin. Med. 2025, 14(20), 7427; https://doi.org/10.3390/jcm14207427 - 21 Oct 2025

Viewed by 701

Abstract

Background: Automatic adventitious lung sound classification using deep learning is a promising strategy for objective respiratory disease screening. Evaluating model performance is challenging, particularly with imbalanced clinical datasets. This study compares CNN architectures and proposes a dual-stream classification approach. Methods: Using the public [...] Read more.

Background: Automatic adventitious lung sound classification using deep learning is a promising strategy for objective respiratory disease screening. Evaluating model performance is challenging, particularly with imbalanced clinical datasets. This study compares CNN architectures and proposes a dual-stream classification approach. Methods: Using the public ICBHI 2017 dataset, we compared five pre-trained architectures: VGG16, VGG19, InceptionV3, MobileNetV2, and ResNet152V2. To mitigate class imbalance, we implemented pitch shifting, random shifting, and mixup data augmentation. We also developed and evaluated a novel VGGish-dual-stream network. The primary endpoint was the Average Score (AS), the arithmetic mean of Sensitivity and Specificity. Results: Among benchmarked models, ResNet152V2 achieved the highest AS (

0.541

), approaching the state-of-the-art range (0.56–0.58). This performance was characterised by a high Specificity (

0.67

) but low Sensitivity (

0.41

). Our proposed dual-stream network yielded a more balanced, albeit slightly lower, performance with an AS of

0.508

. Conclusions: Standard CNN architectures like ResNet152V2 can achieve competitive classification performance but may exhibit a clinically significant bias towards high specificity at the expense of sensitivity. This trade-off poses a risk of missing pathological events (false negatives). To ensure clinical safety and utility, future work must prioritise strategies that explicitly improve model sensitivity. Full article

(This article belongs to the Section Respiratory Medicine)

► Show Figures

Figure 1

24 pages, 3485 KB

Open AccessArticle

Impact Evaluation of Sound Dataset Augmentation and Synthetic Generation upon Classification Accuracy

by Eleni Tsalera, Andreas Papadakis, Gerasimos Pagiatakis and Maria Samarakou

J. Sens. Actuator Netw. 2025, 14(5), 91; https://doi.org/10.3390/jsan14050091 - 9 Sep 2025

Cited by 2 | Viewed by 3274

Abstract

We investigate the impact of dataset augmentation and synthetic generation techniques on the accuracy of supervised audio classification based on state-of-the-art neural networks used as classifiers. Dataset augmentation techniques are applied upon the raw sound and its transformed image format. Specifically, sound augmentation [...] Read more.

We investigate the impact of dataset augmentation and synthetic generation techniques on the accuracy of supervised audio classification based on state-of-the-art neural networks used as classifiers. Dataset augmentation techniques are applied upon the raw sound and its transformed image format. Specifically, sound augmentation techniques are applied prior to spectral-based transformation and include time stretching, pitch shifting, noise addition, volume controlling, and time shifting. Image augmentation techniques are applied after the transformation of the sound into a scalogram, involving scaling, shearing, rotation, and translation. Synthetic sound generation is based on the AudioGen generative model, triggered through a series of customized prompts. Augmentation and synthetic generation are applied to three sound categories: (a) human sounds, (b) animal sounds, and (c) sounds of things, with each category containing ten sound classes with 20 samples retrieved from the ESC-50 dataset. Sound- and image-orientated neural network classifiers have been used to classify the augmented datasets and their synthetic additions. VGGish and YAMNet (sound classifiers) employ spectrograms, while ResNet50 and DarkNet53 (image classifiers) employ scalograms. The streamlined AI-based process of augmentation and synthetic generation, enhanced classifier fine-tuning and inference allowed for a consistent, multicriteria-comparison of the impact. Classification accuracy has increased for all augmentation and synthetic generation scenarios; however, the increase has not been uniform among the techniques, the sound types, and the percentage of the training set population increase. The average increase in classification accuracy ranged from 2.05% for ResNet50 to 9.05% for VGGish. Our findings reinforce the benefit of audio augmentation and synthetic generation, providing guidelines to avoid accuracy degradation due to overuse and distortion of key audio features. Full article

(This article belongs to the Special Issue AI-Assisted Machine-Environment Interaction)

► Show Figures

Figure 1

14 pages, 2927 KB

Open AccessArticle

Optimizing MFCC Parameters for Breathing Phase Detection

by Assel K. Zhantleuova, Yerbulat K. Makashev and Nurzhan T. Duzbayev

Sensors 2025, 25(16), 5002; https://doi.org/10.3390/s25165002 - 13 Aug 2025

Cited by 1 | Viewed by 1548

Abstract

Breathing phase detection is fundamental for various clinical and digital health applications, yet standard Mel Frequency Cepstral Coefficients (MFCCs) settings often limit classification performance. This study systematically optimized MFCC parameters, specifically the number of coefficients, frame length, and hop length, using a proprietary [...] Read more.

Breathing phase detection is fundamental for various clinical and digital health applications, yet standard Mel Frequency Cepstral Coefficients (MFCCs) settings often limit classification performance. This study systematically optimized MFCC parameters, specifically the number of coefficients, frame length, and hop length, using a proprietary dataset of respiratory sounds (n = 1500 segments). Classification performance was evaluated using Support Vector Machines (SVMs) and benchmarked against deep learning models (VGGish, YAMNet, MobileNetV2). Optimal parameters (30 MFCC coefficients, 800 ms frame length, 10 ms hop length) substantially enhanced accuracy (87.16%) compared to default settings (80.96%) and performed equivalently or better than deep learning methods. A trade-off analysis indicated that a clinically practical frame length of 200–300 ms balanced accuracy (85.08%) and latency effectively. The study concludes that optimized MFCC parameters significantly improve respiratory phase classification, providing efficient and interpretable solutions suitable for real-time clinical monitoring. Future research should focus on validating these parameters in broader clinical contexts and exploring multimodal and federated learning strategies. Full article

(This article belongs to the Section Biomedical Sensors)

► Show Figures

Figure 1

27 pages, 6343 KB

Open AccessArticle

Detection and Classification of Obstructive Sleep Apnea Using Audio Spectrogram Analysis

by Salvatore Serrano, Luca Patanè, Omar Serghini and Marco Scarpa

Electronics 2024, 13(13), 2567; https://doi.org/10.3390/electronics13132567 - 29 Jun 2024

Cited by 11 | Viewed by 7584

Abstract

Sleep disorders are steadily increasing in the population and can significantly affect daily life. Low-cost and noninvasive systems that can assist the diagnostic process will become increasingly widespread in the coming years. This work aims to investigate and compare the performance of machine [...] Read more.

Sleep disorders are steadily increasing in the population and can significantly affect daily life. Low-cost and noninvasive systems that can assist the diagnostic process will become increasingly widespread in the coming years. This work aims to investigate and compare the performance of machine learning-based classifiers for the identification of obstructive sleep apnea–hypopnea (OSAH) events, including apnea/non-apnea status classification, apnea–hypopnea index (AHI) prediction, and AHI severity classification. The dataset considered contains recordings from 192 patients. It is derived from a recently released dataset which contains, amongst others, audio signals recorded with an ambient microphone placed ∼1 m above the studied subjects and apnea/hypopnea accurate events annotations performed by specialized medical doctors. We employ mel spectrogram images extracted from the environmental audio signals as input of a machine-learning-based classifier for apnea/hypopnea events classification. The proposed approach involves a stacked model which utilizes a combination of a pretrained VGG-like audio classification (VGGish) network and a bidirectional long short-term memory (bi-LSTM) network. Performance analysis was conducted using a 5-fold cross-validation approach, leaving out patients used for training and validation of the models in the testing step. Comparative evaluations with recently presented methods from the literature demonstrate the advantages of the proposed approach. The proposed architecture can be considered a useful tool for supporting OSAHS diagnoses by means of low-cost devices such as smartphones. Full article

(This article belongs to the Special Issue Advances in Image Processing and Computer Vision Based on Machine Learning)

► Show Figures

Figure 1

23 pages, 21874 KB

Open AccessEditor’s ChoiceArticle

Speech Emotion Recognition Using Deep Learning Transfer Models and Explainable Techniques

by Tae-Wan Kim and Keun-Chang Kwak

Appl. Sci. 2024, 14(4), 1553; https://doi.org/10.3390/app14041553 - 15 Feb 2024

Cited by 22 | Viewed by 8903

Abstract

This study aims to establish a greater reliability compared to conventional speech emotion recognition (SER) studies. This is achieved through preprocessing techniques that reduce uncertainty elements, models that combine the structural features of each model, and the application of various explanatory techniques. The [...] Read more.

This study aims to establish a greater reliability compared to conventional speech emotion recognition (SER) studies. This is achieved through preprocessing techniques that reduce uncertainty elements, models that combine the structural features of each model, and the application of various explanatory techniques. The ability to interpret can be made more accurate by reducing uncertain learning data, applying data in different environments, and applying techniques that explain the reasoning behind the results. We designed a generalized model using three different datasets, and each speech was converted into a spectrogram image through STFT preprocessing. The spectrogram was divided into the time domain with overlapping to match the input size of the model. Each divided section is expressed as a Gaussian distribution, and the quality of the data is investigated by the correlation coefficient between distributions. As a result, the scale of the data is reduced, and uncertainty is minimized. VGGish and YAMNet are the most representative pretrained deep learning networks frequently used in conjunction with speech processing. In dealing with speech signal processing, it is frequently advantageous to use these pretrained models synergistically rather than exclusively, resulting in the construction of ensemble deep networks. And finally, various explainable models (Grad CAM, LIME, occlusion sensitivity) are used in analyzing classified results. The model exhibits adaptability to voices in various environments, yielding a classification accuracy of 87%, surpassing that of individual models. Additionally, output results are confirmed by an explainable model to extract essential emotional areas, converted into audio files for auditory analysis using Grad CAM in the time domain. Through this study, we enhance the uncertainty of activation areas that are generated by Grad CAM. We achieve this by applying the interpretable ability from previous studies, along with effective preprocessing and fusion models. We can analyze it from a more diverse perspective through other explainable techniques. Full article

(This article belongs to the Special Issue Recent Progress and Challenges of Artificial Intelligence in Bioinformatics and New Medicine)

► Show Figures

Figure 1

11 pages, 1283 KB

Open AccessArticle

Trunk Borer Identification Based on Convolutional Neural Networks

by Xing Zhang, Haiyan Zhang, Zhibo Chen and Juhu Li

Appl. Sci. 2023, 13(2), 863; https://doi.org/10.3390/app13020863 - 8 Jan 2023

Cited by 5 | Viewed by 2601

Abstract

The trunk borer is a great danger to forests because of its strong concealment, long lag and great destructiveness. In order to improve the early monitoring ability of trunk borers, the representative Agrilus planipennis Fairmaire was selected as the research object. The convolutional [...] Read more.

The trunk borer is a great danger to forests because of its strong concealment, long lag and great destructiveness. In order to improve the early monitoring ability of trunk borers, the representative Agrilus planipennis Fairmaire was selected as the research object. The convolutional neural network named TrunkNet was designed to identify the activity sounds of Agrilus planipennis Fairmaire larvae. The activity sounds were recorded as vibration signals in audio form. The detector was used to collect the activity sounds of Agrilus planipennis Fairmaire larvae in the wood segments and some typical outdoor noise. The vibration signal pulse duration is short, random and high energy. TrunkNet was designed to train and identify vibration signals of Agrilus planipennis Fairmaire. Over the course of the experiment, the test accuracy of TrunkNet was 96.89%, while MobileNet_V2, ResNet18 and VGGish showed 84.27%, 79.37% and 70.85% accuracy, respectively. TrunkNet based on the convolutional neural network can provide technical support for the automatic monitoring and early warning of the stealthy tree trunk borers. The work of this study is limited to a single pest. The experiment will further focus on the applicability of the network to other pests in the future. Full article

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)

► Show Figures

Figure 1

16 pages, 10853 KB

Open AccessArticle

Intelligent Fault Diagnosis of Industrial Bearings Using Transfer Learning and CNNs Pre-Trained for Audio Classification

by Luigi Gianpio Di Maggio

Sensors 2023, 23(1), 211; https://doi.org/10.3390/s23010211 - 25 Dec 2022

Cited by 29 | Viewed by 5027

Abstract

The training of Artificial Intelligence algorithms for machine diagnosis often requires a huge amount of data, which is scarcely available in industry. This work shows that convolutional networks pre-trained for audio classification already contain knowledge for classifying bearing vibrations, since both tasks share [...] Read more.

The training of Artificial Intelligence algorithms for machine diagnosis often requires a huge amount of data, which is scarcely available in industry. This work shows that convolutional networks pre-trained for audio classification already contain knowledge for classifying bearing vibrations, since both tasks share the need to extract features from spectrograms. Knowledge transfer is realized through transfer learning to identify localized defects in rolling element bearings. This technique provides a tool to transfer the knowledge embedded in neural networks pre-trained for fulfilling similar tasks to diagnostic scenarios, significantly limiting the amount of data needed for fine-tuning. The VGGish model was fine-tuned for the specific diagnostic task by handling vibration samples. Data were extracted from the test bench for medium-size bearings specially set up in the mechanical engineering laboratories of the Politecnico di Torino. The experiment involved three damage classes. Results show that the model pre-trained using sound spectrograms can be successfully employed for classifying the bearing state through vibration spectrograms. The effectiveness of the model is assessed through comparisons with the existing literature. Full article

(This article belongs to the Special Issue Artificial Intelligence Enhanced Health Monitoring and Diagnostics)

► Show Figures

Figure 1

16 pages, 1428 KB

Open AccessArticle

Convolutional Neural Networks for the Identification of African Lions from Individual Vocalizations

by Martino Trapanotto, Loris Nanni, Sheryl Brahnam and Xiang Guo

J. Imaging 2022, 8(4), 96; https://doi.org/10.3390/jimaging8040096 - 1 Apr 2022

Cited by 20 | Viewed by 5274

Abstract

The classification of vocal individuality for passive acoustic monitoring (PAM) and census of animals is becoming an increasingly popular area of research. Nearly all studies in this field of inquiry have relied on classic audio representations and classifiers, such as Support Vector Machines [...] Read more.

The classification of vocal individuality for passive acoustic monitoring (PAM) and census of animals is becoming an increasingly popular area of research. Nearly all studies in this field of inquiry have relied on classic audio representations and classifiers, such as Support Vector Machines (SVMs) trained on spectrograms or Mel-Frequency Cepstral Coefficients (MFCCs). In contrast, most current bioacoustic species classification exploits the power of deep learners and more cutting-edge audio representations. A significant reason for avoiding deep learning in vocal identity classification is the tiny sample size in the collections of labeled individual vocalizations. As is well known, deep learners require large datasets to avoid overfitting. One way to handle small datasets with deep learning methods is to use transfer learning. In this work, we evaluate the performance of three pretrained CNNs (VGG16, ResNet50, and AlexNet) on a small, publicly available lion roar dataset containing approximately 150 samples taken from five male lions. Each of these networks is retrained on eight representations of the samples: MFCCs, spectrogram, and Mel spectrogram, along with several new ones, such as VGGish and stockwell, and those based on the recently proposed LM spectrogram. The performance of these networks, both individually and in ensembles, is analyzed and corroborated using the Equal Error Rate and shown to surpass previous classification attempts on this dataset; the best single network achieved over 95% accuracy and the best ensembles over 98% accuracy. The contributions this study makes to the field of individual vocal classification include demonstrating that it is valuable and possible, with caution, to use transfer learning with single pretrained CNNs on the small datasets available for this problem domain. We also make a contribution to bioacoustics generally by offering a comparison of the performance of many state-of-the-art audio representations, including for the first time the LM spectrogram and stockwell representations. All source code for this study is available on GitHub. Full article

(This article belongs to the Special Issue Computer Vision and Deep Learning: Trends and Applications)

► Show Figures

Figure 1

22 pages, 8024 KB

Open AccessArticle

Comparison of Pre-Trained CNNs for Audio Classification Using Transfer Learning

by Eleni Tsalera, Andreas Papadakis and Maria Samarakou

J. Sens. Actuator Netw. 2021, 10(4), 72; https://doi.org/10.3390/jsan10040072 - 10 Dec 2021

Cited by 122 | Viewed by 21495

Abstract

The paper investigates retraining options and the performance of pre-trained Convolutional Neural Networks (CNNs) for sound classification. CNNs were initially designed for image classification and recognition, and, at a second phase, they extended towards sound classification. Transfer learning is a promising paradigm, retraining [...] Read more.

The paper investigates retraining options and the performance of pre-trained Convolutional Neural Networks (CNNs) for sound classification. CNNs were initially designed for image classification and recognition, and, at a second phase, they extended towards sound classification. Transfer learning is a promising paradigm, retraining already trained networks upon different datasets. We selected three ‘Image’- and two ‘Sound’-trained CNNs, namely, GoogLeNet, SqueezeNet, ShuffleNet, VGGish, and YAMNet, and applied transfer learning. We explored the influence of key retraining parameters, including the optimizer, the mini-batch size, the learning rate, and the number of epochs, on the classification accuracy and the processing time needed in terms of sound preprocessing for the preparation of the scalograms and spectrograms as well as CNN training. The UrbanSound8K, ESC-10, and Air Compressor open sound datasets were employed. Using a two-fold criterion based on classification accuracy and time needed, we selected the ‘champion’ transfer-learning parameter combinations, discussed the consistency of the classification results, and explored possible benefits from fusing the classification estimations. The Sound CNNs achieved better classification accuracy, reaching an average of 96.4% for UrbanSound8K, 91.25% for ESC-10, and 100% for the Air Compressor dataset. Full article

► Show Figures

Figure 1

18 pages, 2869 KB

Open AccessArticle

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

by Kyoung Ju Noh, Chi Yoon Jeong, Jiyoun Lim, Seungeun Chung, Gague Kim, Jeong Mook Lim and Hyuntae Jeong

Sensors 2021, 21(5), 1579; https://doi.org/10.3390/s21051579 - 24 Feb 2021

Cited by 16 | Viewed by 5249

Abstract

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of [...] Read more.

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization. Full article

(This article belongs to the Special Issue Emotion Intelligence Based on Smart Sensing)

► Show Figures

Figure 1

13 pages, 4138 KB

Open AccessArticle

Audio-Based Aircraft Detection System for Safe RPAS BVLOS Operations

by Jorge Mariscal-Harana, Víctor Alarcón, Fidel González, Juan José Calvente, Francisco Javier Pérez-Grau, Antidio Viguria and Aníbal Ollero

Electronics 2020, 9(12), 2076; https://doi.org/10.3390/electronics9122076 - 5 Dec 2020

Cited by 8 | Viewed by 6266

Abstract

For the Remotely Piloted Aircraft Systems (RPAS) market to continue its current growth rate, cost-effective ‘Detect and Avoid’ systems that enable safe beyond visual line of sight (BVLOS) operations are critical. We propose an audio-based ‘Detect and Avoid’ system, composed of microphones and [...] Read more.

For the Remotely Piloted Aircraft Systems (RPAS) market to continue its current growth rate, cost-effective ‘Detect and Avoid’ systems that enable safe beyond visual line of sight (BVLOS) operations are critical. We propose an audio-based ‘Detect and Avoid’ system, composed of microphones and an embedded computer, which performs real-time inferences using a sound event detection (SED) deep learning model. Two state-of-the-art SED models, YAMNet and VGGish, are fine-tuned using our dataset of aircraft sounds and their performances are compared for a wide range of configurations. YAMNet, whose MobileNet architecture is designed for embedded applications, outperformed VGGish both in terms of aircraft detection and computational performance. YAMNet’s optimal configuration, with >70% true positive rate and precision, results from combining data augmentation and undersampling with the highest available inference frequency (i.e., 10 Hz). While our proposed ‘Detect and Avoid’ system already allows the detection of small aircraft from sound in real time, additional testing using multiple aircraft types is required. Finally, a larger training dataset, sensor fusion, or remote computations on cloud-based services could further improve system performance. Full article

(This article belongs to the Special Issue Deep Learning Technologies for Machine Vision and Audition)

► Show Figures

Figure 1

Search Results (12)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (12)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI