MDPI - Publisher of Open Access Journals

21 pages, 3332 KB

Open AccessFeature PaperArticle

Intelligent Classification of Urban Noise Sources Using TinyML: Towards Efficient Noise Management in Smart Cities

by Maykol Sneyder Remolina Soto, Brian Amaya Guzmán, Pedro Antonio Aya-Parra, Oscar J. Perdomo, Mauricio Becerra-Fernandez and Jefferson Sarmiento-Rojas

Sensors 2025, 25(20), 6361; https://doi.org/10.3390/s25206361 - 14 Oct 2025

Viewed by 329

Abstract

Urban noise levels that exceed the World Health Organization (WHO) recommendations have become a growing concern due to their adverse effects on public health. In Bogotá, Colombia, studies by the District Department of Environment (SDA) indicate that 11.8% of the population is exposed [...] Read more.

Urban noise levels that exceed the World Health Organization (WHO) recommendations have become a growing concern due to their adverse effects on public health. In Bogotá, Colombia, studies by the District Department of Environment (SDA) indicate that 11.8% of the population is exposed to noise levels above the WHO limits. This research aims to identify and categorize environmental noise sources in real time using an embedded intelligent system. A total of 657 labeled audio clips were collected across eight classes and processed using a 60/20/20 train–validation–test split, ensuring that audio segments from the same continuous recording were not mixed across subsets. The system was implemented on a Raspberry Pi 2W equipped with a UMIK-1 microphone and powered by a 90 W solar panel with a 12 V battery, enabling autonomous operation. The TinyML-based model achieved precision and recall values between 0.92 and 1.00, demonstrating high performance under real urban conditions. Heavy vehicles and motorcycles accounted for the largest proportion of classified samples. Although airplane-related events were less frequent, they reached maximum sound levels of up to 88.4 dB(A), exceeding the applicable local limit of 70 dB(A) by approximately 18 dB(A) rather than by percentage. In conclusion, the results demonstrate that on-device TinyML classification is a feasible and effective strategy for urban noise monitoring. Local inference reduces latency, bandwidth usage, and privacy risks by eliminating the need to transmit raw audio to external servers. This approach provides a scalable and sustainable foundation for noise management in smart cities and supports evidence-based public policies aimed at improving urban well-being. This work presents an introductory and exploratory study on the application of TinyML for acoustic environmental monitoring, aiming to evaluate its feasibility and potential for large-scale implementation. Full article

(This article belongs to the Section Environmental Sensing)

► Show Figures

Figure 1

24 pages, 3485 KB

Open AccessArticle

Impact Evaluation of Sound Dataset Augmentation and Synthetic Generation upon Classification Accuracy

by Eleni Tsalera, Andreas Papadakis, Gerasimos Pagiatakis and Maria Samarakou

J. Sens. Actuator Netw. 2025, 14(5), 91; https://doi.org/10.3390/jsan14050091 - 9 Sep 2025

Viewed by 795

Abstract

We investigate the impact of dataset augmentation and synthetic generation techniques on the accuracy of supervised audio classification based on state-of-the-art neural networks used as classifiers. Dataset augmentation techniques are applied upon the raw sound and its transformed image format. Specifically, sound augmentation [...] Read more.

We investigate the impact of dataset augmentation and synthetic generation techniques on the accuracy of supervised audio classification based on state-of-the-art neural networks used as classifiers. Dataset augmentation techniques are applied upon the raw sound and its transformed image format. Specifically, sound augmentation techniques are applied prior to spectral-based transformation and include time stretching, pitch shifting, noise addition, volume controlling, and time shifting. Image augmentation techniques are applied after the transformation of the sound into a scalogram, involving scaling, shearing, rotation, and translation. Synthetic sound generation is based on the AudioGen generative model, triggered through a series of customized prompts. Augmentation and synthetic generation are applied to three sound categories: (a) human sounds, (b) animal sounds, and (c) sounds of things, with each category containing ten sound classes with 20 samples retrieved from the ESC-50 dataset. Sound- and image-orientated neural network classifiers have been used to classify the augmented datasets and their synthetic additions. VGGish and YAMNet (sound classifiers) employ spectrograms, while ResNet50 and DarkNet53 (image classifiers) employ scalograms. The streamlined AI-based process of augmentation and synthetic generation, enhanced classifier fine-tuning and inference allowed for a consistent, multicriteria-comparison of the impact. Classification accuracy has increased for all augmentation and synthetic generation scenarios; however, the increase has not been uniform among the techniques, the sound types, and the percentage of the training set population increase. The average increase in classification accuracy ranged from 2.05% for ResNet50 to 9.05% for VGGish. Our findings reinforce the benefit of audio augmentation and synthetic generation, providing guidelines to avoid accuracy degradation due to overuse and distortion of key audio features. Full article

(This article belongs to the Special Issue AI-Assisted Machine-Environment Interaction)

► Show Figures

Figure 1

18 pages, 1127 KB

Open AccessArticle

Comparative Analysis of Machine Learning Techniques in Enhancing Acoustic Noise Loggers’ Leak Detection

by Samer El-Zahab, Eslam Mohammed Abdelkader, Ali Fares and Tarek Zayed

Water 2025, 17(16), 2427; https://doi.org/10.3390/w17162427 - 17 Aug 2025

Viewed by 3533

Abstract

Urban areas face a significant challenge with water pipeline leaks, resulting in resource wastage and economic consequences. The application of noise logger sensors, integrated with ensemble machine learning, emerges as a promising real-time monitoring solution, enhancing efficiency in Water Distribution Networks (WDNs) and [...] Read more.

Urban areas face a significant challenge with water pipeline leaks, resulting in resource wastage and economic consequences. The application of noise logger sensors, integrated with ensemble machine learning, emerges as a promising real-time monitoring solution, enhancing efficiency in Water Distribution Networks (WDNs) and mitigating environmental impacts. The paper investigates the integrated use of Noise Loggers with machine learning models, including Support Vector Machines (SVMs), Random Forest (RF), Naïve Bayes (NB), K-Nearest Neighbors (KNN), Decision Tree (DT), Logistic Regression (LogR), Multi-Layer Perceptron (MLP), and YamNet, along with ensemble models, for effective leak detection. The study utilizes a dataset comprising 2110 sound signals collected from various locations in Hong Kong through wireless acoustic Noise Loggers. RF model stands out with 93.68% accuracy, followed closely by KNN at 93.40%, and MLP with 92.15%, demonstrating machine learning’s potential in scrutinizing acoustic signals. The ensemble model, combining these diverse models, achieves an impressive 94.40% accuracy, surpassing individual models and YamNet. The comparison of various machine learning models provides researchers with valuable insights into the use of machine learning for leak detection applications. Additionally, this paper introduces a novel method to develop a robust ensemble leak detection model by selecting the most performing machine learning models. Full article

(This article belongs to the Special Issue Advances in Management and Optimization of Urban Water Networks)

► Show Figures

Graphical abstract

14 pages, 2927 KB

Open AccessArticle

Optimizing MFCC Parameters for Breathing Phase Detection

by Assel K. Zhantleuova, Yerbulat K. Makashev and Nurzhan T. Duzbayev

Sensors 2025, 25(16), 5002; https://doi.org/10.3390/s25165002 - 13 Aug 2025

Viewed by 600

Abstract

Breathing phase detection is fundamental for various clinical and digital health applications, yet standard Mel Frequency Cepstral Coefficients (MFCCs) settings often limit classification performance. This study systematically optimized MFCC parameters, specifically the number of coefficients, frame length, and hop length, using a proprietary [...] Read more.

Breathing phase detection is fundamental for various clinical and digital health applications, yet standard Mel Frequency Cepstral Coefficients (MFCCs) settings often limit classification performance. This study systematically optimized MFCC parameters, specifically the number of coefficients, frame length, and hop length, using a proprietary dataset of respiratory sounds (n = 1500 segments). Classification performance was evaluated using Support Vector Machines (SVMs) and benchmarked against deep learning models (VGGish, YAMNet, MobileNetV2). Optimal parameters (30 MFCC coefficients, 800 ms frame length, 10 ms hop length) substantially enhanced accuracy (87.16%) compared to default settings (80.96%) and performed equivalently or better than deep learning methods. A trade-off analysis indicated that a clinically practical frame length of 200–300 ms balanced accuracy (85.08%) and latency effectively. The study concludes that optimized MFCC parameters significantly improve respiratory phase classification, providing efficient and interpretable solutions suitable for real-time clinical monitoring. Future research should focus on validating these parameters in broader clinical contexts and exploring multimodal and federated learning strategies. Full article

(This article belongs to the Section Biomedical Sensors)

► Show Figures

Figure 1

25 pages, 6169 KB

Open AccessArticle

Elephant Sound Classification Using Deep Learning Optimization

by Hiruni Dewmini, Dulani Meedeniya and Charith Perera

Sensors 2025, 25(2), 352; https://doi.org/10.3390/s25020352 - 9 Jan 2025

Cited by 4 | Viewed by 3230

Abstract

Elephant sound identification is crucial in wildlife conservation and ecological research. The identification of elephant vocalizations provides insights into the behavior, social dynamics, and emotional expressions, leading to elephant conservation. This study addresses elephant sound classification utilizing raw audio processing. Our focus lies [...] Read more.

Elephant sound identification is crucial in wildlife conservation and ecological research. The identification of elephant vocalizations provides insights into the behavior, social dynamics, and emotional expressions, leading to elephant conservation. This study addresses elephant sound classification utilizing raw audio processing. Our focus lies on exploring lightweight models suitable for deployment on resource-costrained edge devices, including MobileNet, YAMNET, and RawNet, alongside introducing a novel model termed ElephantCallerNet. Notably, our investigation reveals that the proposed ElephantCallerNet achieves an impressive accuracy of 89% in classifying raw audio directly without converting it to spectrograms. Leveraging Bayesian optimization techniques, we fine-tuned crucial parameters such as learning rate, dropout, and kernel size, thereby enhancing the model’s performance. Moreover, we scrutinized the efficacy of spectrogram-based training, a prevalent approach in animal sound classification. Through comparative analysis, the raw audio processing outperforms spectrogram-based methods. In contrast to other models in the literature that primarily focus on a single caller type or binary classification that identifies whether a sound is an elephant voice or not, our solution is designed to classify three distinct caller-types namely roar, rumble, and trumpet. Full article

(This article belongs to the Special Issue Integration of Edge/Fog Artificial Intelligence into Smart Distributed Systems)

► Show Figures

Figure 1

18 pages, 3930 KB

Open AccessArticle

Implementation of an Automatic Meeting Minute Generation System Using YAMNet with Speaker Identification and Keyword Prompts

by Ching-Ta Lu and Liang-Yu Wang

Appl. Sci. 2024, 14(13), 5718; https://doi.org/10.3390/app14135718 - 29 Jun 2024

Cited by 2 | Viewed by 2771

Abstract

Producing conference/meeting minutes requires a person to simultaneously identify a speaker and the speaking content during the course of the meeting. This recording process is a heavy task. Reducing the workload for meeting minutes is an essential task for most people. In addition, [...] Read more.

Producing conference/meeting minutes requires a person to simultaneously identify a speaker and the speaking content during the course of the meeting. This recording process is a heavy task. Reducing the workload for meeting minutes is an essential task for most people. In addition, providing conference/meeting highlights in real time is helpful to the meeting process. In this study, we aim to implement an automatic meeting minutes generation system (AMMGS) for recording conference/meeting minutes. A speech recognizer transforms speech signals to obtain the conference/meeting text. Accordingly, the proposed AMMGS can reduce the effort in recording the minutes. All meeting members can concentrate on the meeting; taking minutes is unnecessary. The AMMGS includes speaker identification for Mandarin Chinese speakers, keyword spotting, and speech recognition. Transferring learning on YAMNet lets the network identify specified speakers. So, the proposed AMMGS can automatically generate conference/meeting minutes with labeled speakers. Furthermore, the AMMGS applies the Jieba segmentation tool for keyword spotting. The system detects the frequency of words’ occurrence. Keywords are determined from the highly segmented words. These keywords help an attendant to stay with the agenda. The experimental results reveal that the proposed AMMGS can accurately identify speakers and recognize speech. Accordingly, the AMMGS can generate conference/meeting minutes while the keywords are spotted effectively. Full article

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)

► Show Figures

Figure 1

23 pages, 21874 KB

Open AccessArticle

Speech Emotion Recognition Using Deep Learning Transfer Models and Explainable Techniques

by Tae-Wan Kim and Keun-Chang Kwak

Appl. Sci. 2024, 14(4), 1553; https://doi.org/10.3390/app14041553 - 15 Feb 2024

Cited by 13 | Viewed by 6718

Abstract

This study aims to establish a greater reliability compared to conventional speech emotion recognition (SER) studies. This is achieved through preprocessing techniques that reduce uncertainty elements, models that combine the structural features of each model, and the application of various explanatory techniques. The [...] Read more.

This study aims to establish a greater reliability compared to conventional speech emotion recognition (SER) studies. This is achieved through preprocessing techniques that reduce uncertainty elements, models that combine the structural features of each model, and the application of various explanatory techniques. The ability to interpret can be made more accurate by reducing uncertain learning data, applying data in different environments, and applying techniques that explain the reasoning behind the results. We designed a generalized model using three different datasets, and each speech was converted into a spectrogram image through STFT preprocessing. The spectrogram was divided into the time domain with overlapping to match the input size of the model. Each divided section is expressed as a Gaussian distribution, and the quality of the data is investigated by the correlation coefficient between distributions. As a result, the scale of the data is reduced, and uncertainty is minimized. VGGish and YAMNet are the most representative pretrained deep learning networks frequently used in conjunction with speech processing. In dealing with speech signal processing, it is frequently advantageous to use these pretrained models synergistically rather than exclusively, resulting in the construction of ensemble deep networks. And finally, various explainable models (Grad CAM, LIME, occlusion sensitivity) are used in analyzing classified results. The model exhibits adaptability to voices in various environments, yielding a classification accuracy of 87%, surpassing that of individual models. Additionally, output results are confirmed by an explainable model to extract essential emotional areas, converted into audio files for auditory analysis using Grad CAM in the time domain. Through this study, we enhance the uncertainty of activation areas that are generated by Grad CAM. We achieve this by applying the interpretable ability from previous studies, along with effective preprocessing and fusion models. We can analyze it from a more diverse perspective through other explainable techniques. Full article

(This article belongs to the Special Issue Recent Progress and Challenges of Artificial Intelligence in Bioinformatics and New Medicine)

► Show Figures

Figure 1

24 pages, 2641 KB

Open AccessArticle

Analysis of Distance and Environmental Impact on UAV Acoustic Detection

by Diana Tejera-Berengue, Fangfang Zhu-Zhou, Manuel Utrilla-Manso, Roberto Gil-Pita and Manuel Rosa-Zurera

Electronics 2024, 13(3), 643; https://doi.org/10.3390/electronics13030643 - 4 Feb 2024

Cited by 12 | Viewed by 7246

Abstract

This article explores the challenge of acoustic drone detection in real-world scenarios, with an emphasis on the impact of distance, to see how sound propagation affects drone detection. Learning machines of varying complexity are used for detection, ranging from simpler methods such as [...] Read more.

This article explores the challenge of acoustic drone detection in real-world scenarios, with an emphasis on the impact of distance, to see how sound propagation affects drone detection. Learning machines of varying complexity are used for detection, ranging from simpler methods such as linear discriminant, multilayer perceptron, support vector machines, and random forest to more complex approaches based on deep neural networks like YAMNet. Our evaluation meticulously assesses the performance of these methods using a carefully curated database of a wide variety of drones and interference sounds. This database, processed through array signal processing and influenced by ambient noise, provides a realistic basis for our analyses. For this purpose, two different training strategies are explored. In the first approach, the learning machines are trained with unattenuated signals, aiming to preserve the inherent information of the sound sources. Subsequently, testing is then carried out under attenuated conditions at various distances, with interfering sounds. In this scenario, effective detection is achieved up to 200 m, which is particularly notable with the linear discriminant method. The second strategy involves training and testing with attenuated signals to consider different distances from the source. This strategy significantly extends the effective detection ranges, reaching up to 300 m for most methods and up to 500 m for the YAMNet-based detector. Additionally, this approach raises the possibility of having specialized detectors for specific distance ranges, significantly expanding the range of effective drone detection. Our study highlights the potential of drone acoustic detection at different distances and encourages further exploration in this research area. Unique contributions include the discovery that training with attenuated signals with a worse signal-to-noise ratio allows the improvement of the general performance of learning machine-based detectors, increasing the effective detection range achieved, and the feasibility of real-time detection, even with very complex learning machines, opening avenues for practical applications in real-world surveillance scenarios. Full article

(This article belongs to the Section Circuit and Signal Processing)

► Show Figures

Figure 1

15 pages, 2735 KB

Open AccessArticle

Sound-Event Detection of Water-Usage Activities Using Transfer Learning

by Seung Ho Hyun

Sensors 2024, 24(1), 22; https://doi.org/10.3390/s24010022 - 19 Dec 2023

Cited by 9 | Viewed by 3465

Abstract

In this paper, a sound event detection method is proposed for estimating three types of bathroom activities—showering, flushing, and faucet usage—based on the sounds of water usage in the bathroom. The proposed approach has a two-stage structure. First, the general sound classification network [...] Read more.

In this paper, a sound event detection method is proposed for estimating three types of bathroom activities—showering, flushing, and faucet usage—based on the sounds of water usage in the bathroom. The proposed approach has a two-stage structure. First, the general sound classification network YAMNet is utilized to determine the existence of a general water sound; if the input data contains water sounds, W-YAMNet, a modified network of YAMNet, is then triggered to identify the specific activity. W-YAMNet is designed to accommodate the acoustic characteristics of each bathroom. In training W-YAMNet, the transfer learning method is applied to utilize the advantages of YAMNet and to address its limitations. Various parameters, including the length of the audio clip, were experimentally analyzed to identify trends and suitable values. The proposed method is implemented in a Raspberry-Pi-based edge computer to ensure privacy protection. Applying this methodology to 10-min segments of continuous audio data yielded promising results. However, the accuracy could still be further enhanced, and the potential for utilizing the data obtained through this approach in assessing the health and safety of elderly individuals living alone remains a topic for future investigation. Full article

(This article belongs to the Special Issue AI for Smart Home Automation: 2nd Edition)

► Show Figures

Figure 1

14 pages, 5185 KB

Open AccessEditor’s ChoiceArticle

Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information

by A-Hyeon Jo and Keun-Chang Kwak

Appl. Sci. 2023, 13(4), 2167; https://doi.org/10.3390/app13042167 - 8 Feb 2023

Cited by 24 | Viewed by 6120

Abstract

Identifying a person’s emotions is an important element in communication. In particular, voice is a means of communication for easily and naturally expressing emotions. Speech emotion recognition technology is a crucial component of human–computer interaction (HCI), in which accurately identifying emotions is key. [...] Read more.

Identifying a person’s emotions is an important element in communication. In particular, voice is a means of communication for easily and naturally expressing emotions. Speech emotion recognition technology is a crucial component of human–computer interaction (HCI), in which accurately identifying emotions is key. Therefore, this study presents a two-stream-based emotion recognition model based on bidirectional long short-term memory (Bi-LSTM) and convolutional neural networks (CNNs) using a Korean speech emotion database, and the performance is comparatively analyzed. The data used in the experiment were obtained from the Korean speech emotion recognition database built by Chosun University. Two deep learning models, Bi-LSTM and YAMNet, which is a CNN-based transfer learning model, were connected in a two-stream architecture to design an emotion recognition model. Various speech feature extraction methods and deep learning models were compared in terms of performance. Consequently, the speech emotion recognition performance of Bi-LSTM and YAMNet was 90.38% and 94.91%, respectively. However, the performance of the two-stream model was 96%, which was a minimum of 1.09% and up to 5.62% improved compared with a single model. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Biomedical Signal and Image Processing)

► Show Figures

Figure 1

22 pages, 8024 KB

Open AccessArticle

Comparison of Pre-Trained CNNs for Audio Classification Using Transfer Learning

by Eleni Tsalera, Andreas Papadakis and Maria Samarakou

J. Sens. Actuator Netw. 2021, 10(4), 72; https://doi.org/10.3390/jsan10040072 - 10 Dec 2021

Cited by 105 | Viewed by 19259

Abstract

The paper investigates retraining options and the performance of pre-trained Convolutional Neural Networks (CNNs) for sound classification. CNNs were initially designed for image classification and recognition, and, at a second phase, they extended towards sound classification. Transfer learning is a promising paradigm, retraining [...] Read more.

The paper investigates retraining options and the performance of pre-trained Convolutional Neural Networks (CNNs) for sound classification. CNNs were initially designed for image classification and recognition, and, at a second phase, they extended towards sound classification. Transfer learning is a promising paradigm, retraining already trained networks upon different datasets. We selected three ‘Image’- and two ‘Sound’-trained CNNs, namely, GoogLeNet, SqueezeNet, ShuffleNet, VGGish, and YAMNet, and applied transfer learning. We explored the influence of key retraining parameters, including the optimizer, the mini-batch size, the learning rate, and the number of epochs, on the classification accuracy and the processing time needed in terms of sound preprocessing for the preparation of the scalograms and spectrograms as well as CNN training. The UrbanSound8K, ESC-10, and Air Compressor open sound datasets were employed. Using a two-fold criterion based on classification accuracy and time needed, we selected the ‘champion’ transfer-learning parameter combinations, discussed the consistency of the classification results, and explored possible benefits from fusing the classification estimations. The Sound CNNs achieved better classification accuracy, reaching an average of 96.4% for UrbanSound8K, 91.25% for ESC-10, and 100% for the Air Compressor dataset. Full article

► Show Figures

Figure 1

16 pages, 27093 KB

Open AccessFeature PaperArticle

Deep Transfer Learning for Machine Diagnosis: From Sound and Music Recognition to Bearing Fault Detection

by Eugenio Brusa, Cristiana Delprete and Luigi Gianpio Di Maggio

Appl. Sci. 2021, 11(24), 11663; https://doi.org/10.3390/app112411663 - 8 Dec 2021

Cited by 38 | Viewed by 6017

Abstract

Today’s deep learning strategies require ever-increasing computational efforts and demand for very large amounts of labelled data. Providing such expensive resources for machine diagnosis is highly challenging. Transfer learning recently emerged as a valuable approach to address these issues. Thus, the knowledge learned [...] Read more.

Today’s deep learning strategies require ever-increasing computational efforts and demand for very large amounts of labelled data. Providing such expensive resources for machine diagnosis is highly challenging. Transfer learning recently emerged as a valuable approach to address these issues. Thus, the knowledge learned by deep architectures in different scenarios can be reused for the purpose of machine diagnosis, minimizing data collecting efforts. Existing research provides evidence that networks pre-trained for image recognition can classify machine vibrations in the time-frequency domain by means of transfer learning. So far, however, there has been little discussion about the potentials included in networks pre-trained for sound recognition, which are inherently suited for time-frequency tasks. This work argues that deep architectures trained for music recognition and sound detection can perform machine diagnosis. The YAMNet convolutional network was designed to serve extremely efficient mobile applications for sound detection, and it was originally trained on millions of data extracted from YouTube clips. That framework is employed to detect bearing faults for the CWRU dataset. It is shown that transferring knowledge from sound and music recognition to bearing fault detection is successful. The maximum accuracy is achieved using a few hundred data for fine-tuning the fault diagnosis model. Full article

(This article belongs to the Special Issue Diagnostics of Rotating Machinery through Vibration Monitoring: Signal Processing and Pattern Analysis)

► Show Figures

Figure 1

13 pages, 4138 KB

Open AccessArticle

Audio-Based Aircraft Detection System for Safe RPAS BVLOS Operations

by Jorge Mariscal-Harana, Víctor Alarcón, Fidel González, Juan José Calvente, Francisco Javier Pérez-Grau, Antidio Viguria and Aníbal Ollero

Electronics 2020, 9(12), 2076; https://doi.org/10.3390/electronics9122076 - 5 Dec 2020

Cited by 7 | Viewed by 5736

Abstract

For the Remotely Piloted Aircraft Systems (RPAS) market to continue its current growth rate, cost-effective ‘Detect and Avoid’ systems that enable safe beyond visual line of sight (BVLOS) operations are critical. We propose an audio-based ‘Detect and Avoid’ system, composed of microphones and [...] Read more.

For the Remotely Piloted Aircraft Systems (RPAS) market to continue its current growth rate, cost-effective ‘Detect and Avoid’ systems that enable safe beyond visual line of sight (BVLOS) operations are critical. We propose an audio-based ‘Detect and Avoid’ system, composed of microphones and an embedded computer, which performs real-time inferences using a sound event detection (SED) deep learning model. Two state-of-the-art SED models, YAMNet and VGGish, are fine-tuned using our dataset of aircraft sounds and their performances are compared for a wide range of configurations. YAMNet, whose MobileNet architecture is designed for embedded applications, outperformed VGGish both in terms of aircraft detection and computational performance. YAMNet’s optimal configuration, with >70% true positive rate and precision, results from combining data augmentation and undersampling with the highest available inference frequency (i.e., 10 Hz). While our proposed ‘Detect and Avoid’ system already allows the detection of small aircraft from sound in real time, additional testing using multiple aircraft types is required. Finally, a larger training dataset, sensor fusion, or remote computations on cloud-based services could further improve system performance. Full article

(This article belongs to the Special Issue Deep Learning Technologies for Machine Vision and Audition)

► Show Figures

Figure 1

Search Results (13)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (13)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI