An Ensemble of Convolutional Neural Networks for Sound Event Detection

Mukhamadiyev, Abdinabi; Khujayarov, Ilyos; Nabieva, Dilorom; Cho, Jinsoo

doi:10.3390/math13091502

Open AccessArticle

An Ensemble of Convolutional Neural Networks for Sound Event Detection

¹

Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Republic of Korea

²

Department of Information Technologies, Samarkand Branch of Tashkent University of Information Technologies Named After Muhammad al-Khwarizmi, Tashkent 100084, Uzbekistan

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(9), 1502; https://doi.org/10.3390/math13091502

Submission received: 12 March 2025 / Revised: 24 April 2025 / Accepted: 30 April 2025 / Published: 1 May 2025

(This article belongs to the Special Issue Advanced Machine Vision with Mathematics)

Download

Browse Figures

Versions Notes

Abstract

Sound event detection tasks are rapidly advancing in the field of pattern recognition, and deep learning methods are particularly well suited for such tasks. One of the important directions in this field is to detect the sounds of emotional events around residential buildings in smart cities and quickly assess the situation for security purposes. This research presents a comprehensive study of an ensemble convolutional recurrent neural network (CRNN) model designed for sound event detection (SED) in residential and public safety contexts. The work focuses on extracting meaningful features from audio signals using image-based representation, such as Discrete Cosine Transform (DCT) spectrograms, Cocheagrams, and Mel spectrograms, to enhance robustness against noise and improve feature extraction. In collaboration with police officers, a two-hour dataset consisting of 112 clips related to four classes of emotional sounds, such as harassment, quarrels, screams, and breaking sounds, was prepared. In addition to the crowdsourced dataset, publicly available datasets were used to broaden the study’s applicability. Our dataset contains 5055 audio files of different lengths totaling 14.14 h and strongly labeled data. The dataset consists of 13 separate sound categories. The proposed CRNN model integrates spatial and temporal feature extraction by processing these spectrograms through convolution and bi-directional gated recurrent unit (GRU) layers. An ensemble approach combines predictions from three models, achieving F1 scores of 71.5% for segment-based metrics and 46% for event-based metrics. The results demonstrate the model’s effectiveness in detecting sound events under noisy conditions, even with a small, unbalanced dataset. This research highlights the potential of the model for real-time audio surveillance systems using mini-computers, offering cost-effective and accurate solutions for maintaining public order.

Keywords:

smart city; sound event detection; audio signal; data augmentation; ensemble of classifiers; pattern recognition; DCT; Mel; convolution neural network (CNN); deep learning

MSC:

68M02; 68N04; 68T11; 68T06; 68U11

1. Introduction

1.1. Research Context and Motivation

It is evident that advancements in machine learning and deep learning technologies have facilitated the creation of highly accurate systems for image processing [1], audio data analysis, speech recognition [2], and various other related applications. Concurrently, these technologies have given rise to new tasks that can find application in the fields of security, early detection, and the prevention of emergency situations. One of the most urgent problems today is the development of low-cost hardware and software that can recognize emotional sounds in emergency situations in the environment, make decisions, and promptly provide information to the situation center, especially in smart cities, to ensure safety. Cities are the result of constant evolutionary processes and are ideal places to test and implement new digital technologies in order to improve the living conditions of the population. Among the many challenges that are typically prioritized in smart cities are those related to public safety. These involve the use of low-cost sensor-based surveillance capabilities that are strategically placed in specific areas (crowded or uncrowded) to create a level of virtual monitoring depending on the situation. Such remote surveillance systems can be implemented in the conventional way using video cameras. However, in order to cover all the designated surveillance zones, several video cameras are needed to continuously record the surrounding environment.

The implementation of the tasks associated with this approach include the cost of hardware and infrastructure, complexity of maintenance, high power consumption, large number of different internal customizations, high need for processing power, and dependence on large storage devices. In addition to the above, there is the need to rely on security personnel to closely monitor the most complex images.

A more convenient and cheaper option to solve such problems is to use environmental sounds. Instead of video, audio from the observed environment is used. Rather than use human beings who are constantly listening to the environment, it would be easier to create a system that automatically detects potential or ongoing incidents that pose a risk to public safety and makes timely decisions. Detecting or recognizing emotional sounds in environmental events is more difficult than the intelligent processing of other speech or music signals because, since the sounds are taken from the environment, their characteristics are subject to strong non-stationary noise and interference. This is why new methods need to be introduced at each stage of sound processing in the classification of sounds.

1.2. Research Aims and Contributions

The detection of sound events in the environment aims to digitally process a continuous acoustic signal and convert it into a specific description of the relevant sound events present in the listening scene. The field of research that studies this process is called digital auditory scene analysis [3].

Analysis systems can be separated into two categories depending on whether they extract temporal information from the sounds being analyzed. Performing detection refers to a system that outputs information on the temporal activity of target classes. Different time scales are utilized depending on the requirements of the application, and detection can be performed for one or multiple sounds. If temporal information is not output, the system performs classification, which can output only one of the possible classes for the analyzed object, or labeling, which can output multiple classes. In machine learning terminology, this is commonly referred to as multi-label classification (Figure 1).

Automatic sound event detection (SED) systems have been employed in a number of applications including the contextual indexing and searching of multimedia databases [4,5], unobtrusive monitoring in healthcare [6], surveillance [7], and military applications [8]. Some information regarding acoustic phenomena can be used in other research areas, for instance, in audio context recognition [9,10], automatic labeling [11], and audio segmentation tasks [12].

This research includes a comprehensive study of an ensemble CRNN model and the use of image-based methods to extract important features from audio signals in order to quickly estimate sounds in situations related to maintaining public order in residential buildings and various public safety problems. Image-based data representation provides a deep and informative description of acoustic signals. This representation allows you to effectively extract features important for recognizing sound events and suppress the influence of noise. Ensemble models combine predictions from multiple base models, resulting in reduced variance, lower bias, and improved generalization compared to stand-alone models. This aggregation of diverse methodologies mitigates individual model errors, enhances robustness against overfitting, and typically yields more accurate and stable predictions across varied datasets.

The main contribution of this research includes the development of a novel dataset containing audio recordings of various disruptive and hazardous events that may occur at the entrances of residential buildings, such as harassment, quarrels, screams induced by alcohol or other substances, and the breaking of windows and doors. Additionally, we propose an image-based feature extraction method utilizing Discrete Cosine Transform (DCT) spectrograms, which enhances robustness against external noise and interference. Furthermore, we explore the effectiveness of Cochleagram and Mel spectrogram representations for audio data visualization in combination with DCT spectrograms to improve feature extraction. Finally, we introduce a CRNN model capable of extracting reliable acoustic features from each frequency–time spectrogram in a time series and integrating them into a unified ensemble model for SED.

1.3. Structure of the Paper

The structure of this article is laid out as follows: Section 2 presents an overview of the currently popular approaches. In Section 3, we inquire into the specific materials and approach used for image-based sound event detection. Section 4 concentrates on the data preparation and training process using deep learning techniques. The paper concludes with Section 5, which summarizes the main results and essential aspects of our discussion.

2. Related Work

Over the years, numerous researchers have conducted research on the classification of sound events using various methods and approaches. These range from approaches based on parametric signal processing [11] to methods for developing automatic speech recognition systems [13], typically using Mel-kepstral coefficients (MFCC) and other similar spectral methods [14], as well as signal parameters in the frequency–time domain [15] as basic parameters. In the work presented by Guodong Guo and Stan Z. Li [16], perceptual features composed of total power, sub-band power, brightness, bandwidth, pitch, and MFCCs were used as signal parameters to recognize voice signals, and support vector machines (SVMs) were used as classifiers. With the combination of perceptual features and eight MFCCs, the classification results achieved an error rate of 11.0%.

In the work of Giambattista Parascandolo et al. [17], Bi-directional Long Short-Term Memory (BLSTM) recurrent neural network (RNN) architectures were used to solve the classification problem of polyphonic audio events consisting of 61 classes based on 40 MFCCs from a signal with a frame size of 50 ms and an overlap of 50%. As a result, the classification system achieved an accuracy of 65.5% on the F1 score in 1 s blocks and an accuracy of 64.7% on single frames. In another research work on sound classification, Victor Bisot et al. [18] used an image-based approach (spectrogram images) for acoustic scene classification. They achieved an accuracy of 92.6% using a Sinkhorn kernel classifier by extracting sub-band power distribution (SPD) and histogram of gradients (HOG) features from the spectrogram image. A similar approach in the work of Alain Rakotomamonjy and Gilles Gasso [19] is based on converting the audio signal into a frequency–time representation and then extracting relevant shape features and changes in the frequency–time structure. These features were based on a gradient histogram and then fed into SVMs for classification, achieving an average recognition accuracy of 96%. Emre Cakır et al. [20] introduced a hybrid approach, CRNN, that integrates the advantages of CNNs and RNNs. This CRNN model was applied to polyphonic SED and demonstrated significant performance improvements over standalone CNNs, RNNs, and other conventional methods across four datasets of common sound events. In another approach to address the problem of audio event detection based on deep learning, Miquel Espi et al. [21] considered two approaches that emphasize the importance of extracting features from audio. The first approach combines results from several high-resolution spectrogram models at different spectral resolutions, and its superiority over a one-dimensional model was proven by practical results. In the second approach, a CNN was used to model the local characteristics of acoustic events, which gave superior results. Although the second approach has been shown to be effective, the first approach is promising and emphasizes the need to improve the combination schemes.

McLaughlin et al. [22] proposed a sound event classification system that compares the external features of an auditory image with external features based on spectrogram images using SVMs and deep learning classifiers. The performance of the system was evaluated on a standard reliable classification task at different noise levels and several improvements to the system and compared against that of state-of-the-art classification methods. Parallel research has focused on leveraging image-based representations of audio signals. Sharan and Moir [23] showed that incorporating Cochleagram features into CNN architectures enhanced robustness in noisy conditions, reporting an improvement in classification accuracy of approximately 8% over conventional spectrogram methods, with some configurations achieving accuracies exceeding 90% under controlled conditions. Similarity, Dennis et al. [24] conducted an analysis of various spectrogram image methods and found that log-Mel spectrograms often outperformed other representations, yielding classification accuracies around 85% on standard SED tasks.

The applicability of these approaches in real-world scenarios has also been demonstrated in smart city contexts. Spadini et al. [25] reported detection rates reaching about 75% in urban surveillance applications, while Ciaburro and Iannace [26] highlighted that deep neural network (DNN)-based sound event detection methods could improve safety by enhancing detection accuracy by up to 15% compared to traditional algorithms. In efforts to optimize SED for resource-constrained environments, Ranmal et al. [27] utilized hardware-aware neural architecture search to achieve environment sound classification accuracies of 88% on benchmark datasets, thereby enabling efficient edge-device deployment.

Further improvements have been observed through advanced CNN architectures. Zhang et al. [28] demonstrated that robust CNN models could attain recognition accuracies close to 88%, and Kwak and Chung [29] enhanced performance by incorporating derivative features into their networks, achieving overall F1 scores near 78%. Finally, ensemble methods have also proven effective. Nanni et al. [30] reported that an ensemble of CNN could boost classification performance by 5–7% with some systems approaching overall accuracies of 90%. Additionally, Xiong et al. [31] extended these techniques to construction activity monitoring, achieving detection accuracies of around 82% using deep learning models. In [32], multi-label neural networks were proposed to detect temporally overlapping audio events in real environments, and the frame-wise spectral characteristics of the signal were taken as the input parameters of the network for multi-label classification. The model was tested on audio recordings from real environments and achieved an accuracy of 63.8%.

Toni Heittola et al. [33] addressed the issue of the context-dependent detection of sound events, and the approach proposed by the authors consists of two stages: automatic context recognition and sound event detection stage. Contexts are modeled using a Gaussian mixture model and sound events are modeled using three-dimensional hidden Markov models.

Recent advancements in sound event detection (SED) have leveraged deep learning techniques to address various application domains. Zheng et al. [34] proposed a convolutional recurrent neural network (CRNN) system for detecting gastrointestinal sounds using data collected from wearable auscultation devices, demonstrating the potential of CRNNs in medical audio analysis. Similarly, Lim et al. [35] explored SED in domestic environments, employing an ensemble of convolutional and recurrent neural networks to enhance detection accuracy in noisy settings. In the context of audio surveillance, Arslan and Canbolat [36] investigated the performance of deep neural networks, highlighting their effectiveness in identifying sound events under controlled conditions. Furthermore, Hu et al. [37] introduced a track-wise ensemble event-independent network for polyphonic sound event localization and detection, achieving robust performance in complex acoustic scenes. These studies collectively underscore the versatility of deep learning architectures in SED across medical, domestic, surveillance, and polyphonic environments.

3. Materials and Methods

3.1. Workflow

Our methodology was scrupulously designed to attain our objectives. The project was divided into six segments: data acquisition, pre-processing, data annotation, feature extraction, data augmentation, and acoustic modeling. For the data acquisition, the required sounds were collected from crowdsourced and public open datasets. In the pre-processing phase, the audio signals were resampled to achieve a uniform sampling rate. Subsequently, spectral techniques were applied to mitigate the impact of various artifacts and noise, thereby ensuring data consistency and enhancing overall signal quality for further analysis.

In the next step, the data labeling process was carried out, and three different types of spectrogram images, DCT, Cochleagram, and Mel, were obtained during the feature extraction phase. Augmentation techniques were then applied to these spectrogram images, in the final step of developing an acoustic model based on an ensemble of CRNNs. The process of the proposed methodology is shown in Figure 2.

3.2. Challenges in Sound Event Classification (SED)

The development of automated systems for SEDs is hampered by a number of issues: the dependence of detectable sounds on their source in the natural environment; that sounds have different acoustic characteristics (e.g., some sounds are relatively short while others are longer in time and can be harmonic); external factors such as the distance of the incident sound from the microphone and data collection as well as annotation procedures complicate the development of SED systems. In addition, the natural environment is polyphonic, meaning that several sounds are active simultaneously during the analysis process, sound events demonstrate harmonic relationships with each other, and fundamental frequencies form small-integer relationships, further complicating the separation of sounds.

There are no predefined rules for how sounds can co-occur, therefore modeling them requires the use of data-derived statistics such as direct counts and the degree of overlap between different classes in data.

The number of possible classes of sound phenomena is probably very large, since any object or creature can produce sound as a natural phenomenon [38]. In practice, each SED application is customized for different classes of sounds and is used in a variety of environments. Given this variability, no datasets or models are universally relevant. This is why each application requires data collection and system architecture to satisfy its unique requirements.

This is in direct opposition to other areas of sound event processing, such as speech recognition, where language has a set of specified classes (phonemes), and language models help acoustic models perform recognition tasks.

3.3. Data Acquisition

Depending on the purpose of the SED, audio signals may vary depending on the acoustic environment (e.g., outside or inside a building, type of reflective surfaces), the relative location of the sound source and microphone, the capture device used, and the source of interfering noise.

The purpose of our study is to detect the sounds of various events that disturb the peace of the population in and around the entrances of multi-story residential buildings: glass breaking, various knocks (door slam, door knock), harassment, fights, shouting caused by alcohol or other substances, various sounds of construction tools, opening and closing of the front door, loud music, car horns, and the sounds of everyday events near the entrance of the building, including sounds of children playing, footsteps, and conversations of ordinary people.

Our data collection project on the sounds mentioned above is called UzSoundEvent. The data collection for this project was conducted jointly with researchers at the “Machine learning and Intelligence” laboratory from the Samarkand branch of the Tashkent University of Information Technologies, named after Muhammad Al-Khwarizmi, and police inspectors in the region. The dataset was collected by two means: crowdsourcing and open source sound event datasets.

3.3.1. Crowdsourcing

The crowdsourcing approach consists of three major stages—namely, choosing places, recording sounds, and audio checking—which will be thoroughly described in the following sections.

Choosing places. To collect sounds disturbing public order and various emotional events (e.g., harassment, fights, shouting, yelling caused by alcohol or other substances, breaking of windows and doors), we, together with police, selected entrances of residential buildings in neighborhoods with unfavorable social conditions and poor lighting.

Recording sounds. A cloud-based approach was chosen to record and collect the sounds. The recording process included a Raspberry pi4 with a Wi-Fi module, mini-computer, and omnidirectional electret condenser microphone. The audio file was recorded in 16 kHz, 16 bit, and mono format. The “omnidirectional” microphone had a sensitivity of −44 dB and frequency characteristics of 20–20 KHz. You can connect it to a breadboard or perfboard or solder wires to the small wires hanging out of the rear. All recorded audio was saved in the Google Drive cloud. Figure 3 below shows the functional diagram of an IoT module for recording and storing event sounds.

The recording of the sounds of these events is assumed to take place at a distance of 20 m from the entrance to the building. The process of sound recording and information gathering was conducted throughout the year in the evenings from 19:00 p.m. to 07:00 a.m. in 26 entrances of 13 residential buildings.

Audio checking. To ensure the high quality of the data gathered, the recorded sounds were reviewed by project participants. In order to quickly extract the event sounds needed for our study from the large amount of audio data, audio data from moments of rule violations recorded by the police during the study period were checked, and event sounds were extracted. There was a total of 6 incidents of violence (harassment), 36 fights and quarrels, 33 screams under the influence of alcohol or other intoxicants, and 37 cases of breaking windows and doors.

3.3.2. Open Source Datasets

Currently, there are ready-made open source datasets such as UrbanSound8K [39], AudioSet Ontology [40], FSD50K [41], TUT Sound Events 2017 [42], and ESC-50 [43]. In our study, we used other relevant event sounds (different types of construction tool sounds, doors opening and closing, loud music sounds, car horns, footsteps, children crying, guns shooting, sirens) from this dataset.

3.4. Data Annotation

The problem of classifying sound events is multi-label classification, and therefore, there are rules for labeling data [44]. To customize the task of multi-label classification in short time segments, the audio annotation should also contain information about audio events in short time segments. There are special requirements for this type of annotation, which are called strong labels. That is, the annotation includes temporal information for each sound sample, including its start time and offset [45]. The strong-label process is performed as shown in Figure 4.

Manually annotating large amounts of audio data is a difficult and time-consuming task. Manually annotation starts and offsets in polyphonic mixes is also difficult, slowing down the annotation process when strong labels are needed. To facilitate the annotation process, many web-based audio annotation tools have been analyzed for use [46,47,48,49,50].

Among the tools analyzed, we used online audio annotation tools like [50] SuperAnnotate [51].

3.5. Pre-Processing

Pre-processing is usually implemented before the feature extraction stage. The primary goal of this process is to improve the extraction of important features of the incoming signal to maximize the efficiency of audio analysis in the subsequent stages of the analysis system. This can be realized by reducing the noise effect or by enhancing the prominence of the target sounds within the signal.

The input signal in the time domain is of the form y(n) = x(n) + s(n), containing only the events of interest x(n) and the additive noise s(n). In this material, the input acoustic signal y(n) is resampled with a sampling rate of fs = 16,000 Hz, 16-bit resolution, and single-channel resampling. The signal was framed at 32 ms and 50% overlap, and we applied a Hamming window to smooth the signal at the boundaries. After that, we used a short-time Fourier transform (STFT).

The noise reduction process involves two steps: noise power spectral density (PSD) estimation [52] and spectral enhancement. PSD noise estimation is critical for accurate noise reduction. The Minimum Statistics (MS) method is used in cases of slow time-varying noise. The noise spectrum computed from the decision-driven approach (DDA) was dynamically updated using a weighted combination of the previous noise estimate and the instantaneous spectral power minimum.

The recursive framework of the approach facilitates smoother estimation of both noise and signal components, thereby minimizing the introduction of artifacts. Finally, the time-domain signal x(n) is reconstructed by applying the inverse short-time Fourier Transform (ISTFT). An overall view of the noise reduction procedure is illustrated in Figure 5.

3.6. Feature Extraction

Audio feature extraction can be categorized into time-domain, frequency-domain, and time–frequency-domain feature representation, as statistical and psychoacoustical. The feature extraction pipeline utilized in various acoustic feature analyses involves several key steps, including framing, windowing, calculating spectral coefficients, and subsequent feature analysis [52]. When analyzing audio signals, the large amount of data available involves the use of relative energy distribution over the frequency, frequency-domain characteristics, or frequency–time-domain characteristics. The most common transform used for audio signals is the discrete Fourier transform (DFT) [46]. Other transform algorithms for audio signals include the constant-square transform (CQT) [53], MFCCs [54], and the discrete wavelet transform (DWT) [55,56].

Audio Representations as Images

In this research work, we generated spectrogram images from an audio signal using a Discrete Cosine Transform (DCT), MFCCs, and Cochleagram:

DCT-based spectrogram: The incoming audio signal has a duration of 2 s (sampling rate, 16 kHz); after going through a framing procedure (frame size = 32 ms, overlap = 50%). the DCT spectral coefficients are calculated for each frame. The formula for the DCT of a 1D signal x[n] is given by

X [k, r] = \sum_{n = 0}^{N - 1} x [n] \times c o s c o s [\frac{π}{N} (n + \frac{1}{2}) \times k], k = 0,1, . ., N - 1

(1)

where N is the length of the window, x[n] is the time-domain signal,

X [k, r]

is the kth harmonic corresponding to the frequency f(k) = kFs/N for the rth frame, and Fs is the sampling frequency. The spectrogram values are obtained from the log of the magnitude of the DCT values as

S (k, r) = l o g l o g |X (k, r)|

(2)

To obtain the same time–frequency image resolution for all signals, each sound event signal was divided according to the framing rule [57] into 128 frames with 50% overlap between frames. A Hamming window was then applied to the frames, and a DCT was applied using 512 points so that the final image had dimensions of 256

\times

128.

After obtaining the spectral coefficients for each frame, the spectrogram image was constructed in grayscale, where each value is scaled in the range from 0 to 255. The sequence of spectrogram image formation is shown in Figure 6. Figure 7 shows the time, frequency, and frequency–time (spectrogram) regions of the loaded signal used to generate spectrogram images.

2.: MFCC spectrograms [58]: These spectrograms are derived by computing coefficients that correspond to compositional frequency components, utilizing the short-time Fourier transform (STFT). This extraction process involves applying each frame of the frequency-domain representation to a Mel filter bank. The intensity values of the Mel spectrogram image are calculated using the energy of the filter bank, similar to the MFCCs [12] but without using DCT. The Mel filter bank output of the mth filter can be determined as

C (m, r) = \sum_{k = 0}^{\frac{N}{2} - 1} F (m, k) |X (k, r)|, m = 1,2, . ., M

(3)

where

C (m, r)

is the filtr bank energy of the mth filter in the rth frame,

F (m, k)

is the normalized response of the triangular filters uniformly spaced on the k-Mel scale, and

m

is the total number of Mel filters. The signal processing parameters for generating a Mel spectrogram are the same as for generating a DCT spectrogram. In total, 128 values of Mel coefficients were used from each frame, and an image of size 128 × 128 was generated. The log values are then calculated as in Equation (2).

3.: Cochleagram: This mapping models the frequency selectivity property of the human cochlea [23,30]. To extract cochlear images from a signal, the incoming signal is first filtered using a gammatone filter bank. The gammatone filter bank is a comb of gammatone filters, each of which is associated with a certain characteristic frequency. The impulse characteristic of a gammatone filter with center frequency fc is described by the expression

g (t) = t^{(l - 1)} {e x p}^{{- 2 π b E R B (f_{c})}^{t}} c o s (2 π f_{c} t), t > 0,

(4)

where t—time; l—filter order; b—parameter that adjusts the filter bandwidth; fc—filter center frequency; and ERB(fc)—equivalent rectangular bandwidth of an auditory filter. In practice, the parameter values l = 4 and b = 1.019 [59] are often used. Typically, the central frequencies f_c of a filter bank are distributed uniformly relative to the ERB (equivalent rectangular bandwidth) scale. The ERB scale is similar to the scale of critical bands of human hearing. Knowing the frequency f (in Hz), one can reference the ERB scale using expression (3).

E R B (f_{c}) = 24,673 (0.004368 f + 1)

(5)

After the signal is filtered, a spectrogram-like image can be generated by adding energy to the windowed signal for each frequency channel. The process of converting the image can be found in [60].

C (m, t) = \sum_{n = 0}^{N - 1} |\hat{x} (m, n)| w (n), m = 1, \dots, M

(6)

where

\hat{x} (m, n)

—filtered signal from the gammatone; and

C (m, t)

—is the mth harmonic corresponding to f_cm for the tth frame.

With the Cochleagram representation, we set the number of gammatone filters to 256. The filtered signal was divided into 128 frames with 50% overlap between frames. The result was an image of size 256 × 128. The signal processing parameters for generating Cochleagram images were the same as for generating a DCT spectrogram.

Figure 8 shows examples of grayscale images of a DCT spectrogram, Cochleagram, and Mel spectrogram.

4. Data Preparation and Training

4.1. Data Augmentation Techniques

Data augmentation is an increasingly popular method that also helps to reduce selection bias. This method involves altering the training data during training to introduce more variance into the sample. As a result of augmentation, the model works with a larger and more diverse training sample, which in turn can better describe the decision boundaries between classes. Perturbations can range from simple effects such as additive background noise, time dilation, and pitch changes to more complex domain-specific deformations (e.g., changes in speech path length or variable rate perturbations) [60,61,62,63]. The underlying concept is that training on these augmented examples enables the model to develop invariance to the specific deformations applied, thereby enhancing its robustness and generalization capabilities.

In our research, augmentation methods were only applied to the group of voices collected through crowdsourcing, because the volume of voices collected over a year is not sufficient compared to other voices. We used Time Masking and Frequency Masking techniques to simulate temporary signal loss and frequency dropouts by randomly masking parts of the time and frequency axis spectrograms; to simulate speed changes, the spectrogram was deformed along the time axis using Time Warping, and additive noise was used to add random Gaussian noise to simulate background noise. In addition, we applied Time Shifting to shift the spectrogram along the time axis to simulate delayed or leading signals and Dynamic Range Compression to compress the amplitude range to simulate effects such as sound normalization.

4.2. Dataset

This study employed both crowdsourcing data and publicly available open datasets to cover a wide range of sound events and situations. The datasets are summarized here.

4.2.1. Crowdsourced Dataset

The UzSoundEvent dataset was developed through a crowdsourcing campaign. In Table 1, it is shown to comprise 112 audio pieces that last approximately 4 h. The dataset contains four separate types of sound events, with an average clip time of two minutes and a sample rate of 16 kHz. The recordings were made in a wide range of real-world contexts, including urban, rural, indoor, and outdoor settings. Pre-processing methods, including spectral denoising and cropping, were used to improve the audio quality. Annotation was performed using a multi-label approach, with an average of 2.5 labels per clip.

4.2.2. Open Datasets

In addition to the crowdsourced dataset, publicly available datasets were used to broaden this study’s applicability. The datasets, sound classifications, number of clips, average duration, and sample rates are all shown in Table 2.

The dataset statistics are reported in Table 3. Our dataset contains 5055 audio files of different lengths totaling 14.14 h, and the dataset contains strongly labeled data. The dataset consists of 13 separate sound categories with their corresponding durations and sources. The total duration of each category (in minutes) was divided into three subsets: 80% for training, 10% for validation, and 10% for testing. The number of clips and duration distributions are shown in Figure 9.

4.3. Development of Ensemble CRNN-Based Model for Sound Event Detection

Neural networks are well known for outperforming traditional machine learning approaches in a variety of pattern recognition tasks, making them the most common method, particularly for image and audio recognition. The following section focuses on an ensemble of CRNNs (GRUs), which are used in our research.

Convolutional Recurrent Neural Network

In the proposed ensemble CRNN-based model for SED, we first consider a time–frequency representation

X \in R^{T \times F}

, where T and F, respectively, denote the number of time frames and frequency bins (e.g., DCT coefficients). A series of L_c ∈ N convolutional layers are then applied to

X

to extract local time–frequency features. Each layer l performs a two-dimensional convolution using a bank of filters

W^{(l, m)}

of size

K_{t}^{(l)} \times K_{f}^{(l)}

, producing outputs

X^{(l)}

for

m \in \{1, \dots, C^{(l)}\}

. Specifically, the feature map of the m-th filter at layer l is given by

X_{t, f, m}^{(l)} = σ (\sum_{i = 0}^{K_{t}^{(l)} - 1} \sum_{j = 0}^{K_{f}^{(l)} - 1} \sum_{c = 1}^{C^{(l - 1)}} W_{i, j, c}^{(l, m)} X_{t + i, f + j, c}^{(l - 1)} + b^{(l, m)}),

(7)

where σ(⋅) is the nonlinear activation (e.g., ReLU), and

X^{(l - 1)}

is the output of the previous layer (with

X^{(0)} = X

). A non-overlapping pooling operation is then performed over the frequency axis to reduce dimensionality, typically captured by

P_{t, f, m}^{l} = {m a x}_{0 \leq j < P_{f}} X_{t, P_{f} * f + j, m}^{(l)}

The final pooled features are stacked along the frequency axis to form a sequence

{\{Z_{t}\}}_{t = 1}^{T}

of dimension

D = (\frac{F}{P_{f}}) \times C^{(L_{c})}

, which serves as the input to

L_{r}

stacked recurrent layers (GRU). In a GRU, the hidden state

h_{t}

and cell state at time t are updated using the following set of equations:

z_{t} = σ (W_{z} Z_{t} + U_{z} h_{t - 1} + b_{z}), r_{t} = σ (W_{r} Z_{t} + U_{r} h_{t - 1} + b_{r}), n_{t} = t a n h (W_{n} Z_{t} + U_{n} (r_{t} ⨀ h_{t - 1}) + b_{n}), h_{t} = (1 - z_{t}) ⨀ n_{t} + z_{t} ⨀ h_{t - 1},

(8)

where

z, r_{t}, a n d n_{t}

are the update, reset, and hidden candidate states, ⊙ signifies elementwise multiplication,

σ (\cdot)

is the elementwise sigmoid function, and

t a n h (\cdot)

is the hyperbolic tangent (Figure 10).

The last hidden states,

h_{t}^{(L_{r})}

, is transmitted to a fully connected feedforward layer, which generates frame-level class probabilities

y_{t} = σ (W_{f f} h_{t}^{(L_{r})} + b_{f f})

, where

y_{t} = R^{C}

and σ(⋅) is the sigmoid activation for each C sound event class.

These probabilities are then binarized to obtain

\hat{y_{t}} = {\{0,1\}}^{C}

via the thresholding function, where

{\hat{y}}_{t, c} = 1

if

y_{t, c} \geq 0

and 0 otherwise, thereby yielding per-frame event activity decisions.

Finally, to exploit multiple representations such as DCT, Mel spectrograms, and Cochleagram, the above CRNN pipeline is replicated for each representation

k \in \{1, \dots ., K\}

, generating predictions

y_{t}^{(k)}

that are then combined—commonly by averaging to form an ensemble output, Equation (9).

y_{t}^{(e n s)} = \frac{1}{K} \sum_{k = 1}^{K} y_{t}^{(k)}

(9)

The final ensemble prediction is subsequently thresholded to yield improved, robust sound event detection by leveraging complementary information from distinct time–frequency domains. Figure 11 shows an illustration of the system structure.

4.4. Regularization

In order to address overfitting in our proposed model, we incorporate a variety of regularization techniques proven effective in earlier work. Following the original dropout concepts introduced in the seminal work by Srivastava et al. [64], we applied dropout not only in the convolution stages but also within the recurrent layers, consistent with the theoretically grounded approach for RNNs proposed by Y.Gal [65]. We also utilized batch normalization to stabilize and accelerate training by normalizing intermediate activations [66].

Beyond these layer-level methods, we find that ensembling multiple spectrogram-specific CRNN branches provides an additional model-averaging effect, yielding robustness against variance in the training data [67,68]. By combining these mechanisms—dropout, batch normalization, and ensemble model averaging—we obtain substantial improvements in mitigating overfitting and enhancing the overall performance of our multi-label sound event detection system.

4.5. Training

To train the proposed model on the annotated dataset, the training set was separated into training, validation, and test sets in a ratio of 80/10/10 (Table 3). Because the dataset is imbalanced and has multiple labels, this study used the iterative stratification method [69] and data augmentation methods outlined in Section 4.1 to achieve a balanced distribution of training, validation, and test sets.

The proposed ensemble CRNN-based model architecture incorporates three types of input spectrograms (DCT, Cochleagrams, and Mel), all processed through a CRNN. The architecture begins with two convolutional layers, each followed by max-pooling operations to reduce spatial dimensions while retaining essential features. After convolutional processing, the feature maps are reshaped into sequences suitable for two bi-directional GRU layers, each with 64 hidden units. Dropout regularization (0.3 for DCT and Cochleagrams, 0.2 for Mel spectrograms) is applied to both convolutional and GRU layers to mitigate overfitting problem. The last dense layer uses a sigmoid activation to predict numerous sound events in a multi-label setup using binary cross-entropy as the loss function. Training is optimized using the Adam optimizer with a learning rate of

1 \times 10^{- 4}

, a batch size of 32, and 100 training epochs. Batch normalization is employed, and an ensemble method is applied to aggregate model outputs via voting. This architecture effectively combines convolutional and recurrent layers to extract spatial and temporal features for sound event detection. Complete information on the hyperparameters of the proposed model architecture is presented in Table 4. For feature extraction, the Python (version 3.8.0) library Librosa and our python function creating DCT spectrogram images were used in this work. For classifier implementations, the deep learning package Keras (version 3.8.0) is used with TensorFlow as the backend. The networks were trained on an NVIDIA Tesla P100 single GPU.

4.6. Metrics

We used two evaluation approaches proposed by Mesaros et al. [70], segment-based and event-based metrics, as evaluation metrics in this research study. In segment-based evolution, both the ground truth and the system’s predictions are divided into fixed-length time segments (commonly used: one second), and the performance is assessed per segment. For each segment, true positives (TP), false positives (FP), and false negatives (FN) are computed, and the segment F1 score is defined as

F_{1}^{s e g m e n t} = \frac{2 T P}{2 T P + F P + F N}

(10)

Complementary to the F1 score, the segment-based error rate (ER) is also calculated. Here, the error rate considers substitution errors (S), deletion errors (D), and insertion errors (I) with respect to the total number of events in the reference (N). It is given by

{E R}^{s e g m e n t} = \frac{S + D + I}{N} .

(11)

In contrast, the event-based evaluation treats events as temporal intervals with specific onsets and offsets. An event is regarded as correctly detected if its onset, and optionally its offset, falls within a predefined tolerance around the reference annotation. With TP, FP, and FN now representing correctly detected, spurious, and missed events, respectively, the event-based F1 score is computed as

F_{1}^{e v e n t} = \frac{2 T P}{2 T P + F P + F N} .

(12)

Similarly, the event-based error rate is defined as

{E R}^{e v e n t} = \frac{S + D + I}{N},

(13)

where the definitions of substitutions, deletions, and insertions are analogous to those in the segment-based evaluation but applied to entire events rather than fixed segments. The calculation process of segment-based and event-based metrics is illustrated in Figure 12 and Figure 13 [70].

Together, these metrics of the F1 score and error rates in both segment-based and event-based contexts offer complementary insights: while the F1 score emphasizes the balance between precision and recall, the error rate provides a direct measure of the frequency and types of errors obtained by the system. This dual-metric approach facilitates a more nuanced evaluation of system performance in complex acoustic scenes.

5. Results

Performance Evaluation

In this section, we evaluate the performance of the proposed model based on segment and event metrics for three different spectrogram images using CRNN model architectures and their combined ensemble model. The dataset selected for testing was composed of audio recordings from crowdsourced and public datasets. The total size of the test dataset was 1.37 h for all sound events. In segment-based metrics, the segment length was set to 1 s, whereas in event-based metrics, the tolerance for estimating the start/end time was T collar = 200 ms and the percentage tolerance for the end was Offset (length) = 50%. We conducted test experiments on the ensemble and, separately, its three neural network models (DCT + CRNN spectrogram, Cochelogram + CRNN spectrogram, Mel + CRNN spectrogram). Based on the final ensemble model, we obtained the following results, namely, the F1 segment score was 71.5% (precision: 84,2%; recallL 62.13%) and F1 event score was 46% (precision: 53.82%; recall: 40.13%), respectively. Table 5 details the F1 score and error rate values for segment-based and event-based metrics for the model methods. The detail information is shown in Table 5, and Figure 14, Figure 15 and Figure 16.

As can be seen in Figure 13, the “Sounds of breaking” sound received the highest F1 score across segments at 96%, followed by “Gun Shot”, “Door Opening/Closing”, and “Construction Tools” sounds with 86.1%, 81.4%, and 81.2% accuracy, respectively. The lowest score was 41.8% for the “Harassment” cases. The main reasons for this are the small size of the dataset, the relatively distant location of the event from the recording device, and the similarity of the sounds of the “quarrel” and “screams” events. In addition, noise mixing resulted in false detections.

Figure 14 is a graphical representation of the model results presented in Table 5. Figure 16 shows the event F1 score, precision, and recall of the event sounds. In this case, we can see a relatively different graph in Figure 15. As we know, event-based metrics require the model to correctly identify and classify each overlapping event independently. This is a difficult task, especially if the events have similar acoustic properties or the duration of the event sound is short.

When evaluated using the event-based metric, the F1 scores of the sounds of the events “Gun shots”, “Construction tools”, and “Footsteps” are 37%, 42.3%, and 34%, respectively. We can see that these are low scores compared to those of the other sounds. One of the main reasons for this is the sensitivity of event metrics to rare or short-lived events. Short sounds (e.g., gunshots, screams) are particularly difficult to detect because they occupy fewer frames, making them more susceptible to complete omission, and small errors in temporal localization can lead to the complete exclusion of an event from the evaluation.

The analysis of the results shows that the variation in the detection accuracy of sound events is due to a combination of factors related to the characteristic acoustic features, temporal characteristics, recording quality, and the amount of training data. Events with unique and easily distinguishable spectral features, such as “sound of breaking” sounds, show high detection accuracy because their distinct acoustic features allow the model to efficiently extract and memorize characteristic features. At the same time, events with less distinct or similar features, such as “harassment” and ”scream” incidents, make correct classification difficult due to acoustic features overlapping with other sounds, resulting in lower accuracy.

The temporal aspects also play a significant role in determining accuracy. Prolonged events provide more information to the model to extract stable features, which contributes to more accurate segmentation and classification under segmental estimation. However, short-lived and transient events, such as “gunshots” or “footsteps”, occupy fewer temporal frames, and even small errors in localization can result in missing events entirely. Strict temporal accuracy criteria in estimation, especially when using event-based metrics, further exacerbate this problem.

Furthermore, the quality of the audio recording has a considerable impact on the results. Events recorded at a high signal-to-noise ratio or close to the microphone are characterized by clear and distinct acoustic signals, which makes them easier to detect. On the contrary, sounds recorded at a distance or with high background noise lose their distinctive features, resulting in poorer recognition and higher error rates. Also, the lack of training data and class imbalance hinder effective model training, as the limited number of examples does not fully capture all variations in sound events. Thus, high accuracy in SED is achieved by having unique and distinct acoustic features, sufficient duration of the signals, favorable recording conditions, and an extensive training sample. Low accuracy, on the other hand, is due to the similarity of acoustic features, short duration of events, low signal-to-noise ratio, and limited data, resulting in difficulties in accurately localizing and classifying sounds in complex acoustic environments.

Our ensemble CRNN approach for SED excels across 13 classes, exploiting DCT spectrograms, Cochleagrams, and Mel spectrograms to achieve a segment-based F1 score of 71.5% with a small, unbalanced 14.14 h dataset. Outperforming the models of Çakır et al. (63.8%, 16 classes), Sang-Iek Kang et al. (53%, 12 classes), and Wei et al. (51.4%, 5 classes), it demonstrates robust handling of moderate class diversity. The multi-spectrogram strategy enhances noise robustness and generalization compared to single-spectrogram methods (e.g., Zheng et al., ~85%, 6 classes), while its ensemble design ensures high performance in urban safety contexts, surpassing niche methods (Wuyue Xiong et al., 82% 5 classes) in applicability. Cost-effective for low-cost hardware, it balances scalability and simplicity, though its event-based F1 score suggests limitations in temporal precision across these classes (Table 6).

6. Conclusions

This research has developed a deep learning model for detecting sounds of emotional or disruptive events occurring around residential buildings. The audio data were converted into three types of images (DCT spectrograms, Cochelograms, and Mel spectrograms) to ensure that the developed model performs effectively under different noise conditions. We used CRNN models that extract reliable acoustic features from each frequency–time spectrogram in the time series and used them as a single ensemble model for SED. This model can detect sound events and perform temporal signal localization tasks. Important features of this model are that it can identify and distinguish each individual sound event from mixed acoustic sources that include strong background noise, as has been reported in real-world environments.

To validate the proposed model, real data were collected from residential building entrances using an audio recorder for a group of sounds (harassment, quarrel, screams, and sound of breaking), as well as extracted sounds of events that may occur around residential buildings from open datasets. Two metrics were used to evaluate the performance of the model. For the 13 sound event types, the ensemble architecture achieved an F1 score of 71.5% for the segment metric and an F1 score of 46% for the event metric.

The results show that the proposed model gives good results even on small datasets that are not class-balanced. This means that the methods used in each stage of model training, such as pre-processing, feature extraction, data annotation, and the acoustic model architecture, are properly chosen.

The experiments conducted confirm the universality of the proposed approach and its applicability in a wider range and under different conditions in detecting incidents occurring near residential buildings. Considering the above observations, it can be concluded that the developed model offers an effective solution to detect different types of situations disturbing the peace of citizens around residential buildings in real time, accurately and promptly. Therefore, its new features such as less need for training sets and easy data labeling can effectively reduce the training and application costs.

The developed model can be installed on mini-computers (for example, Raspberry Pi5, Jetson Nano) for audio surveillance. This is particularly useful in potentially dangerous events, which can be detected early in heavily cluttered spaces where visual events are likely to be hidden. Further research could focus on collecting a larger and more diverse dataset covering the types of incidents that may occur near residential buildings. This facilitates the development of more general models and improves performance in real-world applications. Additionally, audio data can be integrated with video streams to create multimodal surveillance systems capable of cross-validating sound events with visual cues, thereby improving detection accuracy and contextual understanding as well as developing attention mechanisms that prioritize either audio or visual modalities based on environmental conditions.

Despite the promising results achieved in this study, several limitations should be acknowledged and addressed in future work. To address the shortcomings in event-based detection and elevate the F1 score from 46%, several research directions can be pursued.

First, enhancing model architectures with advanced temporal modeling techniques, such as transformer-based models or CRNNs with attention mechanisms, could improve the capture of long-term dependencies essential for precise event localization.

Second, integrating hybrid feature representations—such as combining spectrograms with temporal embeddings and derivative features—may enhance feature extraction and better represent event dynamics.

Third, tackling data-related challenges, such as class imbalance, through techniques like oversampling rare events or employing weighted loss functions, could improve model generalization across diverse sound events.

Additionally, our research focuses on detecting environmental sounds from hazardous events to ensure public safety. Building a dataset of hazardous sounds (harassment, screams, quarrels, etc.) is a very time-consuming process, making it difficult to develop high-accuracy recognition models. In future work, the training dataset will be expanded to increase the accuracy of the model. Finally, building on this study’s use of ensemble methods, which contributed to the current 46% event-based F1 score, further optimization could be achieved by diversifying the ensemble’s constituent models and refining post-processing approaches, such as temporal smoothing, to reduce prediction errors.

Author Contributions

Conceptualization, A.M. and I.K.; methodology, A.M.; software, A.M. and I.K.; validation, A.M. and I.K.; formal analysis, A.M. and D.N.; investigation, A.M. and I.K.; resources, D.N. and A.M.; data curation, A.M. and I.K.; writing—original draft preparation, A.M., D.N., and I.K.; writing—review and editing, J.C.; visualization, A.M.; supervision, J.C. and I.K.; project administration, J.C.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all participants involved in the study.

Data Availability Statement

The data are not publicly available due to privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mukhamadiyev, A.; Khujayarov, I.; Cho, J. Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language. Electronics 2023, 12, 4850. [Google Scholar] [CrossRef]
Musaev, M.; Khujayorov, I.; Ochilov, M. Image Approach to Speech Recognition on CNN. In Proceedings of the 2019 3rd International Symposium on Computer Science and Intelligent Control (ISCSIC 2019), Amsterdam, The Netherlands, 25–27 September 2019; Association for Computing Machinery: New York, NY, USA Article 57. ; pp. 1–6. [Google Scholar] [CrossRef]
Wang, D.; Brown, G.J. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications; Wiley-IEEE Press: New York, NY, USA, 2006. [Google Scholar]
Heittola, T.; Mesaros, A.; Virtanen, T.; Gabbouj, M. Supervised model training for overlapping sound events based on unsupervised source separation. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8677–8681. [Google Scholar] [CrossRef]
Xu, M.; Xu, C.; Duan, L.; Jin, J.S.; Luo, S. Audio keywords generation for sports video analysis. ACM Trans. Multimed. Comput. Commun. Appl. 2008, 4, 1–23. [Google Scholar] [CrossRef]
Kim, S.-H.; Nam, H.; Choi, S.-M.; Park, Y.-H. Real-Time Sound Recognition System for Human Care Robot Considering Custom Sound Events. IEEE Access 2024, 12, 42279–42294. [Google Scholar] [CrossRef]
Neri, M.; Battisti, F.; Neri, A.; Carli, M. Sound Event Detection for Human Safety and Security in Noisy Environments. IEEE Access 2022, 10, 134230–134240. [Google Scholar] [CrossRef]
Gerosa, L.; Valenzise, G.; Tagliasacchi, M.; Antonacci, F.; Sarti, A. Scream and gunshot detection in noisy environments. In Proceedings of the EURASIP, Poznan, Poland, 3–7 September 2007. [Google Scholar]
Chu, S.; Narayanan, S.; Kuo, C.-C.J. Environmental sound recognition with time-frequency audio features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1142–1158. [Google Scholar] [CrossRef]
Heittola, T.; Mesaros, A.; Eronen, A.; Virtanen, T. Audio context recognition using audio event histogramsin. In Proceedings of the 18th European Signal Processing Conference, Aalborg, Denmark, 23–27 August 2010; pp. 1272–1276. [Google Scholar]
Shah, M.; Mears, B.; Chakrabarti, C.; Spanias, A. Lifelogging:archival and retrieval of continuously recorded audio using wearable devices. In Proceedings of the 2012 IEEE International Conference on Emerging Signal Processing Applications (ESPA), Las Vegas, NV, USA, 12–14 January 2012; IEEE Computer Society: Washington, DC, USA, 2012; pp. 99–102. [Google Scholar]
Wichern, G.; Xue, J.; Thornburg, H.; Mechtley, B.; Spanias, A. Segmentation, indexing, and retrieval for environmental and natural sounds. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 688–707. [Google Scholar] [CrossRef]
Mukhamadiyev, A.; Khujayarov, I.; Djuraev, O.; Cho, J. Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language. Sensors 2022, 22, 3683. [Google Scholar] [CrossRef] [PubMed]
Ochilov, M.M. Using the CTC-based Approach of the End-to-End Model in Speech Recognition. Int. J. Theor. Appl. Issues Digit. Technol. 2023, 3, 135–141. [Google Scholar]
Adavanne, S.; Parascandolo, G.; Pertila, P.; Heittola, T.; Virtanen, T. Sound event detection in multichannel audio using spatial and harmonic features. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes Events, Budapest, Hungary, 3 September 2016; pp. 6–10. [Google Scholar]
Guo, G.; Li, S. Content-based audio classification and retrieval by support vector machines. IEEE Trans. Neural Networks 2003, 14, 209–215. [Google Scholar] [CrossRef]
Parascandolo, G.; Huttunen, H.; Virtanen, T. Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6440–6444. [Google Scholar] [CrossRef]
Bisot, V.; Essid, S.; Richard, G. HOG and subband power distribution image features for acoustic scene classification. In Proceedings of the 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 719–723. [Google Scholar] [CrossRef]
Rakotomamonjy, A.; Gasso, G. Histogram of Gradients of Time–Frequency Representations for Audio Scene Classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 142–153. [Google Scholar] [CrossRef]
Çakır, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
Espi, M.; Fujimoto, M.; Kinoshita, K.; Nakatani, T. Exploiting spectro—Temporal locality in deep learning based acoustic event detection. J. Audio Speech Music Proc. 2015, 2015, 26. [Google Scholar] [CrossRef]
Auger, F.; Flandrin, P.; Lin, Y.; McLaughlin, S.; Meignen, S.; Oberlin, T.; Wu, H. Time frequency reassignment and synchro squeezing: An overview. IEEE Signal Process. Mag. 2013, 30, 32–41. [Google Scholar] [CrossRef]
Sharan, R.V.; Moir, T.J. Cochleagram image feature for improved robustness in sound recognition. In Proceedings of the 2015 IEEE International Conference on Digital Signal Processing (DSP), Singapore, 21–24 July 2015; pp. 441–444. [Google Scholar] [CrossRef]
Dennis, J.; Tran, H.D.; Chng, E.S. Analysis of spectrogram image methods for sound event classification. In Proceedings of the Interspeech, Singapore, 14–18 September 2014. [Google Scholar]
Spadini, T.; de Oliveira Silva, D.L.; Suyama, R. Sound event recognition in a smart city surveillance context. arXiv 2019, arXiv:1910.12369. [Google Scholar]
Ciaburro, G.; Iannace, G. Improving Smart Cities Safety Using Sound Events Detection Based on Deep Neural Network Algorithms. Informatics 2020, 7, 23. [Google Scholar] [CrossRef]
Ranmal, D.; Ranasinghe, P.; Paranayapa, T.; Meedeniya, D.; Perera, C. ESC-NAS: Environment Sound Classification Using Hardware-Aware Neural Architecture Search for the Edge. Sensors 2024, 24, 3749. [Google Scholar] [CrossRef]
Zhang, H.; McLoughlin, I.; Song, Y. Robust sound event recognition using convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 559–563. [Google Scholar] [CrossRef]
Kwak, J.-Y.; Chung, Y.-J. Sound Event Detection Using Derivative Features in Deep Neural Networks. Appl. Sci. 2020, 10, 4911. [Google Scholar] [CrossRef]
Nanni, L.; Maguolo, G.; Brahnam, S.; Paci, M. An Ensemble of Convolutional Neural Networks for Audio Classification. Appl. Sci. 2021, 11, 5796. [Google Scholar] [CrossRef]
Xiong, W.; Xu, X.; Chen, L.; Yang, J. Sound-Based Construction Activity Monitoring with Deep Learning. Buildings 2022, 12, 1947. [Google Scholar] [CrossRef]
Sharan, R.V.; Moir, T.J. Acoustic event recognition using cochleagram image and convolutional neural networks. Appl. Acoust. 2019, 148, 62–66. [Google Scholar] [CrossRef]
Heittola, T.; Mesaros, A.; Eronen, A.; Virtanen, T. Context-dependent sound event detection. EURASIP J. Audio Speech Music. Process. 2013, 2013, 1. [Google Scholar] [CrossRef]
Zheng, X.; Zhang, C.; Chen, P.; Zhao, K.; Jiang, H.; Jiang, Z.; Pan, H.; Wang, Z.; Jia, W. A CRNN System for Sound Event Detection Based on Gastrointestinal Sound Dataset Collected by Wearable Auscultation Devices. IEEE Access 2020, 8, 157892–157905. [Google Scholar] [CrossRef]
Lim, W.; Suh, S.; Park, S.; Jeong, Y. Sound Event Detection in Domestic Environments Using Ensemble of Convolutional Recurrent Neural Networks. In Proc. Detection Classification Acoust. Scenes Events Workshop. 2019. June. Available online: https://dcase.community/documents/challenge2019/technical_reports/DCASE2019_Lim_77.pdf (accessed on 10 April 2025).
Arslan, Y.; Canbolat, H. Performance of Deep Neural Networks in Audio Surveillance. In Proceedings of the IEEE 2018 6th International Conference on Control Engineering & Information Technology (CEIT), Istanbul, Turkey, 25–27 October 2018; pp. 1–5. [Google Scholar] [CrossRef]
Kang, J.; Lee, S.; Lee, Y. DCASE 2022 Challenge Task 3: Sound event detection with target sound augmentation. DCASE 2022 Community.
Gygi, B.; Shafiro, V. Environmental sound research as it stands today. Proc. Meetings Acoust. 2007, 1, 050002. [Google Scholar]
Salamon, J.; Jacoby, C.; Bello, J.P. A Dataset and Taxonomy for Urban Sound Research. In Proceedings of the 22nd ACM International Conference on Multimedia (MM ‘14), Orlando, FL, USA, 3–7 November 2014; pp. 1041–1044. [Google Scholar] [CrossRef]
Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar]
Fonseca, E.; Favory, X.; Pons, J.; Font, F.; Serra, X. FSD50K: An Open Dataset of Human-Labeled Sound Events. IEEE/ACM Trans. Audio Speech Lang. Process 2022, 30, 829–852. [Google Scholar] [CrossRef]
Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August–2 September 2016; pp. 1128–1132. [Google Scholar] [CrossRef]
Piczak, K.J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd ACM International Conference on Multimedia (MM ’15), Brisbane, Australia, 26–30 October 2015; Association for Computing Machinery: New York, NY, USA; pp. 1015–1018. [Google Scholar] [CrossRef]
Fillon, T.; Simonnot, J.; Mifune, M.-F.; Khoury, S.; Pellerin, G.; Le Coz, M. Telemeta: An open-source web framework for ethnomusicological audio archives management and automatic analysis. In Proceedings of the 1st International Workshop on Digital Libraries for Musicology (DLfM 2014), London, UK; 12 September 2014, pp. 1–8.
Mesaros, A.; Heittola, T.; Virtanen, T.; Plumbley, M.D. Sound Event Detection: A tutorial. IEEE Signal Process. Mag. 2021, 38, 67–83. [Google Scholar] [CrossRef]
Kim, B.; Pardo, B. I-SED: An Interactive Sound Event Detector. In Proceedings of the 22nd International Conference on Intelligent User Interfaces (IUI ‘17), Limassol, Cyprus, 13–16 March 2017; Association for Computing Machinery: New York, NY, USA; pp. 553–557. [Google Scholar] [CrossRef]
Queensland University of Technology’s Ecoacoustics Research Group. Bioacoustics Workbench. 2017. Available online: https://github.com/QutBioacoustics/baw-client (accessed on 10 April 2025).
Katspaugh. 2017. wavesurfer.js. Available online: https://wavesurfer-js.org/ (accessed on 10 April 2025).
Cartwright, M.; Seals, A.; Salamon, J.; Williams, A.; Mikloska, S.; MacConnell, D.; Law, E.; Bello, J.; Nov, O. Seeing sound: Investigating the effects of visualizations and complexity on crowdsourced audio annotations. In Proceedings of the ACM on Human-Computer Interaction, Denver, CO, USA, 6–11 May 2017. [Google Scholar]
Martin, R. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 2001, 9, 504–512. [Google Scholar] [CrossRef]
The Audio Annotation Tool for Your AI. SuperAnnotate. Available online: https://www.superannotate.com/audio-annotation (accessed on 10 April 2025).
Heittola, T.; Çakır, E.; Virtanen, T. The Machine Learning Approach for Analysis of Sound Scenes and Events. In Computational Analysis of Sound Scenes and Events; Virtanen, T., Plumbley, M., Ellis, D., Eds.; Springer: Cham, Switzerland, 2017. [Google Scholar] [CrossRef]
Bo, H.; Li, H.; Ma, L.; Yu, B. A Constant Q Transform based approach for robust EEG spectral analysis. In Proceedings of the 2014 International Conference on Audio, Language and Image Processing, Shanghai, China, 7–9 July 2014; pp. 58–63. [Google Scholar] [CrossRef]
Musaev, M.; Mussakhojayeva, S.; Khujayorov, I.; Khassanov, Y.; Ochilov, M.; Atakan Varol, H. USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments. In Speech and Computer, Proceedings of the Speech and Computer (SPECOM 2021), St. Petersburg, Russia, 27–30 September 2021; Karpov, A., Potapova, R., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12997. [Google Scholar] [CrossRef]
Musaev, M.; Khujayorov, I.; Ochilov, M. Automatic Recognition of Uzbek Speech Based on Integrated Neural Networks. In Advances in Intelligent Systems and Computing, Proceeding of the 11th World Conference “Intelligent System for Industrial Automation” (WCIS-2020), Tashkent, Uzbekistan, 26–28 November 2020; Aliev, R.A., Yusupbekov, N.R., Kacprzyk, J., Pedrycz, W., Sadikoglu, F.M., Eds.; Springer: Cham, Switzerland, 2021; Volume 1323. [Google Scholar] [CrossRef]
Tzanetakis, G.; Essl, G.; Cook, P.R. Audio analysis using the discrete wavelet transform. In Proceedings of the Acoustics and Music Theory Applications; Citeseer: Princeton, NJ, USA, 2001; Volume 66, pp. 318–323. [Google Scholar]
Available online: https://brianmcfee.net/dstbook-site/content/ch09-stft/Framing.html (accessed on 10 April 2025).
Rabiner, L.R.; Schafer, R.W. Theory and Applications of Digital Speech Processing; Prentice Hall Press: Hoboken, NJ, USA, 2010. [Google Scholar]
Porkhun, M.I.; Vashkevich, M.I. Efficient implementation of gammatone filters based on unequal-band cosine-modulated filter bank. Comput. Sci. Autom. 2024, 23, 1398–1422. [Google Scholar] [CrossRef]
Cui, X.; Goel, V.; Kingsbury, B. Data augmentation for deep neural network acoustic modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1469–1477. [Google Scholar]
Jaitly, N.; Hinton, G.E. Vocal tract length perturbation (VTLP) improves speech recognition. In Proceedings of the ICML Workshop on Deep Learning for Audio, Speech and Language, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
McFee, B.; Humphrey, E.J.; Bello, J.P. A software framework for musical data augmentation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Málaga, Spain, 26–30 October 2015. [Google Scholar]
Ko, T.; Peddinti, V.; Povey, D.; Khudanpur, S. Audio augmentation for speech recognition. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015. [Google Scholar]
Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Gal, Y.; Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167v3Search. [Google Scholar]
Santos, C.F.G.D.; Papa, J.P. Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks. ACM Comput. Surv. 2022, 54, 1–25. [Google Scholar] [CrossRef]
Salehin, I.; Kang, D.-K. A Review on Dropout Regularization Approaches for Deep Neural Networks within the Scholarly Domain. Electronics 2023, 12, 3106. [Google Scholar] [CrossRef]
Szymański, P.; Kajdanowicz, T. A Network Perspective on Stratification of Multi-Label Data. arXiv 2017, arXiv:1704.08756v1. [Google Scholar]
Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for Polyphonic Sound Event Detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
Wei, W.; Zhu, H.; Emmanouil, B.; Wang, Y. A-CRNN: A Domain Adaptation Model for Sound Event Detection. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 276–280. [Google Scholar] [CrossRef]

Figure 1. Input and output features for three analysis systems: sound scene classification, audio tagging, and sound event detection.

Figure 2. The process of the proposed methodology.

Figure 3. Functional scheme of sound event recording and storing using Raspberry Pi 4 as base of IoT module.

Figure 4. Example of strong-label processing of audio data.

Figure 5. Overview of the pre-processing technique.

Figure 6. Processing steps to generate a DCT-based spectrogram from an audio signal.

Figure 7. Software interface for generating spectrogram images.

Figure 8. (a) DCT spectrogram, (b) Cochleagram, and (c) Mel spectrogram image of a sample sound signal (crying baby on the street). The frequency range in each case is 0–16,000 Hz.

Figure 9. Number of clips and duration distribution in the used dataset.

Figure 10. Basic structure of GRU.

Figure 11. Proposed ensemble CRNN-based model.

Figure 12. The process of calculating segment-based metrics.

Figure 13. The process of calculating event-based metrics.

Figure 14. Results of F1, precision, and recall evaluation for the ensemble and each of its models.

Figure 15. Results for segment F1 score, precision, and recall.

Figure 16. Results for event F1 score, precision, and recall.

Table 1. Information of crowdsourced dataset (UzSoundEvent).

	Harassment	Quarrel	Screams	Sounds of Breaking
Number of clips	6	36	33	37
Duration (minutes)	35	77	81	52

Table 2. Information about sounds collected from open datasets.

Sound Class	Datasets	Number of Clips	Average Duration (Seconds)	Sample Rate
Construction Tools	UrbanSound8K	~394	~4.0	44.1 kHz
Door Opening/Closing	AudioSet Ontology	~500+	~10
Loud Music	UrbanSound8K	~1000	~3.5
Car Horn	AudioSet Ontology	~1000+	Varies (5–30)
Footsteps	FSD50K	~500+	Varies (5–30)
Crying Baby	ESC-50	~100+	~5.0
Gun Shooting	UrbanSound8K	374	~2.5
Street Sound	TUT Sound Events 2017	~200	~5.0	48 kHz
Siren	UrbanSound8K	~929	~3.7	44.1 kHz

Table 3. The dataset specifications.

Sound Category	Total (Minutes)	Train (80%)	Validation (10%)	Test (10%)	Source
Construction Tools	26.4	21.12	2.64	2.64	Open source
Door Opening/Closing	83.4	66.72	8.34	8.34
Loud Music	66	52.80	6.60	6.60
Car Horn	300	240	30	30
Footsteps	150	120	15	15
Crying Baby	10.2	8.16	1.02	1.02
Gun Shooting	15.6	12.48	1.56	1.56
Street Sound	16.8	13.44	1.68	1.68
Siren	60	48.00	6.00	6.00
Harassment	35	28.0	3.50	3.50	Crowdsourcing
Quarrel	77	61.60	7.70	7.70
Screams	81	64.80	8.10	8.10
Sounds of Breaking	52	41.60	5.20	5.20

Table 4. Hyperparameters of the proposed model architecture.

Layers	Hyperparameters
Layers	DCT and Cochleagram Spectrogram	Mel spectrogram
Input	$256 \times 128$	$128 \times 128$
Conv1	32 filters, 5 $\times$ 5, stride = 1, pad = 2, ReLU	32 filters, 5 $\times$ 5, stride = 1, pad = 2, ReLU
Pool1	MaxPool 2 $\times$ 2	MaxPool 2 $\times$ 2
Conv2	64 filters, 3 $\times$ 3, stride = 1, pad = 1, ReLU	64 filters, 3 $\times$ 3, stride = 1, pad = 1, ReLU
Pool2	MaxPool 2 $\times$ 2	MaxPool 2 $\times$ 2
Reshape	(B, 64, 64, 32) $\to$ (B, 32, 64 $\times$ 64 = 4096)	(B, Time = 32, 64 $\times$ 32 = 2048)
Bi-GRU1	64 hidden units	64 hidden units
Bi-GRU2	64 hidden units	64 hidden units
Dropout (Conv +RNN)	0.3 in Conv blocks and GRU layers	0.2 in Conv blocks and GRU layers
Dense	Number of sound events, sigmoid	Number of sound events, sigmoid
Loss Function	Binary cross-entropy (multi-label)
Optimizer	Adam
Learning rate	$1 \times 10^{- 4}$
Batch normalization	32
Number of training	50
Ensemble Method	Vote over model outputs

Table 5. The F1 score and error rate values.

	F1-Score(%)		Error Rate
Methods	Segment-Based	Event-Based	Segment-Based	Event-Based
DCT spectrogram + CRNN	64.7	40.8	0.74	0.84
Cochelogram spectrogram + CRNN	53.5	37.8	1	1
Mel spectrogram + CRNN	50.4	32.8	0.97	1.2
Ensemble	71.5	46.0	0.92	1.1

Table 6. Comparative experiments with other existing sound event detection methods.

Authors	Model	Dataset	Classes	Key Features	F1 Score (Segment/Event)	Application
Yüksel Arslan et al. (2018) [36]	DNN	Custom	2	MFCC	F-score 75.4%	Urban safety
Xue Zheng et al. (2020) [34]	CRNN (LSTM)	GI Sound set	6	MFCC	81.06%/-	Medical monitoring
Emre Çakır et al. (2017) [20]	Single CRNN (GRU)	TUT-SED Synthetic 2016	16	MFCC	F1_frame = 66.4%, F1_sec = 68.7%	General polyphonic SED
Wuyue Xiong et al. (2022) [31]	CRNN (GRU)	Custom	5	Mel spectrogram	82% (accuracy)/-	Construction monitoring
Sang-Ick Kang et al. (2022) [37]	CRNN ensemble	DCASE 2022 SELD Synthetic dataset	12	Mel spectrogram	53%/-	Polyphonic SED + location
Wei Wei et al. (2020) [71]	Adapted CRNN	DCASE	5	MFCC	51.4%/-	Domestic SED
Wootaek Lim et al. (2019) [35]	CRNN (Bi-GRU)	DCASE 2019	10	Mel spectrogram	66.17%/40.89%	Domestic SED
Ours (2025)	CRNN (LSTM) ensemble	Crowdsourced and Open source	13	DCT, Mel, Cochleagram spectrograms	71.5%/46%	Urban safety

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mukhamadiyev, A.; Khujayarov, I.; Nabieva, D.; Cho, J. An Ensemble of Convolutional Neural Networks for Sound Event Detection. Mathematics 2025, 13, 1502. https://doi.org/10.3390/math13091502

AMA Style

Mukhamadiyev A, Khujayarov I, Nabieva D, Cho J. An Ensemble of Convolutional Neural Networks for Sound Event Detection. Mathematics. 2025; 13(9):1502. https://doi.org/10.3390/math13091502

Chicago/Turabian Style

Mukhamadiyev, Abdinabi, Ilyos Khujayarov, Dilorom Nabieva, and Jinsoo Cho. 2025. "An Ensemble of Convolutional Neural Networks for Sound Event Detection" Mathematics 13, no. 9: 1502. https://doi.org/10.3390/math13091502

APA Style

Mukhamadiyev, A., Khujayarov, I., Nabieva, D., & Cho, J. (2025). An Ensemble of Convolutional Neural Networks for Sound Event Detection. Mathematics, 13(9), 1502. https://doi.org/10.3390/math13091502

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Ensemble of Convolutional Neural Networks for Sound Event Detection

Abstract

1. Introduction

1.1. Research Context and Motivation

1.2. Research Aims and Contributions

1.3. Structure of the Paper

2. Related Work

3. Materials and Methods

3.1. Workflow

3.2. Challenges in Sound Event Classification (SED)

3.3. Data Acquisition

3.3.1. Crowdsourcing

3.3.2. Open Source Datasets

3.4. Data Annotation

3.5. Pre-Processing

3.6. Feature Extraction

Audio Representations as Images

4. Data Preparation and Training

4.1. Data Augmentation Techniques

4.2. Dataset

4.2.1. Crowdsourced Dataset

4.2.2. Open Datasets

4.3. Development of Ensemble CRNN-Based Model for Sound Event Detection

Convolutional Recurrent Neural Network

4.4. Regularization

4.5. Training

4.6. Metrics

5. Results

Performance Evaluation

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI