Sound Event Detection in Smart Cities: A Systematic Review of Methods, Datasets, and Applications

Ciaburro, Giuseppe; Puyana-Romero, Virginia

doi:10.3390/bdcc10030083

Open AccessSystematic Review

Sound Event Detection in Smart Cities: A Systematic Review of Methods, Datasets, and Applications

by

Giuseppe Ciaburro

^1,*

and

Virginia Puyana-Romero

²

¹

School of Engineering and Informatics, Department of Engineering, Pegaso University, 80143 Naples, Italy

²

Departamento de Ingeniería en Sonido y Acústica, Universidad de Las Américas, Quito 17513, Ecuador

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(3), 83; https://doi.org/10.3390/bdcc10030083

Submission received: 30 December 2025 / Revised: 23 February 2026 / Accepted: 6 March 2026 / Published: 8 March 2026

(This article belongs to the Special Issue Artificial Intelligence Techniques for Audio, Image, and Multisensory Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

Sound Event Detection (SED) is a growing area with vast prospects for understanding and designing the sonic fabric of smart cities. In this paper, the latest advances in SED are summarized, focusing on models, datasets, and applications from scientific papers listed on Scopus and Web of Science. The paper provides a clear view of how SED is being used in smart cities, public safety, environment monitoring, and home security. The paper also addresses the challenges of SED, including dataset representativeness, model robustness under noisy or complex acoustic scenes, event rarity detection, as well as the ethics of using automatic listening. The paper also provides a view of future work to be undertaken in SED. The focus of the paper is on self-supervised learning, multi-modal fusion, neuro-inspired approaches, as well as privacy-preserving analytics. The paper provides a view of SED as a key technology to make smart cities safe, secure, and sustainable. SED has vast prospects as a key technology to enable artificial perception of smart cities.

Keywords:

sound event detection; smart cities; urban security; deep learning

1. Introduction

The Smart City concept has recently been generating considerable academic buzz [1]. While originally conceived as a futuristic vision, Smart City has slowly evolved into a tangible reality in many of the world’s major cities [2]. The basic idea behind Smart City is the extensive use of digital media and smart technologies. The basic functions of smart technologies are to manage resources in an efficient manner and improve the quality of life while moving towards sustainability [3]. The concept of ambient intelligence plays an important role in this regard, as this technology can sense, process, and interpret data from the environment and respond to changes in events and conditions [4]. While all sensors are not created equal, one of the most promising and least intrusive sensors is the use of audio. The importance of sound in the environment cannot be overemphasized, and this technology can be leveraged in numerous ways. While using audio technology in Smart City applications has considerable potential, this technology remains untapped due to several reasons, including the effect of noisy environments, difficulties in generalizing models in dynamic environments, and the legal implications of using this technology [5,6]. Keeping this in view, Sound Event Detection (SED) can prove to be an important technology in this regard [6,7]. SED can be used to identify whether important sounds are being made and for how long in continuous sounds. While traditional sound classification can classify sounds, SED can identify and locate multiple sounds in time and sometimes space [8,9,10].

The application of Sound Event Detection (SED) in smart cities could lead to a paradigm shift in the way city authorities, citizens, and smart systems interact with the urban space. The applications are diverse:

-: Public safety: The automatic detection of unusual events such as screams, glass breaking, gunshots, or car crashes can reduce the time taken for emergency response [11].
-: Environmental monitoring: Continuous monitoring of noise and sound events associated with human activity and weather patterns can help inform the design of more effective environmental policies [12].
-: Smart mobility: Analysis of sounds from traffic, car horns, braking, and electric motors can help with traffic management and the implementation of autonomous vehicles [13].
-: Accessibility: For the visually impaired, sound-based recognition of the urban environment can enhance navigation, increasing independence and safety [14].

These applications require a common solution that can provide reliable, real-time audio detection, flexible enough to adapt to the diverse acoustic environments of modern cities. However, the implementation of SED technology to benefit society on a large scale faces a number of challenges [15], which will be discussed in the following sections. Sound Event Detection (SED) is a technology that identifies what is happening in an audio signal, typically using a single microphone [16]. Sound Event Localization and Detection (SELD) is a more advanced technology that not only identifies what is happening in an audio signal but also locates the source of the sound, using multi-channel microphone arrays [17]. While SED informs us about the composition of the urban soundscape, SELD provides a more complete understanding of the urban soundscape by combining recognition and localization, which is particularly valuable in smart city applications [18]. Recent advances in edge computing have further brought both technologies into line with the requirements of real-time monitoring in complex urban environments.

In SED, or sound event detection, the old playbook has been based on signal processing and statistical models such as GMMs and HMMs, combined with feature engineering that heavily relies on MFCCs, spectrograms, and other time-frequency domain features [19]. And then came deep learning, changing everything with models such as Convolutional Neural Networks (CNNs), Convolutional Recurrent Neural Networks (CRNNs), and more recently, Transformer models that enable learning audio representations directly from raw data with impressive results [20]. At the same time, self-supervised learning has enabled us to take full advantage of unlabeled data, solving the long-standing problem of scarce annotated data in real-world urban environments. Finally, evaluation has also seen important advances with more efficient metrics that now allow for fair comparison in real-world, polyphonic settings [21].

Finally, a major problem in SED has been the availability of labels. While strong labels that indicate not only the presence or absence of an event but also its exact timing in the audio are very valuable for learning precise models, their creation can be expensive [22]. Therefore, weak labels that only indicate the presence or absence of an event in an audio clip are commonly used. Although weak labels are easier to obtain, learning from them can be less certain. To address this problem, mixture recipes and synthetic data are now being used [23]. Mixes are combinations of isolated sound events with field recordings, while synthetic data are obtained from simulations or sound libraries.

Another challenge in this domain is domain adaptation. In other words, a model learned from a neighborhood or a city may not perform well when it is used in a different urban environment due to the difference in background noises and types of sound events [24]. Adaptation methods can be used to reduce this ‘urban shift’ effect and make the model more robust and transferable for practical use cases.

Another significant and cross-cutting issue for acoustic analysis systems and any other systems dealing with private information is privacy by design. This implies that privacy is considered from the very beginning of the design process and not as an afterthought [25]. One significant method used in this regard is processing the information on the device itself. In other words, processing the signals directly on the device without sending a significant amount of data to a remote server can reduce the risks and attacks associated with this issue.

Another significant issue to consider when designing such systems is the minimization of data retention. In other words, the data used for any purpose, such as analysis or training a model, should be stored for a very limited period of time. Advanced mechanisms such as differential privacy and federated learning can also be used for this purpose. In this regard, differential privacy can be used to protect the privacy of users by adding noise to the data or the model used for a particular purpose while maintaining its accuracy [26]. The second allows distributed training of models across multiple devices; sharing only aggregate updates and not the raw data [27]. This strikes a balance between performance and privacy, making privacy an operational principle and not just a normative one.

In recent years, the scientific community has produced a substantial number of articles, models and datasets related to SED. However, there is a paucity of systematic and structured synthesis with a specific focus on the use of SED in smart cities. Existing reviews frequently address the subject in an overly generalized manner or focus exclusively on technical aspects, neglecting to contextualize the technologies within the urban environment [28]. Furthermore, the rapid evolution of models renders many traditional comparisons obsolete [29]. The annual introduction of new architectures in DCASE challenges, the increase in public datasets, and the adoption of techniques such as transfer learning or data augmentation necessitate continuous updating of knowledge [30].

In order to guarantee methodological rigor and transparency in the literature review, the present paper employs a systematic review approach based on the PRISMA framework. This approach allows us to select the studies transparently and easily, reducing the risk of bias as much as possible and ensuring that all the steps are easily traceable. Using the PRISMA protocol allows us to effectively assess the Sound Event Detection (SED) model proposals according to universal and objective parameters, facilitating the evaluation of the strengths and weaknesses of different proposals, as well as highlighting the strengths and weaknesses of each model. This approach helps us refine our perception of the gaps in the knowledge field with greater accuracy, detecting new trends in the design of the model with greater clarity, and highlighting the key areas for future research. At the same time, the review also provides an interesting approach, offering useful insights on the application of the SED systems in the urban environment.

For the purposes of the qualitative review, it was also necessary to conduct a comparative meta-analysis of the key performance metrics described in the literature.

The main goal of the article is to provide a systemic, updated, and comparative overview of the Sound Event Detection literature, with special emphasis on its application in the context of the urban and smart cities environment. Specifically, we aim to:

Identify and classify the main SED models developed and tested in urban scenarios;
Analyze the public datasets used to train and evaluate the models, with attention to their representativeness and realism;
Compare model performance in terms of standardized metrics (F1-score, ER, PSDS) on common benchmarks;
Examine concrete use cases where SED has been employed in smart cities;
Discuss open challenges and propose future directions for research and development.

This literature review offers three main contributions to the scientific and technical community. First, it provides a critical synthesis of studies on Sound Event Detection in urban settings, systematically reviewing work published over the past decade with a focus on applications in smart cities. Unlike more general reviews on machine listening, this paper highlights the distinctive challenges of urban environments-including high polyphony, high variability in background noise, and limited availability of annotated datasets representative of local languages and contexts.

The second contribution consists of a comparative analysis of currently available models and datasets. We compare different approaches, from engineered feature-based methods to more recent deep learning architectures to self-supervised models, discussing their performance, structural limitations, and degree of representativeness with respect to real urban scenarios. At the same time, we compare the main public datasets, evaluating their coverage, quality, heterogeneity and compatibility with different applications. This comparative framework allows us to identify which tools are most appropriate for different urban monitoring contexts, and which issues remain unexplored.

Finally, the paper proposes a set of guidelines and future directions for research and implementation of SED systems in smart cities. In addition to making operational recommendations aimed at researchers, developers, and policy-makers-such as the adoption of shared metrics, the use of ethical and inclusive data collection strategies, and the integration of SED systems with sensor networks and IoT platforms-guidelines for documentation and replicability are also suggested, including items such as dataset cards, environmental conditions, SNR, hardware characteristics, and computational requirements.

This review also goes beyond just compiling the latest developments in the area. It strives to provide a critical impact-oriented perspective that can guide future research directions as well as technological as well as strategic decisions in urban areas in the near future. This is how the article is organized: Section 2 discusses the systematic review approach, following PRISMA protocols. Section 3 discusses the published SED models, grouped by architecture as well as urban areas they were designed for. Section 4 discusses the primary datasets used for training as well as testing. Section 5 goes deeper into SED applications in smart cities. Section 6 compares the models based on published metrics. Section 7 discusses new challenges as well as possible directions for SEDs. Section 8 concludes with key findings as well as recommendations.

2. Methodological Framework for the Review

This review was conducted following the PRISMA guidelines (Preferred Reporting Items for Systematic Reviews and Meta-Analyses), which are the global standard for conducting and reporting systematic reviews and meta-analyses [31,32,33]. The completed PRISMA checklist detailing each reporting item is provided in the Supplementary Materials to ensure transparency and reproducibility of the systematic review process. The rationale behind the use of this framework is to maximize the transparency, reproducibility, and reliability of the review by following a well-defined and documented procedure. In the following sections, the systematic review methodology that was adopted from formulating the review question to the selection of the relevant papers will be described.

The systematic review methodology adopted in this review involves in several steps. Formulation of the review question: The review question was formulated by using the PICOS framework, which is a popular framework adopted by systematic review methodology [34,35]. In the context of this review, the following are the definitions of the elements of the review question:

-: Population: The population of the review question refers to the set of relevant papers related to Sound Event Detection.
-: Intervention: The intervention refers to the available technology and methodology.
-: Comparison: The comparison refers to the methodology that has been compared to the conventional methodology.

In order to formulate a structured review question, the following four review questions have been formulated:

RQ1: What are the most commonly used publicly available data sets in the context of sound event detection, and what are the most commonly applied data transformation techniques?
RQ2: What are the signal representations and descriptor extraction techniques that have been adopted by the state of the art in the context of sound-based detection systems?
RQ3: What are the most popular machine learning and deep learning models that have been proposed and adopted by the state of the art in the context of sound event detection?
RQ4: What are the most popular environments that have been explored by the state of the art in the context of sound event detection?

Finally, the review examines the types of environments most frequently considered in recent studies, offering insights into practical applications and research trends. Together, these questions ensure a comprehensive analysis of current methodologies and datasets.

After developing the research question, a systematic and replicable method for bibliographic searching was developed. The research was conducted from January to December 2025, leveraging prominent electronic databases such as Scopus, Web of Science, PubMed, IEEE Xplore, ScienceDirect, and supplementing with Google Scholar as an additional tool. These databases were selected to offer an interdisciplinary search, encompassing engineering and technology views, as well as research adopting an experimental, application, or methodological focus. Search queries were developed through an iterative process, commencing with primary keywords, followed by Boolean operators, truncation, and filtering. A variety of synonyms, term variations, and relevant acronyms were explored to reduce the risk of overlooking relevant research.

In the following step, specific inclusion and exclusion criteria were established to facilitate systematic and objective study selection, corresponding to the research questions [36,37]. Articles were selected for inclusion if they were published in English, were full-text accessible, peer-reviewed, and specifically focused on sound event detection in a smart city or urban setting. Both experimental and theoretical studies were included if they provided quantitative outcomes interpretable within the context of the review objectives. Duplicated, non-peer-reviewed, review, editorial or descriptive, non-quantitative study, non-relevant outcome, or insufficiently detailed methodological studies were excluded. To further limit bias, studies with non-representative samples, non-replicable results, or non-transparent methodologies were also excluded. The identified studies were then categorized based on characteristics of the datasets, preprocessing and feature extraction techniques, models, and experimental designs to facilitate a structured synthesis.

The risk of bias due to missing results was assessed based on the presence of selective reporting, discrepancies between methods and results sections, and publication bias across studies, without the use of any quantitative methods for bias detection. The certainty of evidence was assessed qualitatively based on the design, methodological rigor, transparency of the dataset, consistency of the findings, and the replicability of the results, without any grading system being used.

The process of selecting studies involved three steps. First, all records retrieved from the searches were imported into citation management software to assist in organizing the records and eliminating duplicates. Second, two independent reviewers assessed the titles and abstracts to exclude obviously irrelevant records. Articles that passed the first screening were fully read to ensure that they met the predetermined inclusion and exclusion criteria. Discrepancies were resolved by discussion. If an agreement could not be reached, a third reviewer made the final decision. No automated tools were employed to ensure that all studies were carefully manually screened. The process was recorded using the PRISMA flow diagram [38] (Figure 1).

Figure 1 also indicated how many studies we started out with, how many were culled at each step of the way, and how many were left for our qualitative examination—and in some cases, our quantitative examination as well.

Having determined which studies to include in our qualitative examination, we then proceeded to our next step: data extraction. For this purpose, we employed a specially designed standardized form for our examination of each included study. This form consisted of sections to identify bibliographic elements of the studies (authors, year of publication, source), methodological elements of each study (research design, sample, instruments), key findings of each research work, and limitations as described by each of the authors of each work. By utilizing this standardized form for our examination of each of the studies we were to include in our examination, we were able to avoid transcriptional errors.

Our approach to data synthesis consisted of both qualitative as well as quantitative examination of our findings. In our qualitative examination of our findings, we arranged our findings according to thematic elements of each of the studies we examined, highlighting trends, areas of agreement, as well as areas of disagreement between studies. In addition to our qualitative examination of our findings, we also employed a quantitative approach to our findings by means of a meta-analysis of our findings by employing an appropriate statistical model. The outcome of interest includes the metrics of sound event detection, dataset, application scenario, and evaluation methodology. All results related to the above-mentioned outcome domain have been systematically collected, including all the results, settings, and evaluation methodologies, without any restriction on specific time points. Data was sought for outcomes such as the performance of sound event detection, the dataset, the application, and the evaluation strategy, among others. All the results within the compatible domains for the specific outcomes were collected, and the effect measure was the descriptive performance measure, which includes accuracy, precision, recall, F1-score, error rate, and qualitative comparisons, as quantitative synthesis was not carried out. The synthesis was carried out based on the tabulation of the results of the individual studies, where the characteristics of the individual study, the methodological approach, the dataset, and the application domain were compared with the synthesis objective. During the data preparation, the performance metrics were standardized, and the incomplete data were removed when necessary. The results were summarized through the presentation of tables and comparative figures. However, the heterogeneity was addressed through the adoption of a narrative synthesis, which allowed for the qualitative comparisons of the results, as the quantitative synthesis was not carried out. This was achieved through the exploration of the heterogeneity based on the dataset and the application context. There was no sensitivity analysis carried out as the quantitative synthesis was not carried out.

3. Sound Event Detection Models: Architectures and Urban Contexts

Recognizing and classifying Sound Event Detection (SED) models is now crucial to understanding the state of the art in acoustic analysis in urban environments. Contemporary urban environments are increasingly noisy and complex, with traffic, factories, public transportation, events, construction, weather, and humans creating a dynamic and chaotic environment [39,40]. In this environment, automatic sound event recognition, including detection, classification, and, if possible, localization, is now a critical strategic tool in various fields, such as environmental monitoring, urban security, traffic management, intelligent acoustic surveillance, and IoT technologies in smart city applications. Over the last decade, the growth of SED models has been rapid, driven by the growth of data sets, the progress of deep learning, and the introduction of new learning paradigms such as self-supervised learning and audio foundation models. Unfortunately, the rapid growth of SED models has resulted in methodological fragmentation, with studies using diverse metrics, non-comparable data sets, diverse preprocessing techniques, and annotation schemes with diverse degrees of detail [41]. Such diversity in models poses challenges to understanding the state of the art, the strengths, and the weaknesses of SED models, and the quest to gain a clear understanding of the models that perform best in urban environments [42] and the major limitations and promising directions.

Articles were selected on the basis of relevance to SED models and the sound properties, and not on the basis of features related to the city, such as size, location, and type. Although we acknowledge the fact that factors such as coastal or inland, population density, and transportation may influence the sound properties, we have considered these factors in the discussion and the implications on the effectiveness of the models, providing insights into the effectiveness of the models in various urban environments. The aim of this section is to provide a comprehensive and systematic overview that identifies and classifies the main SED models developed and tested in urban scenarios. This classification is not limited to a simple list of techniques, but reconstructs the evolution of methodologies, analyzing the logic behind the design of the models, the conditions for validating the systems, the type of datasets used, and their degree of realism. Attention is also paid to the ability of the models to manage the main critical issues typical of the urban soundscape, including strong polyphony, environmental interference, variable atmospheric conditions, and the presence of unstructured stochastic noise.

Historically, the initial approaches to SED in urban contexts were predicated on manual engineering of audio features. A plethora of techniques, including but not limited to MFCCs, zero-crossing rates, power spectra, and temporal or spectral statistics, have been utilized for an extended period to train conventional classifiers, namely Support Vector Machines, Random Forests, Gaussian Mixture Models, and Hidden Markov Models [43]. Despite the fact that these methodologies represented a fundamental step forward, they soon demonstrated their limitations when applied to highly complex soundscapes. Specially Polyphony represented a limit, as engineered features were not naturally able to decompose the acoustic scene into multiple overlapping sources. The mid-2010s introduced the deep neural networks, marking a very new tool in the field. Convolutional Neural Networks (CNNs), when used in log Mel spectrogram representations, provide a stronger and more discriminative representation that is capable of capturing complex patterns in both time and frequency domains [44]. Following this line of thought, CNNs combined with Recurrent Neural Networks (RNNs), particularly in CRNNs, have enabled effective modeling of both spectral and temporal aspects of sound events [45]. This has led to significant improvements in performance, particularly in cases of prolonged events or those characterized by strong variability over time.

At the same time, the advent of attention—first with temporal and frequency mechanisms integrated into neural networks, then with Transformers—has further increased the ability of models to discriminate relevant events in the presence of urban noise [46]. Transformer-based models, thanks to their ability to capture long-range dependencies, have shown excellent results in SED, especially when combined with pre-training on large collections of unannotated audios, paving the way for self-supervised techniques [47]. In recent years, in fact, the self-supervised approach has had a decisive impact [48]. Models such as wav2vec 2.0 [49], HuBERT [50], BEATs [51], or the latest generation of multimodal audio encoders (e.g., those developed for multimodal foundation systems) allow rich and generalizable acoustic representations to be learned even in the absence of annotations, overcoming most of the limitations associated with the scarcity of labeled urban datasets. These encoders act as universal front-ends for numerous tasks, including SED, allowing them to achieve performance comparable—and in many cases superior—to traditional supervised models.

Another front concerns multimodal models, capable of combining acoustic signals with other sources of information such as images, meteorological data, or IoT sensors. In urban environments, this integration is particularly useful for distinguishing acoustic events that, on their own, are highly ambiguous (e.g., traffic vs. strong wind, children playing vs. distant alarms). Audio-video or audio-LiDAR models, although still not widely used, represent one of the most promising directions for advanced urban monitoring [52].

From a strict classification perspective, SED models used in urban scenarios can be grouped into several categories:

Models based on engineered features, which include traditional approaches and lightweight classifiers.
Supervised deep learning models, including CNNs, RNNs, CRNNs, and Transformers trained on annotated datasets.
Self-supervised models and foundation models, which leverage large amounts of unannotated audio.
Multimodal models, based on the integration of audio and other sensory signals.
Specialized models for related tasks, such as source separation, sound localization, or event saliency, which support or complement urban SED.

Each category has specific advantages and limitations. Feature-based models are lightweight and easily explainable, but not very effective in noisy or polyphonic contexts. CNNs and CRNNs offer good performance and a balance between complexity and accuracy, making them ideal for many real-world applications. Transformers and self-supervised models deliver high performance and great generalizability, but require greater computational resources. Finally, multimodal models open up new possibilities, but require advanced sensor infrastructure and complex datasets.

3.1. Models Based on Engineered Features

Models based on engineered features represent the first generation of approaches to Sound Event Detection (SED), and still play an important role today as a methodological reference and baseline in experimental benchmarks (Figure 2). Before the advent of deep learning techniques, SED relied heavily on the ability of experts to manually extract meaningful acoustic descriptors designed to capture the distinctive characteristics of sound events. These features, combined with traditional classifiers, formed the basis of most systems developed between the 2000s and the mid-2010s.

Among the most widely used descriptors are Mel-Frequency Cepstral Coefficients (MFCCs), originally introduced in the context of speech recognition and later adapted to the classification of environmental sounds. MFCCs are often accompanied by derived features, such as delta and delta-delta, and spectral descriptors such as spectral centroid, roll-off, flux, and bandwidth, which provide information on the energy distribution of the signal [53]. These descriptors have been used in numerous pioneering studies on urban sound classification, including the work of Cowling and Sitte [54], which used combinations of MFCC and temporal statistics in association with Support Vector Machines (SVMs) and Gaussian Mixture Models (GMMs).

SVMs, in particular, proved to be among the most effective classifiers in this phase, thanks to their ability to handle non-linear feature spaces through the use of appropriate kernels. For example, Chu et al. [55] showed that SVMs and GMMs, combined with spectral descriptors and MFCCs, can reliably distinguish between different categories of urban sounds, but with significant limitations in the presence of strong sound overlaps. Similarly, Barchiesi et al. [56] provided a detailed review of pre-deep learning sound classification techniques, highlighting how feature selection and normalization had a decisive impact on performance. Another relevant strand is that of models based on Hidden Markov Models (HMMs), widely used to model sounds with a sequential temporal structure, such as sirens, footsteps, or intermittent mechanical noises.

Mesaros et al. [57], in one of the first systematic studies on SED, combined MFCC with HMM and obtained promising results in the recognition of continuous acoustic events, despite the difficulties in managing the polyphony typical of urban contexts. One of the major challenges that a classifier based on engineered features faces is the ease with which its performance is impacted by the nature of data recording, such as recording conditions, signal-to-noise ratio, and changes in background noise. Several studies have shown that even minor changes in environmental settings have a major impact on the effectiveness of traditional classifiers [58]. This is a major problem in a city, where noise is constantly fluctuating.

Despite their limitations, engineered feature-based models are still important. They are used as baselines in global competitions such as the DCASE Challenge, making it simple to compare new methods with existing ones [59]. They also have advantages in terms of interpretability, lower computational complexity, and the ability to be deployed on embedded systems or IoT devices. In conclusion, engineered feature-based models have defined the approach to urban SED and are an important point of reference for monitoring the progress of new methods over time. Although they are challenged by polyphonic sounds and the noisy environment of urban areas, they are useful in scenarios where computational resources are limited [21].

3.2. Supervised Deep Learning Models

The advent of deep learning techniques represents a turning point in the progress of Sound Event Detection (SED) since it overcame many limitations of models using engineered features. In fact, since the mid-2010s, supervised neural models, and specifically CNNs, RNNs, CRNNs, and more recently Transformers, have been the state of the art on a wide range of tasks, achieving excellent results, especially on complex and realistic scenarios such as those encountered in urban environments (see Figure 3).

CNNs were the first neural models to be applied to SED tasks [60]. The basic idea behind these models is to transform the audio signal into a time-frequency representation, such as Mel or Log Mel spectrograms [61], and then to apply the CNN to automatically extract discriminative features from the spectrograms, as shown in the study by Cakir et al. [62], which demonstrated the effectiveness of using Log Mel spectrograms with CNNs, achieving significant performance improvements over traditional models. Subsequently, Piczak [63] further confirmed this evidence in a pioneering study on the classification of environmental sounds using two-layer convolutional CNNs, demonstrating the robustness of the approach even in the presence of urban noise [64]. CNNs are particularly effective in identifying impulsive events or those characterized by well-defined frequency structures, such as car horns, collisions, broken glass, or acoustic alarm signals [65].

Alongside CNNs, RNNs have established themselves as essential tools for modeling the temporal dependence of sound events. The most commonly used variants in urban SED are Long Short-Term Memory (LSTM) [66] and Gated Recurrent Unit (GRU) [67], which are capable of capturing long-term correlations and handling the temporal variability of events. In one of the most influential works in the field, Adavanne et al. [68] demonstrate that the integration of convolution and recurrence significantly improves the recognition of overlapping events, especially in annotated real-life recordings. RNNs alone may be less effective than CNNs at capturing spectral patterns, but they make a fundamental contribution to signal sequentially [69].

The combination of CNNs and RNNs has led to the development of Convolutional Recurrent Neural Networks (CRNNs), long considered the state of the art in supervised SED tasks. A seminal contribution in this direction is that of Cakir et al. [62], which shows how CRNNs can jointly model spectral structure and temporal dynamics, achieving superior performance to isolated CNNs or RNNs in the DCASE series benchmarks. These architectures are particularly well suited to urban scenarios, which are characterized by frequent overlaps between impulsive events (e.g., car horns) and continuous events (engines, traffic noise). In addition, CRNNs are capable of adequately handling the environmental variability associated with changing weather conditions, multiple simultaneous sources, and recordings made in open spaces or urban canyons [70,71].

In recent years, attention has gradually shifted toward models based on the attention mechanism and, subsequently, on Transformer architectures [72]. The attention mechanism was introduced into SED to automatically highlight relevant temporal and frequency regions, improving event discrimination in the presence of strong background noise. Kong et al. [73] proposed the series of “PANNs” (Pretrained Audio Neural Networks) models, based on CNN but enriched with attention modules, showing how temporal and spectral attention improves the temporal localization of events that are difficult to discern in urban landscapes.

The arrival of Transformers has further revolutionized the domain. Gong et al. [74] have shown that architectures based exclusively on attention—without recurrence—are capable of effectively modeling long-range dependencies, a crucial aspect for prolonged sound events or those characterized by complex temporal evolution. Positional encoding assists in maintaining the order of the signal sequence without any loss of information [75]. In urban sound environments, the strength of Transformers is also evident because they are highly resistant to polyphony [76], allowing the model to focus on multiple frequency bins or time frames simultaneously. This subtle distinction makes it easier to distinguish between overlapping sources such as sirens and traffic, or voices and machinery sounds [77]. Currently, the dominant paradigm in sound event detection in urban environments is supervised deep learning [78]. CNNs are excellent for automatic feature learning [79], RNNs are adept at understanding signal evolution in the time domain [80], and CRNNs combine the strengths of both [81]. Transformers, on the other hand, are breaking new grounds by capturing global dependencies and dealing with very complex acoustic environments [82].

3.3. Self-Supervised Models and Foundation Models

In the last few years, self-supervised learning (SSL) has dramatically changed the way we think about audio analysis, enabling us to find ways to make meaningful use of vast amounts of unlabeled audio data [83] (see Figure 4). This is especially relevant for Sound Event Detection (SED) in urban environments, where obtaining accurate labels can be costly, laborious, and highly subjective due to individual variations in perception [84].

SSL allows us to leverage large unlabeled sound archives, reducing dependence on manually annotated datasets and improving model generalization in complex acoustic conditions [85].

One of the first widely used SSL techniques in the audio domain is Contrastive Predictive Coding (CPC) [86]. CPC uses a contrastive strategy in which the model learns to predict future representations within a sequence, distinguishing them from negative examples. Although developed in the field of speech, CPC has been shown to extract representations that are also useful for classification and event detection tasks, providing a conceptual basis for many subsequent methods.

Recent transformer architectures and associated increases in computational power have triggered a significant revolution in self-supervised learning manifested by the development of large-scale pre-training models. For instance, wav2vec 2.0 proposed by Baevski et al. [49] is one such model that combines temporal masking with contrastive learning to form robust representations of raw audio signals. While originally developed for automatic speech recognition, wav2vec 2.0 has recently been extended to various non-speech tasks such as classification of environmental sounds and detecting complex acoustic events. The latest results have indicated that fine-tuning wav2vec 2.0 on UrbanSound8K or ESC-50 produces state-of-the-art results even with a few labeled samples.

At the same time, another strand has developed based on discrete representations learned from the audio signal, as in the HuBERT model by Hsu et al. [50]. HuBERT uses an iterative clustering and mask prediction approach that allows high-level acoustic features to be learned without supervision. Although originally designed for speech tasks, HuBERT has been used in general-purpose applications such as environmental classification, sound tagging, and SED, showing high robustness to noise, adverse conditions, and variability in urban sources.

Another important contribution comes from models pre-trained on large-scale datasets such as PANNs (Pretrained Audio Neural Networks) proposed by Kong et al. [73]. Although PANNs is based on supervised learning, its use as a feature extractor in contexts with a low number of annotations has favored its adoption as a “foundation” model in the domain of machine listening. PANNs is trained on the AudioSet dataset [87], which contains over 2 million annotated clips: this massive pretraining allows the model to capture general acoustic patterns that can be reused in many urban SED applications, such as traffic detection, dangerous events, or alarm signals.

The recent trend toward multimodality has also introduced models that learn shared audio-text representations in the field of audio. Among these, a significant contribution comes from the AudioCLIP model [88], inspired by the CLIP architecture for vision. AudioCLIP allows sounds, images, and text to be mapped onto a common space and has shown excellent results in zero-shot classification, a particularly useful property for dynamic urban scenarios where new types of events can emerge without annotated data available. These foundation models facilitate generalization, make the system more robust to noise, and help it adapt to new situations. In addition, the adoption of self-supervised learning reduces the requirement for annotated data, which may carry cultural, linguistic, and geographical biases when applied in the city setting. Since the model learns from the statistical properties of the signal, it captures more general acoustic characteristics than the model trained on the limited dataset.

Moving forward, the combination of self-supervised learning with multimodal pre-training and large models promises an highly relevant for the field of SED, especially in the context of smart cities where the acoustic environment is complex and needs intelligent, adaptive, and scalable solutions. The growing availability of unannotated audio—from urban surveillance systems to IoT sensors to microphones embedded in mobile devices—provides fertile ground for further advances in self-supervised learning.

3.4. Multimodal Models for Sound Event Detection in Urban Contexts

The most recent development in Sound Event Detection concerns the adoption of multimodal models, which have been designed to integrate information coming simultaneously from audio, text, and video [89]. This will make the system capable to obtain its interpretation of acoustic events with semantic and visual contexts, thus overcoming the limitations of analysis based solely on audio signals [90]. Thanks to the integration of data in multiple modes, it will be possible to disambiguate complex situations, events in difficult acoustic conditions, and remarkably improve the accuracy and robustness of the system [91]. In urban contexts, where sources are superimposed, with background noise and highly dynamic scenes, these models represent a decisive step toward a more reliable and complete understanding of the sound environment (Figure 5).

A pioneering example is the study by Tian et al. [92], which proposes the task of Audio–Visual Event Localization in unconstrained videos. In this work, audio events are synchronized with visual information, and the model uses convolutional networks to extract audio and visual features, combined through a multimodal fusion module. The results on the AVE dataset show that the audio–visual combination significantly improves the temporal localization of events compared to unimodal systems, especially in noisy scenarios and with overlapping events.

Similar approaches have been explored by Bai et al. [93], who integrate audio information and spatiotemporal context for Multimodal Urban Sound Tagging. The model not only focuses on the sound sequence but also focuses on where and when the events occur. Using the attention mechanisms, the model relates information from different modalities. This enables the model to better classify urban acoustic events and avoid false alarms due to similar background noises.

Another example is the work of Hou et al. [94], who develop an intelligent microsystem for SED based on edge computing and mesh networks. The system integrates audio signals collected from distributed sensors with network information and node topology, creating a multimodal framework capable of detecting sound events in real time in urban environments. The end-to-end architecture reduces latency and improves robustness, demonstrating the importance of multimodality for IoT applications in smart cities.

Berghi et al. [95] propose an audio–visual embedding fusion model for sound event localization that projects audio and video representations into a shared space, relying on a multimodal Transformer to correlate information over time. Strong performance of such model on urban datasets would hence confirm the hypothesis that the fusion of data overcomes the limitations imposed by the use of unimodal models.

Luo et al. [96] also build models based on distributed sensor networks; they design a wireless sensor network for monitoring urban noise and for detecting acoustic events. Integration of audio data with spatial and temporal information from the nodes allows for detailed acoustic maps and event classification, which is more accurate than if single audio sensors are used.

Finally, the research in the domain of Urban Informatics, demonstrated by the work of Han et al. [97], fuses audio and video data to shed light on pedestrian behavior and the urban soundscape. The multimodal approach not only allows us to detect sound events but also helps us explore the interaction between sounds and the rhythm of the city, which opens doors to the creation of integrated smart city solutions. A smart city, as defined in this paper, is an ecosystem where the use of digital technologies, smart sensing, and data-driven solutions can help us make the city safer, more sustainable, and more livable. In this context, Sound Event Detection is a crucial task that enables the continuous analysis of the urban acoustic environment. Smart city parameters that are taken into account in this manuscript include the presence of distributed and heterogeneous sensor networks; real-time data processing; noise and complex soundscapes robustness; compatibility with other urban data sources; and decision-making support in public safety, transportation, and environmental monitoring. This approach offers a clear context in which the mentioned SED applications are framed.

In all, multimodal models present some intrinsic advantages for urban SED, such as robustness to noise, better discrimination of similar events, and the possibility of generalization to new urban contexts. Therefore, the integration of audio with video, sensory data, or spatiotemporal information represents a very promising direction for building intelligent and scalable systems for the detection of sound events in future urban environments.

3.5. Task-Specific Models for Urban Sound Event Detection

In urban Sound Event Detection (SED) applications, it is not uncommon for more complex systems to integrate specialized models for related but distinct tasks, such as source separation, sound localization, and event saliency (Figure 6). The resilience, accuracy, and semantic robustness of sound event detection systems are improved by these models, particularly in complex, noisy, and multi-source environments typically encountered in smart city contexts. The identification of sound sources becomes critical when a specific event must be selected from a mixture of acoustic signals, such as sirens, speech, and engine noise. One of the most influential techniques is deep clustering, which was introduced by Hershey et al. [98]. This technique maps each time-frequency bin to an embedding where points belonging to the same source are aggregated. At the designated test time, a clustering algorithm (e.g., K-means) is utilized for the identification of groups corresponding to the sources. This process enables the calculation of time-frequency masks and the separation of signals. This technique has been extended and combined with mask-inference networks in hybrid architectures, showing that the combination improves separation quality and generalization under mixed conditions [99]. Source separation systems have the capacity to support urban SED models by isolating sources of interest (e.g., sirens, alarms, vehicle noise), thereby enhancing classification and temporal detection.

Sound source localization is useful in urban contexts because it provides spatial information about where an event occurs, helping to discriminate between similar events and filter out distant background noise. Sound event localization and detection (SELD) models have been proposed to simultaneously estimate the direction of arrival (DOA) of sources and identify the class of sound events. For example, Adavanne et al. [68] propose a Convolutional Recurrent Neural Network (CRNN) that produces two parallel output sets: (1) the probability of sound event activity (SED) and (2) the 3D position (x, y, z) of the source, estimated by regression. This architecture is very useful in urban scenarios where multiple events can overlap and come from different directions.

Another interesting work has been proposed by Pujol et al. [100], which recommends an end-to-end approach for source localization using raw multichannel data (acoustic pressure sampled by microphone arrays). This process works better than traditional techniques like MUSIC or SRP-PHAT in environments with a lot of noise or reverberation, thereby proving its stability and accuracy even in harsh urban environments. Furthermore, Vera Díaz et al. [101] propose a CNN that takes in raw signal inputs from a microphone array and directly outputs the location of the sound in 3D space. This technique can also be incorporated into a sound event detection system that not only detects sound events but also indicates their location in the urban soundscape. Besides all these, recent surveys also highlight the application of deep learning techniques in sound localization, where they report that such techniques perform better than traditional techniques even in harsh environments with a lot of noise or reverberation [102].

Event Saliency is also a significant aspect in identifying particularly interesting or prominent sounds in an urban scene, such as alarm sounds, screams, or any other sudden sounds. Models of event saliency identify these prominent sounds, which can then be used as inputs to a sound event detection system. From a cognitive point of view, acoustic event saliency has also been studied in humans, where Shuai et al. [103] reported that salient sound events elicit distinct patterns of modulation in the auditory steady-state response (ASSR) and evoked potentials, independent of top-down attention. This paradigm can inspire computational saliency models, where a module simulates bottom-up attention to signal to the SED engine which events are potentially relevant for an intervention (e.g., urban emergencies). If we are considering it from the perspective of engineering, there are saliency-driven models of acoustic scene analysis, which, based on the analysis of the audio signal, identify the salient events that are of importance, and then these events are analyzed to recognize them. LISA, which stands for Latent Perceptual Indexing, is a good example of this, where it employs attention mechanisms that are based on the perception of humans, which are then used to filter the important sounds and recognize the scene based only on those sounds, reducing the computational burden of the analysis [104].

By incorporating these specialized modules, there are definite benefits of having these incorporated into the SED system of the city. The first is that these source separation modules allow us to separate the sounds that are of importance to us, say, the sound of a siren amidst the sounds of the city, before even classifying them. This significantly reduces confusion from overlapping sounds, leading to higher accuracy [105]. The second is that, by incorporating SELD, the Sound Event Localization and Detection module, the SED system is able to localize where these sounds are occurring, leading to faster responses, say, of emergency services, or even generating noise maps of the city [106]. The third is that, by incorporating the Saliency module, it is able to modulate the sounds that are of importance, leading to a more efficient analysis, where it ignores the noise and sounds that are of lesser importance, without compromising the quality of analysis [107]. The fourth is that, by incorporating these three different models, it becomes more robust, where it is able to process even tougher scenarios, say, of reverberation, background noise, and moving sounds, leading to more reliable performance.

Despite the many benefits of specialized modules, incorporating them in urban SED systems faces many challenges. First off is training data: for tasks like source separation or spatial localization, multichannel data or microphone arrays are needed, which may not be readily available or may be scarce in urban areas. Then there’s computation and latency: complex models like SELD CRNNs or source separation models require significant computational power to operate in real-time, which may not be possible in urban areas where processing power may be limited or unavailable. Generalization is yet another problem: models may work perfectly in a controlled environment but may not generalize to other environments in urban areas where different types of noise or sources not seen in training may be encountered. For reliable performance in different environments, domain adaptation or realistic polyphonic training may be needed. Finally, module fusion is also a problem: how to combine processing steps or fuse signals or modulate source saliency may depend on specific tasks or applications and may significantly affect overall performance. All these factors indicate that incorporating specialized modules in urban SED systems is not just about technical capabilities; rather, it’s about designing solutions that take into account different contexts. However, the future of urban SED systems is promising: better realistic training sets, better edge computing capabilities, and richer multimodal models may make specialized source separation, localization, or saliency models an integral part of urban SED systems in the near future.

4. Datasets Used in Urban Sound Event Detection

The selection and analysis of the datasets that are employed to train and test the Sound Event Detection models are of critical importance to the development of dependable models that can function in the real world. In the context of smart cities, it’s critical that the data reflects the real complexity of the city soundscape, including the presence of polyphonic sound, changing backgrounds, and real events rather than simulated ones. In this section, the primary public datasets utilized in the SED literature are analyzed, their representativeness and realism evaluated, and their limitations and potential for future research discussed. Among the most widely used datasets in the urban SED field are UrbanSound8K, URBAN-SED, and datasets based on recordings from distributed sensors such as SONYC-UST.

UrbanSound8K is probably the most iconic dataset for urban audio [108]. It contains 8732 audio clips ≤ 4 s long, labeled across 10 classes (including siren, car horn, traffic, children playing, jackhammer) and spread across 10 folds for cross-validation. The clips come from recordings uploaded to Freesound, which provides a “real-world” baseline of urban sounds captured in different contexts [109]. This dataset has been widely used to train urban acoustic scene classification models and as a baseline in SED challenges. However, the very short clip duration (maximum 4 s) and clip-level labeling (without precise temporal annotations for overlapping events) represent a limitation when applying models in real-world scenarios with polyphonic and overlapping events.

URBAN-SED is a synthetically generated urban soundscape dataset: according to the description, it contains 10,000 soundscapes of 10 s each, with background noise (Brownian noise) and 1 to 9 foreground events for each clip, all pre-annotated [110]. The source material for the events is taken from UrbanSound8K itself. A strength is that URBAN-SED provides predefined train/val/test splits and is intended for multi-class event detection. However, as also reported in the literature, the annotation is not collected from real recordings but computationally generated, which raises realism issues: the composition of the events may appear “artificial” and the acoustic aroma (soundscape) may significantly differ from a real urban environment [111].

Another highly relevant dataset is SONYC-UST V2 [112], this dataset comes from a network of real-world acoustic sensors deployed in New York City (Sounds of New York City, SONYC), with sound tag annotations for over 18,510 recordings. Its main advantage lies in its spatiotemporal contextualization, as each audio clip has associated metadata such as microphone position and timestamp. This allows studying how the urban acoustic landscape varies over time and space, and training models that leverage this information to improve generalization and robustness. However, because it focuses on urban tagging (sound tagging) rather than strongly temporal event labels (onset/offset), it may be less useful for SED models that require precise event start/end annotations.

Other useful resources include larger, more general datasets such as FSD50K, introduced by Fonseca et al. [113]. FSD50K contains over 51,000 audio clips annotated with 200 classes, taken from Freesound, and is designed for general sound event recognition tasks. Although not specific to urban environments, FSD50K offers a wide variety of sounds, including urban ones, and can serve as a basis for models that need to be resilient to various background sounds. However, its distribution may not faithfully reflect the prevalence and co-occurrence of urban events in a real-world context (e.g., class balance may not reflect the actual frequency in the city).

A more recent dataset designed for realistic scenarios is USM-SED [114] created with the aim of simulating urban situations, building polyphonic soundscapes (20,000) based on sounds extracted from FSD50K, positioning them in stereo space, and varying their loudness levels. This approach will enable us to test SED models with mixed and tricky conditions. Although this is more realistic than simulated data, it is still somewhat controlled and may not account for the full mess of reverberation, real city layouts, real microphone placements, and source relationships in space.

Considering the datasets discussed earlier, two important factors are evident: representativeness, or how well the data covers the relevant sound classes and city scenarios, and realism, or how well the data reflects real cityscapes.

The representativeness of a dataset indicates how effectively it covers relevant sound classes and different urban acoustic conditions. A representative dataset includes typical city sounds, various background noise levels, overlapping events, and realistic spatial and temporal contexts, faithfully reflecting the complexity of the urban soundscape (Table 1):

UrbanSound8K covers a relatively small subset of sound classes (10), which includes many typical sounds (siren, car horn, jackhammer), but does not reflect the entirety of the urban acoustic variety (e.g., more complex traffic noises, natural sounds, or social interactions).
URBAN SED includes the same classes as UrbanSound8K, but with multiple overlapping occurrences, which increases the data polyphony but does not introduce new classes.
SONYC UST offers a greater variety of real-world urban environments thanks to field collection and spatiotemporal metadata, making it more representative of a true urban soundscape in medium- and large-sized cities.
FSD50K has a broad class ontology (>200) and therefore offers extensive coverage, but it was not specifically designed for urban environments.
USM SED simulates urban polyphony but relies on pre-existing isolated sounds, which limits its representativeness of the complex interactions that occur in cities.

The realism of a dataset measures how faithfully the recordings reflect the acoustic conditions of a real urban environment. It includes variability in background noise, reverberation, moving sources, event density, and interactions between natural and artificial sounds. A realistic dataset allows for training effective and robust SED models in the field:

UrbanSound8K is real, in the sense that the clips come from “in-the-wild” recordings uploaded to Freesound; however, they are very short and isolated, and do not simulate overlapping sounds or a continuous urban soundscape.
URBAN-SED, being synthetic, cannot recreate the realistic reverberation, spatial layout, and acoustic variability of a real city; the annotation is perfect (generated), but imperfections, microphone distortions, and unexpected noises are missing. Studies themselves report that, while useful for training, its realism is limited.
SONYC-UST provides real-world recordings, with natural noises, reverberation, position variability, and moving sources, making it very realistic and useful for practical applications. However, its annotation is tag-based (not always with onset-offset), which may limit its use for highly accurate SED models.
FSD50K, while realistic in terms of labeled sounds, may not reflect typical urban co-occurrence in terms of event density, spatial location, and relationship to background noise.
USM-SED provides a good compromise for polyphony, but its synthetic nature means that some real-world dynamics (reflections, multiple sources distributed in real space, recorder variability) are not fully captured.

The selected dataset has a significant impact on model performance and overall effectiveness. That is, using datasets like UrbanSound8K, the model becomes good at identifying individual sounds but may not perform well in messy urban environments. In other words, the lack of polyphonic information during model training leads to overfitting on clean and ideal cases.

Using synthetic datasets like URBAN-SED is extremely useful as it provides complete control over the composition of the scenes. However, this may lead to a model that is too biased toward the composition asset. In other words, the model may fail to generalize to unseen cases of soundscapes. The real datasets like SONYC-UST represent a major step forward as they expose the model to the inherent variability of urban environments. However, to make the best of this dataset, architectures must be designed to leverage contextual information like temporal or spatial metadata to better understand the surroundings of the sound. The synthetic datasets still have their uses, particularly when evaluating the model on highly polyphonic cases. The major drawback of using synthetic datasets is the apparent domain shift between synthetic and real cases. To bridge this domain shift, techniques like domain adaptation, data augmentation using real noises or reverberations, or pretraining on real cases have been particularly useful. The class balance is another important factor. In real cases like SONYC-UST, certain classes like gunshot or jackhammer may appear extremely rarely, whereas classes like air conditioner or motor may appear more frequently. This may have a negative impact on the model’s performance on the most important cases. Therefore, it is of significant importance to apply techniques to balance the dataset so that there is proper coverage of the less represented classes.

To create better Sound Event Detection models that can better capture the messiness of city soundscapes, it’s necessary to tweak the quality and structure of the datasets used. First, it’s necessary to create large datasets that consist of recordings taken directly from real-life situations in the city. This means that it’s necessary to use distributed sensor networks that can record in a variety of situations, providing precise annotations regarding the start and end times of the recording, as well as spatial and temporal details such as GPS, altitude, road type, and traffic levels. Another key aspect that’s necessary to create better Sound Event Detection models is to have access to a polyphonic dataset, where there is multiple sound events present in the recording. The use of multi-channel microphones or microphone arrays can not only help in the detection of the sound but can also help in the localization of the sound in space, thus creating better urban noise monitoring systems. On the synthetic side, there is a possibility of a qualitative leap by using acoustic simulation engines that accurately simulate the propagation, reverberation, surface reflection, and urban morphology. Blending this with the real world by using soundscape synthesis techniques based on authentic recordings can help bridge the gap between simulation and reality, increasing the credibility of the synthetic data. Another strategic direction is the enrichment of the existing metadata. By adding information such as weather, time of day, traffic density, and urban morphology, the datasets can become more complete. Community engagement was demonstrated to be valuable, especially through citizen science initiatives and tools that allow citizens to contribute their own annotations or even mobile recordings. This can help scale the datasets rapidly while reflecting the acoustic diversity of cities. Finally, the pursuit of urban SED-specific challenges can help drive innovation. Competitions using both real and synthetic data, evaluated using metrics that consider the complexity of polyphony, localization, and real-time performance, can help drive research toward more reliable, generally applicable, and practically useful solutions.

As part of the investigation into public data for Sound Event Detection, a balance has been found between how representative each class is, how realistic each recording appears, and how usable each data source is for model development. UrbanSound8K has been a driving force for initial success, but it has been found wanting when it comes to polyphony and contextual richness. Synthetic data sources such as URBAN-SED and USM-SED are found to be rich in terms of scale and annotation, but lacking in terms of real-world authenticity. Real-world data sources such as SONYC-UST, collected from real-world urban sensors, are considered to be the way forward for developing SED models that can be deployed in real-world cities. The only problem with such models is that more accurate annotations are needed.

To develop SED models that can be deployed for monitoring, safety, and well-being in smart cities, it has been recommended that future research should focus on a balance between synthetic data sources and real-world data sources.

5. Real-World Applications of Sound Event Detection in Smart Cities

The application of Sound Event Detection (SED) in smart cities has emerged as an important approach for managing noise in cities, tracking the environment, and improving public safety. Recently, the availability of less expensive sensors, the development of deep learning techniques, and edge-cloud configurations have made it possible to establish broad networks that continuously monitor city soundscapes. Real-world applications published in journals and at conferences offer practical, successful examples that help to advance the area. One of the most important and successful applications is SONYC (Sounds of New York City) [115]: a comprehensive platform that measures and automatically analyzes urban noise. It combines low-cost acoustic sensor networks with complex classification models and analysis systems to inform public policy. SONYC illustrates the effectiveness of SED in a real-world setting, demonstrating that supervised learning models can successfully classify sources such as sirens, engines, compressors, and construction activities. The project also illustrates the need for effective systems that can handle highly variable and noisy environments, highlighting model robustness and the ability to identify overlapping sources. At the same time, other related works demonstrate that the sensors, although less expensive, are capable of providing reliable performance, making it possible to establish complex city networks [116]. Another important contribution to urban SED is the UrbanSound8K dataset, which is one of the most popular datasets used for training and testing urban-related models [108]. Its emergence has led to the development of deep learning models that are capable of distinguishing between various sources such as horns, jackhammers, sirens, and engines, opening up the possibility of implementing these models in urban settings. With the widespread use of this dataset, many researchers have developed new approaches for classifying urban sounds, improving the accuracy of automatic detection and system performance in real-world scenarios [117].

However, SED is not solely concerned with noise detection. Some cities have actually used acoustic sensing as a public service, particularly in automatic gunshot detection. These applications, mostly in the United States, involve microphone arrays running algorithms to identify sharp events. In an independent study in St. Louis, the performance of these systems in an urban environment was assessed, demonstrating both potential and challenges for prevention and rapid response [118]. The case study illustrates a real-world application of SED in a critical domain and emphasizes that high-quality event classification is paramount to prevent false alarms and keep the system trustworthy. Another aspect of this research explores the development of low-cost networks for permanent noise monitoring across cities. A systematic review revealed that the diffusion of these networks is closely tied to improvements in sensor miniaturization, data compression, and the capacity for real-time SED processing on the devices themselves [119]. These developments facilitate the widespread deployment of infrastructure that not only measures noise levels but also analyzes the semantic content of audio recordings, providing more detailed and useful data for urban planning policies. There is also a more general scientific interest in integrating SED with soundscape analysis. In a chapter on soundscape research in smart cities, authors explored how automation influences the evaluation of the acoustic landscape. They observe that the availability of data allows for the development of predictive models, warning systems, and urban diagnostics that extend beyond simple noise measurement [120]. The integration of SED, source separation algorithms, and models of human perception provides new avenues for the design of urban soundscapes—focusing not only on how loud the city is but also on how good it sounds.

Another more recent line of research concerns infrastructure-free systems based on deep neural networks optimized to operate on ultra-low-power powered microphones. This line of development aims to overcome the critical issues related to the installation of wired networks and the need for continuous power supply, proposing autonomous solutions capable of performing SED directly locally with a consumption of approximately 100 mW. This opens up significant possibilities for distributed monitoring, particularly in areas where the urban infrastructure does not allow for permanent installations [121].

In addition to the applications already discussed relating to noise monitoring and soundscape management in smart cities, Sound Event Detection (SED) is assuming a central role in more sensitive areas, including urban security, home protection, and child protection. The automatic analysis of acoustic events allows the identification of signals associated with critical situations, such as episodes of violence, calls for help, glass breaking, children’s crying, falls, or home intrusions. These applications, often developed in environments characterized by high constraints in terms of privacy, latency, and robustness, represent one of the emerging directions in international SED research.

In the context of urban security, a consolidated line of research concerns the identification of anomalous or potentially dangerous events through distributed acoustic systems. Already in the early 2000s, researchers highlighted how acoustics was a particularly suitable means for the early detection of critical incidents, especially in areas not always covered by video surveillance. One of the first significant contributions in this field was that of Clavel et al. [122], who proposed an acoustic surveillance system capable of detecting screams and signs of aggression in public spaces using GMM and SVM models. The study, considered one of the pioneering works in audio surveillance, demonstrated that vocal signals related to stressful situations can be recognized with good reliability even in the presence of strong background noise.

The evolution of machine learning has led to the development of even more advanced systems, such as the one proposed by Ntalampiras et al. [123], which introduces novelty detection techniques to identify anomalous acoustic events without the need to train the system on all possible classes of interest. The strategy has also proved to work particularly well for urban safety. In this case, critical events rarely occur, vary significantly, and cannot always be clearly characterized. By relying on strong spectral features, we have managed to differentiate normal, everyday life from potential risk situations efficiently. Furthermore, other studies have also explored the potential of using sound event detection to recognize certain events such as breaking glass, explosions, or collisions. A notable paper under this category is the work by Lojka et al. [124], where the authors analyze the acoustic signals of shattered glass as an indicator of potential intrusions or vandalism. The authors have demonstrated that using deep neural networks with real-world signals, it is possible to recognize high-frequency impulse signals with high accuracy even when urban reverberations are present. This is particularly important for smart cities. At the same time, SED technology is also increasingly used in home security, especially in smart home automation systems. For example, as IoT networks become increasingly popular and smart home assistants become ubiquitous, many smart devices are now coming equipped with microphones and processing power. A significant application area for SED technology in smart home automation is fall detection within the home environment, especially for elderly or vulnerable users. For example, Bourenane et al. [125] propose an “audio–visual integrated fall detection system” and demonstrate the effectiveness of using acoustic signals to reduce false negatives and improve generalization across various home settings. Another emerging area for SED technology is the detection of sounds associated with household hazards such as smoke alarms, carbon monoxide detectors, broken items, and malfunctioning home appliances. For example, Gemmeke et al. [87] proposed a significant contribution to SED technology, proposing the AudioSet dataset, which is now widely used in SED technology, including within smart home settings, providing a diverse range of sound classes for training SED models capable of detecting a wide range of sounds associated with household hazards.

Meanwhile, recognizing sound events associated with child safety has become a burgeoning field of study, driven by pressing social needs and growing industrial interest. One notable trend is in the development of infant cry recognition systems, applicable to both hospital and home environments. Of the works that have been widely cited, Ntalampiras et al. [126] proposed a detection system that utilized audio features and Gaussian models, with its viability tested in a noisy environment. This work has had a profound impact on the creation of commercial smart baby monitors. Later studies have further developed this method by identifying different types of infant crying, proving that Sound Event Detection can identify between attention-seeking, pain, and distress. For example, Liang et al. [127] introduced a multimodal deep learning framework for infant cry classification, reporting promising results for medical and assistive purposes. More recently, McGinnis et al. [128] explored the detection of vocal cues of stress or danger in children using advanced feature extraction with SVM and CNN classifiers. When combined with smart homes or educational settings, these systems could help with the early detection of potentially dangerous situations such as accidents, falls, or bullying by peers.

Another field of application concerns the detection of domestic abuse through acoustic signals, a sensitive area of study but of great social relevance. For example, the work of Stowell et al. [129], although focused on the detection of environmental acoustic events, paved the way for SED methods robust to chaotic environments, which in more recent years have also been adapted to distinguish disturbing voice signals or emergency calls. This is an area that is changing rapidly, with a strong focus on ethics and privacy, but it is one where the application of sound event detection can provide practical support to the protection of vulnerable groups. The other area that is likely to see the application of SED is the integration with wearable technology, such as that aimed at children, seniors, and people with disabilities. Research such as that carried out by Martínez-Villaseñor et al. [130], which looked at the application of acoustic fall and impact recognition with compact devices with microphones and accelerometers, shows the potential for the application of SED with multimodal data to improve the reliability of the detection.

A broad look at the current body of literature shows that the use cases discussed have taken the application of SED from the research lab to a state where it can be applied to a variety of needs, such as, but not limited to, noise reduction, decision support, security management, and soundscape design. The application of SED can be seen as a key part of the management of smart cities, with its application integrated into the policies that seek to manage these cities.

6. Performance Comparison Using Standardized Benchmark Metrics

While evaluating various models for Sound Event Detection in an urban scenario, it has been observed that researchers use commonly accepted metrics that can help them evaluate the performance of their models objectively. Three such metrics that are commonly used for evaluating SED models are the F1 score, which balances accuracy and recall; the Error Rate, which takes into consideration insertions, deletions, and substitutions; and the more recently introduced Polyphonic Sound Detection Score, specifically designed for a challenging scenario involving multiple events. The metrics are not only a way to evaluate a model quantitatively but also help in highlighting different aspects of complexity involved in a problem. F1 score is by far the most used measure in competitions and research. It is the harmonic mean of precision and recall. The F1 score balances these two aspects [131]. In SED models, a high F1 score means that the model can identify a large number of events accurately (high recall) and also maintain a low number of false positives (high precision). However, F1 is a rigid measure that does not consider signal temporal characteristics or the simultaneity of multiple events. The F1 score also depends on where the threshold is set during binary classification. The Error Rate (ER), used in DCASE challenges to evaluate SED at the clip or frame level, takes into account insertion, deletion, and substitution errors [132]. The SED model will have a low ER if it is able to maintain a balance between insertion, deletion, and substitution errors. The SED model will also be highly sensitive to accurately measuring event duration. High-recall models will have a poor ER because they will have too many insertion errors.

The Polyphonic Sound Detection Score (PSDS) is currently the most modern and all-encompassing measure in SED [59]. It is used in realistic scenarios and takes into account temporal tolerance, class coverage, system reliability, and sensitivity to false alarms. The great strength of the PSDS is its ability to aggregate model performance across different operating point thresholds, simulating different application conditions, from highly sensitive systems to those focused on reducing false alarms. Furthermore, the PSDS allows fair comparisons even between datasets with different polyphony levels, making it the most suitable metric for urban assessment.

In Section 3, we classified SED models into five main categories: feature-engineered models, supervised deep learning models, self-supervised and foundation models, multimodal models, and specialized models for related tasks. In this section, we compare these categories using standardized metrics, with the aim of outlining the state of the art and typical performance reported in the literature on the main urban benchmarks. Table 2 summarizes the average performance obtained across the urban polyphonic datasets, providing a benchmark for comparing the evaluated models. The quantitative values reported in Table 1 were obtained using standardized assessment methods. These include evaluation metrics such as accuracy, precision, recall, and F1-score, combined with experimental protocols like data preprocessing, segmentation, and train-test splits, as well as computational procedures involving feature extraction and model-specific training and validation approaches.

Table 1 illustrates two aspects of the evaluation of sound event detection. PSDS-1 is heavily dependent on the accuracy of the timing. It focuses on the extent to which a system accurately identifies the beginning and end of a sound, and any inaccuracy in the timing results in a pretty severe penalty. Therefore, the PSDS-1 metric is best suited for applications that require a rapid and accurate localization of the moment. On the contrary, the PSDS-2 metric focuses on the accuracy of the classification and the avoidance of confusion between different classes. It is not so stringent on the accuracy of the moment a sound starts or ends. Instead, it focuses on the accuracy of the classification of the sound within a wide time frame. Therefore, the PSDS-2 metric is a good representation of the semantic accuracy of a model. By testing different models on different datasets, we can identify the areas where the SED technology has yet to improve and the ways in which the technology must improve in order to function in the real world.

6.1. Performance Metrics for Feature-Engineered Models

For example, in early urban Sound Event Detection systems, it was necessary to rely on handcrafted acoustic features, such as MFCCs, spectral flux, zero-crossing rate, and modulation spectra, to describe timbre, spectral variations, and dynamic patterns in urban soundscapes. Although such features offered compact and efficient representations, they were noisy and often performed poorly in more complex settings. Works such as Mesaros et al. [21] and Stowell et al. [129] showed that such models, often implemented via GMM, SVM, or HMM, achieved modest performance on early urban benchmarks. On datasets such as UrbanSound8K [108], models based on engineered features typically achieved moderate performance: the macro F1 value ranged between 60% and 75%, while the Error Rate often exceeded 0.7. PSDS was not available, as this metric was not yet supported at the time of the first evaluations.

For example, in the work of Salamon & Bello [117], the use of MFCC + SVM achieved an accuracy around 73% on UrbanSound8K, with a highly variable sensitivity between common (air conditioner) and rare (gunshot) classes. In urban polyphonic contexts, such as DCASE Task 4, the performance of these approaches is particularly inferior: Mesaros et al. [133] report ER > 0.9 for HMMs in mixed urban scenes. The comparison shows that these models show only moderate overall performance when applied to simple and monophonic scenarios. However, when moving to more realistic and complex contexts, characterized by the simultaneous presence of multiple sound sources—as in the case of the SONYC dataset—their effectiveness drops dramatically. The difficulties in generalization are particularly evident: as also indicated by the comparative analyses of Bilen et al. [19], their PSDS would be extremely low if evaluated with modern metrics. For these reasons, these models are clearly inferior to all the more advanced categories considered in the study.

6.2. Assessing Performance in Supervised Deep Learning Models

The next category is represented by supervised models based on deep neural networks, especially CNN, CRNN, BiGRU and attentional variants. These models marked a paradigm shift in SED for urban environments, becoming the standard in competitions such as DCASE from 2017 onwards. The work of Cakir et al. [62] introduced a hybrid CRNN (CNN + GRU) that set the baseline for DCASE Task 4. On the DESED Real + Synthetic dataset [134], a traditional CRNN shows typical performance with an event-based F1 around 40–45%, an Error Rate close to 0.65 and PSDS values around 0.32 for PSDS1 and 0.43 for PSDS2. In contrast, on UrbanSound8K, where the task is a simpler monophonic classification, the accuracy comfortably exceeds 85–90%. This comparison highlights how these models are effective in less complex acoustic scenarios, while they encounter greater difficulties when they have to detect overlapping or immersed sound events in more realistic urban contexts.

Table 3 presents representative performance levels achieved on standard urban benchmarks, offering a reference point for contextualizing and comparing model results.

CRNN models have been the de facto standard in Sound Event Detection for years, thanks to their ability to combine local extraction of spectral patterns via CNNs with temporal dependence modeling via RNNs. Seminal studies such as those by Mesaros et al. [21] and Adavanne et al. [135] have shown that a well-trained CRNN can achieve competitive F1-scores on polyphonic datasets and significantly improve the Error Rate (ER) compared to purely convolutional or recurrent models. In the following years, several studies have introduced attention mechanisms to further optimize the behavior of CRNNs. The main goal is to emphasize the most relevant temporal frames or frequency bands, improving the detection of short, overlapping or low signal-to-noise ratio events. The integration of attention has been explored, among others, by Kong et al. [136] with a temporal attention system for SED, by Jin et al. [137] with a temporal-frequency attention model and by Wang et al. [138] with adaptive self-attention modules applied to polyphonic tasks.

In these studies, CRNN models with attention generally show improvements in the F1-score compared to traditional CRNNs, accompanied by a reduction in the Error Rate, especially in contexts where acoustic events are partially overlapped. Furthermore, in works adopting the PSDS metrics introduced by Turpault et al. [139] in the context of the DESED and DCASE benchmarks, an increase in the PSDS is observed, particularly in the PSDS2 variant which penalizes temporal errors more. Indeed, attention contributes to a more accurate timing and a better separation of concurrent events. Overall, peer-reviewed literature consistently shows that adding attention modules to CRNNs constitutes a valid and recognized methodological improvement, especially in complex and polyphonic SED scenarios.

6.3. Evaluating Self-Supervised and Foundation Models

In recent years, the development of self-supervised models and audio foundation models, trained on extensive unannotated data, has had a significant impact on the field of urban sound event detection. The extant literature demonstrates how these approaches have led to significant improvements in key evaluation metrics. This improvement is a result of the development of stronger and more flexible representations that are able to adequately represent the acoustic intricacies of city sounds. As the models improve, they surpass the conventional methods, opening doors for new possibilities in large-scale applications. Kong et al. [73] proposed PANNs, based on CNN14 and trained on AudioSet, and have shown significant performance improvements in the task of urban sound event detection. Their F1-scores on the DESED dataset are around 55–60%, and the corresponding values for PSDS1 and PSDS2 are 0.55–0.60 and 0.65–0.70, respectively. They outperform the CRNN baseline by at least 15%. Pre-trained Audio Neural Networks (PANNs) are CNNs pre-trained on large generic datasets and have shown robust performance in a variety of sound recognition tasks. They have a significant advantage in the task of sound event detection, reducing the need for large amounts of labeled data and increasing the F1-score and PSDS performance over the baseline.

Table 4 presents the estimated performance gaps in comparison to the CRNN baseline.

Thanks to the Transformer model’s inherent capacity to manage long-range dependencies in time and frequency domains, these models extend beyond the capabilities of traditional CNNs, particularly with regard to the detection of short or overlapping events. On standard datasets such as DESED and DCASE Task 4, the model has recorded F1 scores above 62%, with a maximum of 0.67 on PSDS1 and above 0.75 on PSDS2. The model has recorded better performance compared to CRNNs or PANNs. The model’s performance is even more pronounced in real-world urban environments such as SONYC-UST [112]. In this case, the model has recorded better performance compared to CNN-based models. The Transformer model has recorded more accurate and stable event detections compared to CNN-based models. The model has recorded better performance on temporal metrics such as PSDS2.

Chen et al. [140] have proposed a highly advanced model known as HTS-AT. The model has recorded high performance compared to other models on urban Sound Event Detection. The model has been designed using the Transformer model. The model has been further pre-trained on large-scale datasets such as AudioSet. The model has recorded better performance compared to other supervised models. The model has recorded better performance on benchmarks such as DESED [139] and DCASE 2022 [139]. The model has recorded Event-F1 scores between 65% and 70%. The model has recorded better performance on other metrics such as PSDS1 up to 0.78 and PSDS2 up to 0.80. The model has recorded better performance compared to other supervised models. The pre-training of the model on large-scale datasets has recorded better performance on real-world urban environments. The model has recorded better performance compared to other models on real-world urban environments. The model has recorded better performance on real-world urban environments characterized by polyphony, varying levels of noise, and complex acoustic scenarios.

6.4. Multimodal Models for Audio–Text and Audio–Video Integration

In the study of multimodal models, Elizalde et al. [141] proposed the audio-language model, called CLAP, based on the CLIP model but instead utilized sound representations and textual prompts. The performance of the proposed models, when fine-tuned on urban data, is remarkable, with event F1 measuring around 68% to 72%, PSDS1 around 0.75, and PSDS2 around 0.82 to 0.85. The models’ ability to utilize additional semantic information is perhaps the most notable advantage, which is helpful in generalization, even when it comes to infrequent classes that were never seen during training. This is just a clear example of how the combination of audio and language models can enhance the robustness of sound event detection in the chaotic and constantly changing environments of cityscapes.

A summary of the models’ performance, based on a synthetic example inspired by SONYC and DCASE, is given in Table 5, representing the typical performance of these models on common sound classes found in urban environments.

The combination of both sounds and sight in audio–visual models represents an important step forward for Sound Event Detection because it offers a way to improve the detection of sound events in complex, real-world environments. As indicated by Afouras et al. [142] and Gao et al. [143], combining audio and video enhances the usage of event-related visual cues to improve the precision of event classification. In urban environments or polyphonic audio tasks, audio–visual models can reduce error rates by as much as 30% or increase F1 scores by as much as 8–15%, particularly for visually salient sounds like jackhammers, sirens, or road traffic. This enhances event detection beyond just audio cues that may be obscured by other sounds or noises in complex urban environments. Overall, audio–visual models represent an important approach for advanced Sound Event Detection.

In Nakamura et al. [144], the focus of the models is on communication modalities used in conjunction with speech to emphasize the importance of speech in conveying human intention or meaning. Despite important advances in spoken communication technologies, current spoken communication systems demonstrate significant limitations. For example, although automatic speech recognition systems have been significantly improved, they perform poorly in noisy or difficult listening environments. In addition, studies of human perception indicate that humans combine auditory cues with visual cues from facial movements to indicate meaning or intention, thus illustrating a strong multimodal integration process for speech. This has encouraged efforts to combine facial visual cues in speech-related technologies. In this regard, Nakamura et al. indicate important advances in audio–visual speech recognition systems, facial articulation from audio signals, and speech translation systems that combine audio and visual cues.

Table 6 presents a comparison of each of the models in terms of their requirements and applicability to urban environments, emphasizing computational power, data needs, and applicability to urban environments, as well as recommendations for selecting models for urban sound analysis tasks.

Zhang et al. [145] introduce a new multimodal framework that integrates audio, visual, and language information to estimate depression symptoms. The framework consists of three parallel streams for audio, video, and text, each of which generates high-level features unique to their respective modality. These features are then fused using a custom fusion module to produce a unified representation for prediction. To test the model, the authors developed an experimental framework consisting of two emotion-induction tasks: a reading task and an interview task. This design allowed the authors to collect a rich and varied dataset with abundant sensor data on vocal and facial expressions. The study examines the impact of different tasks on emotional expression and diagnostic performance. When the two datasets are combined, the model performs at its best, achieving an F1 measure of 0.78, precision of 0.76, and recall of 0.81.

Arjunan [146] offers insights into the use of multimodal learning that combines visual, auditory, and linguistic inputs to provide a more comprehensive understanding of the environment for an AI system. The combination of multiple inputs has already been shown to improve the performance of multimodal AI systems in various tasks such as emotion recognition, image description, autonomous driving, and medical diagnosis. The report identifies various practical applications of multimodal AI, including personalized customer service, improved safety for self-driving cars, and better healthcare solutions. It also discusses various challenges associated with data integration, privacy, bias, and explainability. However, Arjunan emphasizes the tremendous transformative power of multimodal AI and believes that future developments will significantly enhance the role and capabilities of AI in various domains.

6.5. Specialized Models for Related Tasks

Recent developments in Sound Event Detection (SED) demonstrate models going beyond the identification and classification of sounds [147]. They increasingly involve additional tasks that enhance accuracy and robustness. Some models combine additional tasks such as source separation, where individual sound sources are separated in noisy, overlapping environments to reduce noise and improve detection rates. Others involve spatial localization or Direction of Arrival (DoA) estimation, providing location information for sounds to provide a more comprehensive view of the urban sound environment [148]. Dereverberation algorithms help mitigate the effects of reverberation in closed or densely populated urban areas, resulting in improved event detection. There are also models that integrate detection and localization in a single process, providing simultaneous class and location information to improve overall accuracy and generalization [149]. These multi-dimensional models, therefore, represent an advanced boundary for SED, particularly in real-world complex urban environments.

In a significant study, Turpault et al. [134] demonstrated that incorporating a source separation step can significantly improve the accuracy of SED in complex urban environments. By separating sound sources before classifying them, the model minimizes the effects of overlapping sound sources. Their findings: a 10% absolute improvement in F1-score relative to a standard CRNN, an improvement in PSDS1 from 0.32 to 0.50, and a significant reduction in error rate for simultaneous events. This study demonstrates the effectiveness of combining source separation with sound detection to improve the accuracy and robustness of SED models in real-world urban environments.

Few-shot sound event detection (FS-SED) is a method for sound event detection that relies on only a few labeled examples per class, making it possible to learn new sounds without requiring large amounts of labeled training data. One of the most prominent approaches is Few-Shot Sound Event Detection by Wang et al. [150], which applies metric-based few-shot learning to audio signals, using prototypical networks and an automatically constructed set of “negative” examples.

Within the urban landscape of a city, Sound Event Localization and Detection (SELD) represents an important advancement over the traditional single-channel approach through the utilization of spatial information for both improved detection and localization [151]. The research by Adavanne et al. [68] and Politis et al. [152] indicates robust support for the integration of event detection and direction of arrival (DoA) estimation. The multi-dimensional approach has been shown to outperform the traditional monophonic or single-channel approach. The utilization of information from multiple channels has been shown to provide significant improvements, where the F1-score is improved by up to 15%, the error rate is reduced by 20–30%, and the accuracy of the DoA is improved by 15–25%.

Table 7 provides a comparative overview of performance across different SED model classes, summarizing benchmark trends and highlighting relative strengths, weaknesses, and typical outcomes observed in urban sound event detection.

Table 7 provides a general comparison among various Sound Event Detection (SED) model families, reflecting the general performance trends on standard urban benchmarks. This demonstrates how supervised models, such as CRNNs, achieve strong temporal accuracy and detection of common sound events, whereas self-supervised and foundation models achieve better generalization over diverse urban acoustic conditions and robustness to rare and unseen sound events. This also demonstrates the general trade-offs between models in terms of computation, data requirements, and usability, thereby proving that no single family dominates all others.

In comparing all the families, we can see a general trend in the evolution of what the models can and can’t do. Engineered feature-based models, once the norm, have become obsolete when using complexity-sensitive metrics such as PSDS, which measures the diversity of acoustic environments that such models can’t handle. Supervised models remain a safe bet in many situations, especially when resources and data availability are limited, due to their reliability and stability in controlled conditions. The change signaled by the advent of self-supervised learning and transformer-based techniques represents a significant qualitative leap forward. These techniques demonstrate significantly stronger generalization capabilities and high-performance levels, even in complex real-world environments such as those found in urban areas. Multimodal techniques go one step further by combining the input of acoustic data with other modalities such as vision and text, often delivering the highest accuracy and robustness levels as a result of the complementarity of the different modalities involved. Their drawback is the requirement for other data sources, such as videos, which may not always be readily available and easily obtained. Models designed specifically for complex spatial configurations, urban hotspots, and rare event detection often significantly outperform other models in the specific domain of application. Their advantage stems from the use of structured problem knowledge to optimize performance in the areas where the problems are the most challenging. Overall, the field of sound event detection in urban environments is experiencing significant growth, with the advent of new techniques for representation learning, unsupervised learning, and multimodality giving rise to a new performance benchmark.

Table 8 summarizes the 66 studies included in this systematic review of sound event detection in smart city settings. For each paper, it identifies which datasets were used, the particular sound event detection tasks, how features were represented, the types of models employed, where the work was applied, and the highlights of the results. This side-by-side snapshot provides an overview that makes it easier to spot method trends, the most common datasets in use, the typical application areas, and the convergence or divergence in findings between studies. In other words, the table acts like a single, useful reference underpinning the qualitative and quantitative discussion that follows.

Taken together, the selected studies show a rapidly shifting landscape for sound event detection in smart city contexts. There is a clear move away from traditional machine learning toward deep learning, with more papers adopting hybrid and transformer-based architectures. An increased emphasis is also observed on evaluating models across multiple datasets and on real-time, edge-oriented solutions that align with realistic real-world deployment requirements. The challenges remain-standardization of datasets, generalization of models in diverse urban environments, and reproducibility. These insights help to map out the research gaps and give promising directions, as explored in the next section.

In the absence of specific and widely accepted evaluation risk of bias tools for machine learning-based sound event detection studies, a per study evaluation of the bias impact for individual contributions was not conducted for this review paper. Within this context, a qualitative evaluation of the potential bias impact across individual studies at a methodological and dataset level was conducted, with relevant considerations for key aspects such as the source of the datasets, the reliability of the annotation labels, the overall number of datasets, the feature extraction methods and characteristics applied for individual studies, and issues related to the evaluation metrics and the overall code/data accessibility across studies. For individual studies examined for the purposes of this review paper, key liabilities to consider concern issues related to differences in non-uniform evaluation metrics applied across studies and aspects related to non-publicly available code/data applied for individual studies. Overall, although a quantitative assessment of bias risk is not conducted for individual studies contributing to this review paper, a number of common methodological aspects and liabilities to consider are highlighted across individual studies examined for review purposes

7. Emerging Challenges and Future Pathways for Sound Event Detection

Advanced Sound Event Detection (SED) is a field that has developed rapidly within the area of artificial intelligence and environmental acoustics. Although tremendous progress has been achieved in this area during the last few years, driven by improvements in powerful deep learning models and availability of annotated data, there are still many open issues to be addressed for seamless integration with modern smart city infrastructures and consumer devices. Various aspects are considered, ranging from technical and methodological issues to ethical and operational concerns, which are all important for seamless integration with modern smart city infrastructures and consumer devices. However, research trends are indicating a significant shift in this area, where the entire ecosystem for SED will evolve to multimodal, brain-inspired, and self-supervised models that are able to adapt to changing operating conditions in real time. Sounds events are defined as unique and identifiable sounds, which provide useful information on the environment. In the reviewed articles, sounds events include sounds such as those generated by traffic, alarm sounds, human sounds, and other environmental sounds. The detection of sounds events involves the detection of overlapping sounds, background sounds, and changes in the intensity and duration of sounds. The detection of sounds events has numerous benefits, which are evident and useful for the safety of the public, the environment, and the smart city concept. The detection of sounds events, as far as the acoustics are concerned, involves the detection of sounds events using feature extraction, segmentation, and classification techniques. The understanding of the concepts of feature extraction, segmentation, and classification of sounds events is crucial for the evaluation of the techniques, the results obtained, and the applicability of the results. One of the challenges facing the field of SED research is the availability of data sets that mimic the real-world sounds environment. Although there are well-established data sets such as UrbanSound8K, TUT Urban Acoustic Scenes, SONYC-UST, and the DCASE data sets, the data sets are limited by the complexity of the real-world sound’s environment. Most of the data sets are clear and clean, with the sounds of interest easily distinguishable from the background sounds. In the case of smart cities, the sounds environment is far more complex, with the sounds of interest often faint, overlapping, and buried under high levels of background noise.

A key future direction, therefore, concerns the creation of “in the wild” datasets, i.e., collected directly in operational contexts lacking experimental control and characterized by high temporal, atmospheric, and social variability. Added to this is the need to generate continuously updated datasets, representative of the evolution of cities, transportation technologies, new human activities, and extreme weather events, which significantly alter the acoustic signature of places. Future research will likely need to address hybrid semi-automatic annotation mechanisms, in which pre-trained algorithms produce pre-annotations that are then refined by human annotators, thus drastically reducing labeling costs.

Another critical issue concerns the ability of SED models to generalize to contexts beyond their training settings. Many systems demonstrate very high performance on known datasets but suffer significant degradation when applied to new cities, new recording devices, inexpensive microphones, or very different weather conditions. Generalization remains a key challenge in academia and industry. A particularly interesting future perspective is represented by self-supervised learning and contrastive learning, techniques that allow models to extract robust acoustic representations without the need for extensive annotations. These approaches enable the development of “universal” systems, capable of learning from the enormous amounts of data passively collected in cities without relying entirely on manual labels. At the same time, the integration of domain adaptation strategies—both based on acoustic statistics and deep learning approaches—may allow models to transfer the expertise acquired in a specific domain to entirely new scenarios.

Many acoustic events of interest to urban or home security are typical examples of “rare events”: explosions, breaking glass, screams for help, gunshots, or emergency signals. These events are extremely rare and are often masked by background noise. Model training therefore requires specific strategies, such as targeted data augmentation techniques, few-shot learning methods, metaheuristic learning, and anomaly detection systems capable of detecting significant deviations from standard acoustic behavior. In the near future, a growing role is expected for generative models such as diffusion-based generators, capable of synthesizing realistic examples of rare events, or for long-term memory neural networks that can accumulate knowledge on weak but recurrent patterns. Techniques for signal detection that leverage modeling of a signal’s energy patterns and rhythms, combined with deep learning, have tremendous potential in addressing this issue. SED systems are increasingly being used in sensor networks spread across cities and in home appliances such as smart cameras, voice assistants, and IoT sensors. This clearly indicates a need to optimize models to enable them to be executed at the edge, without necessarily depending on cloud servers all the time. Scalability, in terms of the number of nodes and the ability to maintain the system for a long time with low energy consumption, needs to be addressed. Future research should focus on developing light architectures that are either quantized models, highly compressed networks, or optimized convolutional neural networks that offer a good trade-off between accuracy and computational complexity. At the same time, utilization of specialized hardware such as neuromorphic accelerators or neural network processors embedded in microcontrollers will enable machine listening to be performed continuously with extremely low energy consumption and low latency.

Most researchers feel that the future of SED lies in multimodal fusion. Pure audio-based approaches are often not sufficient to detect an event definitively or locate it. Integration with video, environmental sensors, accelerometers, IoT sensor networks, or predictive models developed from past data can greatly improve accuracy. Future research directions are moving toward deep multimodal systems, in which cross-attention architectures and multi-stream neural networks integrate acoustic, visual, spatial, and semantic information. This will enable the creation of true urban perception models capable of understanding the operational context in a manner similar to what autonomous systems in the automotive sector currently achieve. Audio–visual event localization, three-dimensional reconstruction of the acoustic scene, and temporal analysis of event evolution represent areas of considerable interest.

The increasing utilization of SEDs in smart cities gives rise to a number of questions concerning the transparency and comprehensibility of models. In view of the fact that the decisions made by an acoustic detection system have the potential to have significant consequences, especially in public safety contexts, it is imperative that the algorithms employed are capable of being explained. Future research directions include the adoption of audio-specific XAI (explainable AI) techniques, such as spectral saliency maps, interpretable models based on decision trees trained on learned representations, or methods for temporal localization of key attributes that contribute to a given classification. These tools will help technicians, administrators, and ordinary citizens comprehend why acoustic systems are making certain decisions, building trust and acceptance in society.

Finally, a major barrier to implementing large-scale sound event detection (SED) systems is ensuring privacy protection. Although SED systems are not dependent upon linguistic data, the general concept of a “listening device” in a city or home makes many people uncomfortable. The following are some ideas that should be pursued to advance privacy-preserving audio analysis:

-: models that only use non-invertible features, meaning that the original data cannot be recovered;
-: neuromorphic compression to only store features relevant to recognition;
-: federated learning to train models without data movement;
-: edge computing for inference that never sends audio data to a remote server.

Using such approaches will be critical for SED systems to be socially acceptable. A brief overview of the major SED techniques reveals their strengths and weaknesses. Classical approaches, based on hand-crafted features and traditional classifiers, have low computational cost but poor performance on complex or overlapping sounds. Deep learning-based techniques, including CNNs and RNNs, have higher accuracy and can learn temporal and spectral patterns. However, they require large amounts of labeled data and heavy computational resources. Hybrid approaches combine the strengths of feature-based techniques with deep learning. The cities of the future require smart continuous monitoring of urban areas to detect acoustic anomalies related to important events such as landslides, flash floods, infrastructure failures, fires, and road accidents. Current SED models have focused on human-related events, while natural events have remained under-explored. The next step is to extend SED to urban resilience and environmental changes. The development of dedicated datasets and SED models to detect weak signals of geoacoustic phenomena or infrastructure degradation is an important area of future work. The integration of SED with predictive climate models and structural sensor networks is an attractive area of exploration.

In home and personal security, SED can be used in a very general fashion: monitoring infants and vulnerable individuals, detecting intrusions, and assisting senior care networks. However, the diversity of home environments and the sheer number of possible configurations makes training a general-purpose model difficult. Future research should focus on developing models that adapt automatically to different environments, employing few-shot learning or self-training based on user behavior. A home SED system might learn to model the acoustics of its environment over time, continually improving detection accuracy and reducing false positives.

One promising direction is to follow the brain’s own approach to sound processing. The cochlea, hair cells, and auditory nerves are astonishingly efficient at extracting meaningful signals from noisy or complex inputs. Biologically inspired models, spiking neuromorphic networks, and hardware designed around neural dynamics could enable a new class of SED systems that drink power and have excellent temporal resolution. This research could close the performance gap between artificial systems and human hearing abilities.

8. Conclusions

The evolution of Sound Event Detection (SED) in the last few years demonstrates a clear pathway by which analysis of the soundscape can play a major role in creating a smarter world, in terms of smart cities, but also in terms of intelligent systems related to security, urban management, and people’s well-being in general. The evolution of SED, thanks to machine learning and deep learning techniques, has led to accuracy levels that were unthinkable even a decade ago, allowing for the automatic recognition of a wide spectrum of sounds, ranging from emergency sounds to rare events, as well as common sounds related to people’s homes and security. Despite all these positive case studies, SED has a lot of room for growth, especially in terms of robustness, generalization, and integration in complex urban infrastructures. From the analysis of the previous sections, a number of benefits can already be seen in smart cities related to SED, such as monitoring environmental sounds, identifying critical sound sources, evidence-based urban policies, public safety through early warning of dangerous events, enhanced emergency services, and home protection for vulnerable people. The case studies presented in this section, all documented in major scientific databases, highlight a pathway by which SED is becoming a reality, moving from laboratory experiments to a widely deployed technology. This shift is driven not just by better models but also by the growing footprint of IoT and edge computing technologies, which enable real-time processing of acoustic data and ensure the privacy of the user at the same time. However, there are many critical challenges to be addressed, such as the representativeness of the data, the heterogeneity of the urban environment, the occurrence of rare and difficult-to-annotate scenes, and the variety of devices used to capture the data, which limit the generality of the models to real-life situations that are very different from the ones encountered during the training process. Furthermore, there is the need to develop models that are light, energy-efficient, and reliable enough to be used over long periods of time. And then there are the social and ethical issues related to the use of SED in smart cities, such as the need to ensure the privacy of the user, the social acceptance of machine listening technologies, and the need to ensure transparency and explainability in decision-making processes.

The future course of research is clearly laid out: multimodal, adaptive, neuro-inspired, and self-supervised Sound Event Detection (SED) systems. The fusion of the acoustic signal with information from other sensors—video, atmospheric, structural, and topographic data—will increase the reliability and accuracy of detection. On the other hand, self-supervised learning and contrastive learning allow for the exploitation of enormous amounts of unlabeled data, reducing the need for expensive human annotation. Finally, hardware specifically designed for machine learning, such as neuromorphic platforms, will facilitate models inspired by the human auditory system, achieving high performance with low energy consumption. In conclusion, this research makes it clear that SED is on the cusp of becoming a basic component of the urban digital infrastructure, helping to ensure safety, sustainability, and quality of life in public and private spaces. SED is more than a detection tool; it is a new paradigm of artificial perception that could give cities an intelligent, pervasive “auditory dimension.” In the coming years, it will be crucial to have the collaboration of researchers, government agencies, IoT technology developers, and security experts to unlock the scientific potential and translate it into reliable, trustworthy solutions. Truly acoustically aware cities are just beginning, but the evidence so far is clear: SED will soon be identified as one of the essential cornerstones of smart city infrastructure.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bdcc10030083/s1, File S1: PRISMA checklist [153].

Author Contributions

Conceptualization, G.C.; methodology, G.C. and V.P.-R.; investigation, G.C.; formal analysis, G.C. and V.P.-R.; writing—original draft preparation, G.C. and V.P.-R.; software, G.C.; writing—review and editing, G.C. and V.P.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. This study did not involve human participants, human data, or animals.

Informed Consent Statement

Not applicable. This study did not include human participants.

Data Availability Statement

Data available in a publicly accessible repository.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Winkowska, J.; Szpilko, D.; Pejić, S. Smart city concept in the light of the literature review. Eng. Manag. Prod. Serv. 2019, 11, 70–86. [Google Scholar] [CrossRef]
Zubizarreta, I.; Seravalli, A.; Arrizabalaga, S. Smart city concept: What it is and what it should be. J. Urban Plan. Dev. 2016, 142, 04015005. [Google Scholar] [CrossRef]
Eremia, M.; Toma, L.; Sanduleac, M. The smart city concept in the 21st century. Procedia Eng. 2017, 181, 12–19. [Google Scholar] [CrossRef]
Ciaburro, G. Security systems for smart cities based on acoustic sensors and machine learning applications. In Machine Intelligence and Data Analytics for Sustainable Future Smart Cities; Springer International Publishing: Cham, Switzerland, 2021; pp. 369–393. [Google Scholar]
Liu, Y.; Ma, X.; Shu, L.; Yang, Q.; Zhang, Y.; Huo, Z.; Zhou, Z. Internet of things for noise mapping in smart cities: State of the art and future directions. IEEE Netw. 2020, 34, 112–118. [Google Scholar] [CrossRef]
Zappatore, M.; Longo, A.; Bochicchio, M.A. Crowd-sensing our smart cities: A platform for noise monitoring and acoustic urban planning. J. Commun. Softw. Syst. 2017, 13, 53–67. [Google Scholar] [CrossRef]
Rawindaran, N. Legal Considerations and Ethical Challenges of Artificial Intelligence on Internet of Things and Smart Cities. In Data Protection in a Post-Pandemic Society: Laws, Regulations, Best Practices and Recent Solutions; Springer International Publishing: Cham, Switzerland, 2023; pp. 217–239. [Google Scholar]
Ciaburro, G.; Iannace, G. Improving smart cities safety using sound events detection based on deep neural network algorithms. Informatics 2020, 7, 23. [Google Scholar] [CrossRef]
De Coensel, B.; Botteldooren, D. Smart sound monitoring for sound event detection and characterization. In 43rd International Congress on Noise Control Engineering (INTERNOISE 2014); Australian Acoustical Society: Toowong, Queensland, 2014. [Google Scholar]
Rashed, A.; Abdulazeem, Y.; Farrag, T.A.; Bamaqa, A.; Almaliki, M.; Badawy, M.; Elhosseini, M.A. Toward Inclusive Smart Cities: Sound-Based Vehicle Diagnostics, Emergency Signal Recognition, and Beyond. Machines 2025, 13, 258. [Google Scholar] [CrossRef]
Castorena, C.; Cobos, M.; Lopez-Ballester, J.; Ferri, F.J. A safety-oriented framework for sound event detection in driving scenarios. Appl. Acoust. 2024, 215, 109719. [Google Scholar] [CrossRef]
Alsina-Pagès, R.M.; Benocci, R.; Brambilla, G.; Zambon, G. Methods for noise event detection and assessment of the sonic environment by the harmonica index. Appl. Sci. 2021, 11, 8031. [Google Scholar] [CrossRef]
Ciaburro, G. Sound event detection in underground parking garage using convolutional neural network. Big Data Cogn. Comput. 2020, 4, 20. [Google Scholar] [CrossRef]
Nagatomo, K.; Yasuda, M.; Yatabe, K.; Saito, S.; Oikawa, Y. On-line sound event localization and detection for real-time recognition of surrounding environment. Appl. Acoust. 2022, 199, 108961. [Google Scholar] [CrossRef]
Mesaros, A.; Heittola, T.; Virtanen, T.; Plumbley, M.D. Sound event detection: A tutorial. IEEE Signal Process. Mag. 2021, 38, 67–83. [Google Scholar] [CrossRef]
Li, K.; Song, Y.; Dai, L.R.; McLoughlin, I.; Fang, X.; Liu, L. Ast-sed: An effective sound event detection method based on audio spectrogram transformer. In ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Guirguis, K.; Schorn, C.; Guntoro, A.; Abdulatif, S.; Yang, B. SELD-TCN: Sound event localization & detection via temporal convolutional networks. In 2020 28th European Signal Processing Conference (EUSIPCO); IEEE: Piscataway, NJ, USA, 2021; pp. 16–20. [Google Scholar]
Cao, Y.; Iqbal, T.; Kong, Q.; An, F.; Wang, W.; Plumbley, M.D. An improved event-independent network for polyphonic sound event localization and detection. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2021; pp. 885–889. [Google Scholar]
Bilen, Ç.; Ferroni, G.; Tuveri, F.; Azcarreta, J.; Krstulović, S. A framework for the robust evaluation of sound event detection. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2020; pp. 61–65. [Google Scholar]
Mnasri, Z.; Rovetta, S.; Masulli, F. Anomalous sound event detection: A survey of machine learning based methods and applications. Multimed. Tools Appl. 2022, 81, 5537–5586. [Google Scholar] [CrossRef]
Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for polyphonic sound event detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
Kong, Q.; Xu, Y.; Sobieraj, I.; Wang, W.; Plumbley, M.D. Sound event detection and time–frequency segmentation from weakly labelled data. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 777–787. [Google Scholar] [CrossRef]
Park, S.; Bellur, A.; Han, D.K.; Elhilali, M. Self-training for sound event detection in audio mixtures. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2021; pp. 341–345. [Google Scholar]
Wei, W.; Zhu, H.; Benetos, E.; Wang, Y. A-crnn: A domain adaptation model for sound event detection. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2021; pp. 276–280. [Google Scholar]
Ciaburro, G.; Kumcu, E.H. Using Soundscape to Design Branded Environments. In Building Strong Brands and Engaging Customers with Sound; IGI Global Scientific Publishing: New York, NY, USA, 2024; pp. 1–31. [Google Scholar]
Yang, Z.; Wei, Y.; Li, H.; Li, Q.; Jiang, L.; Sun, L.; Yu, X.; Hu, C.; Peng, H. Adaptive differentially private structural entropy minimization for unsupervised social event detection. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 2950–2960. [Google Scholar]
Maurya, M.K.; Kumar, M.; Kumar, M. Sound event detection using federated learning. In 2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON); IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Karbalaie, A.; Abtahi, F.; Sjöström, M. Event detection in surveillance videos: A review. Multimed. Tools Appl. 2022, 81, 35463–35501. [Google Scholar]
Mohmmad, S.; Sanampudi, S.K. Exploring current research trends in sound event detection: A systematic literature review. Multimed. Tools Appl. 2024, 83, 84699–84741. [Google Scholar]
Ciaburro, G.; Kumcu, E.H. The Role of Sound on the Future of E-Commerce Applications Using Metaverse Technologies. In Cutting-Edge Technologies for Business Sectors; IGI Global: Hershey, PA, USA, 2025; pp. 177–204. [Google Scholar]
Sarkis-Onofre, R.; Catalá-López, F.; Aromataris, E.; Lockwood, C. How to properly use the PRISMA Statement. Syst. Rev. 2021, 10, 117. [Google Scholar] [CrossRef]
Asar, S.H.; Jalalpour, S.H.; Ayoubi, F.; Rahmani, M.R.; Rezaeian, M. PRISMA; preferred reporting items for systematic reviews and meta-analyses. J. Rafsanjan Univ. Med. Sci. 2016, 15, 68–80. [Google Scholar]
Takkouche, B.; Norman, G. PRISMA statement. Epidemiology 2011, 22, 128. [Google Scholar] [CrossRef]
Amir-Behghadami, M.; Janati, A. Population, Intervention, Comparison, Outcomes and Study (PICOS) design as a framework to formulate eligibility criteria in systematic reviews. Emerg. Med. J. 2020, 37, 387. [Google Scholar] [CrossRef]
Methley, A.M.; Campbell, S.; Chew-Graham, C.; McNally, R.; Cheraghi-Sohi, S. PICO, PICOS and SPIDER: A comparison study of specificity and sensitivity in three search tools for qualitative systematic reviews. BMC Health Serv. Res. 2014, 14, 579. [Google Scholar] [CrossRef] [PubMed]
Meline, T. Selecting studies for systemic review: Inclusion and exclusion criteria. Contemp. Issues Commun. Sci. Disord. 2006, 33, 21–27. [Google Scholar] [CrossRef]
Linares-Espinós, E.; Hernández, V.; Domínguez-Escrig, J.L.; Fernández-Pello, S.; Hevia, V.; Mayor, J.; Padilla-Fernández, B.; Ribal, M.J. Methodology of a systematic review. Actas Urológicas Españolas (Engl. Ed.) 2018, 42, 499–506. [Google Scholar] [CrossRef]
Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G.; Antes, G.; Atkins, D.; Barbour, V.; Barrowman, N.; Berlin, J.A.; Tugwell, P.; et al. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Rev. Esp. De Nutr. Humana Y Diet. 2014, 18, 172–181. [Google Scholar] [CrossRef]
Puyana-Romero, V.; Díaz-Márquez, A.M.; Garzón, C.; Ciaburro, G. The Domestic Acoustic Environment in Online Education—Part 1: Differences by Gender, Perceived Academic Quality, and Self-Rated Performance. Buildings 2024, 15, 84. [Google Scholar] [CrossRef]
Aletta, F.; Oberman, T.; Mitchell, A.; Tong, H.; Kang, J. Assessing the changing urban sound environment during the COVID-19 lockdown period using short-term acoustic measurements. Noise Mapp. 2020, 7, 123–134. [Google Scholar] [CrossRef]
Aletta, F.; Kang, J.; Axelsson, Ö. Soundscape descriptors and a conceptual framework for developing predictive soundscape models. Landsc. Urban Plan. 2016, 149, 65–74. [Google Scholar] [CrossRef]
Puyana-Romero, V.; Díaz-Márquez, A.M.; Garzón, C.; Ciaburro, G. The Domestic Acoustic Environment in Online Education—Part 2: Different Interference Perception of Sound Sources and While Conducting Academic Tasks. Buildings 2024, 15, 93. [Google Scholar] [CrossRef]
Pandey, D.; Niwaria, K.; Chourasia, B. Machine learning algorithms: A review. Mach. Learn 2019, 6, 916–922. [Google Scholar]
Puyana-Romero, V.; Tamayo-Guamán, L.M.; Núñez-Solano, D.; Hernández-Molina, R.; Ciaburro, G. Artificial Neural Network-Based Model to Characterize the Reverberation Time of a Neonatal Incubator. In Innovations in Machine and Deep Learning: Case Studies and Applications; Springer Nature: Cham, Switzerland, 2023; pp. 305–322. [Google Scholar]
Wang, X.; Jiang, W.; Luo, Z. Combination of convolutional and recurrent neural network for sentiment analysis of short texts. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2428–2437. [Google Scholar]
Gillioz, A.; Casas, J.; Mugellini, E.; Abou Khaled, O. Overview of the Transformer-based Models for NLP Tasks. In 2020 15th Conference on Computer Science and Information Systems (FedCSIS); IEEE: Piscataway, NJ, USA, 2020; pp. 179–183. [Google Scholar]
Rahali, A.; Akhloufi, M.A. End-to-end transformer-based models in textual-based NLP. AI 2023, 4, 54–110. [Google Scholar] [CrossRef]
Xiao, Y.; Zuo, X.; Lu, X.; Dong, J.S.; Cao, X.; Beschastnikh, I. Promises and perils of using Transformer-based models for SE research. Neural Netw. 2025, 184, 107067. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Wu, S.L.; Chang, X.; Wichern, G.; Jung, J.W.; Germain, F.; Le Roux, J.; Watanabe, S. BEATs-based audio captioning model with INSTRUCTOR embedding supervision and ChatGPT mix-up. In Proceedings of the Detection and Classification of Acoustic Scenes and Events, Challenge, Tampere, Finland, 21–22 September 2023; pp. 1–5. [Google Scholar]
Piadyk, Y.; Rulff, J.; Brewer, E.; Hosseini, M.; Ozbay, K.; Sankaradas, M.; Chakradhar, S.; Silva, C. Streetaware: A high-resolution synchronized multimodal urban scene dataset. Sensors 2023, 23, 3710. [Google Scholar] [CrossRef]
Ciaburro, G.; Puyana-Romero, V. Sustainable Membrane-Based Acoustic Metamaterials Using Cork and Honeycomb Structures: Experimental and Numerical Characterization. Buildings 2025, 15, 2763. [Google Scholar] [CrossRef]
Cowling, M.; Sitte, R. Comparison of techniques for environmental sound recognition. Pattern Recognit. Lett. 2003, 24, 2895–2907. [Google Scholar] [CrossRef]
Chu, S.; Narayanan, S.; Kuo, C.C.J. Environmental sound recognition with time–frequency audio features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1142–1158. [Google Scholar] [CrossRef]
Barchiesi, D.; Giannoulis, D.; Stowell, D.; Plumbley, M.D. Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Process. Mag. 2015, 32, 16–34. [Google Scholar] [CrossRef]
Mesaros, A.; Heittola, T.; Eronen, A.; Virtanen, T. Acoustic event detection in real life recordings. In 2010 18th European Signal Processing Conference; IEEE: Piscataway, NJ, USA, 2010; pp. 1267–1271. [Google Scholar]
Temko, A.; Nadeu, C. Classification of acoustic events using SVM-based clustering schemes. Pattern Recognit. 2006, 39, 682–694. [Google Scholar] [CrossRef]
Mesaros, A.; Heittola, T.; Benetos, E.; Foster, P.; Lagrange, M.; Virtanen, T.; Plumbley, M.D. Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 26, 379–393. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Bhatt, D.; Patel, C.; Talsania, H.; Patel, J.; Vaghela, R.; Pandya, S.; Modi, K.; Ghayvat, H. CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics 2021, 10, 2470. [Google Scholar] [CrossRef]
Cakır, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
Piczak, K.J. Environmental sound classification with convolutional neural networks. In 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP); IEEE: Piscataway, NJ, USA, 2015; pp. 1–6. [Google Scholar]
Ciaburro, G. Deep learning methods for audio events detection. In Machine Learning for Intelligent Multimedia Analytics: Techniques and Applications; Springer: Singapore, 2021; pp. 147–166. [Google Scholar]
Ciaburro, G. Automated Home Security System Based on Sound Event Detection Using Deep Learning Methods. In Modern Advancements in Surveillance Systems and Technologies; IGI Global Scientific Publishing: New York, NY, USA, 2025; pp. 273–302. [Google Scholar]
Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Studies in Computational Intelligence; Springer: Berlin, Heidelberg, 2012; Volume 385. [Google Scholar]
Dey, R.; Salem, F.M. Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS); IEEE: Piscataway, NJ, USA, 2017; pp. 1597–1600. [Google Scholar]
Adavanne, S.; Politis, A.; Nikunen, J.; Virtanen, T. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Top. Signal Process. 2018, 13, 34–48. [Google Scholar] [CrossRef]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Ciaburro, G. Time series data analysis using deep learning methods for smart cities monitoring. In Big Data Intelligence for Smart Applications; Springer International Publishing: Cham, Switzerland, 2022; pp. 93–116. [Google Scholar]
Xu, F.; Chen, C.; Shang, Z.; Peng, Y.; Li, X. A CRNN-based method for Chinese ship license plate recognition. IET Image Process. 2024, 18, 298–311. [Google Scholar] [CrossRef]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
Gong, Y.; Lai, C.I.; Chung, Y.A.; Glass, J. Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 10699–10709. [Google Scholar]
Zhang, Y.; Li, B.; Fang, H.; Meng, Q. Spectrogram transformers for audio classification. In 2022 IEEE International Conference on Imaging Systems and Techniques (IST); IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Mohmmad, S.; Sanampudi, S.K. A parametric survey on polyphonic sound event detection and localization. Multimed. Tools Appl. 2025, 84, 22083–22120. [Google Scholar] [CrossRef]
Li, Z.; Fan, R.; Ma, J.; Ai, J.; Dong, Y. Dynamic Temporal Denoise Neural Network with Multi-Head Attention for Fault Diagnosis Under Noise Background. Sensors 2024, 24, 6813. [Google Scholar] [CrossRef]
Chen, Q.; Wang, W.; Wu, F.; De, S.; Wang, R.; Zhang, B.; Huang, X. A survey on an emerging area: Deep learning for smart city data. IEEE Trans. Emerg. Top. Comput. Intell. 2019, 3, 392–410. [Google Scholar] [CrossRef]
Jogin, M.; Madhulika, M.S.; Divya, G.D.; Meghana, R.K.; Apoorva, S. Feature extraction using convolution neural networks (CNN) and deep learning. In 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT); IEEE: Piscataway, NJ, USA, 2018; pp. 2319–2323. [Google Scholar]
Yin, B.; Corradi, F.; Bohté, S.M. Effective and efficient computation with multiple-timescale spiking recurrent neural networks. In Proceedings of the International Conference on Neuromorphic Systems 2020, Oak Ridge, TN, USA, 28–30 July 2020; pp. 1–8. [Google Scholar]
Shaheen, K.; Hanif, M.A.; Hasan, O.; Shafique, M. Continual learning for real-world autonomous systems: Algorithms, challenges and frameworks. J. Intell. Robot. Syst. 2022, 105, 9. [Google Scholar] [CrossRef]
Wang, Y.; Shi, Y.; Zhang, F.; Wu, C.; Chan, J.; Yeh, C.F.; Xiao, A. Transformer in action: A comparative study of transformer-based acoustic models for large scale speech recognition applications. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2021; pp. 6778–6782. [Google Scholar]
Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
Krishnan, R.; Rajpurkar, P.; Topol, E.J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 2022, 6, 1346–1352. [Google Scholar] [CrossRef]
Rani, V.; Nabi, S.T.; Kumar, M.; Mittal, A.; Kumar, K. Self-supervised learning: A succinct review. Arch. Comput. Methods Eng. 2023, 30, 2761–2775. [Google Scholar] [CrossRef]
Henaff, O. Data-efficient image recognition with contrastive predictive coding. Proc. Mach. Learn. Res. 2020, 119, 4182–4192. [Google Scholar]
Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2017; pp. 776–780. [Google Scholar]
Guzhov, A.; Raue, F.; Hees, J.; Dengel, A. Audioclip: Extending clip to image, text and audio. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2022; pp. 976–980. [Google Scholar]
Chen, T.; Rao, R.R. Audio-visual integration in multimodal communication. Proc. IEEE 1998, 86, 837–852. [Google Scholar] [CrossRef]
Sukel, M.; Rudinac, S.; Worring, M. Multimodal classification of urban micro-events. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1455–1463. [Google Scholar]
Benoit, C.; Martin, J.C.; Pelachaud, C.; Schomaker, L.; Suhm, B. Audio-visual and multimodal speech systems. In Handbook of Standards and Resources for Spoken Language Systems-Supplement; Springer: Boston, MA, USA, 2000; Volume 500, pp. 1–95. [Google Scholar]
Tian, Y.; Shi, J.; Li, B.; Duan, Z.; Xu, C. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 247–263. [Google Scholar]
Bai, J.; Chen, J.; Wang, M. Multimodal urban sound tagging with spatiotemporal context. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 555–565. [Google Scholar] [CrossRef]
Hou, L.; Duan, W.; Xuan, G.; Xiao, S.; Li, Y.; Li, Y.; Zhao, J. Intelligent microsystem for sound event recognition in edge computing using end-to-end mesh networking. Sensors 2023, 23, 3630. [Google Scholar] [CrossRef]
Berghi, D.; Wu, P.; Zhao, J.; Wang, W.; Jackson, P.J. Fusion of audio and visual embeddings for sound event localization and detection. In ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2024; pp. 8816–8820. [Google Scholar]
Luo, L.; Qin, H.; Song, X.; Wang, M.; Qiu, H.; Zhou, Z. Wireless sensor networks for noise measurement and acoustic event recognitions in urban environments. Sensors 2020, 20, 2093. [Google Scholar] [CrossRef] [PubMed]
Han, C.; Seshadri, P.; Ding, Y.; Posner, N.; Koo, B.W.; Agrawal, A.; Lerch, A.; Guhathakurta, S. Understanding pedestrian movement using urban sensing technologies: The promise of audio-based sensors. Urban Inform. 2024, 3, 22. [Google Scholar] [CrossRef]
Hershey, J.R.; Chen, Z.; Le Roux, J.; Watanabe, S. Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2016; pp. 31–35. [Google Scholar]
Luo, Y.; Chen, Z.; Hershey, J.R.; Le Roux, J.; Mesgarani, N. Deep clustering and conventional networks for music separation: Stronger together. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2017; pp. 61–65. [Google Scholar]
Pujol, H.; Bavu, E.; Garcia, A. BeamLearning: An end-to-end deep learning approach for the angular localization of sound sources using raw multichannel acoustic pressure data. J. Acoust. Soc. Am. 2021, 149, 4248–4263. [Google Scholar] [CrossRef]
Vera-Diaz, J.M.; Pizarro, D.; Macias-Guarasa, J. Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates. Sensors 2018, 18, 3418. [Google Scholar] [CrossRef]
Grumiaux, P.A.; Kitić, S.; Girin, L.; Guérin, A. A survey of sound source localization with deep learning methods. J. Acoust. Soc. Am. 2022, 152, 107–151. [Google Scholar] [CrossRef]
Shuai, L.; Elhilali, M. Task-dependent neural representations of salient events in dynamic auditory scenes. Front. Neurosci. 2014, 8, 203. [Google Scholar] [CrossRef] [PubMed]
Kalinli, O.; Sundaram, S.; Narayanan, S. Saliency-driven unstructured acoustic scene classification using latent perceptual indexing. In 2009 IEEE International Workshop on Multimedia Signal Processing; IEEE: Piscataway, NJ, USA, 2009; pp. 1–6. [Google Scholar]
Jekateryńczuk, G.; Piotrowski, Z. A survey of sound source localization and detection methods and their applications. Sensors 2023, 24, 68. [Google Scholar] [CrossRef] [PubMed]
Pan, Y.F.; Hou, X.; Liu, C.L. A hybrid approach to detect and localize texts in natural scene images. IEEE Trans. Image Process. 2010, 20, 800–813. [Google Scholar] [CrossRef]
Purwins, H.; Li, B.; Virtanen, T.; Schlüter, J.; Chang, S.Y.; Sainath, T. Deep learning for audio signal processing. IEEE J. Sel. Top. Signal Process. 2019, 13, 206–219. [Google Scholar] [CrossRef]
Salamon, J.; Jacoby, C.; Bello, J.P. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 1041–1044. [Google Scholar]
Fuentes, M.; Plaja-Roglans, G.; Cortès-Sebastià, G.; Khandelwal, T.; Miron, M.; Serra, X.; Bello, J.P.; Salamon, J. Soundata: Reproducible use of audio datasets. J. Open Source Softw. 2024, 9, 6634. [Google Scholar] [CrossRef]
Salamon, J.; MacConnell, D.; Cartwright, M.; Li, P.; Bello, J.P. Scaper: A library for soundscape synthesis and augmentation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA); IEEE: Piscataway, NJ, USA, 2017; pp. 344–348. [Google Scholar]
McFee, B.; Salamon, J.; Bello, J.P. Adaptive pooling operators for weakly labeled sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 2180–2193. [Google Scholar] [CrossRef]
Xiao, S.; Lei, L. Detection and System Implementation of Abnormal Audio Events in Urban Environments Based on Convolutional Neural Networks. In 2024 International Conference on Culture-Oriented Science & Technology (CoST); IEEE: Piscataway, NJ, USA, 2024; pp. 45–49. [Google Scholar]
Fonseca, E.; Favory, X.; Pons, J.; Font, F.; Serra, X. Fsd50k: An open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 829–852. [Google Scholar] [CrossRef]
Abeßer, J. Classifying Sounds in Polyphonic Urban Sound Scenes. AES E-Library. 2022. Available online: https://aes.org/publications/elibrary-page/?id=21683 (accessed on 30 January 2026).
Bello, J.P.; Silva, C.; Nov, O.; Dubois, R.L.; Arora, A.; Salamon, J.; Mydlarz, C.; Doraiswamy, H. Sonyc: A system for monitoring, analyzing, and mitigating urban noise pollution. Commun. ACM 2019, 62, 68–77. [Google Scholar] [CrossRef]
Mydlarz, C.; Salamon, J.; Bello, J.P. The implementation of low-cost urban acoustic monitoring devices. Appl. Acoust. 2017, 117, 207–218. [Google Scholar] [CrossRef]
Salamon, J.; Bello, J.P. Unsupervised feature learning for urban sound classification. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2015; pp. 171–175. [Google Scholar]
Mares, D.; Blackburn, E. Acoustic gunshot detection systems: A quasi-experimental evaluation in St. Louis, MO. J. Exp. Criminol. 2021, 17, 193–215. [Google Scholar] [CrossRef]
Picaut, J.; Can, A.; Fortin, N.; Ardouin, J.; Lagrange, M. Low-cost sensors for urban noise monitoring networks—A literature review. Sensors 2020, 20, 2256. [Google Scholar] [CrossRef]
Bello, J.P.; Mydlarz, C.; Salamon, J. Sound analysis in smart cities. In Computational Analysis of Sound Scenes and Events; Springer International Publishing: Cham, Switzerland, 2017; pp. 373–397. [Google Scholar]
Yun, J.; Srivastava, S.; Roy, D.; Stohs, N.; Mydlarz, C.; Salman, M.; Steers, B.; Bello, J.P.; Arora, A. Infrastructure-free, Deep Learned Urban Noise Monitoring at~ 100 mW. In 2022 ACM/IEEE 13th International Conference on Cyber-Physical Systems (ICCPS); IEEE: Piscataway, NJ, USA, 2022; pp. 56–67. [Google Scholar]
Clavel, C.; Ehrette, T.; Richard, G. Events detection for an audio-based surveillance system. In 2005 IEEE International Conference on Multimedia and Expo; IEEE: Piscataway, NJ, USA, 2005; pp. 1306–1309. [Google Scholar]
Ntalampiras, S.; Potamitis, I.; Fakotakis, N. Probabilistic novelty detection for acoustic surveillance under real-world conditions. IEEE Trans. Multimed. 2011, 13, 713–719. [Google Scholar] [CrossRef]
Lojka, M.; Pleva, M.; Kiktová, E.; Juhár, J.; Čižmár, A. Efficient acoustic detector of gunshots and glass breaking. Multimed. Tools Appl. 2016, 75, 10441–10469. [Google Scholar] [CrossRef][Green Version]
Bourenane, S.A.; Henni, S.A. Audio-visual multimodal fall detection to ensure the safety of elderly people in intelligent buildings: An innovative approach using LSTM, CNN, and a shallow neural network. Signal Image Video Process. 2025, 19, 1271. [Google Scholar] [CrossRef]
Ntalampiras, S. Audio pattern recognition of baby crying sound events. In AES; Audio Engineering Society: New York, NY, USA, 2015; Volume 63, pp. 358–369. [Google Scholar]
Liang, Y.C.; Wijaya, I.; Yang, M.T.; Cuevas Juarez, J.R.; Chang, H.T. Deep learning for infant cry recognition. Int. J. Environ. Res. Public Health 2022, 19, 6311. [Google Scholar] [CrossRef]
McGinnis, E.W.; Anderau, S.P.; Hruschak, J.; Gurchiek, R.D.; Lopez-Duran, N.L.; Fitzgerald, K.; Rosenblum, K.L.; Muzik, M.; McGinnis, R.S. Giving voice to vulnerable children: Machine learning analysis of speech detects anxiety and depression in early childhood. IEEE J. Biomed. Health Inform. 2019, 23, 2294–2301. [Google Scholar] [CrossRef]
Stowell, D.; Giannoulis, D.; Benetos, E.; Lagrange, M.; Plumbley, M.D. Detection and classification of acoustic scenes and events. IEEE Trans. Multimed. 2015, 17, 1733–1746. [Google Scholar] [CrossRef]
Martínez-Villaseñor, L.; Ponce, H.; Perez-Daniel, K. Deep learning for multimodal fall detection. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC); IEEE: Piscataway, NJ, USA, 2019; pp. 3422–3429. [Google Scholar]
Christen, P.; Hand, D.J.; Kirielle, N. A review of the F-measure: Its history, properties, criticism, and alternatives. ACM Comput. Surv. 2023, 56, 1–24. [Google Scholar] [CrossRef]
Och, F.J. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, 7–12 July 2003; pp. 160–167. [Google Scholar]
Mesaros, A.; Serizel, R.; Heittola, T.; Virtanen, T.; Plumbley, M.D. A decade of DCASE: Achievements, practices, evaluations and future challenges. In ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Turpault, N.; Serizel, R.; Wisdom, S.; Erdogan, H.; Hershey, J.R.; Fonseca, E.; Seetharaman, P.; Salamon, J. Sound event detection and separation: A benchmark on desed synthetic soundscapes. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2021; pp. 840–844. [Google Scholar]
Adavanne, S.; Pertilä, P.; Virtanen, T. Sound event detection using spatial features and convolutional recurrent neural network. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2017; pp. 771–775. [Google Scholar]
Kong, Q.; Xu, Y.; Wang, W.; Plumbley, M.D. Sound event detection of weakly labelled data with cnn-transformer and automatic threshold optimization. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2450–2460. [Google Scholar] [CrossRef]
Jin, Y.; Wang, M.; Luo, L.; Zhao, D.; Liu, Z. Polyphonic sound event detection using temporal-frequency attention and feature space attention. Sensors 2022, 22, 6818. [Google Scholar] [CrossRef] [PubMed]
Wang, M.; Yao, Y.; Qiu, H.; Song, X. Adaptive memory-controlled self-attention for polyphonic sound event detection. Symmetry 2022, 14, 366. [Google Scholar] [CrossRef]
Turpault, N.; Serizel, R.; Shah, A.P.; Salamon, J. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Proceedings of the Detection and Classification of Acoustic Scenes and Events, New York, NY, USA, 25–26 October 2019. [Google Scholar]
Chen, K.; Du, X.; Zhu, B.; Ma, Z.; Berg-Kirkpatrick, T.; Dubnov, S. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2022; pp. 646–650. [Google Scholar]
Elizalde, B.; Deshmukh, S.; Al Ismail, M.; Wang, H. Clap learning audio concepts from natural language supervision. In ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Afouras, T.; Owens, A.; Chung, J.S.; Zisserman, A. Self-supervised learning of audio-visual objects from video. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 208–224. [Google Scholar]
Gao, R.; Grauman, K. Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2021; pp. 15490–15500. [Google Scholar]
Nakamura, S. Statistical multimodal integration for audio-visual speech processing. IEEE Trans. Neural Netw. 2002, 13, 854–866. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Zhang, S.; Ni, D.; Wei, Z.; Yang, K.; Jin, S.; Huang, G.; Liang, Z.; Zhang, L.; Wang, J.; et al. Multimodal sensing for depression risk detection: Integrating audio, video, and text data. Sensors 2024, 24, 3714. [Google Scholar] [CrossRef]
Arjunan, G. AI beyond text: Integrating vision, audio, and language for multimodal learning. Int. J. Innov. Sci. Res. Technol. 2024, 9, 1911–1920. [Google Scholar]
Mandel, M.I.; Weiss, R.J.; Ellis, D.P. Model-based expectation-maximization source separation and localization. IEEE Trans. Audio Speech Lang. Process. 2009, 18, 382–394. [Google Scholar] [CrossRef]
Ansari, S.; Alnajjar, K.A.; Khater, T.; Mahmoud, S.; Hussain, A. A robust hybrid neural network architecture for blind source separation of speech signals exploiting deep learning. IEEE Access 2023, 11, 100414–100437. [Google Scholar] [CrossRef]
Rouard, S.; Massa, F.; Défossez, A. Hybrid transformers for music source separation. In ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Wang, Y.; Salamon, J.; Bryan, N.J.; Bello, J.P. Few-shot sound event detection. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2020; pp. 81–85. [Google Scholar]
Nguyen, T.N.T.; Jones, D.L.; Gan, W.S. A sequence matching network for polyphonic sound event localization and detection. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2020; pp. 71–75. [Google Scholar]
Politis, A.; Mesaros, A.; Adavanne, S.; Heittola, T.; Virtanen, T. Overview and evaluation of sound event localization and detection in DCASE 2019. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 684–698. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]

Figure 1. PRISMA 2020 flow chart for SED based review.

Figure 2. Illustration of the traditional Sound Event Detection pipeline based on engineered features. This pipeline begins with audio preprocessing and then involves the extraction of hand-crafted features such as MFCCs, spectral flux, and chroma features, followed by classification using traditional machine learning classifiers like GMMs or SVMs.

Figure 3. Overview of the main supervised deep learning architectures used for Sound Event Detection, including CNN, CRNN, and Transformer-based models, highlighting their core components and typical processing pipelines.

Figure 4. A map of the main audio foundation model architectures, showing the key groups—convolution-based SSL, transformer-based SSL, and multimodal models—and notable examples from each category.

Figure 5. Taxonomy of multimodal models for urban Sound Event Detection, showing main model categories (Audio + Vision, Audio + Text, Audio + Context, Audio + Foundation Models) and representative examples.

Figure 6. Network-style diagram illustrating task-specific models for urban Sound Event Detection, showing the mapping between key urban tasks and the corresponding model types.

Table 1. Comparison of Public Datasets Used in Sound Event Detection.

Dataset	Year	Domain Focus	# Clips/Hours	# Classes	Annotation Type	Recording Environment	Polyphony	Label Granularity	Main Use in Literature
UrbanSound8K	2014	Urban sounds	8732 clips (~9 h)	10	Clip-level	Real-world urban	Low	Weak labels	Benchmark classification
URBAN-SED	2018	Urban events	10,000 clips (synthetic mixtures)	10	Strong (onset/offset)	Synthetic (simulated scenes)	High	Frame-level	Polyphonic SED
SONYC-UST	2019	Urban noise monitoring	~18,000 clips	23 (coarse/fine)	Weak + partial strong	Real NYC sensors	Medium–High	Clip + partial frame	Urban noise monitoring
FSD50K	2020	General environmental sounds	51,197 clips (~108 h)	200	Weak (multi-label)	Freesound (crowdsourced)	High	Clip-level	Large-scale SED pretraining
USM-SED	2021	Urban sound monitoring	~20 h	~8–10	Strong	Real-world	Medium	Frame-level	Overlapping urban SED
LibriSpeech	2015	Speech	~1000 h	Speech only	Strong (aligned text)	Clean speech	Low	Frame-level (speech)	Speech modeling, pretraining
AudioSet	2017	General audio events	~2M clips (~5800 h)	527	Weak (multi-label)	YouTube	High	Clip-level	Large-scale pretraining
ESC-50	2015	Environmental sounds	2000 clips (~2.8 h)	50	Clip-level	Controlled & field	Low	Weak labels	Benchmark classification
Freesound	Ongoing	General audio	Variable	Variable	Community tags (weak)	User-uploaded	Variable	Weak	Data source for custom datasets
UrbanSound8K	2014	Urban sounds	8732 clips (~9 h)	10	Clip-level	Real-world urban	Low	Weak labels	Benchmark classification

Table 2. Summary of Average Performance on Urban Polyphonic Sound Datasets.

Model	F1-Score	Error Rate (ER)	PSDS-1	PSDS-2
Feature-engineered	30–55%	0.8–1.2	<0.20	<0.30
Deep Supervised (CNN/CRNN)	40–50%	0.65–0.75	0.30–0.45	0.45–0.55
Self-Supervised	60–70%	0.35–0.55	0.60–0.78	0.70–0.82
Multimodal	70–75%	0.30–0.45	0.75–0.85	0.82–0.90
Specialized	60–72%	0.35–0.50	0.60–0.80	0.70–0.85

Table 3. Representative Performance on Standard Urban Benchmarks.

Benchmark	Feature-Based	CNN CRNN	Self-Supervised Transformer	Multimodal
DESED (Real + Synth)	F1: 20–35% ER: >1.0 PSDS1: <0.15	F1: 40–45% ER: 0.65 PSDS1: 0.32–0.45	F1: 60–70% PSDS1: 0.55–0.75	F1: 70–80% PSDS1: 0.75–0.85
SONYC-UST	F1: <40%	F1: 45–55%	F1: 60–65%	F1: 70–75%
UrbanSound8K	Acc ≈ 60–75%	Acc ≈ 85–90%	Acc ≈ 90–95%	—
DCASE Task 4	ER > 0.9	ER 0.65–0.75	ER 0.40–0.55	ER 0.30–0.45

Table 4. Comparison of Estimated Performance Gaps Against CRNN Baselines.

Model Category	ΔF1 vs. CRNN	ΔER vs. CRNN	ΔPSDS-1 vs. CRNN
Engineered Features	−10–−25%	+0.20–+0.45	−0.20–−0.30
Supervised Transformers	+8–+15%	−0.10–−0.20	+0.15–+0.25
PANNs (self-supervised CNN)	−10–−25%	+0.20–+0.45	−0.20–−0.30
HTS-AT	+15–+20%	−0.15–−0.25	+0.20–+0.30
Multimodal CLAP	+20–+30%	−0.20–−0.30	+0.30–+0.40
Models with source separation	+25–+35%	−0.25–−0.35	+0.35–+0.45

Table 5. Performance on Typical Urban Sound Classes (Synthetic Example Based on SONYC and DCASE References).

Urban Class	CRNN (F1%)	PANNs/AST (F1%)	Multimodal CLAP (F1%)	Observations
Sirens	55–60	70–80	80–90	Highly informative class, greatly improved with general embeddings
Drill/Jackhammer	45–55	60–70	70–80	Significant gains thanks to pre-trained models on AudioSet
Vehicles	50–65	65–75	75–85	Combined audio and semantics improve subtype distinction
Human voices	60–70	70–80	80–85	Transformers and multimodal more robust to urban noise
General background noise	40–50	55–65	65–75	Self-supervised models much more stable at scene-event transitions

Table 6. Model requirements and practical applicability.

Category	Label Request	Computational Cost	Urban Noise Robustness	Typical Application
Engineered features	Medium	Low	Low	Low-cost sensors, legacy systems
CNN/CRNN supervised	High	Medium	Moderate	Urban monitoring networks with available annotations
Transformers supervised	High	High	High	Advanced reporting, research projects
Self-supervised	Low	Medium/High	Very high	Variable real-world applications, unlabeled scenarios
Multimodal	Low	Very High	Maximum	Smart cities, acoustic video surveillance
Specialized (SELD, separation)	High	High	High	Complex and directional monitoring

Table 7. Performance Comparison Across SED Model Classes (Summary of Benchmark Trends).

Model Class	Typical Strengths	Typical Weaknesses	Typical Metrics & Observed Trends
Feature-engineered models (e.g., MFCC + HMM, NMF-based SED)	Low computational cost, interpretable, stable under controlled noise conditions	Poor performance under polyphony, limited generalization to new cities or devices	F1-score often <50%; ER high (0.5–0.8); PSDS rarely reported because of limited real-world applicability
Supervised deep learning (CNN, CRNN, Transformer-based)	Strong performance on large datasets, robust to moderate noise, can model polyphony effectively	Requires extensive labeled data; performance drops when changing sensor, city, or soundscape	F1-score commonly 65–85%; ER 0.25–0.45; PSDS 0.35–0.55 on DESED
Self-supervised & foundation models	Excellent transfer learning, less dependent on labeled data, high robustness to variability	Require large-scale pretraining; sometimes inefficient for low-resource deployment	F1-score 75–90% (fine-tuned); PSDS 0.55–0.75 on urban benchmarks; strong zero-shot performance emerging
Multimodal models (audio + video, audio + metadata)	Highest contextual awareness; can disambiguate overlapping sounds; strong performance under urban complexity	Require synchronized multimodal data; high computational cost and privacy concerns	F1-score 80–95%; PSDS up to 0.8 in controlled multimodal benchmarks
Task-specialized models (e.g., localization, multichannel SED)	Useful for smart-city deployments; high realism; effective at spatial disambiguation	Poor scalability when reduced to mono; device-dependent performance	ER decreased by up to 30% vs. mono; F1-score +10–20% in multichannel setups

Table 8. Summary of the 66 studies included in the systematic review, application domains, reporting datasets, model types, extracted features, and sound event detection tasks.

Ref.	Application	Dataset	Model	Features	SED Task
[13]	Traffic and mobility monitoring	Custom	Convolutional Neural Networks (CNN-based)	Log-mel spectrogram	Sound event detection (monophonic)
[15]	Environmental and urban noise monitoring	Urban-Sed	Recurrent and hybrid deep models	Log-mel spectrogram	Polyphonic sound event detection
[21]	Environmental and urban noise monitoring	Multi-dataset	Traditional machine learning models	MFCC	Polyphonic sound event detection
[22]	Environmental and urban noise monitoring	Custom	Convolutional Neural Networks (CNN-based)	Log-mel spectrogram	Multi-label sound event detection
[27]	Smart city sensing and infrastructure monitoring	UrbanSound8K	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Urban sound classification
[29]	Urban surveillance and real-time monitoring	Multi-dataset	Multi-models	Log-Mel Spectrograms	Polyphonic sound event detection
[49]	Smart city sensing and infrastructure monitoring	Librispeech	Transformer-based models	Raw audio	Acoustic event classification
[50]	Smart city sensing and infrastructure monitoring	Librispeech	Transformer-based models	Raw audio	Acoustic event classification
[54]	Urban surveillance and real-time monitoring	Custom	Traditional machine learning models	MFCC	Acoustic event classification
[55]	Environmental and urban noise monitoring	Custom	Traditional machine learning models	MFCC	Urban sound classification
[56]	Environmental and urban noise monitoring	Multi-dataset	Traditional machine learning models	MFCC	Acoustic event classification
[57]	Urban surveillance and real-time monitoring	Custom	Traditional machine learning models	MFCC	Polyphonic sound event detection
[58]	Crowd and human activity monitoring	CHIL	Traditional machine learning models	MFCC	Acoustic event classification
[62]	Environmental and urban noise monitoring	Multi-dataset	Recurrent and hybrid deep models	Log-Mel Spectrograms	Polyphonic sound event detection
[63]	Environmental and urban noise monitoring	UrbanSound8K	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Urban sound classification
[68]	Urban surveillance and real-time monitoring	Multi-dataset	Recurrent and hybrid deep models	Log-Mel Spectrograms	Polyphonic sound event detection
[73]	Environmental and urban noise monitoring	AudioSet	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Multi-label sound event detection
[74]	Smart city sensing and infrastructure monitoring	AudioSet	Transformer-based models	Log-Mel Spectrograms	Multi-label sound event detection
[75]	Environmental and urban noise monitoring	ESC-50	Transformer-based models	Log-Mel Spectrograms	Urban sound classification
[76]	Urban surveillance and real-time monitoring	Multi-dataset	Traditional machine learning models	MFCC	Polyphonic sound event detection
[87]	Smart city sensing and infrastructure monitoring	AudioSet	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Multi-label sound event detection
[88]	Environmental and urban noise monitoring	AudioSet	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Urban sound classification
[92]	Urban surveillance and real-time monitoring	AVE	Recurrent and hybrid deep models	Log-Mel Spectrograms	Audio–visual event localization
[93]	Smart city sensing and infrastructure monitoring	SONYC-UST v2	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Urban sound tagging
[94]	Smart city sensing and infrastructure monitoring	ESC-50	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Urban sound classification
[95]	Urban surveillance and real-time monitoring	STARSS23	Recurrent and hybrid deep models	Log-Mel Spectrograms	Polyphonic sound event detection
[96]	Smart city sensing and infrastructure monitoring	Custom	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Urban sound classification
[97]	Smart city sensing and infrastructure monitoring	ASPED	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Pedestrian detection and volume estimation
[98]	Urban surveillance and real-time monitoring	WSJ0	Recurrent and hybrid deep models	Log-Mel Spectrograms	Source separation and segmentation
[100]	Smart city sensing and infrastructure monitoring	Custom	Convolutional Neural Networks (CNN-based)	Raw audio	Sound event localization
[101]	Urban surveillance and real-time monitoring	Custom	Convolutional Neural Networks (CNN-based)	Raw audio	Sound event localization
[103]	Smart city sensing and infrastructure monitoring	Custom	Traditional machine learning models	Intensity, pitch, and timbre	Salience-driven sound event detection
[108]	Smart city sensing and infrastructure monitoring	UrbanSound8K	Traditional machine learning models	MFCC	Urban sound classification
[109]	Environmental and urban noise monitoring	Freesound	Environmental and urban noise monitoring	Log-Mel Spectrograms	Reproducible sound event detection
[110]	Smart city sensing and infrastructure monitoring	Urban-Sed	Recurrent and hybrid deep models	Log-Mel Spectrograms	Polyphonic sound event detection
[112]	Urban surveillance and real-time monitoring	Sonyc-Ust V2	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Abnormal sound event detection
[113]	Smart city sensing and infrastructure monitoring	FSD50K	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Multi-label sound event classification
[114]	Smart city sensing and infrastructure monitoring	USM-SED	Convolutional Neural Networks (CNN-based)	Log-Mel Spectrograms	Polyphonic sound event detection
[115]	Environmental and urban noise monitoring	SONYC	Recurrent and hybrid deep models	Log-Mel Spectrograms	Urban sound tagging
[117]	Environmental and urban noise monitoring	UrbanSound8K	Traditional machine learning models	Log-Mel Spectrograms	Urban sound classification
[122]	Urban surveillance and real-time monitoring	Custom	Traditional machine learning models	MFCC	Abnormal sound event detection
[123]	Urban surveillance and real-time monitoring	Custom	Traditional machine learning models	MFCC	Abnormal sound event detection
[124]	Urban surveillance and real-time monitoring	EAR-TUKE	Traditional machine learning models	MFCC	Dangerous sound event detection
[125]	Smart home and security	Multi-dataset	Recurrent and hybrid deep models	Log-Mel Spectrograms	Audio–visual fall detection
[126]	Smart home and security	Custom	Recurrent and hybrid deep models	Multi-feature fusion	Infant cry classification
[127]	Smart home and security	Custom	Recurrent and hybrid deep models	MFCC	Infant cry classification
[128]	Smart home and security	UVM KID Study	Traditional machine learning models	Multi-feature vector	Internalizing disorder detection
[129]	Smart city sensing and infrastructure monitoring	Multi-dataset	Traditional machine learning models	MFCC + Log-Mel Spectrograms	Polyphonic sound event detection
[130]	Smart home and security	UP-Fall Detection	Convolutional Neural Networks (CNN)	Raw sensor data	Multimodal fall detection and activity recognition
[134]	Smart home and security	DESED	Recurrent and hybrid deep models	Log-Mel Spectrograms	Polyphonic sound event detection
[135]	Environmental and urban noise monitoring	TUT-SED 2016–2019	Convolutional Recurrent Neural Networks (CRNN)	Log-Mel Spectrograms	Polyphonic sound event detection
[136]	Smart city sensing and infrastructure monitoring	Multi-dataset	Convolutional Neural Network + Transformer	Log-Mel Spectrograms	Weakly labeled sound event detection
[137]	Environmental and urban noise monitoring	Multi-dataset	Convolutional Recurrent Neural Networks	FLM Fusion	Polyphonic sound event detection
[138]	Smart home and security	Multi-dataset	Adaptive Memory-Controlled Self-Attention (AMCSA)	Log-Mel Spectrograms	Polyphonic sound event detection
[139]	Smart home and security	DESED	Recurrent and hybrid deep models	Log-Mel Spectrograms	Large-scale Sound Event Detection (SED)
[140]	Smart city sensing and infrastructure monitoring	AudioSet, ESC-50	Transformer-based models	Log-Mel Spectrograms	Sound event detection (monophonic)
[141]	Smart city sensing and infrastructure monitoring	Large-scale Multimodal	Contrastive Language-Audio Pretraining (CLAP)	Joint Multimodal Embeddings	Zero-Shot sound event classification
[142]	Smart home and security	LRS2, VoxCeleb	Self-supervised Attention Network	Audio–Visual Object Embeddings	Active speaker detection
[143]	Smart home and security	VoxCeleb2, LRS3-TED, and AVSpeech	Recurrent and hybrid deep models	Lip motion (ROIs) and Facial appearance attributes	Audio–visual speech separation
[144]	Human–Computer Interaction (HCI)	ATR Audio–Visual	Traditional machine learning models	MFCC	Audio–visual speech recognition (AVSR)
[145]	Smart home and security	Custom	Audio, Video, and Text Fusion-Three Branch Network (AVTF-TBN)	MFCCs/Spectrograms	Multimodal behavioral event classification
[146]	Smart city sensing and infrastructure monitoring	Literature Review/Meta-Analysis	Recurrent and hybrid deep models	Cross-modal Embeddings	Comprehensive multimodal event analysis
[147]	Smart home and security	Simulated Binaural Mixtures	Traditional machine learning models	Interaural Level Differences (ILD)	Blind source separation and localization
[149]	Music source separation	MUSDB18-HQ	Hybrid Transformer Demucs (HT Demucs)	Waveforms and Spectrograms	Blind source separation
[150]	Smart home and security	Spoken Wikipedia Corpora (SWC)	Metric-based Few-Shot Learning	Log-Mel Spectrograms	Few-shot sound event detection
[152]	Urban surveillance and real-time monitoring	TAU Spatial Sound Events 2019	Convolutional Recurrent Neural Networks (CRNN)	Log-Mel Spectrograms	Joint Sound Event Localization and Detection (SELD)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ciaburro, G.; Puyana-Romero, V. Sound Event Detection in Smart Cities: A Systematic Review of Methods, Datasets, and Applications. Big Data Cogn. Comput. 2026, 10, 83. https://doi.org/10.3390/bdcc10030083

AMA Style

Ciaburro G, Puyana-Romero V. Sound Event Detection in Smart Cities: A Systematic Review of Methods, Datasets, and Applications. Big Data and Cognitive Computing. 2026; 10(3):83. https://doi.org/10.3390/bdcc10030083

Chicago/Turabian Style

Ciaburro, Giuseppe, and Virginia Puyana-Romero. 2026. "Sound Event Detection in Smart Cities: A Systematic Review of Methods, Datasets, and Applications" Big Data and Cognitive Computing 10, no. 3: 83. https://doi.org/10.3390/bdcc10030083

APA Style

Ciaburro, G., & Puyana-Romero, V. (2026). Sound Event Detection in Smart Cities: A Systematic Review of Methods, Datasets, and Applications. Big Data and Cognitive Computing, 10(3), 83. https://doi.org/10.3390/bdcc10030083

Article Menu

Sound Event Detection in Smart Cities: A Systematic Review of Methods, Datasets, and Applications

Abstract

1. Introduction

2. Methodological Framework for the Review

3. Sound Event Detection Models: Architectures and Urban Contexts

3.1. Models Based on Engineered Features

3.2. Supervised Deep Learning Models

3.3. Self-Supervised Models and Foundation Models

3.4. Multimodal Models for Sound Event Detection in Urban Contexts

3.5. Task-Specific Models for Urban Sound Event Detection

4. Datasets Used in Urban Sound Event Detection

5. Real-World Applications of Sound Event Detection in Smart Cities

6. Performance Comparison Using Standardized Benchmark Metrics

6.1. Performance Metrics for Feature-Engineered Models

6.2. Assessing Performance in Supervised Deep Learning Models

6.3. Evaluating Self-Supervised and Foundation Models

6.4. Multimodal Models for Audio–Text and Audio–Video Integration

6.5. Specialized Models for Related Tasks

7. Emerging Challenges and Future Pathways for Sound Event Detection

8. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI