Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions

Estay Zamorano, Bastian; Dehghan Firoozabadi, Ali; Brutti, Alessio; Adasme, Pablo; Zabala-Blanco, David; Palacios Játiva, Pablo; Azurdia-Meza, Cesar A.

doi:10.3390/electronics14142778

Open AccessEditor’s ChoiceArticle

Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions

by

Bastian Estay Zamorano

¹

,

Ali Dehghan Firoozabadi

^1,*

,

Alessio Brutti

²,

Pablo Adasme

³

,

David Zabala-Blanco

⁴

,

Pablo Palacios Játiva

⁵

and

Cesar A. Azurdia-Meza

⁶

¹

Department of Electricity, Universidad Tecnológica Metropolitana, Santiago 7800002, Chile

²

Center for Augmented Intelligence, Fondazione Bruno Kessler (FBK), 38123 Trento, Italy

³

Department of Electrical Engineering, Universidad de Santiago de Chile, Santiago 9170124, Chile

⁴

Department of Computing and Industries, Universidad Católica del Maule, Talca 3466706, Chile

⁵

Escuela de Informática y Telecomunicaciones, Universidad Diego Portales, Santiago 8370190, Chile

⁶

Department of Electrical Engineering, Universidad de Chile, Santiago 8370451, Chile

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2778; https://doi.org/10.3390/electronics14142778

Submission received: 11 June 2025 / Revised: 7 July 2025 / Accepted: 8 July 2025 / Published: 10 July 2025

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Sound event localization and detection (SELD) is a fundamental task in spatial audio processing that involves identifying both the type and location of sound events in acoustic scenes. Current SELD models often struggle with low signal-to-noise ratios (SNRs) and high reverberation. This article addresses SELD by reformulating direction of arrival (DOA) estimation as a multi-class classification task, leveraging deep convolutional recurrent neural networks (CRNNs). We propose and evaluate two modified architectures: M-DOAnet, an optimized version of DOAnet for localization and tracking, and M-SELDnet, a modified version of SELDnet, which has been designed for joint SELD. Both modified models were rigorously evaluated on the STARSS23 dataset, which comprises 13-class, real-world indoor scenes totaling over 7 h of audio, using spectrograms and acoustic intensity maps from first-order Ambisonics (FOA) signals. M-DOAnet achieved exceptional localization (6.00° DOA error, 72.8% F1-score) and perfect tracking (100% MOTA with zero identity switches). It also demonstrated high computational efficiency, training in 4.5 h (164 s/epoch). In contrast, M-SELDnet delivered strong overall SELD performance (0.32 rad DOA error, 0.75 F1-score, 0.38 error rate, 0.20 SELD score), but with significantly higher resource demands, training in 45 h (1620 s/epoch). Our findings underscore a clear trade-off between model specialization and multifunctionality, providing practical insights for designing SELD systems in real-time and computationally constrained environments.

Keywords:

sound event localization and detection (SELD); direction of arrival (DOA); convolutional recurrent neural networks (CRNN); acoustic signal processing; spatial audio; deep learning

1. Introduction

Sound source localization (SSL) represents a fundamental challenge in acoustic signal processing, defined as the process of estimating the position of one or multiple sound sources within a space, using a microphone array as a reference. At its core, this process involves determining the direction of arrival (DOA) of sound waves by analyzing multichannel signals to identify the azimuth (horizontal angle) and, in some cases, the elevation angle of the source as seen in Figure 1. Accurate source localization is crucial in various applications, including rescue robots capable of locating victims in hazardous environments [1], audio signal separation for improved recording quality [2], robust speech recognition systems effective in noisy environments [3], and advanced sound event localization and detection (SELD) [4] algorithms for surveillance and security systems [5,6].

Traditional SSL methods, often based on mathematical models, are effective in controlled acoustic environments, but have some significant limitations in complex real scenarios being burdened by factors like reverberation, environmental noise, and the simultaneous presence of multiple sound sources [7]. These unpredictable conditions create intricate signal mixtures that traditional algorithms struggle to handle. To address these challenges, there has been growing interest in applying deep learning (DL) techniques, leveraging the remarkable ability of deep neural networks (DNNs) to extract features and perform well under adverse conditions. Recent research and comprehensive reviews have confirmed the feasibility and advantages of deep learning-based localization methods compared to traditional approaches.

The principal aim of this research is to evaluate the robustness, accuracy, and computational efficiency of two hybrid deep neural network models for two-dimensional (2D) SSL in complex acoustic scenarios, analyzing their applicability in real environments, and selecting the model that can be appropriate for this task. Our research demonstrates that properly designed deep learning architectures can maintain high localization accuracy even under challenging conditions like low signal-to-noise ratio (SNR) and high reverberation, outperforming traditional methods while requiring manageable computational resources. This makes them viable candidates for implementation in resource-constrained systems, potentially transforming applications in fields ranging from autonomous vehicles and smart environments to security systems and assistive technologies.

Deep learning-based SSL algorithms can be conceptualized as a linear processing pipeline in which multichannel acoustic signals (as seen in Figure 2), captured through an array of strategically placed microphones, serve as input. These signals are then processed through a feature extraction module that transforms the raw waveforms into informative representations suitable for ingestion by a neural network, which ultimately estimates the DOA of the sound sources [8].

The process begins with the simultaneous acquisition of acoustic signals by a set of microphones, typically arranged in a specific spatial geometry within the environment under study. Although the microphones are relatively close to each other, they are capable of capturing subtle inter-channel differences in time delay and amplitude. These variations arise due to the unique propagation paths followed by the sound waves as they pass from the sources to each microphone. Importantly, the recorded signal includes both direct-path sound and multiple reflected components, which are particularly significant in enclosed environments. In practice, these time-domain signals are often analyzed in the time-frequency domain using the short-time Fourier transform (STFT). In this domain, the convolution operation in the time domain simplifies to a multiplication between the STFT of the source signal and the corresponding acoustic transfer function (ATF), thereby facilitating more efficient processing and feature extraction.

In DL for SSL, the choice of input feature types is crucial. These commonly include low-level audio features like waveforms and spectrograms, as well as engineered features derived from signal processing expertise, such as binaural cues (e.g., interaural time differences (ITD) and interaural level differences (ILD)) and methods like MUSIC and GCC-PHAT. Combinations of these feature types are also frequently used. While the trend leans toward end-to-end learning, preprocessed features remain highly effective due to their computational practicality and established ability to capture spatial and spectral-temporal information. Concatenating multiple feature types is a common and beneficial practice, as it enhances performance by enabling neural networks to integrate diverse information sources. Complementing this, the strategy employed in the neural network’s output layer is fundamental, as it dictates the format of the localization estimation and how crucial aspects like multiple sound sources or DOA precision are handled. Different output methodologies aim to map features extracted by intermediate layers to a spatial representation indicating active sound source locations. In DL-based SSL, two principal strategies prevail: framing the problem as a classification [9] task over a discretized space or directly estimating continuous coordinates through regression [10].

Sound source localization (SSL) is a dynamic and continuously advancing research area with critical implications across security, robotics, healthcare, and the Internet of Things (IoT). SSL methodologies utilize microphone arrays to capture acoustic signals, enabling Direction of Arrival (DOA) estimation and source localization [11]. Technological advancements in sensing, machine learning, and embedded systems have expanded SSL capabilities, positioning it as integral to next-generation intelligent environments. In security and digital forensics, SSL proves highly effective for detecting, classifying, and localizing critical auditory events like gunfire. Modern systems increasingly incorporate Convolutional Neural Networks (CNNs); Raponi et al. [12] demonstrated

> 90 %

accuracy in firearm identification under noisy field conditions, while Damarla [13] enabled real-time threat assessment via mobile microphone arrays. In robotics, SSL enhances situational awareness through DL-odometry fusion (

R^{2} \leq 0.97

) [14], innovative array geometries (e.g., rectangular pyramids [15]), and cost-effective solutions like acoustic beacons (69 mm position error, 0.02 rad orientation error) [16]. In healthcare, particularly ICUs, SSL identifies and monitors noise sources using spatial filtering and acoustic clustering [17], validated at John Radcliffe Hospital. For IoT, SSL enables energy-efficient, real-time localization on low-power devices: multi-stream CNNs achieve

91.41 %

accuracy with 7.43° DOA error in noise [18], processing samples in

7.811

ms on a Raspberry Pi 4B. Additional studies validate SSL integration in edge devices for assisted living [19], with advancements improving energy efficiency and real-time capabilities for IoT ecosystems [20,21].

The methodology implemented in this research is divided into five primary stages: signal modeling and preprocessing, feature extraction, model adaptation using classification, deep learning training, and performance evaluation. Initially, multichannel acoustic signals are recorded and modeled with noise elements. These signals then go through preprocessing steps such as time-frequency transformation and frame-wise normalization. Subsequently, two CRNN-based architectures, SELDnet and DOAnet, are modified from regression to classification frameworks (M-SELDnet and M-DOAnet) by discretizing the azimuthal space and adjusting their output layers accordingly. To improve training effectiveness and generalizability, specific configurations are applied to each model, including learning rate scheduling, dropout regularization, and early stopping. Lastly, the model performance is assessed using both traditional and SELD-specific metrics, which include localization error, F1-score, error rate, and multi-object tracking accuracy.

Key contributions of this research encompass:

1.: Converting regression-oriented DOA estimation models into resilient classification frameworks, thereby enhancing both training stability and clarity.
2.: Incorporating spatial and spectral features, including acoustic intensity maps, to increase directional sensitivity.
3.: Embedding attention mechanisms and CRNN to improve the model’s focus on pertinent temporal segments.
4.: Conducting a thorough evaluation using the STARSS23 dataset in realistic acoustic settings, which allows for equitable benchmarking across various performance metrics.
5.: Analyzing the comparative strengths of specialized (M-DOAnet) and multifunctional (M-SELDnet) models, offering insights into their respective advantages in SSL and SELD tasks.

The rest of the article is organized in this way: Section 2 reviews the literature, Section 3 describes the methodology and setup of the model, Section 4 offers insights into the dataset, results, and analysis, and Section 5 wraps up the conclusions of the research and suggests avenues for future research.

2. State of the Art

2.1. Conventional SSL Methods

Prior to the advent of DL techniques, various signal processing methods were developed for SSL. These traditional approaches remain relevant for two primary reasons: they often serve as the foundation for DL-based methods, and many DL models utilize them for input feature extraction.

2.1.1. Generalized Cross-Correlation with Phase Transform (GCC-PHAT)

Given the spatial arrangement of a microphone array, it is possible to estimate the DOA by determining the time direction of arrival (TDOA) of sound sources between microphones. One of the most widely used methods with a pair of microphones is the generalized cross-correlation (GCC) with phase transform (PHAT) [22]. This method is defined as the inverse Fourier transform of a phase-weighted version of the cross-power spectrum (CPS) between the signals from two microphones.

GCC-PHAT has been extended to configurations with more than two microphones, demonstrating improved localization performance by incorporating information from multiple pairs. Another approach to estimating the DOA involves generating an acoustic power map

P (x)

, where local maxima correspond to source directions. The steered response power (SRP) map and its PHAT variant, which are more robust to reverberation, have been widely adopted [23].

2.1.2. Subspace Methods

Subspace methods represent another classical family of localization algorithms that rely on the computation of the time-averaged CPS matrix

R (f)

and its eigenvalue decomposition (EVD). Assuming that the target source signals and noise are uncorrelated, the multiple signal classification (MUSIC) [24] algorithm uses the EVD to identify the signal and noise subspaces. The signal subspace bases correspond to the columns of the mixing matrix

A (f)

, which represent the multichannel acoustic transfer functions (ATF) of the sources. These bases are then used to scan a given direction for the presence of a source through spatial filtering or beamforming. The Estimation of Signal Parameters via the rotational invariance technique (ESPRIT) [25] leverages the subspace structure to directly estimate the DOA, reducing computational complexity, albeit typically with lower precision than MUSIC. Both methods assume narrowband signals, although broadband extensions have been proposed. Subspace methods are robust to noise and provide accurate estimates, but are generally sensitive to reverberation.

2.2. Neural Network-Based Methods in Sound Source Localization

The field of SSL has seen significant integration and advancement with the development of DNNs. Traditionally, the DNN architectures employed in SSL have often been adapted from successful approaches in other areas of signal processing or deep learning, and subsequently fine-tuned to accommodate the specific characteristics of acoustic signals.

2.2.1. Convolutional Recurrent Neural Networks (CRNNs)

CRNNs are hybrid architectures that combine one or more convolutional layers with one or more recurrent layers. These networks have gained prominence in SSL since 2018 by leveraging the complementary strengths of their components: convolutional layers are effective at extracting spatial features critical for SSL, while recurrent layers are designed to integrate temporal information across frames, capturing sequential dynamics in the acoustic scene. Common architectural design choices for CRNNs in this domain often involve sequences of convolutional layers with varying filter sizes (e.g., 3 × 3), followed by pooling strategies (like max-pooling) to progressively reduce dimensionality while preserving essential features.

A series of works by Adavanne et al. demonstrated the effectiveness of CRNNs for the joint task of SELD in a multitask learning framework, using first-order ambisonic input features. This multi-task approach, where the network learns both sound event detection and localization simultaneously, often benefits from CRNNs as they can learn shared representations beneficial for both sub-tasks. The model introduced by Adavanne et al. [26] (see Figure 3) consisted of a sequence of convolutional layers (followed by max pooling) and bidirectional gated recurrent unit (BGRU) layers. A distinctive feature of their architecture was including an intermediate feedforward layer designed to estimate a spatial pseudo-spectrum (SPS), inspired by the MUSIC algorithm, which conceptually helps to guide the network toward learning useful spatial representations prior to the final azimuth and elevation classification stages.

The popularity of CRNNs in SELD became especially evident in the 2020 DCASE Challenge, where numerous systems were based on this architecture. Singla et al. [27], building upon Adavanne et al. [28], adopted a joint output strategy that merged the SED output with a preceding layer for DOA estimation, rather than using entirely separate output branches. Song [29] followed a similar architectural foundation but implemented separate networks to handle the sequential estimation of the number of sources (Nos) and their respective DOAs. Tian [30] proposed multiple specialized CRNNs: one for NoS detection (up to two sources), another for single-source DOA estimation, and a third for handling cases with two simultaneously active sources. Cao et al. [31] introduced a comprehensive CRNN architecture capable of detecting and localizing up to two instances of the same sound event.

Beyond the DCASE Challenge, other studies have also investigated CRNN variants for SSL. Ronchini et al. [32] explored integrating 1D convolutional filters to better exploit information along feature dimensions, while Sampathkumar and Kowerko [33] extended the reference CRNN architecture (Adavanne et al.) by incorporating additional input features, including log-mel spectrograms, GCC-PHAT, and intensity vectors. Comminiello et al. [34] adapted a CRNN (based on Adavanne et al.) to process quaternion-based FOA input features, reporting improved performance. Perotin et al. applied CRNNs with bidirectional LSTM layers to FOA pseudo-intensity vectors for localizing one (Perotin et al.) [35] or two (Perotin et al.) [36] speakers, demonstrating robust performance in reverberant environments. This line of work was further extended by Grumiaux et al. [37], who achieved substantial improvements by redesigning the network to include deeper convolutional layers and additional max pooling, enabling the simultaneous localization of up to three speakers.

These advances confirm the versatility and power of CRNNs in tackling complex SSL and SELD tasks, particularly in dynamic and acoustically challenging conditions. Their layered structure facilitates both local feature extraction and global temporal reasoning, key aspects in modern audio scene analysis.

Figure 3. CRNN representation [26,28] for SELD: processes FOA spectrograms using convolutional layers (with pooling) and BGRU layers, generating an intermediate spatial pseudo-spectrum (SPS) that is used to infer the final DoA.

2.2.2. Attention-Based Neural Networks

These mechanisms represent a significant advancement in deep learning, designed to allow neural networks to focus their processing on the most relevant parts of an input sequence at any given moment. Initially introduced by Bahdanau et al. [38] to enhance RNN-based sequence-to-sequence models in machine translation, the core idea of attention is to assign varying weights to elements of the input sequence when generating the output. This enables the model to learn both intra-sequence relationships (self-attention) and the relevance of each input element to specific outputs (decoder attention). This concept was instrumental in developing the Transformer architecture (Vaswani et al.) [39], which revolutionized machine translation by fully replacing RNNs with attention mechanisms, allowing greater efficiency in computation.

The applicability of attention-based models has since expanded to numerous machine learning tasks, including SSL. In this context, attention allows models to weigh the importance of different time frames or spectral features when performing localization or event detection. This ability is particularly beneficial in SSL because it can help:

Handle polyphony: By enabling the model to focus on individual sound event streams or specific time-frequency bins where a particular source is active, effectively disentangling overlapping sounds.
Improve temporal modeling: By allowing the model to weigh information from distant time frames directly, addressing the vanishing gradient problems often associated with traditional RNNs.
Reduce redundancy: By selectively attending to the most informative features and suppressing less relevant ones, thereby making the learning process more efficient.

Phan et al. [40,41] applied an attention-based system in the DCASE 2020 challenge, combining convolutional layers and a BGRU with a self-attention layer to jointly estimate source activity and dynamic attention allocation for multiple overlapping sound events. Schymura et al. [42] integrated an attention mechanism after the recurrent layers of a CRNN architecture (see Figure 4), enabling the estimation of source activity along with azimuth and elevation, and demonstrating improved temporal modeling compared to the DCASE reference system.

The influence of the Transformer architecture has also been observed in SSL research. Xinghao et al. [43] replaced traditional convolutional layers with adaptive layers and attention blocks to enhance feature extraction. Transformers versus RNNs offer a key advantage in parallelization capabilities due to their self-attention mechanisms, which can process all input elements simultaneously, in contrast to the sequential nature of RNNs. This hints at significant computational efficiency benefits for real-time applications. Park et al. proposed a Transformer-based model for SSL using multi-head self-attention (MHSA), employing transfer learning with a pretrained model and averaging predictions over 3-second segments for DOA estimation. Building further on attention integration, Sudarsanam et al. [44] augmented the CRNN architecture of Adavanne et al. by incorporating MHSA blocks and fully connected layers. They systematically optimized the number of attention heads and blocks, and analyzed the effects of positional embeddings, normalization strategies, and residual connections.

Figure 4. Self-attention architecture [45]: processes multichannel spectrograms using convolutional layers and a Transformer encoder to estimate the activity and location of sound sources.

3. Methodology

This research focuses on 2D SSL in challenging acoustic environments, reinterpreting DOA estimation as a multi-class classification problem. These environments present significant hurdles for accurate localization due to factors like reverberation, which smears spatial cues and makes direct path identification difficult; noise, which can mask signals and degrade feature quality; and multi-source scenarios (polyphony), where overlapping sounds severely complicate both detection and localization.

CRNNs are considered to extract spatio-temporal features from multichannel audio signals. This methodology includes a comparative analysis of two modified primary architectures, specifically, M-SELDnet and M-DOAnet, within a unified framework for angular discretization. This classification-based approach offers significant benefits over traditional regression methods. It implicitly handles the inherent uncertainty in DOA estimation, especially in complex scenes, and avoids the struggles of regression with angular wrap-around or sharp boundaries in discretized space. The block diagram of the proposed methodology, outlining the overall signal processing and deep learning pipeline from raw audio input to final DOA predictions, is shown in Figure 5.

3.1. Input Signal Representations

3.1.1. Spectrogram Representations for Speech Signal

In the practice of SSL using DL, a common technique involves using multichannel spectrograms derived from the STFT. These spectrograms are typically structured as 3D tensors, representing time, frequency, and microphone channel.

Several studies have opted to feed neural models with spectral vectors from individual time frames, without explicitly modeling temporal correlations between adjacent frames. This results in frame-wise independent localization estimates. In such cases, the input is an M×K matrix, where M is the number of microphones and K is the number of STFT frequency bins. For instance, Hirvonen [46] combined the logarithmic spectra of eight channels per frame into a 2D matrix for CNN input. Conversely, Chakrabarty and Habets [47,48,49], as well as Mack et al., explored using the multichannel phase spectrogram as a primary feature, often facilitated by synthetic data generation. Bohlender et al. also demonstrated the utility of phase maps. It is important to note that while these approaches can be effective for static source localization, our chosen architecture specifically models temporal correlations between frames. This capability is key for handling the dynamic nature of real-world acoustic scenes and improving tracking of moving or evolving sound events.

Beyond the basic STFT with linearly spaced frequency bins, Mel-scale and Bark-scale spectrograms employ nonlinear subband decompositions that better approximate human auditory perception. These scales offer greater frequency resolution in lower bands, where human hearing is more discriminative, and coarser resolution in higher bands. Mel-scale spectrograms have been a preferred choice in DNN-based SSL architectures due to their better alignment with perceptual cues.

3.1.2. Ambisonic Signal Representation

In the SSL application, the Ambisonics format is widely used for signal representation. This format is based on decomposing the sound field into Spherical Harmonics (SH) function coefficients. Ambisonics is popular due to its inherent ability to capture spatial properties of the sound field and its independence from the specific microphone array geometry, which offers flexibility in both acquisition and processing. A key advantage of Ambisonics is that its channels are phase-aligned, as

Y_{l, m} (Ω)

do not depend on time or frequency.

SH decomposition is applied to the acoustic pressure on a sphere concentric with the array. For a fixed far-field source, the coefficient of order ℓ and degree

m \in [- ℓ, ℓ]

in the STFT domain is defined as:

B_{ℓ, m} (f, n) = \int_{Ω \in S^{2}} X (f, n, Ω) Y_{ℓ, m}^{*} (Ω) d Ω

(1)

where

X (f, n, Ω)

is the acoustic pressure and

Y_{ℓ, m}^{*} (Ω)

is the complex conjugate of the SH function in the direction

Ω

. In practice, this integral is estimated via numerical quadrature due to the limited number of microphones, under the assumption that the pressure field is order-limited, with the maximum order L determined by the array configuration. FOA (FOA,

L = 1

) yields four coefficients (channels) per time-frequency (TF) bin, while Higher-Order Ambisonics (HOA,

L > 1

) results in more channels.

A plane wave with amplitude

S (f, n)

arriving from direction

Ω

has a simple SH representation:

B_{ℓ, m} (f, n) = S (f, n) Y_{ℓ, m} (Ω)

(2)

Unlike other spatial encoding formats, Ambisonic channels are phase-aligned, as

Y_{ℓ, m} (Ω)

does not depend on time or frequency. For J sources and reverberation, the multichannel ambisonic spectrogram is modeled as:

B_{ℓ, m} (f, n) = \sum_{j = 1}^{J} \sum_{r = 0}^{\infty} A_{j r} (f, n) S_{j} (f, n) Y (Ω_{j r}) + N (f, n)

(3)

where

A_{j r} (f, n)

are complex amplitudes representing attenuation and phase for the rth reflection of source

S_{j}

(with

r = 0

for the direct path), Y is the vector of appropriate SH basis functions, and

N (f, n)

is additive noise. This formulation highlights the spatial expressiveness and mathematical elegance of the Ambisonics representation, making it particularly suitable for spatial audio tasks, including SSL, under various acoustic conditions.

3.1.3. Sound Intensity

In acoustics, the sound intensity is defined as the product of sound pressure (a scalar) and the particle velocity of the medium (a vector). In the TF domain, it is represented as a complex vector.

A widely adopted and practical method for estimating particle velocity in SSL relies on FOA signals. This approach, known as the complex pseudointensity vector, estimates sound intensity from the pressure and pressure gradient signals encoded within the FOA components. It is defined as:

I (f, n) = B_{0, 0} (f, n) B_{ℓ = 1, m} {(f, n)}^{*}

(4)

where

B_{0, 0} (f, n)

is the zeroth-order SH coefficient (omnidirectional pressure component), and

B_{ℓ = 1, m} {(f, n)}^{*}

is the conjugate vector of first-order SH coefficients. In free-field conditions with a single source active in the TF bin

(f, n)

, the real components

R e (I (f, n))

correspond to the Cartesian coordinates of a vector aligned with the DOA of the source.

In this specific study, sound intensity features, derived from FOA signals, serve as crucial direct inputs to both our M-SELDnet and M-DOAnet models.

3.2. Model Architecture and Classification Adaptation

The models in this research leverage the power of CRNNs, which are uniquely suited for SELD and SSL tasks. The fundamental strength of CRNNs for these applications lies in their ability to combine the ability of CNNs for local feature extraction in the TF domain with the RNNs’ ability to capture temporal dependencies and dynamics across frames. This hybrid structure is essential for tasks where the identity, location, and timing of a sound are all crucial pieces of information.

Our methodology involves reinterpreting DOA estimation as a multi-class classification problem, shifting from traditional regression. This classification-based approach offers significant benefits over continuous regression methods, especially for DOA. Firstly, it inherently handles the uncertainty common in DOA estimation, particularly in high-noise or multi-source conditions, where a single continuous value might be ambiguous. Classification, by outputting probabilities over discrete angular sectors, provides a more robust and interpretable prediction. Secondly, it avoids the complexities of angular wrap-around (e.g., 359° vs. 0° being numerically far apart in regression but spatially close) and sharp boundaries in discretized space, which can cause issues for continuous regression loss functions.

The MHSA mechanism, integral to Transformer architectures, is applied within our methodology to enhance the models’ ability to focus on the most relevant features. It is defined as:

M H S A (X) : = c o n c a t_{h \in [N_{h}]} [S e l f A t t e n t i o n_{h} (X)] W_{o u t} + b_{o u t}

(5)

Here,

S e l f A t t e n t i o n_{h} (X)

represents the output of a single attention head,

N_{h}

is the number of attention heads,

W_{o u t}

is the output weight matrix, and

b_{o u t}

is the bias term. This parallel processing of attention allows our models to capture diverse relationships within the input sequence from different representation subspaces, significantly enhancing feature learning and overall performance in complex SSL tasks.

3.3. Modified Architectures

Both SELDnet and DOAnet were converted from regression to classification paradigms, which in this research are called M-SELDnet and M-DOAnet, replacing their original continuous-DOA outputs with tanh over discrete angular sectors. This transformation significantly improves training stability, as classification loss functions are generally more robust in high-noise scenarios. It also simplifies integration into real-time systems by providing discrete outputs and offers consistent evaluation, especially in multi-source conditions where regression ambiguity increases, making the problem more manageable. We settled on a 10-degree (36-class) discretization to achieve a practical trade-off between classification complexity, localization accuracy, and computational efficiency. Finer resolutions (e.g.,

5^{\circ}

) would offer more detail but incur higher model size, training time, and potential confusion for close angles. Meanwhile, coarser options (like

30^{\circ}

) would reduce complexity but compromise spatial precision.

3.3.1. SELDnet

SELDnet is a CRNN model combining convolutional, recurrent, and attention-based components. Its updated architecture, as seen in Figure 6, was specifically adapted for our classification-based DOA estimation to reduce the computational complexity and make special focus on the performance localization. It includes:

CNN blocks (×3): These use filters of [64, 128, 128] with 3 × 3 kernels, followed by Batch Normalization and max-pooling layers. These blocks are crucial for extracting robust spatial features from the input spectrograms.
RNN: Consists of 2 BGRU layers, each with 128 units per direction. The recurrent layers process the features extracted by the CNNs over time, enabling the model to learn temporal dependencies.
Attention: A temporal weighting mechanism applied to the RNN outputs, allowing the network to focus on the most relevant time frames for localization.
Outputs:
-
SED head: Uses sigmoid activation. However, for the specific objectives of this study, this head was deactivated, meaning we did not train or use M-SELDnet for sound event detection, focusing solely on its localization capabilities.

$σ (z) = \frac{1}{1 + e^{- z}}$

(6)

-
DOA head: Uses softmax activation across 36 distinct angular classes, optimized with categorical cross-entropy loss. This is the primary output for our classification-based DOA estimation.

$s o f t m a x (z_{i}) = \frac{e^{z_{i}}}{\sum_{j = 1}^{n} e^{z_{j}}}$

(7)
Regularization: Includes 0.3 dropout after the RNN layers and Batch Normalization after each CNN block to prevent overfitting.

Given that the SED head was deactivated, the combined loss function for M-SELDnet in this study is solely driven by the DOA estimation:

\begin{matrix} L_{M - S E L D n e t} & = L_{D O A} \\ L_{M - S E L D n e t} & = - \sum_{k = 1}^{36} y_{k} l o g (p_{k}) \end{matrix}

(8)

This refined M-SELDnet architecture, by exclusively focusing on DOA classification and integrating attention mechanisms within its CRNN framework, provides a robust and specialized solution for pinpointing sound sources. The design choices, from the specific CNN filter configurations to the BGRU layers and the categorical cross-entropy loss, are meticulously selected to maximize the accuracy and stability of angular predictions, making it a powerful component of our overall methodology for sound localization.

3.3.2. DOAnet

DOAnet is specifically optimized for multi-source DOA estimation and tracking. Its design emphasizes depth and time modeling, aiming to capture complex temporal patterns and distinguish multiple simultaneous sources. A small angle between proximity sources (as low as

10^{\circ}

) would require a more specific setup to sub-select samples, in order to address these isolated cases through a more detailed assessment. The architecture of the network is illustrated in Figure 7:

Dilated CNN (×9): Employs 3 × 1 kernels with dilation rates following a Fibonacci sequence. This use of dilated convolutions is particularly beneficial as it allows the network to expand its receptive field without losing resolution or requiring additional pooling layers. This design enables the model to capture multi-scale temporal patterns efficiently, which is crucial for tracking dynamic and multiple sound sources. Batch Normalization is applied after each convolutional block.
RNN: Features 1 BGRU with 2 layers (128 units each) using tanh activation internally. This recurrent component excels at modeling the long-term temporal dependencies required for robust tracking.
Output: For this study, M-DOAnet’s output is configured for multi-label classification of DOA, allowing it to predict multiple active sources simultaneously.
-
Multi-label: Uses sigmoid activation combined with binary cross-entropy loss. This setup is ideal for scenarios where multiple sound events can occur at different angles within the same time frame.

$σ (z) = \frac{1}{1 + e^{- z}}$

(9)

-
Single-label: tanh, given by the next equation.

$t a n h (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}}$

(10)
Tracking losses: Incorporates Differentiable CLEAR-MOT–inspired dMOTp and dMOTa as auxiliary training losses. These metrics are specifically designed to improve the model’s ability to consistently track sources over time, minimizing ID switches and localization errors for moving or overlapping events.

This meticulously designed DOAnet architecture, with its deep dilated CNNs, powerful recurrent layers, and specialized tracking losses, is crucial for effectively addressing the complexities of multi-source sound localization and tracking. By embracing a multi-label classification paradigm for DOA and optimizing directly for tracking performance, DOAnet provides a sophisticated solution that can discern and follow multiple sound events concurrently, making it a cornerstone for robust and dynamic audio scene analysis in our research.

3.4. Training Configuration

Both models are trained using the ADAM optimizer with default parameters, as listed in Table 1. ADAM is a popular choice for deep learning optimization due to its adaptive learning rates for different parameters and its generally strong performance across a wide range of tasks. It efficiently combines the benefits of AdaGrad and RMSProp, accelerating convergence by adjusting the learning rate for each weight based on the past gradients.

The specific training configurations for each model incorporate adaptive learning rate schedules and early stopping, which are critical for robust model training. These techniques prevent models from getting stuck in suboptimal local minima or overshooting the optimal solution, especially in the later stages of training.

M-SELDnet:
-
LR = Initialized at 0.001. The LR is halved if no validation improvement is observed for five consecutive epochs. This strategy helps the model fine-tune its weights more precisely as it approaches convergence.
-
Early stopping: Training halts after 10 stagnant epochs (epochs without validation improvement). This is a vital regularization technique that prevents overfitting by stopping training when the model’s performance on unseen data no longer improves, thereby saving computational resources and improving generalization.
M-DOAnet:
-
LR = Initialized at 0.01. The LR is reduced by 0.1 after three plateaus (epochs where validation performance stagnates). This aggressive reduction helps DOAnet navigate complex loss landscapes.
-
Early stopping: Training ceases after three reductions in the learning rate.

These distinct training configurations reflect the different complexities and optimization landscapes of the two architectures. It is important to note the practical implications of training time differences between M-SELDnet and M-DOAnet, especially concerning deployment on resource-constrained devices or in real-time applications. While not explicitly detailed here, these configurations often influence overall training duration and computational cost.

3.5. Evaluation Metrics

The evaluation of M-SELDnet and M-DOAnet is comprehensive, encompassing metrics related to both SED and DOA localization. These metrics are categorized into two types: event-based and frame-based. Frame-based metrics provide a fine-grained, instantaneous view of performance, evaluating the model’s output at every time step. In contrast, event-based metrics offer a more holistic view of how well actual sound events are detected and localized throughout their entire duration, which is often more aligned with real-world application needs.

F1-Score: This is a crucial metric for SED, specifically measuring the balance between precision and recall in identifying sound events. It is particularly useful in scenarios with imbalanced classes, providing a single score that represents the harmonic mean of precision and recall.

$F 1 - S c o r e = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c s i s i o n + R e c a l l}$

(11)

where TP is True positives that are correctly detected events, FP is False positives (incorrect detections), and FN is False negatives (missed events).
Error Rate (ER): This is a key metric for SED, quantifying the proportion of errors made by the system relative to the total number of ground truth events. It provides a comprehensive measure of event detection accuracy by considering substitutions, deletions, and insertions.

$E R = \frac{S + D + I}{N}$

(12)

where S is a correct event detected as an incorrect one, D is Missed events, I is a False Alarm, and N is Ground truth.
Localization Recall (LR): Evaluates the proportion of sound sources that were correctly localized within a specified angular threshold (for this study used as $θ = 20^{\circ}$ ). It indicates the model’s ability to successfully identify the presence and approximate direction of active sound sources.

$L R = \frac{C o r r e c t D O A}{T o t a l D O A}$

(13)
Localization Error (LE): This measures the mean angular error between the detected and true sound sources, but only for those cases that were deemed correct according to the Localization Recall. It provides an insight into the precision of the localization when a source is identified successfully.

$L E = \frac{1}{N_{c o r r e c t}} \sum_{i = 1}^{N_{c o r r e c t}} ∠ (y_{i}, {\hat{y}}_{i})$

(14)

where the angle is between the true vector ( $y_{i}$ ) and the estimated one ( ${\hat{y}}_{i}$ ).
SELD score: This is a unified metric particularly utilized in DCASE challenges to offer a comprehensive assessment of a model’s effectiveness in both detection and localization. It combines the Error Rate, F1-score, Localization Error, and Localization Recall into a single value, providing a holistic view of the system’s performance. A reduced SELD score signifies improved overall model effectiveness.

$S E L D_{s c o r e} = \frac{1}{2} (\frac{E R + (1 - F 1)}{2} + \frac{L E / 180 + (1 - L R)}{2})$

(15)
dMOTp and dMOTa: These are essential for M-DOAnet as presented by Xu et al. [50]. These are tracking metrics, which differ from the static DOA metrics by evaluating the consistency and continuity of source tracks over time, not just instantaneous accuracy. They are vital for assessing how well M-DOAnet manages multi-source scenarios, including handling ID switches and maintaining accurate localization for dynamic or overlapping sound events.

$\begin{matrix} d M O T a & = 1 - \frac{F P_{t} + F N_{t} + γ I D S_{t}}{M_{t}} \\ d M O T p & = 1 - \frac{D ⊙ B_{1}^{T P}}{B_{0}^{T P}} \end{matrix}$

(16)

4. Results and Discussion

M-DOAnet and M-SELDnet represent two key paradigms in SELD. While M-DOAnet specializes in the precise estimation of DOA, focusing on positional accuracy, M-SELDnet adopts a broader approach by simultaneously integrating spatial localization with sound event detection and classification. This section presents a detailed comparative analysis of these two modified architectures using the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) [51] dataset from the DCASE 2024 challenge. This dataset, which features multichannel recordings in real-world environments along with precise spatio-temporal annotations, serves as a demanding testbed for real-world scenarios involving reverberation, noise, and overlapping events. This section examines architectural designs, tracks the evolution of performance metrics, compares computational efficiency and practical applicability, and identifies current limitations. It also outlines promising research directions, such as self-supervised learning and attention mechanisms, aiming to provide a clear and comprehensive understanding of existing approaches and future challenges in SELD.

4.1. Dataset Description

Task 3 of the DCASE 2024 Challenge represents a significant advancement in SELD research by introducing Sound Distance Estimation (SDE), expanding the task into Sound Event Localization and Detection with Distance Estimation (SELDDE). The STARSS23 dataset serves as the cornerstone resource for this challenge, featuring multichannel recordings captured in diverse real-world indoor environments with comprehensive spatiotemporal annotations essential for developing robust SELD models.

All audio files are recorded at 24 kHz sampling rate with 16-bit resolution, with clip durations ranging from 30 s to 9 min. The dataset comprises approximately 7.5 h of development data (168 clips: 98 from TAU and 70 from Sony) and 3.5 h of evaluation data (79 clips), distributed under an open license with a total size of 16.3 GB. Building upon its predecessor STARSS22, STARSS23 expands the number of rooms, adds more annotated recordings, and incorporates the key innovation of distance labels—enabling evaluation under real acoustic conditions beyond synthetic datasets used in previous challenges. Spatio-temporal annotations are provided in CSV files with 100 ms temporal resolution, including frame number, class index (0–12), source index, azimuth (°), elevation (°), and distance (cm). This supports the novel objective of distance-aware SELD modeling, with multiple events potentially annotated within the same frame. The core challenge introduced in Task 3 is accurate sound distance estimation, which requires sensitivity to intensity cues and reverberation characteristics like the direct-to-reverberant ratio.

Regarding generalization ability, while STARSS23 provides an excellent real-world spatial audio benchmark, its geographic and device-specific constraints may limit generalization scope. To address this, we utilized FSD50K [52] and ESC50 [53] datasets in preliminary stages for architecture validation, but focused exclusively on STARSS23 for core experiments due to its direct SELDDE relevance and rich annotations aligned with DCASE Task 3 objectives. For model inputs, we employed time-frequency features (STFT, Mel spectrograms) and spatial features (GCC-PHAT, multichannel magnitude/phase maps), all normalized before training. Audio was segmented into 40 ms frames with 50% overlap for M-SELDnet, while M-DOAnet used 0.5 s segments with 0.25 s hop during inference to balance temporal and spatial resolution. The complexity of real-world data motivated data augmentation to improve model generalization.

4.1.1. Content Characteristics and Target Classes

The STARSS23 dataset consists of real sound scenes that capture the complexity and variability of natural acoustic environments. These recordings involve interactions among people and objects, with loosely scripted scenarios that promote variability in event density, spatial distribution, and polyphony. A total of 13 target sound event classes are annotated, including female and male speech, clapping, telephone ringing, laughter, domestic sounds, footsteps, door opening/closing, music, musical instruments, water tap sounds, bells, and knocks. The classes were selected to reflect realistic audio contexts and were manually annotated by human experts.

4.1.2. Data Collection Method and Recording Setup

Data collection was conducted collaboratively in Tampere, Finland, by the TAU Audio Research Group, and in Tokyo, Japan, by Sony. Recordings were performed in 16 unique indoor rooms (12 in Tampere, 4 in Tokyo), each associated with specific sessions involving unique actors, sound-producing props, and acoustic conditions. Sound events were partially scripted but retained a naturalistic character.

Spatial audio was captured using two formats: FOA and a tetrahedral microphone array (MIC), each with four channels. The spatial annotations of active sound events were derived using a high-precision optical tracking system. Events outside the defined target classes were treated as interference.

4.2. Signal Modeling and Feature Extraction

An

N_{m}

Channel acoustic signal is captured by an FOA array. Each channel is modeled as:

x_{i} (t) = \sum_{j = 1}^{N_{s}} (a_{i, j} (t)) * s_{j} (t) + n_{i} (t)

(17)

where

s_{j}

is the j-th sound source,

a_{i, j}

represents the RIR, and

n_{i}

is the additive noise. After applying an STFT with a 64 ms window, the signal transforms into the frequency domain:

\begin{matrix} X_{i} (f, n) & = A_{i, j} (f) S_{j} (f, n) + N_{i} (f, n) \end{matrix}

(18)

where f and n denote the frequency and frame indices, respectively.

For each model, we put the specific feature extracted:

M-SELDnet: spectrograms and acoustic intensity maps.
M-DOAnet: spectrograms and intensity vectors.

But all the features are normalized per frame and frequency band.

4.3. Metric Analysis for M-DOAnet

The training process of M-DOAnet demonstrated a clear improvement in predictive performance, as evidenced by the consistent decrease in both loss functions and localization error, as shown in Figure 8.

Training Loss: The training loss decreased significantly from an initial value of 55.4 at epoch 0 to 8.66 at epoch 99, representing an 84% reduction. This indicates a substantial improvement in the model’s fit to the training data.
Validation Loss: The validation loss also showed a marked reduction, from 42.92 at epoch 0 to 7.58 at epoch 99, reflecting an 82% improvement. The similar trend between training and validation loss suggests that the model generalizes well to unseen data. From epoch 60 onward, both training and validation losses stabilized within a range of 6 to 9, indicating model convergence. However, the lack of further reduction beyond this point may suggest limitations in capturing finer-grained patterns.
Localization Error (LE): This key accuracy metric began at a high value of 32.43 degrees in epoch 0, reflecting the model’s initial lack of spatial awareness. However, LE steadily and significantly decreased throughout training, reaching minimum values of approximately 5.30–5.37 degrees in the final epochs, thereby demonstrating M-DOAnet’s effective improvement in spatial estimation.

In addition to LE, several other metrics were recorded to evaluate the performance of M-DOAnet in tracking multiple sound sources:

Multiple Object Tracking Accuracy (MOTA): This metric remained consistently at 100% throughout all training epochs. A MOTA of 100% suggests the model exhibits exceptional ability to maintain the identity of sound sources over time, without ID assignment errors or trajectory fragmentation. However, this result may be influenced by specific characteristics of the dataset, such as a potentially clear separation between sources.
Number of Identity Switches (IDS): The IDS value remained at 0.0 throughout the entire training process, reinforcing the model’s consistency in assigning unique identifiers without erroneous identity switching.
Localization F1-score (LF): The LF score remained stable at 72.79 across all epochs. This metric evaluates precision and recall in the spatial detection and classification of sound events. The constancy of this value, despite improvements in LE, suggests that the model’s ability to correctly identify the spatial occurrence of sounds remained largely unchanged.

The rapid loss reduction observed during the first 30 epochs indicates high initial efficiency in learning prominent representations. The subsequent slowdown and stabilization in learning may reflect architectural limitations or constraints inherent to the dataset. The combined trend of decreasing LE alongside stable MOTA and LF suggests that M-DOAnet primarily excels in DOA estimation.

Computational Efficiency During Training

An analysis of M-DOAnet’s computational efficiency is shown in Table 2 and during training. It reveals the following parameters:

Epoch Duration: The average time to complete one epoch was 164 s (approximately 2.7 min).
Total Cost (100 Epochs): The estimated total training time for 100 epochs was approximately 4.5 h.

4.4. Metric Analysis for M-SELDnet

M-SELDnet is designed to simultaneously perform SSL and acoustic event detection with classification. Accordingly, it is evaluated using task-specific metrics for each component.

DOA Error: This metric, which quantifies angular discrepancy, started at relatively high values. Over the course of training, the DOA error decreased steadily, reaching values between 0.32 and 0.34 radians in the final epochs. This consistent reduction indicates improved spatial localization accuracy by M-SELDnet.
F1-score Overall: As a key metric for event detection and classification, the F1-score overall began at 0.44. It showed a continuous increase, reaching 0.70 by epoch 50 and peaking at 0.75 in the later stages. This significant rise reflects M-SELDnet’s optimization in accurately detecting and classifying acoustic events.
Overall Error Rate (Overall ER): This metric, which captures detection errors, initially stood at 0.75. With continued training, it progressively decreased, reaching 0.38 in the final epochs.
SELD Error Metric: This composite metric integrates both localization and detection errors (based on F1 and ER). It declined from an initial value of 0.53 in early epochs to 0.20 in the final ones, demonstrating overall model improvement in balancing both tasks.

The evolution of the most important metrics such as

F 1

-

S c o r e

and the Error Metric is shown in Figure 9.

The concurrent improvement in directional accuracy (reflected by the reduction in DOA error) and in event detection/classification performance (evidenced by the increase in F1-score and decrease in error rate) demonstrates M-SELDnet’s effective integration of its two core functionalities. It is noteworthy that initial metric values indicated substantial room for optimization in the detection task. This highlights the importance of continuous tuning of the architecture and hyperparameters to avoid overfitting, particularly when improvements become marginal in later training stages.

Computational Efficiency During Training

An evaluation of M-SELDnet’s computational efficiency is shown in Table 3 and during training yielded the following parameters:

Epoch Duration: The average training epoch duration was 1620 s (approximately 27 min).
Total Cost (100 Epochs): The estimated total training time for 100 epochs was approximately 45 h.

4.5. Analysis of Results

A fundamental distinction between M-DOAnet and M-SELDnet lies in their design and the range of tasks they are intended to address. M-DOAnet focuses exclusively on sound source localization, enabling it to specialize and optimize its architecture for minimizing LE. This task-specific specialization provides a clear advantage in applications where positional accuracy is critical, leading to faster convergence and more straightforward model tuning for the localization task.

In contrast, M-SELDnet is designed to simultaneously perform spatial localization and acoustic event detection/classification. This multifunctional nature requires the model to learn representations that capture both spatial and event-related features. The evolution of M-SELDnet’s performance metrics reflects this balance: spatial accuracy (via the reduction in DOA error) and event-related capabilities (via increased F1-score and reduced error rate) improve in tandem. This integrated capacity is valuable for comprehensive acoustic scene understanding.

However, M-SELDnet’s dual-purpose design introduces greater complexity in both model architecture and training process. The optimization of multiple objectives may lead to trade-offs, making it more challenging to identify a hyperparameter configuration that consistently benefits both tasks. This contrast is also reflected in metric interpretation, as shown in Figure 10. The stability of MOTA and LF in M-DOAnet underscores its consistency in tracking localized sources, whereas the coordinated improvement in DOA error and overall F1-score in M-SELDnet highlights its growing capacity to both locate and classify sound events.

4.5.1. Comparative Computational Efficiency M-DOAnet and M-SELDnet

Computational efficiency is a critical factor in assessing the practical viability of M-DOAnet and M-SELDnet. M-DOAnet, with its relatively lightweight architecture, exhibits significantly lower training times. The average duration per training epoch is approximately 2.7 min (164 s), resulting in an estimated total cost of 4.5 h for 100 epochs. This efficiency makes it well suited for applications with time or resource constraints. In contrast, M-SELDnet requires considerably more computational resources. The average epoch duration is around 27 min (1620 s), amounting to an estimated total training time of 45 h for 100 epochs. This higher computational cost stems from the complexity of its multifunctional architecture, which is necessary to simultaneously process localization, detection, and classification information.

As is shown in Figure 11, the efficiency gap has direct implications for applicability. M-DOAnet is preferable in systems requiring fast responses or operating on limited hardware. M-SELDnet, despite its higher computational demand, provides a more comprehensive output by delivering both spatial and semantic information. This added value may justify the increased resource consumption in applications where such integrative insights are essential. Ultimately, the choice between the two models depends on the specific requirements of the application and the available computational resources.

A fundamental distinction between M-DOAnet and M-SELDnet lies in their architectural design and the scope of tasks they address. M-DOAnet is exclusively dedicated to sound source localization, enabling it to specialize and optimize its architecture to minimize LE. This specialization confers an advantage in applications where precise localization is paramount, resulting in faster convergence and more straightforward tuning for the specific localization task.

Conversely, M-SELDnet is engineered to simultaneously perform spatial localization and sound event detection/classification. This multifunctionality necessitates that the model learns to represent both spatial information and event characteristics. The progression of its metrics reflects this balance: spatial accuracy (reduction in DOA error) and detection/classification capability (increase in F1-score and reduction in ER) improve concurrently. This integrated capability is valuable for a comprehensive understanding of the acoustic scene.

However, M-SELDnet’s dual nature introduces increased complexity in design and training. Optimizing distinct objectives can lead to conflicts, making it more challenging to find a set of hyperparameters that consistently benefit both tasks. The interpretation of metrics also reflects this difference: the stability of MOTA and LF in M-DOAnet underscores its consistency in tracking localized sources, whereas the coordinated improvement in DOA error and overall F1-score in M-SELDnet indicates its growing ability to locate and identify sound types.

4.5.2. Practical Implications in Real-World Scenarios

Computational efficiency emerges as a key differentiation between M-DOAnet and M-SELDnet. M-DOAnet’s lightweight design makes it advantageous for scenarios with resource limitations or demands for rapid processing speed, whereas M-SELDnet offers a more comprehensive understanding of the acoustic environment by integrating event detection and classification with localization, albeit at a significantly higher computational cost. The initial choice between these architectures depends heavily on balancing specific application requirements with the available computational resources and infrastructure. Generalization to complex real-world environments presents distinct challenges for each model, as shown in Figure 11. While M-DOAnet demonstrates high performance and precision in controlled conditions—making it suitable for tasks like directional surveillance—its singular focus on localization may limit effectiveness in dynamic, noisy, or obstructed scenarios, potentially reducing accuracy and limiting interpretability of the acoustic context. Conversely, M-SELDnet provides a richer, multidimensional analysis valuable for applications such as biodiversity monitoring, yet faces implementation challenges in real-world settings due to its higher cost and tendency to overfit. These drawbacks require mitigation strategies to ensure robustness and generalization. Ultimately, the selection depends on a balanced evaluation of environmental conditions and system objectives, favoring M-DOAnet for efficient operation in controllable settings, and M-SELDnet for greater versatility where a deeper acoustic understanding justifies the computational investment.

4.6. Limitations and Future Research Directions

This comparative analysis offers valuable insights while acknowledging inherent limitations. The exclusive indoor evaluation using STARSS23 restricts understanding of model performance in outdoor or adverse weather conditions, potentially limiting generalization. Furthermore, potential angular distribution biases in the dataset may constrain directional robustness assessments. For models like M-DOAnet, saturated tracking metrics such as MOTA and LF suggest reduced sensitivity for capturing subtle spatial tracking improvements. These limitations crucially contextualize our findings and inform future research directions. Addressing these challenges could involve incorporating self-supervised learning techniques to develop more robust spatial representations from multichannel data without explicit labels, thereby improving resilience to noise and reverberation. Simultaneously, exploring attention mechanisms may enhance handling of overlapping events through adaptive weighting of relevant signal components. Finally, comprehensive evaluation in diverse real-world scenarios—potentially enabled by new datasets or domain adaptation techniques—remains essential for validating and improving model robustness across varied acoustic conditions. Pursuing these pathways promises more accurate, adaptable models with stronger practical applicability.

Our models demonstrate compelling trade-offs when contextualized against state-of-the-art IoT-oriented [19] SSL systems: while the referenced multi-stream CNN achieves impressive efficiency (7.811 ms latency on Raspberry Pi 4B) with competitive accuracy (91.41%, 7.43° DOA error), our specialized M-DOAnet offers superior localization precision—reducing angular error by 19.2% to 6.00° on comparable hardware. This accuracy–efficiency trade-off extends to functionality scope, as M-DOAnet provides advanced tracking capabilities absent in the IoT baseline, critical for dynamic surveillance but resource-intensive. Similarly, while the IoT solution prioritizes energy efficiency, it exhibits 8.6% lower detection accuracy compared to our tracking-optimized model. These contrasts reveal an application-dependent optimization frontier: for ultra-constrained edge devices where latency and energy dominate, streamlined architectures remain preferable; however, in critical applications like security robotics or industrial monitoring requiring high-precision tracking, M-DOAnet’s advantages justify its computational overhead, especially on mid-tier hardware like NVIDIA Jetson platforms, motivating future work on hybrid architectures that adaptively balance these competing objectives.

5. Conclusions

This study investigated SSL in challenging acoustic environments using CRNNs. A key contribution was converting regression-oriented DOA estimation models into resilient classification frameworks, enhancing training stability and clarity by discretizing the angular space. Two models, M-SELDnet and M-DOAnet, were proposed and evaluated, incorporating spatial and spectral features like acoustic intensity maps to increase directional sensitivity. Both models demonstrated accurate azimuth estimation despite varying noise and reverberation. M-SELDnet proved robust in joint sound event detection and localization, leveraging its attention mechanism and dual output to improve focus on pertinent temporal segments. However, it demanded substantial training time (approximately 45 h per 100 epochs). In contrast, M-DOAnet featured a lighter architecture with significantly lower computational requirements (around 4.5 h per 100 epochs), making it suitable for resource-limited environments, though with less robustness in dynamic tracking.

A thorough evaluation, using the STARSS23 dataset in realistic acoustic settings, allowed for equitable benchmarking across various performance metrics. This comprehensive framework combined traditional metrics like hit rate with specialized metrics such as ER, F1-score, and differentiable tracking metrics (dMOTp, dMOTa). The discretization of the azimuthal space into 36 or 72 sectors offered an effective balance between classification precision and real-time applicability. This holistic comparison, including detailed training behavior analysis, reinforces the potential of both architectures for integration into embedded or real-time applications, offering insights into their respective advantages in SSL and SELD tasks.

Author Contributions

Conceptualization, B.E.Z., A.D.F., A.B., and P.A.; methodology, B.E.Z., A.D.F., and A.B.; software, B.E.Z., A.D.F., A.B., P.A., D.Z.-B., and P.P.J.; validation, B.E.Z., A.D.F., and P.A.; formal analysis, B.E.Z., D.Z.-B., P.P.J., and C.A.A.-M.; investigation, B.E.Z., A.D.F., and P.A.; resources, B.E.Z., A.D.F., and A.B.; data curation, B.E.Z., A.D.F., and A.B.; writing original draft preparation, B.E.Z., A.D.F., and P.A.; writing review and editing, B.E.Z., A.D.F., A.B., P.A., D.Z.-B., P.P.J., and C.A.A.-M.; visualization, B.E.Z., A.D.F., P.A., D.Z.-B., P.P.J., and C.A.A.-M.; supervision, A.D.F.; project administration, A.D.F.; funding acquisition, A.D.F. All authors have read and agreed to the published version of the manuscript.

Funding

The Authors acknowledge the financial support from projects: ANID/FONDECYT Iniciación No. 11230129, cost center No: 02030402-999, Department of Electricity, Projects: DICYT Regular No. 062313AS, ANID/FONDECYT Iniciación No. 11240799, ANID Vinculación Internacional FOVI240009, Universidad de Las Américas under Project 563.B.XVI.25, SENESCYT “Convocatoria abierta 2014-primera fase, Acta CIBAE-023-2014”; Pontificia Universidad Católica del Ecuador under Project PEP QINV0485-IINV528020300.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors acknowledge Carolina Andrea Estay Cabrera and Tamara Javiera Vallejos Ramirez, both graduates in English Linguistics and Literature from the University of Chile, for the review, correction, and support in writing the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADAM	Adaptive Moment Estimation
ATF	Acoustic Transfer Function
BGRU	Bidirectional Gated Recurrent Unit
CPS	Cross-Power Spectrum
CRNN	Convolutional Recurrent Neural Network
CNN	Convolutional Neural Network
DCASE	Detection and Classification of Acoustic Scenes and Events
DNN	Deep Neural Network
DOA	Direction of Arrival
dMOTa	Differentiable Multiple Object Tracking Accuracy
dMOTp	Differentiable Multiple Object Tracking Precision
ER	Error Rate
ESPRIT	Estimation of Signal Parameters via Rotational Invariance Technique
EVD	Eigenvalue Decomposition
FOA	First-Order Ambisonics
GCC-PHAT	Generalized Cross-Correlation with Phase Transform
HOA	Higher-Order Ambisonics
ICU	Intensive Care Unit
IDS	Identity Switches
ILD	Interaural Level Difference
IoT	Internet of Things
ITD	Interaural Time Difference
LE	Localization Error
LF	Localization F1-score
LR	Localization Recall
MHSA	Multi-Head Self-Attention
MOTA	Multiple Object Tracking Accuracy
MUSIC	Multiple Signal Classification
NoS	Number of Sources
SELD	Sound Event Localization and Detection
SELDDE	Sound Event Localization and Detection with Distance Estimation
SH	Spherical Harmonics
SNR	Signal-to-Noise Ratio
SPS	Spatial Pseudo-Spectrum
SRP	Steered Response Power
SSL	Sound Source Localization
STFT	Short-Time Fourier Transform

References

Sun, H.; Yang, P.; Zu, L.; Xu, Q. A Far Field Sound Source Localization System for Rescue Robot. In Proceedings of the 2011 International Conference on Control, Automation and Systems Engineering (CASE), Singapore, 30–31 July 2011; pp. 1–4. [Google Scholar] [CrossRef]
Wang, H.; Chu, P. Voice source localization for automatic camera pointing system in videoconferencing. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, 21–24 April 1997; Volume 1, pp. 187–190. [Google Scholar] [CrossRef]
Swietojanski, P.; Ghoshal, A.; Renals, S. Convolutional Neural Networks for Distant Speech Recognition. IEEE Signal Process. Lett. 2014, 21, 1120–1124. [Google Scholar] [CrossRef]
Mu, D.; Zhang, Z.; Yue, H.; Wang, Z.; Tang, J.; Yin, J. Seld-mamba: Selective state-space model for sound event localization and detection with source distance estimation. arXiv 2024, arXiv:2408.05057. [Google Scholar]
Foggia, P.; Petkov, N.; Saggese, A.; Strisciuglio, N.; Vento, M. Audio Surveillance of Roads: A System for Detecting Anomalous Sounds. IEEE Trans. Intell. Transp. Syst. 2016, 17, 279–288. [Google Scholar] [CrossRef]
Yasuda, M.; Saito, S.; Nakayama, A.; Harada, N. 6DoF SELD: Sound Event Localization and Detection Using Microphones and Motion Tracking Sensors on Self-Motioning Human. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1411–1415. [Google Scholar] [CrossRef]
Firoozabadi, A.D.; Irarrazaval, P.; Adasme, P.; Zabala-Blanco, D.; Palacios-Játiva, P.; Durney, H. Three-dimensional sound source localization by distributed microphone arrays. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 196–200. [Google Scholar] [CrossRef]
Dehghan Firoozabadi, A.; Abutalebi, H.R. A Novel Nested Circular Microphone Array and Subband Processing-Based System for Counting and DOA Estimation of Multiple Simultaneous Speakers. Circuits Syst. Signal Process. 2016, 35, 573–601. [Google Scholar] [CrossRef]
Aparicio, D.D.G.; Politis, A.; Sudarsanam, P.A.; Shimada, K.; Krause, D.; Uchida, K.; Koyama, Y.; Takahashi, N.; Takahashi, S.; Shibuya, T.; et al. Baseline models and evaluation of sound event localization and detection with distance estimation in DCASE 2024 Challenge. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE, Tokyo, Japan, 23–25 October 2024; pp. 41–45. [Google Scholar]
Zhang, X.; Chen, Y.; Yao, R.; Zi, Y.; Xiong, S. Location-Oriented Sound Event Localization and Detection with Spatial Mapping and Regression Localization. arXiv 2025, arXiv:2504.08365. [Google Scholar]
Krause, D.A.; Politis, A.; Mesaros, A. Sound Event Detection and Localization with Distance Estimation. In Proceedings of the 2024 32nd European Signal Processing Conference (EUSIPCO), Lyon, France, 26–30 August 2024; pp. 286–290. [Google Scholar] [CrossRef]
Raponi, S.; Oligeri, G.; Ali, I.M. Sound of guns: Digital forensics of gun audio samples meets artificial intelligence. Multimed. Tools Appl. 2022, 81, 30387–30412. [Google Scholar] [CrossRef]
Damarla, T. Detection of gunshots using microphone array mounted on a moving platform. In Proceedings of the 2015 IEEE Sensors Conference, Busan, Republic of Korea, 1–4 November 2015. [Google Scholar]
Boztas, G. Sound source localization for auditory perception of a humanoid robot using deep neural networks. Neural Comput. Appl. 2023, 35, 6801–6811. [Google Scholar] [CrossRef]
Chen, G.; Xu, Y. A sound source localization device based on rectangular pyramid structure for mobile robot. J. Sens. 2019, 2019, 4639850. [Google Scholar] [CrossRef]
Ogiso, S.; Kawagishi, T.; Mizutani, K.; Wakatsuki, N.; Zempo, K. Self-localization method for mobile robot using acoustic beacons. ROBOMECH J. 2015, 2, 12. [Google Scholar] [CrossRef]
Müller-Trapet, M.; Cheer, J.; Fazi, F.M.; Darbyshire, J.; Young, J.D. Acoustic source localization with microphone arrays for remote noise monitoring in an intensive care unit. Appl. Acoust. 2018, 139, 93–100. [Google Scholar] [CrossRef]
Ko, J.; Kim, H.; Kim, J. Real-time sound source localization for low-power IoT devices based on multi-stream CNN. Sensors 2022, 22, 4650. [Google Scholar] [CrossRef] [PubMed]
Fabregat, G.; Belloch, J.A.; Badía, J.M.; Cobos, M. Design and implementation of acoustic source localization on a low-cost IoT edge platform. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 3547–3551. [Google Scholar] [CrossRef]
Belloch, J.A.; Badía, J.M.; Igual, F.D.; Cobos, M. Practical considerations for acoustic source localization in the IoT era: Platforms, energy efficiency, and performance. IEEE Internet Things J. 2019, 6, 5068–5079. [Google Scholar] [CrossRef]
Aguirre, D.; Firoozabadi, A.D.; Seguel, F.; Soto, I. Proposed energy based method for light receiver localization in underground mining. In Proceedings of the 2016 IEEE International Conference on Automatica (ICA-ACCA), Curico, Chile, 19–21 October 2016; pp. 1–6. [Google Scholar] [CrossRef]
Luu, G.; Ravier, P.; Buttelli, O. The generalized correlation methods for estimation of time delay with application to electromyography. In Proceedings of the 1st International Symposium on Engineering Physics and Mechanics (ISEPM), 25–26 October 2011; pp. 1–6. [Google Scholar]
Firoozabadi, A.D.; Abutalebi, H.R. SRP-ML: A Robust SRP-based speech source localization method for Noisy environments. In Proceedings of the 18th Iranian Conference on Electrical Engineering (ICEE), Isfahan, Iran, 11–13 May 2010; pp. 11–13. [Google Scholar]
Schmidt, R. Multiple Emitter Location and Signal Parameter Estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef]
Khan, Z.I.; Awang, R.A.; Sulaiman, A.A.; Jusoh, M.H.; Baba, N.H.; Kamal, M.M.; Khan, N.I. Performance analysis for Estimation of signal Parameters via Rotational Invariance Technique (ESPRIT) in estimating Direction of Arrival for linear array antenna. In Proceedings of the 2008 IEEE International RF and Microwave Conference, Kuala Lumpur, Malaysia, 2–4 December 2008; pp. 530–533. [Google Scholar]
Adavanne, S.; Politis, A.; Virtanen, T. Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. arXiv 2017, arXiv:1710.10059. [Google Scholar]
Singla, R.; Tiwari, S.; Sharma, R. A Sequential System for Sound Event Detection and Localization Using CRNN; Technical Report; DCASE: Tokyo, Japan, 2020. [Google Scholar]
Sharath, A.; Politis, A.; Nikunen, J.; Virtanen, T. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Top. Signal Process. 2019, 13, 34–48. [Google Scholar]
Song, J.-M. Localization and Detection for Moving Sound Sources Using Consecutive Ensembles of 2D-CRNN; Technical Report; DCASE: Tokyo, Japan, 2020. [Google Scholar]
Tian, C. Multiple CRNN for SELD; Technical Report; DCASE: Tokyo, Japan, 2020. [Google Scholar]
Cao, Y.; Iqbal, T.; Kong, Q.; Zhong, Y.; Wang, W.; Plumbley, M.D. Event-Independent Network for Polyphonic Sound Event Localization and Detection. arXiv 2020, arXiv:2010.00140. [Google Scholar]
Ronchini, F.; Arteaga, D.; Pérez-López, A. Sound event localization and detection based on CRNN using rectangular filters and channel rotation data augmentation. arXiv 2020, arXiv:2010.06422. [Google Scholar]
Sampathkumar, A.; Kowerko, D. Sound Event Detection and Localization Using CRNN Models; Technical Report; DCASE: Tokyo, Japan, 2020. [Google Scholar]
Comminiello, D.; Lella, M.; Scardapane, S.; Uncini, A. Quaternion convolutional neural networks for detection and localization of 3D sound events. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12—17 May 2019; pp. 8533–8537. [Google Scholar] [CrossRef]
Perotin, L.; Serizel, R.; Vincent, E.; Guérin, A. CRNN-based joint azimuth and elevation localization with the ambisonics intensity vector. In Proceedings of the 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, 17–20 September 2018; pp. 241–245. [Google Scholar]
Perotin, L.; Serizel, R.; Vincent, E.; Guérin, A. CRNN-based multiple DOA estimation using acoustic intensity features for ambisonics recordings. IEEE J. Sel. Top. Signal Process. 2019, 13, 22–33. [Google Scholar] [CrossRef]
Grumiaux, P.A.; Kitic, S.; Girin, L.; Guerin, A. Improved feature extraction for CRNN-based multiple sound source localization. arXiv 2021, arXiv:2105.01897. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Phan, H.; Pham, L.; Koch, P.; Duong, N.Q.; McLoughlin, I.; Mertins, A. Audio Event Detection and Localization with Multitask Regression Network; Technical Report; DCASE: Tokyo, Japan, 2020. [Google Scholar]
Phan, H.; Pham, L.; Koch, P.; Duong, N.Q.; McLoughlin, I.; Mertins, A. On multitask loss function for audio event detection and localization. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE Workshop), Virtual, 2–4 November 2020. Technical Report. [Google Scholar]
Schymura, C.; Ochiai, T.; Delcroix, M.; Kinoshita, K.; Nakatani, T.; Araki, S.; Kolossa, D. Exploiting attention-based sequence-to-sequence architectures for sound event localization. arXiv 2021, arXiv:2103.00417. [Google Scholar]
Sun, X.; Hu, Y.; Zhu, X.; He, L. Sound event localization and detection based on adaptive hybrid convolution and multi-scale feature extractor. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE Workshop), Online, 15–19 November 2021. Technical Report. [Google Scholar]
Sudarsanam, P.; Politis, A.; Drossos, K. Assessment of self attention on learned features for sound event localization and detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE Workshop), Online, 15–19 November 2021. Technical Report. [Google Scholar]
Schymura, C.; Bönninghoff, B.; Ochiai, T.; Delcroix, M.; Kinoshita, K.; Nakatani, T.; Araki, S.; Kolossa, D. PILOT: Introducing transformers for probabilistic sound event localization. arXiv 2021, arXiv:2106.03903. [Google Scholar]
Hirvonen, T. Classification of spatial audio location and content using convolutional neural networks. Audio Eng. Soc. Conv. 2015, 138, 9294. [Google Scholar]
Chakrabarty, S.; Habets, E.A.P. Broadband DOA estimation using convolutional neural networks trained with noise signals. In Proceedings of the 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 15–18 October 2017; pp. 136–140. [Google Scholar]
Chakrabarty, S.; Habets, E.A.P. Multi-speaker localization using convolutional neural network trained with noise. arXiv 2017, arXiv:1712.04276. [Google Scholar]
Chakrabarty, S.; Habets, E.A.P. Multi-speaker DOA estimation using deep convolutional networks trained with noise signals. IEEE J. Sel. Top. Signal Process. 2019, 13, 8–21. [Google Scholar] [CrossRef]
Xu, Y.; Osep, A.; Ban, Y.; Horaud, R.; Leal-Taixé, L.; Alameda-Pineda, X. How to train your deep multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6787–6796. [Google Scholar]
Politis, A. STARSS23: Sony-TAu Realistic Spatial Soundscapes 2023. Zenodo 2023. Available online: https://zenodo.org/records/7880637 (accessed on 20 May 2025).
Fonseca, E.; Favory, X.; Pons, J.; Font, F.; Serra, X. Fsd50k: An open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 829–852. [Google Scholar] [CrossRef]
Trowitzsch, I.; Taghia, J.; Kashef, Y.; Obermayer, K. The NIGENS general sound events database. arXiv 2019, arXiv:1902.08314. [Google Scholar]

Figure 1. Polar coordinate reference in SSL application.

Figure 2. Generalized block diagram for an SSL system.

Figure 5. The block diagram of the proposed methodology based on the NN for SSL in adverse environments.

Figure 6. Diagram of proposed M-SELDnet architecture.

Figure 7. Diagram of proposed M-DOAnet architecture.

Figure 8. Train loss and validation loss graph for M-DOAnet.

Figure 9. Evolution of (a) F1 Overall and (b) Error Metric in M-SELDnet.

Figure 10. Comparison of metrics between M-SELDnet and M-DOAnet.

Figure 11. Time per epoch for the M-SELDnet and M-DOAnet networks.

Table 1. ADAM optimizer moment estimates values and epsilon.

$β_{1}$	$β_{2}$	$ϵ$
$0.9$	$0.999$	$10^{- 8}$

Table 2. M-DOAnet: Epoch-specific metrics to summarize performance.

Epoch	LE (Grades)	MOTA (%)	LF (%)
0	32.43	100.00	72.79
10	20.04	100.00	72.79
20	13.65	100.00	72.79
50	7.84	100.00	72.79
100	6.00	100.00	72.79

Table 3. M-SELDnet: Epoch-specific metrics for summarizing performance.

Epoch	F1	ER	LE	Error Metric
1	0.00	1.00	1.00	0.76
10	0.44	0.75	0.56	0.53
20	0.58	0.54	0.42	0.35
50	0.70	0.43	0.34	0.24
100	0.75	0.38	0.32	0.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Estay Zamorano, B.; Dehghan Firoozabadi, A.; Brutti, A.; Adasme, P.; Zabala-Blanco, D.; Palacios Játiva, P.; Azurdia-Meza, C.A. Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions. Electronics 2025, 14, 2778. https://doi.org/10.3390/electronics14142778

AMA Style

Estay Zamorano B, Dehghan Firoozabadi A, Brutti A, Adasme P, Zabala-Blanco D, Palacios Játiva P, Azurdia-Meza CA. Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions. Electronics. 2025; 14(14):2778. https://doi.org/10.3390/electronics14142778

Chicago/Turabian Style

Estay Zamorano, Bastian, Ali Dehghan Firoozabadi, Alessio Brutti, Pablo Adasme, David Zabala-Blanco, Pablo Palacios Játiva, and Cesar A. Azurdia-Meza. 2025. "Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions" Electronics 14, no. 14: 2778. https://doi.org/10.3390/electronics14142778

APA Style

Estay Zamorano, B., Dehghan Firoozabadi, A., Brutti, A., Adasme, P., Zabala-Blanco, D., Palacios Játiva, P., & Azurdia-Meza, C. A. (2025). Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions. Electronics, 14(14), 2778. https://doi.org/10.3390/electronics14142778

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions

Abstract

1. Introduction

2. State of the Art

2.1. Conventional SSL Methods

2.1.1. Generalized Cross-Correlation with Phase Transform (GCC-PHAT)

2.1.2. Subspace Methods

2.2. Neural Network-Based Methods in Sound Source Localization

2.2.1. Convolutional Recurrent Neural Networks (CRNNs)

2.2.2. Attention-Based Neural Networks

3. Methodology

3.1. Input Signal Representations

3.1.1. Spectrogram Representations for Speech Signal

3.1.2. Ambisonic Signal Representation

3.1.3. Sound Intensity

3.2. Model Architecture and Classification Adaptation

3.3. Modified Architectures

3.3.1. SELDnet

3.3.2. DOAnet

3.4. Training Configuration

3.5. Evaluation Metrics

4. Results and Discussion

4.1. Dataset Description

4.1.1. Content Characteristics and Target Classes

4.1.2. Data Collection Method and Recording Setup

4.2. Signal Modeling and Feature Extraction

4.3. Metric Analysis for M-DOAnet

Computational Efficiency During Training

4.4. Metric Analysis for M-SELDnet

Computational Efficiency During Training

4.5. Analysis of Results

4.5.1. Comparative Computational Efficiency M-DOAnet and M-SELDnet

4.5.2. Practical Implications in Real-World Scenarios

4.6. Limitations and Future Research Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI