1. Introduction
In the last decade, much effort has been spent on the development of sound source localization (SSL) technology. SSL pertains to the problem of determining the spatial location of a sound source using only the signals captured by a microphone array. The traditional approach involves signal processing-based algorithms, including the estimation of time differences of arrival (TDOA) between microphone pairs using techniques like generalized cross-correlation with phase transform (GCC-PHAT) [
1,
2,
3], beamforming methods such as SRP-PHAT [
4,
5], and spectral estimation approaches like the multiple signal classification (MUSIC) algorithm [
6,
7]. However, with the development of research, the same issue arises in all of these classical algorithms: their performance will decrease dramatically when reverberation and noise are present simultaneously [
6].
Therefore, researchers have turned to machine learning algorithms as a potential solution [
8]. In recent years, numerous sound source localization algorithms based on deep learning have been developed. In studies [
1,
9,
10,
11,
12,
13,
14], inspired by traditional methods, researchers have used the time differences of arrival (TDOA) as input features to neural networks, leveraging the network’s nonlinear fitting capability to map these features to the sound source location. Their findings demonstrate that neural networks can effectively learn this mapping, improving the accuracy of azimuth angle estimation. However, these approaches are limited to estimating the source distance. In [
15,
16], both sound intensity and arrival time difference are combined as input features to the neural network. This approach shows that sound intensity is strongly correlated with the distance of the source, allowing for simultaneous estimation of both azimuth angle and distance. Despite these advances, these methods rely on two-dimensional microphone arrays, which suffer from the mirror effect and do not enable full three-dimensional localization.
Up until now, the methodologies discussed have relied on manually extracted features as inputs for neural networks, which present an inherent challenge. This approach limits us to using only those features that have been rigorously studied in the field of sound source localization, inadvertently overlooking potentially valuable characteristics that remain unexplored. Consequently, important information may be missed during the feature extraction phase, hindering the full utilization of available data and impeding progress in positioning accuracy. Drawing inspiration from the human auditory system’s natural ability to localize sound sources, an alternative strategy has emerged [
17,
18]. In this approach, neural network models are used to mimic the functional mechanisms of the human ear, eliminating the need for manual feature extraction. These models can directly determine the spatial orientation of sound sources from raw audio signals, to create an end-to-end sound source localization framework. This methodology not only aligns with the biological capabilities of human hearing but also represents a leading trend in the evolution of artificial neural networks, pushing toward more holistic and autonomous learning systems.
End-to-end systems have achieved substantial progress across diverse audio applications [
17,
19,
20,
21,
22]. For example, the end-to-end sound source localization (SSL) model in [
17] directly infers sound source azimuth from raw binaural waveforms, validating that end-to-end architectures markedly improve reverberation suppression and noise reduction; its integrated frequency analysis module further delivers state-of-the-art azimuth estimation accuracy, setting a new task benchmark.
Parallel progress has been made in audio synthesis, with end-to-end models that directly map text to raw waveforms [
23,
24]. These works validate end-to-end learning’s ability to align discrete text and continuous audio signals, underscoring this paradigm’s transformative potential for audio processing.
Although existing deep learning-based sound source localization methods have made substantial progress, they still face the following key challenges in practical applications:
1. Insufficient robustness against interference in complex acoustic environments: background noise, reverberation, and non-stationary interference can severely disrupt the spatial characteristics of the sound field, making it difficult for conventional neural networks to stably extract reliable azimuth information.
2. Inadequate fusion of spatial and spectral features: many methods process spectral and spatial features separately, neglecting the high-order interactions between them, which limits localization accuracy under low signal-to-noise ratio (SNR) conditions.
3. Weak adaptability of network structures to critical frequency bands: different frequency bands contribute differently to source localization, and existing methods lack effective channel attention mechanisms to dynamically enhance informative frequency bands while suppressing interfering ones.
This paper introduces a novel deep convolutional neural network (CNN) designed to directly infer the 3-D spatial coordinates of sound sources in a Cartesian framework from raw waveforms, eliminating manual feature extraction. Leveraging a stereo microphone array, the proposed model resolves spatial ambiguity. The architecture contains three core components: a frequency analysis module, a residual connection module, and a squeeze-and-excitation (SE) block. Inspired by [
17], the frequency analysis module is integrated to enhance performance. Unlike conventional methods that simply stack convolutional layers, we introduce residual connections to capture both global waveform characteristics and fine-grained local features, thus improving localization accuracy. Furthermore, the SE block acts as a channel attention mechanism for dynamic feature weighting, which optimizes feature utilization and further boosts overall performance.
4. Conclusions
To enhance the accuracy of sound source localization, this paper addresses the challenges posed by SNR and reverberation time. Traditional signal processing algorithms often struggle in environments with high reverberation and low SNR. To overcome these limitations, the study leverages the proven effectiveness of deep learning across various domains and proposes an end-to-end sound source localization model. By integrating a frequency analysis module, an attention channel module, and a residual module, the model improves robustness against reverberation and noise. The conclusions are as follows:
(1) The ablation study results reveal that the SE module contributes more than RC to the overall algorithm performance, thereby demonstrating SE’s outstanding robustness against reverberation, noise, and other interferences.
(2) The results from both simulations and real-world experiments demonstrate that the combined use of RC and SE can further enhance the algorithm’s robustness against reverberation, noise, and other interference.
(3) The experimental findings in real-world physical environments indicate that the end-to-end sound source localization algorithm exhibits significantly superior performance compared to traditional algorithms regarding robustness against reverberation and noise. Specifically, this approach demonstrates an average improvement in performance of approximately 60%.
Despite significant advancements in sound source localization, several areas remain open for further exploration. Future research will focus on the following directions to enhance the robustness and generalizability of the proposed method:
(1) Establishment of Diverse Acoustic Environments: Future work will aim to create acoustic environments that closely resemble real-world conditions, including those similar to the training data distribution and non-stationary noise.
(2) Multi-sound Source Localization Challenge: Future research will explore the integration of video signals as complementary inputs, fusing auditory and visual modalities to enable robust multi-source localization.
(3) Moving Sound Source Localization and Tracking Problem: Future research will explore source separation approaches to achieve simultaneous localization and tracking of multiple moving sound sources.