FFMamba: Feature Fusion State Space Model Based on Sound Event Localization and Detection

Li, Yibo; Ge, Dongyuan; Xu, Jieke; Yao, Xifan

doi:10.3390/electronics14193874

Open AccessArticle

FFMamba: Feature Fusion State Space Model Based on Sound Event Localization and Detection

¹

College of Mechanical and Automotive Engineering, Guangxi University of Science and Technology, Yufeng District, Liuzhou 545616, China

²

School of Mechanical and Automotive Engineering, South China University of Technology, Tianhe District, Guangzhou 510641, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3874; https://doi.org/10.3390/electronics14193874

Submission received: 4 August 2025 / Revised: 13 September 2025 / Accepted: 19 September 2025 / Published: 29 September 2025

Download

Browse Figures

Versions Notes

Abstract

Previous studies on Sound Event Localization and Detection (SELD) have primarily focused on CNN- and Transformer-based designs. While CNNs possess local receptive fields, making it difficult to capture global dependencies over long sequences, Transformers excel at modeling long-range dependencies but have limited sensitivity to local time–frequency features. Recently, the VMamba architecture, built upon the Visual State Space (VSS) model, has shown great promise in handling long sequences, yet it remains limited in modeling local spatial details. To address this issue, we propose a novel state space model with an attention-enhanced feature fusion mechanism, termed FFMamba, which balances both local spatial modeling and long-range dependency capture. At a fine-grained level, we design two key modules: the Multi-Scale Fusion Visual State Space (MSFVSS) module and the Wavelet Transform-Enhanced Downsampling (WTED) module. Specifically, the MSFVSS module integrates a Multi-Scale Fusion (MSF) component into the VSS framework, enhancing its ability to capture both long-range temporal dependencies and detailed local spatial information. Meanwhile, the WTED module employs a dual-branch design to fuse spatial and frequency domain features, improving the richness of feature representations. Comparative experiments were conducted on the DCASE2021 Task 3 and DCASE2022 Task 3 datasets. The results demonstrate that the proposed FFMamba model outperforms recent approaches in capturing long-range temporal dependencies and effectively integrating multi-scale audio features. In addition, ablation studies confirmed the effectiveness of the MSFVSS and WTED modules.

Keywords:

Sound Event Localization and Detection; State Space Model; attention mechanism; feature fusion

1. Introduction

As a critical modality in multimodal perception, audio signals carry rich semantic information about the environment, extending perceptual range and enhancing situational awareness [1]. Sound Event Localization and Detection (SELD) is a cutting-edge task in computational auditory perception. It involves jointly identifying the type of sound event and estimating its spatial location—combining Sound Event Detection (SED) with Direction of Arrival (DoA) estimation. This integration is essential for developing human-like auditory systems. SELD has wide applications in speech recognition and localization [2,3,4], intelligent surveillance [5,6], and robotic navigation [7,8], making it both a highly valuable research area and a technology with significant real-world impact.

Due to the complexity of SELD, models must effectively capture and integrate audio features across temporal, spectral, and channel dimensions [9]. To address these challenges, various deep learning-based approaches for SELD have been proposed in recent years. Examples include neural networks with dynamic convolutional kernels to enhance adaptability to local features [10]; bidirectional gated recurrent units (BiGRUs) for capturing temporal dependencies [11,12]; and Transformer architectures for modeling global contextual relationships [13,14]. In addition, multi-scale fusion mechanisms [15] and time–frequency attention techniques [9] have been widely adopted to improve the perception of features at different granularities. These methods have led to significant progress in public benchmarks, such as the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge and the Learning 3D Audio Sources (L3DAS) project.

In recent years, State Space Models (SSMs) [16] have shown strong potential in natural language processing [17] and speech processing [18,19], owing to their linear-time complexity and efficient modeling capabilities. Among these, the selective State Space Model known as Mamba [20] introduces an input-dependent dynamic selection mechanism that significantly enhances the modeling of long-range sequential features, making it a key focus in current sequence modeling research. Mu et al. [21] applied the Mamba architecture to the SELD task, demonstrating its ability to capture broader contextual information while maintaining computational efficiency, thereby providing initial evidence of SSMs’ potential in audio modeling. Building on this foundation, the Vision Mamba (VMamba) model [22] extends Mamba’s capability to handle two-dimensional spatial data. It features a 2D-Selective-Scan (SS2D) module, which incorporates a cross-scan mechanism to enable directional sensitivity in spatial dimensions (as shown in Figure 1). This design preserves the long-range modeling strength of SSMs while bridging the structural gap between one-dimensional scanning and two-dimensional vision-based architectures, resulting in strong spatial modeling performance and computational efficiency. These architectural innovations introduce a promising new modeling approach for SELD: the SSM framework effectively captures temporal continuity and contextual dependencies in long audio sequences, while the Vision State Space (VSS) structure enables deep spatial feature extraction, potentially addressing the limitations of spatial modeling in complex acoustic environments.

Despite its effectiveness in modeling long-range dependencies, the original Mamba architecture is mainly designed for image tasks and struggles to capture frequency-domain features and multi-scale structures in audio. This limitation, coupled with the inherent time–frequency coupling in SELD, makes single-scale or unidirectional scanning insufficient for modeling local details and multi-scale interactions.

To this end, we propose an attention-based Feature Fusion State Space model (FFMamba), designed to enhance both local spatial detail modeling and long-sequence dependency modeling. First, we develop a Multi-Scale Fusion Visual State Space (MSFVSS) module that integrates feature representations from multiple receptive fields. Unlike conventional state space modules that primarily model sequential dependencies, MSFVSS fuses features across different scales, thereby strengthening local spatial perception while preserving spatial resolution and channel dimensions. This improves the network’s sensitivity to fine-grained local acoustic patterns, such as transient sound events.

Second, we introduce the Wavelet Transform-Enhanced Downsampling (WTED) module, which combines discrete wavelet decomposition with convolutional downsampling. This mechanism preserves frequency-domain features that are often lost in convolution or pooling operations. By integrating spatial and spectral information during downsampling, WTED enhances the model’s ability to capture time–frequency interactions, which are crucial for SELD.

In summary, FFMamba leverages the MSFVSS module to strengthen local feature representation and the WTED module to fuse spatial and spectral features, thereby improving the model’s robustness and accuracy in sound event detection and localization.

The main contributions of this work are as follows:

In this paper, we propose a Feature Fusion State Space model with an attention mechanism (FFMamba). To the best of our knowledge, this is the first work to apply the VSS architecture to the SELD task, replacing the commonly used Transformer-based structures.
We introduce a Multi-Scale Fused Visual State Space (MSFVSS) module. In this design, the Multi-Scale Spatial Fusion (MSF) component replaces the deep convolution layer in the original VSS to enhance its capability in modeling local spatial details.
This paper proposes a Wavelet Transform-Enhanced Downsampling (WTED) module, which combines convolutional downsampling with multi-scale frequency features from wavelet transform and enhances them through channel weighting.

The proposed method is evaluated on two benchmark datasets and compared with existing mainstream approaches, demonstrating its effectiveness and superiority in practical application scenarios.

2. Related Works

2.1. Current Methods for SELD

Early approaches to SELD were primarily based on Convolutional Recurrent Neural Network (CRNN) architectures. Adavanne et al. [23] proposed SELDnet, which combines convolutional layers for extracting local time–frequency features with GRUs for modeling temporal dependencies, serving as a cornerstone for subsequent SELD models. Since then, numerous improvements have been made in both feature modeling and spatial encoding. Grondin et al. [24] were the first to apply a CRNN structure to multi-microphone arrays, enhancing the spatial modeling capabilities for sound source localization. Cao et al. [25] proposed EINV2Net, which introduced a multi-branch design and replaced traditional RNNs with a Multi-Head Self-Attention (MHSA) mechanism [13], achieving higher accuracy in event detection and localization. Zhang et al. [26] introduced the Conformer architecture to the SELD task, combining convolutional networks with Transformers to jointly model local and global information, demonstrating strong robustness in semi-supervised settings. More recently, the CST-former [9] incorporated specialized attention mechanisms across channel, spectral, and temporal dimensions, improving the model’s ability to decouple and model multimodal information, and further extending the applicability of Transformer-based methods in SELD.

2.2. Multi-Scale Feature Fusion

In SELD tasks, multi-scale feature fusion is a key strategy for enhancing the model’s ability to handle complex acoustic environments. By incorporating receptive fields of different scales across temporal, spectral, and spatial dimensions, the model can capture contextual dependencies more effectively, leading to improved robustness in event detection and higher accuracy in spatial localization. Shimada et al. [27] proposed D3Net, a DenseNet-based architecture that combines multi-resolution learning with exponentially expanding receptive fields. Li et al. [28] introduced L2-MS SincNet, an improved version of SincNet based on the Sinc function, which uses multi-scale convolutional layers to extract features across different temporal and frequency ranges. Mu et al. [29] presented a three-stage architecture that generates multi-scale spectral and spatial features through parallel subnetworks. Ouyang et al. [15] developed the Efficient Multi-scale Attention (EMA) module, which leverages parallel branches to learn long- and short-term spatial dependencies, effectively modeling inter-channel relationships while enhancing pixel-level attention to high-level features.

2.3. State Space Models (SSMs)

SSMs are a class of mathematical frameworks designed to describe the dynamic evolution of sequential data, originally derived from the Kalman filter [30]. The core idea is to use latent state variables to capture the temporal dynamics of a sequence. In SSMs, the input signal drives the hidden states to evolve continuously over time within a latent space, and the output sequence is generated through a decoding mapping. The continuous evolution of the system is governed by a set of linear ordinary differential equations (ODEs), as shown below:

h^{'} (t) = Ah (t) + B u (t)

(1)

y (t) = Ch (t) + D u (t)

(2)

where

h^{'} (t)

indicate latent state,

u (t)

and

y (t)

indicate input signal and output signal.

A \in ℝ^{N \times N}, B \in ℝ^{N \times 1},

C \in ℝ^{1 \times N}

, and are the weighting parameters.

In deep learning implementations, to handle discrete sequences, the continuous-time ODEs are typically discretized using the Zero-Order Hold (ZOH) method, resulting in a discrete-time formulation as follows:

h^{'} (t) = \bar{A} h (t) + \bar{B} u (t)

(3)

y (t) = Ch (t) + D u (t)

(4)

\bar{A} = e^{Δ A}, \bar{B} = {(Δ A)}^{- 1} (e^{Δ A} - 1) \cdot Δ B

(5)

where

\bar{A} = e^{Δ A}

,

\bar{B} = {(Δ A)}^{- 1} (e^{Δ A} - 1) \cdot Δ B

, ∆ denotes the time-scale parameter, while

\bar{A}

and

\bar{B}

represent discrete-time system parameters.

Traditional SSMs model the evolution of inputs and hidden states using linear ODEs, which implies that the projection matrices are fixed and unaffected by variations in the input sequence. This static configuration limits the model’s ability to focus on individual elements within the sequence. To address this, Mamba introduces the Selective State Space model, which incorporates an input-dependent selection mechanism (S6) to enhance modeling capacity [20], computational efficiency, and numerical stability. However, limitations remain when applying this approach to high-dimensional tasks such as computer vision. To better accommodate visual scenarios, the SS2D model has been proposed (as shown in Figure 2). SS2D extends the state space modeling framework to the spatial dimensions of an image—height (H) and width (W)—by independently modeling and fusing state dynamics along both directions, thereby enabling efficient capture of global dependencies in visual data.

3. Proposed Method

3.1. Preprocessing

In this study, first-order Ambisonics (FOA)-format audio is used across all datasets to extract both time–frequency and spatial features for joint SED and DoA estimation. The input audio is sampled at 24 kHz.

In this paper, log-Mel spectrogram features are extracted as the input for the SED branch. To this end, the Short-Time Fourier Transform (STFT), which can simultaneously capture both amplitude and phase information of the audio signal, is applied for multichannel audio time–frequency analysis. The specific parameter settings are as follows: a Hann window is used with a window length of L = 512, a hop size of H = 300, and an FFT size of N = 512. The STFT result

X_{m} (k)

is then computed as follows:

X_{m} (k) = \sum_{n = 0}^{L - 1} x (m H + n) \cdot w (n) \cdot e^{\frac{- j 2 π k n}{N}}

(6)

where

k

denotes the frequency index

(0 \leq k < N)

,

n

represents the time index within a frame

(0 \leq n < L)

,

w (n)

is the window function, and

x (m H + n)

indicates the sampled point of the original waveform in the

m

-th frame.

The power spectral density

P_{m} (k)

reflects the energy of the audio signal, and is defined as:

P_{m} (k) = {|X_{m} (k)|}^{2}

(7)

Finally, a log-Mel filter bank is applied to generate the log-Mel spectrogram, with the number of filters set to b = 128. The computation is given by:

L og M e l_{m} (b) = \log [\sum_{k} P_{m} (k) \cdot H_{b} (k) + ε]

(8)

where b denotes the filter index,

b = 1, \dots, 128

,

H_{b} (k)

represents the triangular filter, and ε is a small positive constant.

Compared with linear filter banks, log-Mel filter banks are more consistent with the characteristics of human auditory perception, providing higher resolution at low frequencies and lower resolution at high frequencies. In addition, the logarithmic transformation compresses the dynamic range and enhances the robustness of the model to energy differences.

For FOA audio, the intensity vector (IV) feature is extracted as the input for the DoA branch. The intensity vector is computed from the phase and magnitude of the four-channel spectrograms, which characterizes the directional flow of energy in the sound field and serves as an important spatial cue for modeling directional features. First, an STFT is applied to each channel signal, and the frequency-domain representation is given by:

P_{i} = S (W, X, Y, Z)

(9)

where

i \in \{W, X, Y, Z\}

, with W representing the omnidirectional component, and X, Y, and Z representing the components along different directions in the Cartesian coordinate system.

Then, the raw directional components along X, Y, and Z are computed with respect to W within each time–frequency unit as follows:

I_{i} = Re a l (P_{W}, P_{i} *)

(10)

where ∗ denotes the complex conjugate, and

Re a l ()

denotes taking the real part.

To improve numerical stability and obtain directional information, normalization is applied. The energy intensity is defined as:

E = {|P_{W}|}^{2} + {|P_{X}|}^{2} + {|P_{Y}|}^{2} + {|P_{Z}|}^{2}

(11)

The intensity vector is then given by:

\hat{I_{i}} = \frac{2 \times I_{i}}{{|P_{W}|}^{2} + ε}

(12)

{\hat{I}}_{E} = \frac{E}{E + ε}

(13)

where

{\hat{I}}_{i} \in [- 1, 1]

,

i \in \{X, Y, Z\}

, and ε denotes a small positive constant.

The intensity vector is mapped to the Mel frequency scale to ensure alignment with the log-Mel features along the time–frequency axes. The computation is given by:

I V_{i} = \frac{\sum \hat{I_{i}} \cdot H_{b} (k)}{\sum H_{b} (k) + ε}

(14)

where

i \in \{X, Y, Z\}

, and

b = 1, \dots, 128

.

Finally, the input feature F is given by:

F = [L og M e l_{m}^{(j)} (b), I V_{i}], j = 1, 2, 3, 4 .

(15)

The adopted feature F consists of a 4-channel log-Mel spectrogram and a 3-channel IV, enabling joint modeling of time–frequency and spatial characteristics.

3.2. Network Architecture

The overall architecture of the proposed SELD model is illustrated in Figure 3 and follows a hierarchical encoder–decoder structure. The encoder, referred to as FFMamba, consists of one convolutional block, four layers of MSFVSS modules, and three WTED-based downsampling layers. The decoder comprises two BiGRU layers followed by two fully connected layers.

The encoder processes audio features extracted from FOA recordings and encodes them into intermediate representations. The input to the model has a shape of C × T × F, where C denotes the total number of feature channels, consisting of the 4-channel log-Mel spectrogram and the 3-channel IV, T denotes the number of temporal frames, and F is the number of frequency bins. The input features

F \in ℝ^{C \times T \times F}

are first passed through a convolutional block for preliminary processing, extracting low-level local time–frequency features and expanding the channel dimension. The convolutional block can be expressed using the following formulation:

Z = Avgpool (GELU (BN (Conv (F))))

(16)

where Z denotes the output of the convolution block, BN denotes batch normalization, GELU is the activation function, Conv denotes a 3 × 3 convolution.

The extracted features are fed into four MSFVSS modules with a 1:2:2:1 configuration, which preserve spatial dimensions and channel size while enriching high-level spatial representations. Each MSFVSS module is followed by a WTED module that halves the spatial resolution and doubles the channel number, thereby compressing the feature maps and enhancing representational capacity. After four stages, a compact high-level semantic representation is obtained.

In the decoder, a two-layer BiGRU is employed as the core structure to capture long-term dependencies in temporal sequences. To reduce temporal dimensionality and mitigate the influence of redundant features, the input is first processed through average pooling before being fed into the two-layer BiGRU. This allows for temporal context modeling and extraction of high-level features with global time dependencies. To support the multi-task learning framework, the decoder adopts a dual-branch structure. Each branch consists of two fully connected layers that perform nonlinear mapping of the shared deep features, enhancing the model’s ability to differentiate task-specific information. At the output stage, the SED branch employs a Sigmoid activation function to ensure that the existence probability of each event class lies within the [0, 1] range, reflecting the likelihood of different sound events occurring in various directions. Meanwhile, the DoA branch uses a Tanh activation function to normalize the three-dimensional coordinate outputs within the [−1, 1] range, satisfying the output constraints of the 3D sound source localization task and representing the unit vector direction of the sound sources.

3.3. MSFVSS Module

To enhance the local spatial modeling capability of the VSS module while capturing multi-scale audio features, we propose the MSFVSS module, as illustrated in Figure 4a. As shown in Figure 1, SS2D scans the input features along four different directions, encoding positional information and generating four distinct feature sequences

F_{S} = \{|f_{0}^{i}, f_{1}^{i}, f_{2}^{i} \dots f_{k}^{i}|, f_{k}^{i} \in ℝ^{1 \times 2 C}, i = \{1, 2, 3, 4\}\}

. The S6 module processes each of the four feature sequences independently [22] (as shown in Figure 2), modeling long-range spatial dependencies within each sequence. The outputs are then fused to form a 2D feature map as the final output. Excluding residual connections and Multilayer Perceptron (MLP), the MSFVSS module can be expressed using the following formulation:

Z = Linear (LN (SS 2 D (SILU (MSF (Linear (LN (F)))))))

(17)

where linear denotes a linear layer, LN refers to layer normalization, SILU is the activation function, and SS2D represents the 2D-selective-scan module.

We propose the MSF module, as illustrated in Figure 4b. By replacing the depthwise convolution in the VSS module, the MSF module effectively mitigates the insufficient utilization of spatial distribution information during local modeling. This enhancement significantly improves the model’s sensitivity to local spatial variations.

First, the input features

F \in ℝ^{C \times T \times F}

are divided into G groups along the channel dimension, resulting in C/G segments with diverse semantic information. To reduce the number of parameters and improve efficiency, we set G = 32. Each group of features is processed through three parallel branches: two using 1D adaptive average pooling, and one using a 3 × 3 convolution. The 1D adaptive average pooling branches perform channel attention encoding along the temporal and frequency dimensions, respectively, capturing global information in both domains. The outputs of the two pooling branches are concatenated and fused using a 1 × 1 convolution. The result is then split into two 1D vectors, activated by a sigmoid function, and multiplied element-wise with the input feature map to produce the first attention-enhanced feature map

F_{1} \in ℝ^{C^{'} \times T \times F}

. The 3 × 3 convolution branch enhances local features, yielding the second attention-aware feature map

F_{2} \in ℝ^{C^{'} \times T \times F}

.

Next, the two branches

F_{1}

and

F_{2}

, which encode different types of spatial information, are aggregated across spatial dimensions to enhance the directional representation of the features. Two-dimensional adaptive average pooling is applied separately to

F_{1}

and

F_{2}

to generate global context vectors along the channel dimension. These vectors are then transformed and activated using a softmax function to produce inter-channel attention weights

α_{1} \in ℝ^{C^{'} \times T F}

and

α_{2} \in ℝ^{C^{'} \times T F}

. The outputs of each branch are multiplied by their corresponding weights through matrix multiplication, resulting in two spatial attention-enhanced feature maps. The two maps are then element-wise summed and passed through a nonlinear activation function to produce the final reweighted feature map

Z \in ℝ^{C \times T \times F}

. The computation process can be formally described as follows:

α_{1} = Softmax (Avgpool (GN (F_{1})))

(18)

α_{2} = Softmax (Avgpool (F_{2}))

(19)

Z = Sigmoid (α_{1} ⊙ F_{2} + α_{2} ⊙ F_{1})

(20)

where GN denotes group normalization,

⊙

represents matrix multiplication.

Finally, a channel attention mechanism is applied to perform weighted fusion on the feature map Z. The feature map is processed by two convolutional branches with different receptive fields to extract multi-scale feature representations, denoted as

Y_{1} \in ℝ^{C \times T \times F}

and

Y_{2} \in ℝ^{C \times T \times F}

. The outputs

Y_{1}

and

Y_{2}

are then summed element-wise to obtain the combined multi-scale feature representation. Adaptive weights

β_{1}

and

β_{2}

for the two branches are subsequently computed. Finally, a weighted fusion of the two branches is performed using

β_{1}

and

β_{2}

, enabling adaptive selection of information across different scales. The overall process is outlined as follows:

Y_{1} = DWConv 1 (Z)

(21)

Y_{2} = DWConv 2 (Z)

(22)

[β_{1}, β_{2}] = Softmax (Conv (FC (Avgpool (Y_{1} + Y_{1}))))

(23)

Z = β_{1} \otimes Y_{1} + β_{2} \otimes Y_{2}

(24)

where DWConv1 denotes a 3 × 3 depthwise convolution, DWConv2 represents a 5 × 5 depthwise convolution, Conv refers to a 1 × 1 convolution, and FC indicates a fully connected layer, ⊗ denotes the elemental multiplication.

3.4. WTED Module

To improve feature representation and preserve critical information during the downsampling stage, we propose a Wavelet Transform-Enhanced Downsampling (WTED) module, as illustrated in Figure 5. This module integrates a convolutional downsampling branch and a wavelet transform branch, combined with an adaptive channel-weighted fusion mechanism. It effectively merges features across different spatial scales and frequency bands, providing richer and more robust representations for subsequent layers.

The WTED module consists of two main branches: a convolutional branch and a wavelet transform branch. The convolutional branch utilizes a 2D convolution (k = 2, s = 2), followed by batch normalization, to perform spatial downsampling and extract preliminary features. The wavelet transform branch applies a Discrete Wavelet Transform [31] (DWT) to decompose the input into low-frequency components (LL) and high-frequency components, with the high-frequency details represented as horizontal (HL), vertical (LH), and diagonal (HH). The decomposition is computed as follows:

y_{LL}, y_{HL}, y_{LH}, y_{HH} = DWT (F)

(25)

where

F \in ℝ^{C \times T \times F}

is the input of the WTED module.

In this study, the db4 wavelet is chosen for its well-balanced trade-off between computational efficiency and the ability to capture local features. Compared to the Haar wavelet, db4 provides greater smoothness and a shorter filter length, allowing for more effective preservation of signal edges and fine-grained details.

To enhance spatial consistency in the wavelet branch, all frequency subbands are rescaled to the target downsampling size via linear interpolation. The low-frequency component and directional high-frequency components are then concatenated along the channel dimension. A 1 × 1 convolution block is applied to compress the features and introduce nonlinearity. The overall computation can be formulated as follows:

Y = ReLU (BN (Conv (Concat (y_{LL}, y_{HL}, y_{LH}, y_{HH}))))

(26)

To enhance the fusion of features obtained from the two downsampling branches, the WTED module combines channel-weighted summation. The computation is defined as follows:

Z = λ_{1} X + λ_{2} Y

(27)

where

X \in ℝ^{C \times T \times F}

denotes the feature map of the convolutional branch,

Y \in ℝ^{C \times T \times F}

denotes the feature map of the DWT branch, with

λ_{1}

and

λ_{1}

denoting the corresponding channel-wise weighting coefficients and satisfying

λ_{1} + λ_{2} = 1

.

This fusion strategy introduces learnable channel-wise weighting parameters for each branch, enabling the model to adaptively adjust the contribution of convolutional and wavelet features based on the feature distribution of different acoustic scenes or event types. Such dynamic balancing enhances the network’s ability to model both spatial structures and multi-band frequency characteristics.

4. Experiments

4.1. Dataset

In this study, the DCASE 2021 Task 3 and DCASE 2022 Task 3 datasets were selected for model evaluation. Both datasets are officially provided by DCASE and represent the most widely recognized public benchmarks for sound event localization and detection (SELD), making them authoritative and comparable. Compared with other databases, these datasets align more closely with the research objectives in terms of task definition, format, and sound source types, allowing an effective assessment of model performance across different scenarios.

The DCASE 2021 Task 3 dataset is based on the TAU Spatial Room Impulse Response (TAU SRIR) database and the NIGENS sound event library. Multi-channel audio is generated through acoustic simulation and convolution. Specifically, clean event segments are selected from the NIGENS library and assigned class labels. These segments are then spatialized using room impulse responses from TAU SRIR, producing multi-channel audio with precise directional annotations (azimuth and elevation). Each scene contains 1 to 3 simultaneous events, with varying room reverberation and background noise to enhance diversity and realism. The dataset is provided in two formats: first-order Ambisonics (FOA) and microphone array (MIC), sampled at 24 kHz, with signal-to-noise ratios (SNR) ranging from 6 dB to 30 dB, covering 12 common indoor sound event classes (e.g., phone ringing, alarm, door knock, laughter). The dataset contains 600 recordings: 400 for training, 100 for validation, and 100 for testing, ensuring no overlap between training and test sets.

The DCASE 2022 Task 3 dataset (STARSS22) extends the 2021 dataset by including real recording scenarios. Audio was captured in office and meeting environments using high-resolution spherical microphone arrays. Accurate temporal and spatial annotations were obtained through manual labeling combined with optical tracking systems. To increase the number of training samples, the dataset also provides synthetic audio, generated by convolving publicly available event segments with real room impulse responses. The final dataset includes 121 real recordings (67 training, 54 testing; total duration 4 h 52 min) and 1200 one-minute synthetic recordings for training and evaluation. All recordings are available in FOA and MIC formats, sampled at 24 kHz, covering 13 sound event classes.

In addition, ablation experiments were conducted on the DCASE 2021 Task 3 dataset to validate the effectiveness of each model component. By comparing results on both synthetic and real recordings, the experiments comprehensively assess the model’s adaptability and robustness across different acoustic environments and event types.

4.2. Evaluation Indicators

We adopt the same evaluation metrics as those used in DCASE 2021 Task 3, including: event-based error rate (ER_20°) and F-score (F_20°) for Sound Event Detection (SED), and localization error (LE) and localization recall (LR) for Direction-of-Arrival (DoA) estimation. The threshold of 20° indicates that an SED prediction is considered correct only if the corresponding DoA estimation error is less than 20°. The SELD score is computed as follows:

S E L D = \frac{[E R_{20^{\circ}} + (1 - F_{20^{\circ}}) + \frac{L E}{180^{\circ}} + (1 - L R)]}{4}

(28)

Among these evaluation metrics, ER: Quantifies the proportion of missed detections, false alarms, and insertions in event detection, providing a measure of the system’s reliability in determining whether an acoustic event occurs. F_20°: Combines precision and recall to evaluate the overall performance of event classification, reflecting the model’s accuracy in identifying sound event categories. LE: Represents the average angular deviation between the predicted and reference directions of arrival, given that the event class is correctly detected, indicating the precision of spatial localization. LR: Denotes the percentage of active sources that are correctly localized within a predefined angular threshold (e.g., 20°), highlighting the model’s ability to both detect and approximately localize active sound sources.

In this evaluation scheme, lower ER_20° and higher F_20° values indicate better SED performance, while smaller LE and higher LR values reflect superior DoA estimation accuracy.

4.3. Experimental Configuration

All experiments were conducted on a server equipped with an NVIDIA GeForce RTX 3090 GPU. The software environment included Python 3.8 and the PyTorch deep learning framework (v2.2.1 with CUDA v11.8). The training was performed for 32 epochs using the Adam optimizer [32]. The initial learning rate was set to 3 × 10⁻⁴ and linearly decayed to 1 × 10⁻⁴ over the course of training. The batch size was set to 60. Given the limited size of the dataset, all experiments employed data augmentation techniques, including Channel Swapping (CS) [33], Random Cropping (RC) [34,35], and Frequency Shifting (FS) [36].

4.4. Ablation Experiments

To verify the effectiveness of the proposed modules in long-sequence modeling, multi-scale feature fusion, and spatial feature representation, we conducted a series of ablation studies on the DCASE 2021 Task 3 dataset. The experimental results are summarized in Table 1. The baseline model (Base) was constructed by removing both the MSF and WTED modules, replacing them with conventional deep convolution and standard convolutional downsampling structures, respectively. The results show that the baseline model performed worse across all evaluation metrics. Specifically, the ER_20° and LE increased by 0.025 and 0.3°, respectively, while the overall SELD score dropped by 0.018. In addition, F_20° and LR decreased by 2.0% and 2.3%, respectively. These findings suggest that conventional convolutional structures are limited in their ability to model spatiotemporal features in complex acoustic environments.

After introducing the MSF module to form the MSFVSS module, the model exhibited notable improvements in both SED and DoA estimation tasks. In particular, the F_20° and LR metrics improved significantly by 1.4% and 2.5%, respectively. Meanwhile, the ER_20° and LE metrics decreased by 0.016 and 0.02°, respectively. These results confirm the effectiveness of the MSF module’s multi-scale feature fusion and spatial attention mechanisms in enhancing local spatial awareness.

With the additional integration of the WTED module, the overall model performance was further improved. Although the LE metric slightly increased (from 13.2° to 13.4°), the F_20°, LR, and SELD score still outperformed those of the baseline model. The SELD score decreased from 0.255 to 0.245, indicating an overall performance improvement. This observation suggests that the WTED module enhances the preservation of critical spatial-frequency information by introducing wavelet-based frequency detail modeling.

4.5. Comparative Experiments of Different Models

First, we compared the proposed FFMamba model with several recent methods on the DCASE 2021 Task 3 dataset, as summarized in Table 2. All comparison models are based on Transformer and CNN architectures. The results show that the proposed model achieves superior performance. In addition, the Base model exhibits a higher LE compared to GLFER-Net and AD-YOLO, with differences of 2.0° and 0.2°, respectively. However, the Base model still outperforms the others in overall performance, indicating that incorporating the VSS module into the SELD task can effectively enhance both event classification and localization capabilities. The proposed FFMamba model demonstrates stronger capability in capturing and integrating long-range audio dependencies and complex multi-scale features. It achieves more comprehensive spatial feature extraction, leading to improved localization and classification of diverse sound events.

On the more challenging DCASE 2022 Task 3 dataset (Table 3), our model also demonstrates strong generalization and robustness. The FFMamba model achieved the best overall SELD score (0.35), substantially outperforming the official baseline (0.55) as well as recent models such as GLFER-Net (0.46) and AAC-enhanced EINV2 (0.391). In terms of SED, our model improved the F_20° to 54.3%, far surpassing the baseline’s 21.0%, highlighting its superior event recognition capability in complex real-world environments. In terms of DoA, our model achieved the highest LR of 68.3%. Despite a slightly higher LE compared with AAC-enhanced EINV2, the superior performance on key classification metrics such as ER and F_20° demonstrates that our model maintains a better balance between detection accuracy and reliability, which is critical for practical applications.

4.6. Visual Analysis

A visual analysis was conducted using the “fold6_room1_mix002” audio clip from the DCASE 2021 Task 3 test set, as illustrated in Figure 6. The first and second rows of the figure show the visualizations of azimuth and elevation angles for the DoA estimation task, respectively, while the third row presents the results for the SED task. In the azimuth subplot (Figure 6a,b), the predictions generated by the FFMamba model closely follow the reference trajectories over most time intervals, demonstrating strong horizontal localization accuracy. For sound sources with significant motion trajectories (e.g., the bell sound marked in red), minor tracking deviations are observed near abrupt angular transitions, as highlighted by the orange bounding box in the reference image. In the elevation subplot (Figure 6c,d), the reference trajectories often appear as step functions or constant values, indicating minimal elevation variation among most sound sources. The predicted elevation angles show smooth and continuous trends in some events, suggesting that the FFMamba model possesses temporal modeling capabilities and can effectively capture gradual changes in elevation. Although slight deviations are observed in the initial elevation values for certain events, the overall trend aligns well with the ground truth. In the SED subplot (Figure 6e,f), the model accurately identifies the activity of multiple sound sources at most time points, achieving high consistency with the reference annotations and demonstrating the model’s strong performance in sound event recognition.

5. Conclusions

This paper proposes FFMamba, a multi-scale feature fusion network for SELD. The architecture integrates Visual State Space (VSS) modeling with the strengths of multi-scale convolution, enabling effective capture of long-range temporal dynamics and preservation of local time–frequency features. By introducing two key modules—Multi-Scale Fusion Visual State Space (MSFVSS) and Wavelet Transform-Enhanced Downsampling (WTED)—the model significantly enhances its ability to capture and preserve time–frequency characteristics in audio signals. The MSF module improves local spatial feature representation by fusing multi-scale spatial features with an attention mechanism. The WTED module combines convolutional modeling for local spatial features with wavelet decomposition for multi-band frequency information, thereby improving the retention of critical spatial–spectral features. Extensive experiments on the DCASE 2021 and DCASE 2022 Task 3 datasets—including comparative and ablation studies—demonstrate the robustness and effectiveness of the proposed approach in both SED and DoA subtasks, outperforming mainstream models across multiple key metrics. Visualization results further confirm FFMamba’s accuracy in spatial angle estimation and event detection, highlighting its strong temporal modeling and localization capabilities. Overall, the proposed FFMamba model offers an effective pathway for integrating multi-scale perception with state space modeling in SELD tasks, contributing significantly to sound understanding in complex acoustic environments.

Despite its strong performance, the FFMamba model still has limitations, particularly in terms of computational overhead and the completeness of spatial modeling. Future work may focus on model lightweighting, multi-modal fusion, and enhanced elevation modeling, which are promising directions for further exploration.

Author Contributions

Conceptualization, Y.L. and D.G.; methodology, Y.L.; software, Y.L.; validation, Y.L. and J.X.; formal analysis, Y.L. and J.X.; investigation, Y.L.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L., D.G., X.Y. and J.X.; visualization, Y.L.; supervision, Y.L.; project administration, D.G.; funding acquisition, D.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under Grant No. 51765007.

Data Availability Statement

The DCASE 2021 Task 3 and DCASE 2022 Task 3 datasets used in this study can be accessed from the following websites: https://zenodo.org/records/4844825; https://dcase.community/challenge2022/task-sound-event-localization-and-detection-evaluated-in-real-spatial-sound-scenes (accessed on 15 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SELD	Sound Event Localization and Detection
CNNs	Convolutional Neural Networks
VSS	Visual State Space
FFMamba	Feature Fusion State Space Model
MSFVSS	Multi-Scale Fusion Visual State Space
WTED	Wavelet Transform-Enhanced Downsampling
SED	Sound Event Detection
DoA	Direction of Arrival
BiGRU	bidirectional gated recurrent units
DCASE	Detection and Classification of Acoustic Scenes and Events
L3DAS	Learning 3D Audio Sources
VMamba	Vision Mamba
SS2D	2D-Selective-Scan
CRNN	Convolutional Recurrent Neural Network
MHSA	Multi-Head Self-Attention
EMA	Efficient Multi-scale Attention
SSMs	State Space Models
ODEs	ordinary differential equations
ZOH	Zero-Order Hold
FOA	first-order Ambisonics
STFT	Short-Time Fourier Transform
IV	Intensity Vector
MLP	Multilayer Perceptron
DWT	Discrete Wavelet Transform
MIC	microphone array
TAU-SRIR	TAU Spatial Room Impulse Response
CS	Channel Swapping
RC	Random Cropping
FS	Frequency Shifting

References

King, E.A.; Tatoglu, A.; Iglesias, D.; Matriss, A. Audio-Visual Based Non-Line-of-Sight Sound Source Localization: A Feasibility Study. Appl. Acoust. 2021, 171, 107674. [Google Scholar] [CrossRef]
Wu, S.; Zhai, X.; Hu, Z.; Sun, Y.; Liu, J. Advanced Acoustic Footstep-Based Person Identification Dataset and Method Using Multimodal Feature Fusion. Knowl.-Based Syst. 2023, 264, 110331. [Google Scholar] [CrossRef]
Barhoush, M.; Hallawa, A.; Schmeink, A. Speaker Identification and Localization Using Shuffled MFCC Features and Deep Learning. Int. J. Speech Technol. 2023, 26, 185–196. [Google Scholar] [CrossRef]
Wu, S.; Huang, S.; Liu, Z.; Zhang, Q.; Liu, J. AFPILD: Acoustic Footstep Dataset Collected Using One Microphone Array and LiDAR Sensor for Person Identification and Localization. Inf. Fusion 2024, 104, 102181. [Google Scholar] [CrossRef]
Li, X.Y.; Guan, Y.H.; Law, S.S.; Zhao, W. Monitoring Abnormal Vibration and Structural Health Conditions of an In-Service Structure from Its SHM Data. J. Sound Vib. 2022, 537, 117185. [Google Scholar] [CrossRef]
Kafle, M.D.; Fong, S.; Narasimhan, S. Active Acoustic Leak Detection and Localization in a Plastic Pipe Using Time Delay Estimation. Appl. Acoust. 2022, 187, 108482. [Google Scholar] [CrossRef]
Wang, H.; Wang, Y.; Zhong, F.; Wu, M.; Zhang, J.; Wang, Y.; Dong, H. Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation. IEEE Robot. Autom. Lett. 2023, 8, 3900–3907. [Google Scholar] [CrossRef]
Younes, A.; Honerkamp, D.; Welschehold, T.; Valada, A. Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds. arXiv 2021, arXiv:2111.14843v4. [Google Scholar] [CrossRef]
Shul, Y.; Choi, J.-W. CST-Former: Transformer with Channel-Spectro-Temporal Attention for Sound Event Localization and Detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 8686–8690. [Google Scholar]
Zhang, D.; Chen, J.; Huang, S.; Bai, J.; Jia, Y.; Wang, M. Synthesis-to-Real Robust Training for Enhanced Sound Event Localization and Detection Using Dynamic Kernel Convolution Networks. Appl. Acoust. 2025, 228, 110267. [Google Scholar] [CrossRef]
Nguyen, T.N.T.; Gan, W.-S. A Sequence Matching Network for Polyphonic Sound Event Localization and Detection. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 71–75. [Google Scholar]
Nguyen, T.N.T.; Nguyen, N.K.; Phan, H.; Pham, L.; Ooi, K.; Jones, D.L.; Gan, W.-S. A General Network Architecture for Sound Event Localization and Detection Using Transfer Learning and Recurrent Neural Network. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 935–939. [Google Scholar]
Niu, S.; Du, J.; Wang, Q.; Chai, L.; Wu, H.; Nian, Z.; Sun, L.; Fang, Y.; Pan, J.; Lee, C.-H. An Experimental Study on Sound Event Localization and Detection Under Realistic Testing Conditions. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
He, C.; Cheng, S.; Bao, J.; Liu, J. Adapting Single-Channel Pre-Trained Transformer Models for Multi-Channel Sound Event Localization and Detection. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar]
Dao, T.; Fu, D.Y.; Saab, K.K.; Waldmann, T.A.; Rudra, A.; Ré, C. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. arXiv 2022, arXiv:2212.14052. [Google Scholar]
Yang, Z.; Mitra, A.; Kwon, S.; Yu, H. ClinicalMamba: A Generative Clinical Language Model on Longitudinal Clinical Notes. arXiv 2024, arXiv:2403.05795. [Google Scholar]
Jiang, X.; Han, C.; Mesgarani, N. Dual-Path Mamba: Short and Long-Term Bidirectional Selective Structured State Space Models for Speech Separation. arXiv 2024, arXiv:2403.18257. [Google Scholar]
Zhang, X.; Zhang, Q.; Liu, H.; Xiao, T.; Qian, X.; Ahmed, B.; Ambikairajah, E.; Li, H.; Epps, J. Mamba in Speech: Towards an Alternative to Self-Attention. arXiv 2025, arXiv:2405.12609. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Mu, D.; Zhang, Z.; Yue, H.; Wang, Z.; Tang, J.; Yin, J. SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation. arXiv 2024, arXiv:2408.05057. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Adavanne, S.; Politis, A.; Nikunen, J.; Virtanen, T. Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. IEEE J. Sel. Top. Signal Process. 2019, 13, 34–48. [Google Scholar] [CrossRef]
Grondin, F.; Glass, J.; Sobieraj, I.; Plumbley, M.D. Sound Event Localization and Detection Using CRNN on Pairs of Microphones. arXiv 2019, arXiv:1910.10049. [Google Scholar]
Cao, Y.; Iqbal, T.; Kong, Q.; An, F.; Wang, W.; Plumbley, M.D. An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 885–889. [Google Scholar]
Zhang, H.; Li, S.; Min, X.; Yang, S.; Zhang, L. Conformer-Based Sound Event Detection with Data Augmentation. In Proceedings of the 2022 International Conference on Knowledge Engineering and Communication Systems (ICKES), Chickballapur, India, 28–29 December 2022; pp. 1–7. [Google Scholar]
Shimada, K.; Koyama, Y.; Takahashi, N.; Takahashi, S.; Mitsufuji, Y. ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection. arXiv 2021, arXiv:2010.15306. [Google Scholar] [CrossRef]
Li, Z.; Zhang, Y.; Li, W. Fusion of L2 Regularisation and Hybrid Sampling Methods for Multi-Scale SincNet Audio Recognition. In Proceedings of the 2024 IEEE 7th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 20–22 September 2024; Volume 7, pp. 1556–1560. [Google Scholar]
Mu, D.; Zhang, Z.; Yue, H. MFF-EINV2: Multi-Scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection. arXiv 2024, arXiv:2406.08771. [Google Scholar]
Basar, T. A New Approach to Linear Filtering and Prediction Problems. In Control Theory: Twenty-Five Seminal Papers (1932–1981); IEEE: Piscataway, NJ, USA, 2001; pp. 167–179. ISBN 978-0-470-54433-4. [Google Scholar]
Mallat, S.G. A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674–693. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Wang, Q.; Du, J.; Wu, H.-X.; Pan, J.; Ma, F.; Lee, C.-H. A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1251–1264. [Google Scholar] [CrossRef]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. arXiv 2017, arXiv:1708.04896. [Google Scholar] [CrossRef]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
Nguyen, T.N.T.; Watcharasupat, K.N.; Nguyen, N.K.; Jones, D.L.; Gan, W.-S. SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection. arXiv 2021, arXiv:2110.00275. [Google Scholar] [CrossRef]
Jacome, K.G.R.; Grijalva, F.L.; Masiero, B.S. Sound Events Localization and Detection Using Bio-Inspired Gammatone Filters and Temporal Convolutional Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2314–2324. [Google Scholar] [CrossRef]
Kim, J.S.; Park, H.J.; Shin, W.; Han, S.W. AD-YOLO: You Look Only Once in Training Multiple Sound Event Localization and Detection. arXiv 2023, arXiv:2303.15703. [Google Scholar] [CrossRef]
Wang, C.; Huang, Q. FA3-Net: Feature Aggregation and Augmentation with Attention Network for Sound Event Localization and Detection. Appl. Intell. 2025, 55, 540. [Google Scholar] [CrossRef]
Ma, M.; Hu, Y.; He, L.; Huang, H. GLFER-Net: A polyphonic sound source localization and detection network based on global-local feature extraction and recalibration. J. Audio Speech Music Proc. 2024, 2024, 34. [Google Scholar] [CrossRef]
Chen, B.; Wang, M.; Gu, Y. Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection. Sensors 2024, 24, 6090. [Google Scholar] [CrossRef] [PubMed]
Wu, S.; Wang, Y.; Hu, Z.; Liu, J. HAAC: Hierarchical audio augmentation chain for ACCDOA described sound event localization and detection. Appl. Acoust. 2023, 211, 109541. [Google Scholar] [CrossRef]
Shi, D.; Guo, M.; Ma, M. A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network. Multimed. Syst. 2024, 30, 367. [Google Scholar] [CrossRef]

Figure 1. Cross-scan schematic, where red indicates query image blocks. (a) indicates cross-scanning along the width direction. (b) represents a cross-scan along the height direction.

Figure 2. Schematic illustration of the SS2D model.

Figure 3. The architecture of FFMamba model.

Figure 4. (a) Structure diagram of MSFVSS module. (b) Structural diagram of MSF.

⊙

represents matrix multiplication, ⊗ denotes the elemental multiplication, ⊕ indicates addition of elements, unlabeled convolutions represent 1 × 1 convolutions.

Figure 4. (a) Structure diagram of MSFVSS module. (b) Structural diagram of MSF.

⊙

represents matrix multiplication, ⊗ denotes the elemental multiplication, ⊕ indicates addition of elements, unlabeled convolutions represent 1 × 1 convolutions.

Figure 5. WTED module structure diagram. Ⓒ indicates concatenation operation, Ⓕ represents channel weighted summation.

Figure 6. Visualization of the “fold6_room1_mix002” audio clip. (a,b) Visualization of reference and predicted values representing azimuth angles. (c,d) Visualization of reference and predicted values for elevation. (e,f) Visualization of reference and predicted values for elevation SED. The first and second rows show the azimuth and elevation angle estimations for the DoA task, respectively, while the third row displays the SED task results. In all visualization results, the horizontal axis represents time. For the DoA task, the vertical axis indicates the azimuth and elevation angles, with ranges of [−180°, 180°] and [−60°, 60°], respectively. For the SED task, the vertical axis represents the event categories.

Table 1. Model ablation experiments on DCASE2021 task3 dataset.

Model	ER_20°	F_20°/%	LE	LR/%	SELD
FFMamba	0.371	73.7	13.4°	75.7	0.237
Base	0.396	71.7	13.7°	73.4	0.255
MSFVSS	0.380	73.1	13.2°	75.9	0.240
WTED	0.385	72.6	13.4°	75.1	0.245

Table 2. Comparative experiments of models on DCASE 2021 task3 dataset.

Model	ER_20°	F_20°/%	LE	LR/%	SELD
DCASE2021baseline	0.69	33.9	24.1°	43.9	0.54
G-SELD [37]	0.50	58.75	13.8°	59.6	0.35
AD-YOLO [38]	0.52	54.4	13.5°	64.7	0.35
FA3-Net [39]	0.51	58.2	14.7°	66.7	0.336
GLFER- Net [40]	0.40	71.1	11.7°	72.1	0.26
STFF-Net [41]	0.42	67.7	11.9°	74.0	0.26
AAC-enhanced EINV2 [42]	0.49	60.0	15.26	70.0	0.321
Base	0.396	71.7	13.7°	73.4	0.255
Ours	0.371	73.7	13.4°	75.7	0.237

Table 3. Comparative experiments of the model on the DCASE 2022Task 3 dataset.

Model	ER_20°	F_20°/%	LE	LR/%	SELD
DCASE2022baseline	0.71	21.0	29.3°	46.0	0.55
Resnet-Confomer [33]	0.65	33.0	23.4	58.0	0.47
LGAF-TIN [43]	0.65	31.1	22.3°	54.8	0.48
GLFER- Net [40]	0.55	46.5	20.7°	64.4	0.46
AAC-enhanced EINV2 [42]	0.54	45.0	17.20°	62.0	0.391
Ours	0.51	54.3	21.0°	68.3	0.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Ge, D.; Xu, J.; Yao, X. FFMamba: Feature Fusion State Space Model Based on Sound Event Localization and Detection. Electronics 2025, 14, 3874. https://doi.org/10.3390/electronics14193874

AMA Style

Li Y, Ge D, Xu J, Yao X. FFMamba: Feature Fusion State Space Model Based on Sound Event Localization and Detection. Electronics. 2025; 14(19):3874. https://doi.org/10.3390/electronics14193874

Chicago/Turabian Style

Li, Yibo, Dongyuan Ge, Jieke Xu, and Xifan Yao. 2025. "FFMamba: Feature Fusion State Space Model Based on Sound Event Localization and Detection" Electronics 14, no. 19: 3874. https://doi.org/10.3390/electronics14193874

APA Style

Li, Y., Ge, D., Xu, J., & Yao, X. (2025). FFMamba: Feature Fusion State Space Model Based on Sound Event Localization and Detection. Electronics, 14(19), 3874. https://doi.org/10.3390/electronics14193874

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FFMamba: Feature Fusion State Space Model Based on Sound Event Localization and Detection

Abstract

1. Introduction

2. Related Works

2.1. Current Methods for SELD

2.2. Multi-Scale Feature Fusion

2.3. State Space Models (SSMs)

3. Proposed Method

3.1. Preprocessing

3.2. Network Architecture

3.3. MSFVSS Module

3.4. WTED Module

4. Experiments

4.1. Dataset

4.2. Evaluation Indicators

4.3. Experimental Configuration

4.4. Ablation Experiments

4.5. Comparative Experiments of Different Models

4.6. Visual Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI