DOM-MUSE: A Deformable Omnidirectional State Space Architecture for Efficient Speech Enhancement

Li, Tsung-Jung; Su, Bo-Yu; Lin, Jung-Shan; Hung, Jeih-Weih

doi:10.3390/electronics15102159

Open AccessArticle

DOM-MUSE: A Deformable Omnidirectional State Space Architecture for Efficient Speech Enhancement

Department of Electrical Engineering, National Chi Nan University, Nantou 545301, Taiwan

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(10), 2159; https://doi.org/10.3390/electronics15102159

Submission received: 22 April 2026 / Revised: 7 May 2026 / Accepted: 14 May 2026 / Published: 18 May 2026

(This article belongs to the Special Issue Recent Advances in Audio, Speech and Music Processing and Analysis, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Transformer-based speech enhancement (SE) architectures suffer from high computational complexity, while existing lightweight state space model (SSM) approaches are constrained to fixed one-dimensional scanning that cannot fully exploit the two-dimensional time–frequency structure of speech spectrograms. To address these limitations, we propose DOM-MUSE, a lightweight U-Net-style SE framework built upon the Mamba-2 SSM with four targeted innovations. First, a Deformable Feature Extractor (DFE) predicts per location spatial offsets that warp the feature sampling grid to align with speech formant trajectories and harmonic structures, providing geometrically coherent inputs to the state space model. Second, a DOM Mamba Block with Cross-Dimensional Gated Fusion (CDGF) deploys two parallel Mamba-2 instances scanning the time and frequency axes independently, and uses Taylor Channel Attention (TCA) to derive semantic gates that modulate each SSM output before fusion. Third, a Phase-Guided Feature Conditioner (PGFC) computes local phase-gradient gates that suppress noise-dominated activations prior to the SSM stage, making the feature extraction pathway implicitly phase-aware. Fourth, an Attention-Based Skip Connection (ABSC) replaces conventional concatenation skip connections with a learned channel gate, adaptively controlling the information flow from the encoder to the decoder. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DOM-MUSE outperforms the reproduced MUSE baseline on all five evaluation metrics—including PESQ (+0.077), CSIG (+0.058), CBAK (+0.026), COVL (+0.070), and STOI (+0.002)—while reducing the parameter count by 24% (0.51 M to 0.39 M). Notably, DOM-MUSE also surpasses MUSE++ on perceptual quality metrics (PESQ +0.061, COVL +0.032) despite MUSE++ employing dynamic SNR augmentation and an augmented multi-objective loss that DOM-MUSE deliberately omits, demonstrating that the proposed architectural innovations yield genuine improvements independent of training strategy. When DOM-MUSE is additionally trained under the same augmented protocol as MUSE++, it achieves PESQ of 3.46 and COVL of 4.22, further confirming the complementary nature of architectural and training improvements.

Keywords:

speech enhancement; Mamba state space model; deformable convolution; time–frequency modeling; phase-guided conditioning; lightweight neural network

1. Introduction

Speech enhancement (SE) aims to recover clean speech from noise-corrupted recordings and serves as a foundational front-end component for a wide range of applications, including mobile telephony, hearing aids, smart assistants, remote conferencing, and automatic speech recognition (ASR) [1,2,3]. Despite decades of research, SE remains challenging in real-world conditions where noise is non-stationary, the signal-to-noise ratio (SNR) is low, and computational resources are constrained.

Classical SE methods, such as spectral subtraction [4], Wiener filtering, and the minimum mean-square error (MMSE) log-spectral amplitude estimator [5], primarily operate in the frequency domain by estimating a noise-suppressing gain applied to the magnitude spectrum. Although computationally efficient, these approaches largely ignore phase information [6] and degrade rapidly under non-stationary or highly dynamic noise conditions.

The advent of deep learning transformed SE by enabling complex, data-driven mappings between noisy and clean speech. Convolutional neural networks (CNNs), recurrent networks, and later U-Net encoder–decoder architectures [7] demonstrated strong generalization to diverse acoustic environments. Further gains were achieved by adopting Transformer-based attention mechanisms to model long-range spectro-temporal dependencies, leading to a series of high-performing yet parameter-heavy models [8,9]. Alongside these performance advances, growing awareness of phase information as a critical component for perceptual quality [6] prompted the development of phase-aware architectures, such as PHASEN [10] and magnitude-phase dual-path networks [11,12], which jointly predict magnitude masks and phase corrections.

The scalability gap between Transformer-based models and the requirements of resource-constrained deployment sparked renewed interest in lightweight SE architectures. Methods such as TSTNN [13], DPT-FSNet [9], MetricGAN-OKDv2 [14], and MANNER-S [15] pushed the efficiency frontier through bottleneck compression, grouped convolutions, and efficient attention approximations. In parallel, SSMs emerged as a compelling alternative to self-attention: Mamba [16] and its successor Mamba-2 [17] achieve linear complexity with respect to sequence length while maintaining competitive modeling capacity, making them attractive for long-sequence audio processing.

Building on this foundation, MUSE [18] introduced a lightweight U-Net-based SE framework centered on the Multi-path Enhanced Taylor (MET) Transformer block. By combining Deformable Embedding, Channel-and-Spatial Attention (CSA), and a Taylor-series approximation to softmax attention, MUSE achieves competitive SE performance with only 0.51 M parameters. Its successor, MUSE++ [19], further reduced model complexity by replacing the MET Transformer with a 1D Mamba-2 block—cutting the parameter count to 0.17 M—and supplemented the architecture with dynamic SNR-based data augmentation and a multi-objective loss function to compensate for the reduced representational capacity.

Despite these advances, both MUSE and MUSE++ share several fundamental limitations that are rooted in concrete acoustic and signal-processing properties. First, standard rectangular convolution grids impose an implicit assumption that speech features are locally stationary and geometrically uniform in the time–frequency plane. This assumption is violated by the curved trajectories of formants and the harmonic structures of voiced speech, which follow the glottal pulse rate and vocal tract resonances rather than fixed horizontal lines. Second, MUSE++ models sequential dependencies along a single (typically temporal) axis using a unidirectional 1D Mamba, thereby ignoring cross-frequency correlations that arise from the harmonic energy distribution across frequency bins for a given fundamental frequency. Third, local phase gradients

\partial θ / \partial t

provide a physically grounded cue for separating speech from noise: harmonics produce smooth, predictable phase progressions, while noise produces erratic, high-variance gradients. Neither model exploits this property to condition the feature extraction process itself; the phase is treated only as a post hoc output target. Fourth, the standard concatenation-based skip connections in both models treat all encoder features indiscriminately, potentially propagating redundant or noise-corrupted information to the decoder.

To address these limitations simultaneously, we propose DOM-MUSE (Deformable Omnidirectional Mamba-based MUSE), a lightweight SE framework that retains the efficient U-Net skeleton of MUSE while introducing four targeted innovations, each derived from the above acoustic and signal-processing observations. The main contributions of this work are as follows:

Deformable Feature Extractor (DFE). We introduce a learnable deformable convolution module that predicts per location 2D spatial offsets and modulation masks, warping the feature sampling grid to align with speech formant trajectories and harmonic structures. This enables the subsequent SSM to process geometrically coherent feature sequences rather than arbitrary rectangular patches.
DOM Mamba Block with Cross-Dimensional Gated Fusion (CDGF). We design a parallel dual-branch Mamba block that scans the time and frequency axes independently with two dedicated Mamba-2 instances. A TCA module computes a channel covariance projection that generates axis-specific semantic gates, allowing high-level channel semantics to modulate state transitions along both the temporal and spectral dimensions simultaneously—rather than fusing them by naive concatenation that treats all channel activations equally.
Phase-Guided Feature Conditioner (PGFC). We propose a lightweight conditioning module that uses local phase-gradient coherence as a discriminative signal to generate per channel attention gates, suppressing noise-dominated activations prior to the SSM stage. This exploits the statistical distinction between smooth harmonic phase progressions and erratic noise-phase patterns, making the feature extraction pathway implicitly phase-aware without incurring significant computational overhead.
Attention-Based Skip Connection (ABSC). We replace the conventional concatenation skip connection with a content-adaptive channel gate that computes a bottleneck attention weight from both encoder and decoder features. This selectively incorporates encoder information based on its relevance to the current decoding context, avoiding the channel-dimension doubling and information conflict associated with naive concatenation.

Experiments on the VoiceBank-DEMAND benchmark [20,21] demonstrate that DOM-MUSE outperforms the reproduced MUSE across all five evaluation metrics—including PESQ (+0.077), CSIG (+0.058), CBAK (+0.026), COVL (+0.070), and STOI (+0.002)—while reducing the parameter count by 24% (0.51 M to 0.39 M). Compared with MUSE++, DOM-MUSE achieves substantially higher scores on all perceptual quality metrics, with a PESQ gain of 0.061 and a COVL gain of 0.032, suggesting that the proposed architectural innovations translate into meaningful enhancement quality improvements beyond what a plain 1D Mamba can deliver.

The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 describes the MUSE and MUSE++ backbone architectures. Section 4 presents the proposed DOM-MUSE framework and its individual components. Section 5 describes the experimental setup and evaluation metrics. Section 6 reports the quantitative results with analysis. Section 7 concludes this paper and outlines future directions.

2. Related Work

2.1. U-Net-Based Speech Enhancement

The U-Net encoder–decoder architecture was adapted for SE by SEGAN [7], which first demonstrated end-to-end waveform enhancement via skip-connected convolutional layers. MP-SENet [12,22] further advanced the U-Net design by introducing parallel magnitude and phase decoder branches, showing that jointly estimating both spectral components substantially improves perceptual quality over magnitude-only approaches. MUSE [18] built upon the MP-SENet architecture by incorporating Deformable Embedding and a Multi-path Enhanced Taylor Transformer block, achieving competitive performance at only 0.51 M parameters. A common limitation of existing U-Net-based SE models is their reliance on fixed-weight, concatenation-based skip connections, which treat all encoder features identically regardless of their relevance at each decoder stage. Recent work on attention-based residual connections [23] has demonstrated that replacing fixed additive residuals with content-adaptive gating improves representational quality; DOM-MUSE applies this principle at two levels—within the DOM Block (Scalar AttnRes) and across the encoder–decoder interface (ABSC).

2.2. Transformer-Based Speech Enhancement

Transformer-based attention mechanisms have been widely applied in SE to model long-range spectro-temporal dependencies. DB-AIAT [8] proposed dual-branch intra- and inter-chunk attention to balance local and global context, while DPT-FSNet [9] adopted dual-path Transformers with full-band and sub-band processing to capture both fine and coarse spectral patterns. These models achieve strong enhancement performance but at the cost of quadratic attention complexity and large parameter counts, limiting their applicability in resource-constrained scenarios. To reduce attention complexity, the Taylor Transformer [18] approximates softmax self-attention using a first-order Taylor expansion, yielding linear complexity in the sequence length. DOM-MUSE retains the Taylor Channel Attention concept from MUSE as the semantic gating mechanism in CDGF, while delegating the primary sequence modeling role to the more efficient Mamba-2 SSM.

2.3. State Space Models for Speech Enhancement

Mamba [16] introduced a selective SSM with input-dependent state transitions and a hardware-aware parallel scan algorithm, enabling linear-time sequence modeling with strong representational capacity. Mamba-2 [17] extended this framework by establishing a formal duality between SSMs and structured attention. Early studies on applying Mamba to SE [24,25] demonstrated that Mamba-based blocks can serve as effective drop-in replacements for Transformer blocks, with lower computational overhead and competitive performance. MUSE++ [19] was among the first to systematically replace the Transformer block in a lightweight SE framework with a 1D Mamba-2 module, achieving a dramatic reduction in parameter count (0.51 M to 0.17 M). However, the 1D unidirectional formulation of MUSE++ scans only a single flattened axis, limiting its ability to capture the inherently two-dimensional time–frequency structure of speech spectrograms. DOM-MUSE addresses this by deploying two parallel Mamba-2 instances that explicitly scan the time and frequency axes independently, combined with a Cross-Dimensional Gated Fusion mechanism that coordinates their outputs through channel-level semantics.

2.4. Phase-Aware Speech Enhancement

The importance of phase estimation in SE was established by Paliwal et al. [6], who showed that accurate phase reconstruction is critical for perceptual quality, especially at low SNRs. PHASEN [10] was among the first deep SE models to explicitly model a phase using a two-stream architecture. More recent works, including MP-SENet [12] and its successor [22], demonstrated that parallel magnitude-phase decoding with explicit phase supervision outperforms implicit approaches. A key distinction of DOM-MUSE from prior phase-aware methods is that phase information is used not only to supervise the output, but also to actively condition the feature extraction process via the PGFC, making the entire sequence modeling pathway implicitly phase-aware.

2.5. Deformable Convolution in Audio Processing

Deformable convolution [26] was originally proposed for object detection in computer vision, where it learns spatial offsets to warp the convolution sampling grid toward the target object’s geometry. MUSE [18] introduced Deformable Embedding as a patch feature extractor, using learned offsets to adapt the receptive field to voiceprint structures. In contrast, IMSE [27] argued that the dynamic offset computation in MUSE introduces additional computational burden, and proposed replacing it with a static Inception Depthwise Convolution (IDConv) that captures the anisotropic patterns of spectrograms without deformable operations. DOM-MUSE takes a different perspective: rather than eliminating deformable convolution, we place it as a dedicated preprocessing step (DFE) before the SSM stage, ensuring that the state space model operates on geometrically coherent sequences that follow the natural curvature of formant trajectories. This targeted placement allows the deformable sampling to directly benefit the Mamba scanning, rather than being embedded in the patch embedding as in MUSE.

3. Backbone: MUSE and MUSE++

DOM-MUSE is built directly upon the MUSE++ framework [19], which itself extends the original MUSE architecture [18]. This section briefly reviews the key design choices of both models, as they form the starting point for our proposed improvements.

3.1. MUSE

MUSE (Multi-path Enhanced Taylor Transformer-based U-Net for Speech Enhancement) is a lightweight SE framework with only 0.51 M parameters. Given a time-domain input x, the Short-Time Fourier Transform (STFT) produces a magnitude spectrogram

M_{x} \in R^{T \times F}

and phase spectrogram

θ_{x} \in R^{T \times F}

. The magnitude is compressed using a power-law transform

M_{x}^{(c)} = M_{x}^{γ}

(with

γ = 0.3

), and the compressed magnitude and raw phase are concatenated to form a two-channel input tensor that enters a U-Net encoder–decoder.

The core processing unit of MUSE is the Multi-path Enhanced Taylor (MET) Transformer block, which integrates three components: (1) Deformable Embedding (DE) that adapts the receptive field to voiceprint geometry, (2) a Taylor-MSA branch that approximates softmax attention in linear time via a first-order Taylor expansion, and (3) a Channel-and-Spatial Attention (CSA) branch that compensates for information loss in the Taylor approximation. The outputs of the three branches are fused element-wise and passed through a feed-forward network (FFN). At the output, enhanced magnitude and phase features are decoded by separate branches and recombined via the inverse STFT (iSTFT).

3.2. MUSE++

MUSE++ replaces the MET Transformer block with a 1D Mamba-2 module, reducing the parameter count dramatically from 0.51 M to 0.17 M. The 1D Mamba-2 block processes a flattened time–frequency sequence using a selective state space model with linear complexity, defined by the recurrence:

\begin{matrix} h_{t} & = γ_{t} h_{t - 1} + v_{t} k_{t}^{⊤}, \end{matrix}

(1)

\begin{matrix} y_{t} & = h_{t} q_{t}, \end{matrix}

(2)

where

γ_{t}

is a learned forgetting factor and

{v_{t}, k_{t}, q_{t}}

are learned projections of the input at step t. To compensate for the reduced model capacity, MUSE++ introduces dynamic SNR-based data augmentation and an augmented multi-objective loss that adds STFT consistency, time-domain, and multi-resolution STFT terms to the original MUSE loss. Dynamic SNR-based data augmentation is used in MUSE++ to offset the limited representational capacity of the plain 1D Mamba-2 backbone. Specifically, during training, MUSE++ mixes clean utterances with noise at SNR values sampled uniformly from

[- 5, 20]

dB—a much broader range than the fixed

{0, 5, 10, 15}

dB in the standard VoiceBank-DEMAND protocol—to improve robustness across diverse conditions.

It is important to note that in MUSE++, the 1D Mamba-2 block placed at each encoder and decoder level performs the same mathematical operations—the selective state space scan as defined above. The blocks differ only in the tensor dimensions they process: shallower encoder levels handle features with higher spatial resolution and fewer channels (e.g.,

C = 16

at encoder level one), while deeper levels process lower-resolution features with more channels (e.g.,

C = 48

at the bottleneck), following the standard U-Net channel schedule. There is no structural or operational distinction between encoder-side and decoder-side Mamba blocks. The same principle carries over to DOM-MUSE: all DOM Blocks share identical internal structure across all encoder, bottleneck, and decoder levels, differing only in input channel width.

3.3. Limitations of the MUSE++ Backbone

Although MUSE++ achieves a remarkable reduction in the parameter count, the direct substitution of the MET Transformer with 1D Mamba alone—without the augmented training strategy—yields only comparable performance to MUSE, indicating that the 1D Mamba backbone alone has limited representational capacity. Moreover, the 1D unidirectional scanning ignores cross-frequency dependencies, phase information is used only as an output target, and the concatenation-based skip connections treat all encoder features uniformly. These structural weaknesses motivate the four innovations introduced in DOM-MUSE, detailed in the following section.

4. Proposed Method: DOM-MUSE

We propose DOM-MUSE, a lightweight speech enhancement framework that extends the MUSE++ backbone with four targeted innovations: a Deformable Feature Extractor (DFE), a Cross-Dimensional Gated Fusion (CDGF) mechanism, a Phase-Guided Feature Conditioner (PGFC), and an Attention-Based Skip Connection (ABSC). Rather than replacing the entire backbone, each innovation addresses a specific structural weakness while preserving the efficiency of the original design.

4.1. Overall Architecture

DOM-MUSE retains the U-Net-style encoder–decoder skeleton of MUSE and MUSE++, with two structural changes applied consistently: the 1D Mamba blocks are replaced by the proposed DOM Block, and the concatenation-based skip connections are replaced by ABSC. Figure 1 and Figure 2 place the two architectures side by side.

The input waveform x is transformed by STFT into magnitude

M_{x}

and phase

θ_{x}

spectrograms. Following MUSE [18], the magnitude is power-law compressed (

γ = 0.3

) and concatenated with the raw phase to form a two-channel feature map, which passes through a dilated Dense-Net codec (DDN) inherited from MP-SENet [12]. The features traverse two encoder stages—each consisting of DOM Blocks and a pixel-unshuffle downsampling layer—and a bottleneck DOM Block, before being decoded symmetrically with pixel-shuffle upsampling and ABSC skip connections. After decoding, two independent refinement DOM Blocks separately process the magnitude and phase streams. The clean phase is recovered via Cartesian coordinates (real and imaginary parts) as

\hat{θ} = atan 2 ({\hat{x}}_{i}, {\hat{x}}_{r})

; the enhanced waveform is reconstructed by iSTFT.

4.2. DOM Block

The DOM Block is the fundamental processing unit of DOM-MUSE. Despite its structural resemblance to a Transformer block, it contains no spatial self-attention—the attention role is fulfilled by the Mamba-based CDGF mechanism and TCA, both operating in the channel domain. The block belongs to the SSM–CNN hybrid family. As shown in Figure 3, each DOM Block applies batch normalization (BN) to the input

X

and splits processing into a local branch and a global branch. Their outputs are merged by a

1 \times 1

convolution and combined with

X

via a Scalar Attention Residual (

{AttnRes}_{1}

), followed by GS-FFN combined via a second Scalar Attention Residual (

{AttnRes}_{2}

):

\begin{matrix} Y^{'} & = {AttnRes}_{1} (X, {Conv}_{1 \times 1} ([F_{local}; F_{global}])), \end{matrix}

(3)

\begin{matrix} Y & = {AttnRes}_{2} (Y^{'}, F_{FFN} (BN (Y^{'}))) . \end{matrix}

(4)

Scalar Attention Residual (Scalar AttnRes). To prevent indiscriminate propagation of residual features, both residual connections use a Scalar AttnRes [23] rather than standard addition. Given a main feature

X_{main}

and a branch feature

X_{branch}

, both are globally average-pooled and projected to an 8-dimensional subspace via bias-free linear layers, yielding

q, k \in R^{B \times 8}

. A data-dependent scalar gate is computed as:

α = σ (\frac{\sum_{i = 1}^{8} q_{i} \cdot k_{i}}{\sqrt{8}}), AttnRes (X_{main}, X_{branch}) = X_{main} + α \cdot X_{branch} .

(5)

Local branch. A

3 \times 3

depthwise convolution (DWConv: a convolution in which each channel is processed by its own independent filter) paired with a

1 \times 1

pointwise convolution serves as a lightweight local feature extractor, capturing fine-grained spectro-temporal textures while preserving the harmonic structure critical for artifact suppression. The extracted features are activated by the SiLU (Sigmoid Linear Unit,

SiLU (x) = x \cdot σ (x)

) and GELU (Gaussian Error Linear Unit) to selectively amplify perceptually relevant components.

Global branch. The global branch passes the normalized input through three sequential operations: DFE, PGFC, and the DOM Mamba Block. The ordering is theoretically motivated: geometric alignment must precede phase-guided gating because phase coherence is best assessed on formant-aligned features rather than on distorted rectangular-grid samples; similarly, noise suppression via PGFC must precede SSM scanning to prevent noise-corrupted activations from propagating through the recurrent state.

4.2.1. Deformable Feature Extractor (DFE)

Standard convolutions sample features on a fixed rectangular grid that cannot conform to the curved harmonic structures in speech spectrograms. The DFE addresses this by learning

K^{2}

independent per location 2D spatial offsets (

K = 3

, i.e., nine sampling points per position) that warp the sampling grid toward the underlying speech geometry, following the modulated deformable convolution framework of [26]. As shown in Figure 4, a two-stage offset network (DWConv

3 \times 3

→ Conv

1 \times 1

) predicts

2 K^{2}

spatial offsets

(Δ u, Δ v)

and

K^{2}

sigmoid modulation masks m (total

3 K^{2} = 27

output channels). The offsets are initialized to zero so that training begins from a standard convolution and progressively adapts to data-driven deformations. A modulated deformable convolution then samples features at all

K^{2}

warped locations via bilinear interpolation.

The choice of

K = 3

follows the standard setting established in the original deformable convolution literature [26] and is motivated by three considerations. First,

K = 3

yields a

3 \times 3

sampling grid that is center-symmetric, providing balanced coverage around each feature location and preserving the spatial locality essential for formant trajectory alignment. Second,

K = 2

provides only four sampling points, which is insufficient to capture the curvature of formant trajectories; the asymmetric 2 × 2 layout also introduces an implicit spatial bias that can destabilize the offset prediction. Third, increasing to

K = 5

would expand the offset network output from

3 K^{2} = 27

to 75 channels, substantially increasing the parameter count and FLOPs in a module whose purpose is lightweight preprocessing—contradicting the efficiency objective of DOM-MUSE. A systematic sensitivity analysis of K across different values would require training separate model instances and is left as future work; however, the above considerations provide a well-grounded rationale for

K = 3

as the appropriate operating point.

4.2.2. Phase-Guided Feature Conditioner (PGFC)

Phase coherence is a reliable indicator of voiced speech that can be characterized mathematically by the local phase gradient

\partial θ / \partial t

. For voiced speech, this gradient follows a smooth, predictable pattern determined by the fundamental frequency

f_{0}

; for noise, it produces erratic gradients with high variance. The PGFC exploits this statistical distinction by computing local phase-gradient features and deriving a per channel gate that selectively suppresses noise-dominated activations before the SSM stage. This is one of the first lightweight U-Net-style SE modules to use local phase gradients to gate SSM inputs before sequence modeling, rather than treating the phase only as an output target [6,10,12]. Unlike prior phase-aware methods that operate only at the output decoder stage, PGFC operates directly on the hybrid complex-aware feature maps produced by the DFE, allowing noise suppression to occur seamlessly in the latent feature space.

A

3 \times 3

DWConv approximates local differential operators on the deformed features. These responses are then passed through a GELU activation, a

1 \times 1

pointwise convolution, and a sigmoid to produce a per channel gate

α \in {(0, 1)}^{C}

:

α = σ ({Conv}_{1 \times 1} (GELU ({DWConv}_{3 \times 3} (X)))), \tilde{X} = X ⊙ α .

(6)

This pre-filters the SSM input with only two lightweight convolution layers, as illustrated in Figure 5.

4.2.3. DOM Mamba Block and Cross-Dimensional Gated Fusion (CDGF)

The DOM Mamba Block (Figure 6) processes the time and frequency axes in parallel with two dedicated Mamba-2 instances, fusing their outputs via semantic gates from a TCA module—a mechanism we term Cross-Dimensional Gated Fusion (CDGF).

The theoretical motivation for CDGF over simpler fusion strategies is as follows. Naive concatenation doubles the channel dimension (

O (C^{2})

cost for subsequent mixing) and treats all channel activations equally, implicitly assuming that temporal and frequency Mamba outputs are always equally informative across all regions—an assumption that does not hold for speech, where voiced segments benefit more from temporal continuity while fricatives and consonants are better characterized by spectral patterns. Additive fusion is equivalent to assuming that the temporal and frequency dimensions are perfectly aligned and equally weighted in the latent space for all time–frequency positions, which is physically unreasonable. By contrast, CDGF uses TCA to compute a channel covariance matrix between the temporal and frequency representations, capturing the cross-channel energy correlations that arise from formant structures spanning multiple frequency bands. This covariance is then used to derive axis-specific gates

G_{T}

and

G_{F}

, allowing the model to dynamically emphasize whichever dimensional scan is more semantically relevant at each channel and time–frequency location. The ablation study (Table 4, A2 vs. full model) confirms that CDGF provides the largest single-component gain (+0.041 PESQ, +0.039 CBAK), consistent with its role as the primary cross-dimensional integration mechanism.

The input is projected to

d = 32

by a

1 \times 1

convolution. TCA [18] applies L2 normalization to query and key vectors to ensure convergence of the Taylor expansion and introduces a learnable temperature parameter

τ

to dynamically scale the channel covariance projection—empirically vital for stabilizing the cross-dimensional gating signals. Two independent

1 \times 1

convolutions followed by sigmoid activations produce gates

G_{T}

and

G_{F}

:

Y_{DOM} = {Conv}_{1 \times 1} ([H_{T} ⊙ G_{T}; H_{F} ⊙ G_{F}]) .

(7)

To provide qualitative evidence of CDGF’s behavior, Figure 7 visualizes the Time Gate

G_{T}

and Frequency Gate

G_{F}

(channel-averaged) for two representative utterances at contrasting SNR levels (12.5 dB and 2.5 dB).

Two consistent observations hold across both conditions. First,

G_{T}

exhibits structured horizontal suppression bands (dark purple, ≈0) that are frequency-selective and spatially sharp, aligned with high-frequency noise-dominated regions and inter-harmonic gaps in the voiced segments. This reflects the temporal Mamba’s role in tracking harmonic continuity along the time axis: it selectively suppresses channels with low temporal coherence while remaining open at harmonically stable frequencies. Second,

G_{F}

produces comparatively smoother and more spatially diffuse suppression, with the most prominent dark region concentrated in a contiguous block around 4000–6000 Hz during the early noise-only frames. Unlike

G_{T}

’s sharp band structure,

G_{F}

’s broader patterns are consistent with the frequency Mamba’s role in capturing spectral energy distributions across the frequency axis rather than tracking time-continuous harmonics.

The visual contrast between the two gates—sharp, channel-selective banding for

G_{T}

versus smooth, broadband attenuation for

G_{F}

—confirms that CDGF learns complementary axis-specific gating behavior rather than redundant suppression. Comparing the two SNR conditions, both gates exhibit stronger overall suppression at 2.5 dB (right column), consistent with the heavier noise contamination, validating that the gating is input-adaptive.

Gated Star FFN (GS-FFN). Inspired by the element-wise product design of StarNet [28], a pointwise convolution expands the channel by

3.0 \times

and splits into a content stream

U

and a gate stream

V

, each with a

1.5 \times

ratio. A

3 \times 3

DWConv is applied to the content stream

U

to introduce local spatial inductive bias:

F_{FFN} (X) = {Conv}_{1 \times 1} ({DWConv}_{3 \times 3} (U) ⊙ GELU (V)), [U; V] = {Conv}_{1 \times 1} (X) .

(8)

Compared with a standard two-layer FFN at the same expansion ratio, GS-FFN reduces parameters by approximately one-third.

4.3. Attention-Based Skip Connection (ABSC)

Standard U-Net skip connections concatenate encoder and decoder features channel-wise, doubling the channel dimension and indiscriminately passing all encoder information to the decoder. The proposed ABSC replaces this with a content-adaptive channel gate (Figure 8).

The features

D

and

E

are first independently projected via depthwise

1 \times 1

convolutions, then globally average-pooled, concatenated, and passed through a bottleneck (Conv

1 \times 1

→ SiLU → Conv

1 \times 1

→ sigmoid) to produce

α \in {(0, 1)}^{C}

:

ABSC (D, E) = D + α ⊙ E .

(9)

This preserves the decoder channel dimension and allows the network to learn at each level how much encoder information is beneficial. ABSC can be viewed as the macro-level counterpart to the Scalar AttnRes within the DOM Block: both implement content-adaptive residual gating, but at different structural scales and gate granularities (per channel vector vs. scalar).

4.4. Training Objective

DOM-MUSE is trained with the original MUSE loss [18]:

L = γ_{1} L_{metric} + γ_{2} L_{mag} + γ_{3} L_{phase} + γ_{4} L_{com},

(10)

with

γ_{1} = 0.05

,

γ_{2} = 0.9

,

γ_{3} = 0.3

, and

γ_{4} = 0.1

[12,22]. The augmented loss of MUSE++ is not used as the primary training configuration, as the PGFC and CDGF already embed sufficient phase-aware inductive biases. Section 6.7 presents additional results under the augmented training protocol.

4.5. Comparison with MUSE and MUSE++

DOM-MUSE shares the same U-Net backbone lineage as MUSE and MUSE++, but differs substantially in how each stage processes features. MUSE relies on the MET Transformer block—a combination of Taylor-approximated attention and deformable patch embedding—achieving strong performance at the cost of a relatively heavy attention mechanism. MUSE++ simplifies this by replacing the MET Transformer entirely with a 1D Mamba-2 block, cutting the parameter count substantially; however, the unidirectional 1D scan sacrifices cross-frequency modeling, and the reduced representational capacity must be compensated by augmented training strategies rather than architectural improvements.

DOM-MUSE redesigns the internal structure of the processing block to address three concrete weaknesses shared by both predecessors: the DFE replaces fixed rectangular grids with learnable warp locations; CDGF processes time and frequency in parallel with semantic gating rather than simple concatenation; and PGFC actively conditions features on local phase coherence rather than treating the phase only as an output supervision target. Scalar AttnRes and ABSC together form a two-level content-adaptive gating system at both the block level and the encoder–decoder interface.

5. Experimental Setup

To evaluate the effectiveness of the proposed DOM-MUSE framework, we conduct experiments on the VoiceBank-DEMAND corpus [20,21]. This widely used benchmark pairs clean speech from the VoiceBank dataset with diverse noise recordings from the DEMAND database. The training set contains 11,572 utterances from 28 distinct speakers, while the test set comprises 824 utterances from two unseen speakers. During training, clean speech is artificially corrupted by ten categories of DEMAND noise at four SNR levels: 0, 5, 10, and 15 dB. The test set employs five types of DEMAND noise at SNRs of 2.5, 7.5, 12.5, and 17.5 dB. Approximately 200 utterances are reserved for validation.

Time–frequency representation: All models use the STFT as the time–frequency front-end. This choice is motivated by three concrete requirements of our architecture. First, the PGFC relies on explicit phase separation: STFT natively produces a complex spectrum

S = M e^{j θ}

from which phase

θ

can be directly extracted and used to compute local phase gradients. Discrete Wavelet Transform (DWT) coefficients are real-valued and do not provide direct access to an instantaneous phase, making pixel-level phase-gradient gating impractical without additional computation. Second, the iSTFT provides a lossless reconstruction path that is a mathematical requirement of our magnitude-phase dual-decoder design; DWT-based reconstruction is more complex and is not universally invertible for arbitrary filter banks. Third, using STFT ensures a fully fair architectural comparison with all competing methods in Table 6, including MUSE, MUSE++, IMSE, and DPT-FSNet, which uniformly adopt STFT as the standard front-end in this benchmark.

Data preprocessing and training protocol: All waveforms are uniformly segmented into chunks of 30,700 samples. STFT is computed with a FFT (Fast Fourier Transform) size of 510, window length of 510, hop size of 100, and a sampling rate of 16 kHz. Models are trained for up to 100 epochs using the AdamW optimizer, with an initial learning rate of

5 \times 10^{- 4}

, an exponential decay factor of 0.99, a weight decay of

1 \times 10^{- 4}

, and a batch size of two. Early stopping is applied if the validation loss does not improve for 10 consecutive epochs.

The initial learning rate of

5 \times 10^{- 4}

is adopted directly from the MUSE and MUSE++ training protocols [18,19], ensuring a controlled comparison within the model family and avoiding confounding variables. This value is also consistent with widely used AdamW settings for SSM-based audio models [24], where the learning rate must be kept moderate to avoid gradient instability caused by the recurrent state update dynamics. The batch size of two is dictated by GPU memory constraints (NVIDIA RTX 3060, 12 GB VRAM): variable-length speech utterances require zero-padding or chunking, and a larger batch size causes out-of-memory (OOM) errors. All three models in this comparison (MUSE, MUSE++, DOM-MUSE) are trained with an identical batch size to ensure fair comparison.

Model architecture configuration: DOM-MUSE adopts a two-level U-Net encoder–decoder backbone, with the base channel dimension initialized at 16 and increased linearly at each downsampling stage, yielding channel widths of

[16, 32]

. All DOM Mamba Blocks share a fixed internal projection dimension of 32, and each embedded Mamba-2 module is configured with a state size of 16, head dimension of eight, and expansion factor of two. The front-end Dense Encoder and the back-end Mask and Phase Decoders follow the MP-SENet design [12], employing dilated convolutions with dilation rates

{1, 2, 4, 8}

and dense skip connections. The magnitude mask is estimated via a learnable sigmoid activation.

Loss function configuration: DOM-MUSE is trained using the original MUSE loss in Equation (10), with weights set to

γ_{1} = 0.05

,

γ_{2} = 0.9

,

γ_{3} = 0.3

, and

γ_{4} = 0.1

, following the configuration in [12,18].

Evaluation metrics: To benchmark enhancement performance, we employ five widely adopted objective metrics, each assessing a different aspect of the enhanced speech:

1.: Perceptual Evaluation of Speech Quality (PESQ) [29]: Ranges from $- 0.5$ to 4.5; higher is better. As an ITU-T P.862 standard, PESQ is validated against subjective mean opinion scores (MOS) and is widely adopted as the primary proxy for perceptual quality in the SE literature [18,19,27].
2.: Short-Time Objective Intelligibility (STOI) [30]: Scores from 0 to 1; higher values indicate greater intelligibility.
3.: Composite Overall Quality (COVL) [31]: Ranges from 0 to 5; an objective MOS predictor for overall perceived speech quality.
4.: Composite Signal Distortion (CSIG) [31]: MOS scale (0–5) for signal distortion; higher values indicate less distortion.
5.: Composite Background Noise Intrusiveness (CBAK) [31]: MOS scale (0–5) for background noise; higher values indicate more effective suppression.

These standardized metrics enable objective and comprehensive comparison with both the MUSE family of models and other state-of-the-art enhancement systems. No formal subjective listening evaluation (MOS test) is conducted in this work, which is consistent with the established practice in the lightweight SE literature: none of the competing methods in Table 6—including MUSE, MUSE++, IMSE, and DPT-FSNet—report formal subjective assessments. PESQ, as a validated ITU-T proxy for MOS, is considered the primary quality indicator.

6. Experimental Results and Discussions

6.1. Overall Performance Comparison

Table 1 reports the SE performance of MUSE (both the originally reported scores and our reproduction), MUSE++, and the proposed DOM-MUSE on the VoiceBank-DEMAND test set. All models were trained and evaluated under the same protocol described in Section 5.

Statistical significance: To confirm that the reported differences are not attributable to random variation, we performed the Wilcoxon signed-rank test (two-sided) on per utterance scores across the full test set (

N = 824

utterances), comparing DOM-MUSE against MUSE++. Table 2 summarizes the results for all five reported metrics.

All five differences are statistically highly significant (

p < 0.001

), confirming that none of the observed gains or trade-offs are attributable to random variation. Notably, the CBAK and STOI differences—where MUSE++ leads—are also statistically significant, which strengthens rather than weakens our interpretation: the CBAK gap is a real, systematic consequence of MUSE++’s augmented training strategy, not a measurement artifact. This is further confirmed in Section 6.7, where DOM-MUSE trained with augmented strategies achieves CBAK 3.9351, substantially exceeding MUSE++’s 3.8584.

It is important to note that MUSE++ is trained with dynamic SNR augmentation and the augmented multi-objective loss introduced in [19], while DOM-MUSE uses neither of these training enhancements. The fact that DOM-MUSE still outperforms MUSE++ on the most perceptually important metrics (PESQ and COVL) despite this asymmetry demonstrates that the architectural innovations alone—not training strategy—drive the quality improvement.

DOM-MUSE vs. MUSE (reproduced): DOM-MUSE outperforms the reproduced MUSE on all five evaluation metrics. PESQ improves from 3.3475 to 3.4241 (+0.077), CSIG from 4.6163 to 4.6738 (+0.058), CBAK from 3.7965 to 3.8224 (+0.026), COVL from 4.0827 to 4.1531 (+0.070), and STOI from 0.9506 to 0.9522 (+0.002). These gains are achieved while simultaneously reducing the parameter count by 24% (from 0.51 M to 0.39 M).

DOM-MUSE vs. MUSE++: Despite the training asymmetry, DOM-MUSE outperforms MUSE++ on PESQ (+0.061) and COVL (+0.032), with CSIG also marginally higher (+0.012). MUSE++ leads on CBAK (+0.036) and STOI (+0.002), metrics that reward aggressive noise suppression—a behavior directly encouraged by its augmented training strategy.

Discussion on CBAK: DOM-MUSE scores slightly lower than MUSE++ on CBAK (3.8224 vs. 3.8584), a metric that rewards aggressive noise suppression. This gap is directly attributable to MUSE++’s dynamic SNR augmentation and augmented loss, which explicitly push the model toward stronger noise attenuation at the cost of potential over-suppression of low-energy speech components. In real-world deployment—especially for hearing aids and high-quality voice communication—over-suppression manifests as a “machine-like” or “clipped” perceptual quality that reduces listener preference even when the background noise is numerically lower. PESQ and COVL capture precisely this perceptual dimension, and DOM-MUSE’s higher scores on both metrics suggest it strikes a more natural balance between noise removal and speech preservation. As confirmed in Section 6.7, when DOM-MUSE is trained with the same augmented protocol, CBAK also improves substantially (to 3.9351), confirming that the CBAK gap is attributable to the training strategy difference rather than an architectural limitation.

6.2. Parameter and Computational Efficiency

Table 3 reports parameter counts alongside computational efficiency metrics measured on the VoiceBank-DEMAND test set (824 utterances) using a single NVIDIA RTX 3060 GPU, including the real-time factor (RTF), total inference time for the full test set (IFT, in seconds), throughput (THP, utterances per second), and peak VRAM (Video Random Access Memory) usage.

The natural architectural comparison baseline for DOM-MUSE is MUSE, not MUSE++, since DOM-MUSE and MUSE share the same U-Net design philosophy, while MUSE++ represents a more extreme architectural simplification. Relative to MUSE, DOM-MUSE achieves simultaneous improvements on all efficiency dimensions: the parameter count is reduced by 24% (0.51 M → 0.39 M), the inference time is reduced by 37% (218 s → 137 s), throughput increased by 61% (9.64 → 15.50 utt/s), and peak VRAM reduced by 41% (7584 MB → 4506 MB)—while outperforming MUSE on all perceptual quality metrics.

Relative to MUSE++, DOM-MUSE requires more computation, which is commensurate with its richer architectural design. However, an important reference point is the bare MUSE++ backbone without its augmented training strategies: our experiments show that a plain 1D Mamba-2 substitution (without SNR augmentation or augmented loss) achieves PESQ of only 3.2860, significantly below DOM-MUSE’s 3.4241. This demonstrates that DOM-MUSE’s moderate computational overhead delivers a substantially larger architectural improvement over the plain Mamba baseline, beyond what MUSE++’s efficiency numbers alone might suggest.

All three models achieve a RTF well below 1.0, confirming real-time feasibility across the board. DOM-MUSE’s RTF of 0.0645 means that processing a 10 ms speech frame requires less than 1 ms of computation, well within the requirements of real-time telephony and hearing aid applications (RTF < 1.0).

6.3. Ablation Study

To quantify the individual contribution of each proposed component, we evaluate three variants of DOM-MUSE, denoted A1–A3, each omitting a different subset of the innovations. All variants share the same two-level asymmetric U-Net skeleton and are trained under the same protocol. Table 4 summarizes the results.

Role of CDGF and TCA (A3 vs. A2): Adding CDGF and TCA yields the largest single-step improvement across all perceptual metrics: PESQ increases by 0.041, CSIG by 0.020, CBAK by 0.039, and COVL by 0.039, confirming that Cross-Dimensional Gated Fusion is the most critical component. These gains are achieved under identical training conditions, demonstrating that the improvement is attributable to the CDGF architecture rather than any training strategy difference.

Role of Scalar AttnRes (A2 vs. A1): Scalar AttnRes produces clear improvements in CBAK (+0.026) and STOI (+0.002), consistent with content-adaptive residual gating effectively filtering noise-corrupted branch outputs.

Role of DFE, PGFC, and GS-FFN (DOM-MUSE vs. A3): Adding DFE, PGFC, and GS-FFN improves PESQ by 0.022, CSIG by 0.018, COVL by 0.019, and STOI by 0.001, with no increase in parameter count. CBAK decreases slightly (

- 0.005

), reflecting PGFC’s deliberate design priority of favoring perceptual naturalness over maximum noise suppression.

Summary: Each component contributes in a complementary way, and the full performance of DOM-MUSE emerges from their joint interaction. All ablation comparisons are conducted under identical training protocols, ensuring that the observed differences reflect architectural contributions exclusively.

6.4. Per SNR Performance Analysis

To further examine the behavior of DOM-MUSE across different noise conditions, Table 5 reports per SNR results for all three models on the VoiceBank-DEMAND test set, which spans input SNRs of 2.5, 7.5, 12.5, and 17.5 dB.

DOM-MUSE leads on PESQ and COVL at every SNR level: Across all four test conditions, DOM-MUSE achieves the highest PESQ and COVL scores, indicating that its perceptual quality advantage is consistent across the full range of noise conditions present in the benchmark. Crucially, this advantage is maintained without dynamic SNR augmentation or the augmented multi-objective loss that MUSE++ employs.

The PESQ gain is largest at low SNR: The improvement of DOM-MUSE over MUSE++ on PESQ narrows progressively as the SNR increases: +0.087 at 2.5 dB, +0.068 at 7.5 dB, +0.044 at 12.5 dB, and +0.043 at 17.5 dB. This SNR-dependent pattern provides direct empirical support for PGFC’s phase-gradient gating: at low SNR, smooth harmonic phase progressions are strongly contrasted against erratic noise-phase gradients, making PGFC’s signal maximally discriminative. As SNR increases, the noisy phase itself approaches the clean phase, reducing the marginal benefit of phase-guided conditioning.

MUSE++ leads on CBAK at every SNR level: MUSE++’s advantage on CBAK is consistent across all conditions and reflects the direct effect of its augmented training strategy rather than an architectural advantage.

6.5. Qualitative Spectrogram Analysis

Figure 9 and Figure 10 show representative spectrogram comparisons on two test utterances selected from the VoiceBank-DEMAND test set to illustrate contrasting SNR conditions (SNR 12.5 dB and SNR 2.5 dB respectively). These examples are provided for qualitative illustration only and are not intended as statistical evidence of generalization. The generality of DOM-MUSE’s improvements across all noise conditions is substantiated quantitatively by Table 5, which reports per SNR performance spanning all four test SNR levels (2.5, 7.5, 12.5, 17.5 dB), and by Table 1, which aggregates results across all five noise categories in the VoiceBank-DEMAND test set. DOM-MUSE consistently leads on PESQ and COVL at every SNR level without exception, providing more rigorous evidence of generality than additional spectrogram examples would offer.

DOM-MUSE (e) produces spectrogram patterns closely aligned with the clean reference (a) in the voiced regions, with harmonic contours that remain smooth and continuous. This qualitative alignment is consistent with the higher PESQ and COVL scores of DOM-MUSE, which capture precisely these perceptual dimensions of harmonic fidelity and overall speech naturalness.

6.6. Comparison with State-of-the-Art Methods

To place DOM-MUSE in the broader context of the SE literature, Table 6 compares its performance against two classical non-neural baselines and several representative lightweight neural SE systems on the VoiceBank-DEMAND test set. Scores for competing methods are taken from their respective original papers or from the summary reported in [18].

Performance of classical baselines: Both Wiener filtering and LogMMSE provide limited enhancement relative to modern neural models, confirming the well-established advantage of data-driven architectures over classical spectral-domain estimators.

DOM-MUSE vs. state-of-the-art neural models: DOM-MUSE achieves the best PESQ (3.42), CBAK (3.82), and COVL (4.15) among all models in the comparison, including the larger DB-AIAT (2.81 M) and the competitive DPT-FSNet (0.88 M). The gains over DPT-FSNet are consistent across all perceptual quality metrics (+0.09 PESQ, +0.09 CSIG, +0.10 CBAK, +0.15 COVL), indicating that the deformable scanning, cross-dimensional gating, and phase-guided conditioning provide complementary benefits that neither attention-based Transformers nor fixed-grid convolutional networks can easily replicate. On STOI, DOM-MUSE (0.95) matches the majority of competing models but falls behind DPT-FSNet (0.96), which is consistent with the conservative gating behavior of PGFC and CDGF discussed above.

DOM-MUSE vs. IMSE: IMSE [27] is the most directly comparable recent method, as it also builds upon MUSE at a similar parameter scale (0.43 M vs. our 0.39 M). IMSE proposes replacing the MET Transformer with Amplitude-Aware Linear Attention (MALA) and substituting Deformable Embedding with a static Inception Depthwise Convolution (IDConv), arguing that dynamic deformable offsets introduce unnecessary computational burden. DOM-MUSE takes a different design philosophy: rather than eliminating deformable convolution, we deploy it as a dedicated preprocessing step (DFE) that warps the feature grid to align with speech formant trajectories before the SSM stage, allowing deformable sampling to directly benefit the Mamba scanning rather than being absorbed into patch embedding. DOM-MUSE achieves PESQ 3.42 and COVL 4.15, both marginally higher than IMSE (PESQ 3.40, COVL 4.14), while using slightly fewer parameters (0.39 M vs. 0.43 M), demonstrating that the Mamba-based parallel T-F scanning approach with phase-guided conditioning offers a competitive and complementary alternative to the attention-replacement strategy of IMSE.

Parameter efficiency: DOM-MUSE achieves the highest overall perceptual quality scores while using only 0.39 M parameters—fewer than every competing neural model except MUSE++. In particular, it surpasses DB-AIAT with more than

7 \times

fewer parameters and outperforms DPT-FSNet and MetricGAN-OKDv2 with less than half their parameter counts.

Comparison within the MUSE family: Within the MUSE family, DOM-MUSE consistently surpasses MUSE and MUSE++ across all perceptual quality metrics. These results confirm that DOM-MUSE occupies a favorable position: more expressive and higher-performing than MUSE++, more compact than the original MUSE, and competitive with the most recent MUSE-derived lightweight model IMSE—while offering a different and complementary architectural perspective through Mamba-based parallel T-F scanning with phase-guided conditioning.

6.7. DOM-MUSE Under Augmented Training Conditions

To address the question of whether DOM-MUSE’s architectural improvements persist under identical training conditions as MUSE++, we additionally trained DOM-MUSE with the same dynamic SNR augmentation and augmented multi-objective loss employed by MUSE++. Table 7 presents the results.

Under identical augmented training conditions, DOM-MUSE (aug.) achieves PESQ 3.4609, CSIG 4.7366, CBAK 3.9351, COVL 4.2165, and STOI 0.9586—surpassing MUSE++ on all five metrics. These results confirm two key points. First, DOM-MUSE’s architectural innovations provide genuine representational improvements that are independent of and complementary to training strategy: the same architecture benefits from augmented training while maintaining its structural advantages over MUSE++. Second, the CBAK gap observed in Table 1 between DOM-MUSE and MUSE++ is a consequence of the training strategy difference rather than an architectural limitation—once trained with augmented strategies, DOM-MUSE’s CBAK (3.9351) substantially exceeds MUSE++’s (3.8584). The parameter count of DOM-MUSE (aug.) is 0.39 M, essentially identical to the primary DOM-MUSE model (0.39 M), confirming that no architectural expansion was required.

7. Conclusions

This paper presented DOM-MUSE, a lightweight speech enhancement framework that addresses four structural limitations shared by the MUSE and MUSE++ architectures within a unified U-Net backbone. The Deformable Feature Extractor (DFE) replaces fixed rectangular sampling grids with

K^{2}

learnable warp locations that align with speech formant trajectories, providing geometrically coherent inputs to the subsequent state space model. The DOM Mamba Block with Cross-Dimensional Gated Fusion (CDGF) extends unidirectional 1D Mamba scanning to parallel time–frequency axes modulated by channel-level semantic gates derived from Taylor Channel Attention, enabling the model to selectively emphasize spectrally meaningful state transitions—a theoretically principled alternative to naive concatenation that captures the cross-channel energy correlations inherent in formant structures. The Phase-Guided Feature Conditioner (PGFC) exploits the statistical distinction between smooth harmonic phase progressions and erratic noise-phase gradients to suppress noise-dominated activations before the SSM stage. The Attention-Based Skip Connection (ABSC) replaces channel-doubling concatenation with a content-adaptive per channel gate, preserving the decoder channel dimension while selectively incorporating encoder information at each level.

Experiments on the VoiceBank-DEMAND benchmark demonstrate that DOM-MUSE outperforms the reproduced MUSE baseline on five evaluation metrics—PESQ (+0.077), CSIG (+0.058), CBAK (+0.026), COVL (+0.070), and STOI (+0.002)—while reducing the parameter count by 24% (0.51 M → 0.39 M). Compared with MUSE++—which employs dynamic SNR augmentation and an augmented multi-objective loss—DOM-MUSE achieves higher PESQ (+0.061) and COVL (+0.032) without any of these training enhancements, with statistical significance confirmed by the Wilcoxon signed-rank test (two-sided

p = 4.32 \times 10^{- 23}

for PESQ; all five metrics

p < 0.001

). When trained under the same augmented protocol as MUSE++, DOM-MUSE further improves to PESQ 3.4609 and COVL 4.2165, surpassing MUSE++ on all five metrics. Ablation experiments confirm that each proposed component contributes positively and complementarily, with CDGF accounting for the largest individual gain. Computational efficiency measurements show that DOM-MUSE reduces inference time by 37% and peak VRAM by 41% relative to MUSE, while maintaining a RTF of 0.0645—well within the requirements of real-time deployment.

Several directions remain open for future investigation. First, all experiments in this work are conducted on the VoiceBank-DEMAND benchmark, which uses relatively controlled additive noise conditions. Cross-dataset evaluation on more challenging benchmarks—such as the DNS Challenge [34] and reverberant environments (e.g., Valentini-Reverb)—would provide stronger evidence of generalization to out-of-domain acoustic conditions; this is explicitly identified as the primary limitation of the current work and a priority for future evaluation. Second, the current architecture is non-causal; extending DOM-MUSE to a causal variant through masked convolutions and unidirectional Mamba scanning would enable low-latency streaming deployment on mobile and embedded devices. Third, exploring adaptive bottleneck dimensionality in Scalar AttnRes and ABSC may further improve the quality–efficiency trade-off. Finally, integrating DOM-MUSE as a front-end for downstream tasks, such as ASR or speaker verification, is a natural next step toward real-world deployment.

Author Contributions

Conceptualization, T.-J.L. and J.-W.H.; methodology, T.-J.L., B.-Y.S. and J.-W.H.; software, T.-J.L. and B.-Y.S.; validation, T.-J.L., B.-Y.S. and J.-W.H.; formal analysis, T.-J.L., J.-W.H. and B.-Y.S.; investigation, J.-W.H.; resources, J.-W.H.; data curation, J.-W.H. and T.-J.L.; writing—original draft preparation, T.-J.L., B.-Y.S. and J.-W.H.; writing—review and editing, J.-W.H.; visualization, J.-W.H. and B.-Y.S.; supervision, J.-S.L. and J.-W.H.; project administration, J.-S.L. and J.-W.H.; funding acquisition, J.-S.L. and J.-W.H. All authors have read and agreed to the published version of this manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Leglaive, S.; Fraticelli, M.; ElGhazaly, H.; Borne, L.; Sadeghi, M.; Wisdom, S.; Pariente, M.; Hershey, J.R.; Pressnitzer, D.; Barker, J.P. Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge. Comput. Speech Lang. 2025, 89, 101685. [Google Scholar] [CrossRef]
Zheng, C.; Zhang, H.; Liu, W.; Luo, X.; Li, A.; Li, X.; Moore, B.C.J. Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods. Trends Hear. 2023, 27, 23312165231209913. [Google Scholar] [CrossRef] [PubMed]
Natarajan, S.; Rahman Al-Haddad, S.A.; Ahmad, F.A.; Kamil, R.; Hassan, M.K.; Azrad, S.; Macleans, J.F.; Abdulhussain, S.H.; Mahmmod, B.M.; Saparkhojayev, N.; et al. Deep neural networks for speech enhancement and speech recognition: A systematic review. Ain Shams Eng. J. 2025, 16, 103405. [Google Scholar] [CrossRef]
Boll, S.F. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [Google Scholar] [CrossRef]
Paliwal, K.K.; Wojcicki, K.; Rao, B.P. The importance of phase in speech enhancement. Speech Commun. 2010, 53, 465–494. [Google Scholar] [CrossRef]
Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 3642–3646. [Google Scholar]
Yu, G.; Li, A.; Zheng, C.; Guo, Y.; Wang, Y.; Wang, H. Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7847–7851. [Google Scholar]
Dang, F.; Chen, H.; Zhang, P. DPT-FSNet: Dual-Path Transformer Based Full-Band and Sub-Band Fusion Network for Speech Enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6857–6861. [Google Scholar]
Yin, D.; Luo, C.; Xiong, Z.; Zeng, W. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9458–9465. [Google Scholar] [CrossRef]
Mattursun, A.; Wang, L.; Yu, Y.; Ma, C. Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch Boosting. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Nantes, France, 30 June–4 July 2025. [Google Scholar]
Lu, Y.X.; Ai, Y.; Ling, Z.H. MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023. [Google Scholar]
Wang, K.; He, B.; Zhu, W.P. TSTNN: Two-stage Transformer Based Neural Network for Speech Enhancement in the Time Domain. In ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2021; pp. 7098–7102. [Google Scholar]
Shin, W.; Lee, B.H.; Kim, J.S.; Park, H.J.; Han, S.W. MetricGAN-OKD: Multi-Metric Optimization of MetricGAN via Online Knowledge Distillation for Speech Enhancement. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 31521–31538. [Google Scholar]
Shin, W.; Park, H.J.; Kim, J.S.; Lee, B.H.; Han, S.W. Multi-View Attention Transfer for Efficient Speech Enhancement. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 1196–1200. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the Conference on Language Modeling (COLM), Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Dao, T.; Gu, A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Proceedings of the 41st International Conference on Machine Learning, ICML’24, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Lin, Z.; Chen, X.; Wang, J. MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech Enhancement. In Proceedings of the Interspeech 2024, Kos, Greece, 1–5 September 2024. [Google Scholar]
Li, T.J.; Hung, J.W. Enhancing the MUSE Speech Enhancement Framework with Mamba-Based Architecture and Extended Loss Functions. Mathematics 2025, 13, 3481. [Google Scholar] [CrossRef]
Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar] [CrossRef]
Thiemann, J.; Ito, N.; Vincent, E. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. In Proceedings of the 21st International Congress on Acoustics, Montreal, QC, Canada, 2–7 June 2013; pp. 1–6. [Google Scholar]
Lu, Y.X.; Ai, Y.; Ling, Z.H. Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement. Neural Netw. 2025, 189, 107562. [Google Scholar] [CrossRef]
Kimi Team. Attention Residuals. arXiv 2026, arXiv:2603.15031. [Google Scholar]
Zhang, X.; Zhang, Q.; Liu, H.; Xiao, T.; Qian, X.; Ahmed, B.; Ambikairajah, E.; Li, H.; Epps, J. Mamba in Speech: Towards an Alternative to Self-Attention. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 1933–1948. [Google Scholar] [CrossRef]
Kim, S.H.; Kim, T.G.; Chun, C.J. Mamba-based Hybrid Model for Speech Enhancement. In Proceedings of the Interspeech 2025, Rotterdam, The Netherlands, 17–21 August 2025. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Tang, X.; Qin, B.; Li, Y. IMSE: Efficient U-Net-based speech enhancement using Inception depthwise convolution and amplitude-aware linear attention. arXiv 2025, arXiv:2511.14515. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar] [CrossRef]
ITU-T. Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-End Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs; Technical Report P.862; International Telecommunication Union: Geneva, Switzerland, 2001. [Google Scholar]
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
Hu, Y.; Loizou, P.C. Evaluation of Objective Quality Measures for Speech Enhancement. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
Scalart, P.; Vieira Filho, J.V. Speech enhancement based on a priori signal to noise estimation. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Atlanta, GA, USA, 7–10 May 1996; Volume 2, pp. 629–632. [Google Scholar]
Logmmse: A Python Implementation of the LogMMSE Speech Enhancement/Noise Reduction Algorithm. Version 1.5.3. Available online: https://pypi.org/project/logmmse/ (accessed on 2 January 2026).
Reddy, C.K.A.; Gopal, V.; Cutler, R.; Beyrami, E.; Cheng, R.; Dubey, H.; Matusevych, S.; Aichner, R.; Aazami, A.; Braun, S.; et al. The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 2492–2496. [Google Scholar] [CrossRef]

Figure 1. Architecture of the MUSE++ backbone [19]. Each encoder and decoder level contains a 1D Mamba (Mamba-2) block with downsampling or upsampling, and skip connections use standard feature concatenation. The dynamic SNR augmentation and augmented multi-objective loss of MUSE++ are not depicted as DOM-MUSE does not employ them.

Figure 2. Overall architecture of the proposed DOM-MUSE. The encoder adopts DOM Blocks across two stages with channel widths

[16, 32]

, followed by a bottleneck DOM Block at

ch = 48

. The decoder incorporates encoder features via ABSC. After decoding, two independent refinement DOM Blocks separately process the magnitude and phase streams before the output heads.

Figure 2. Overall architecture of the proposed DOM-MUSE. The encoder adopts DOM Blocks across two stages with channel widths

[16, 32]

, followed by a bottleneck DOM Block at

ch = 48

. The decoder incorporates encoder features via ABSC. After decoding, two independent refinement DOM Blocks separately process the magnitude and phase streams before the output heads.

Figure 3. Internalstructure of the DOM Block. After batch normalization (BN), the input splits into a local branch (DWConv

3 \times 3

+ SiLU + Conv

1 \times 1

) and a global branch (DFE → PGFC → DOM Mamba Block). The merged output is combined with the input via

{AttnRes}_{1}

(scalar dynamic gate

α_{1}

), then processed by GS-FFN with a second

{AttnRes}_{2}

gate

α_{2}

.

Figure 3. Internalstructure of the DOM Block. After batch normalization (BN), the input splits into a local branch (DWConv

3 \times 3

+ SiLU + Conv

1 \times 1

) and a global branch (DFE → PGFC → DOM Mamba Block). The merged output is combined with the input via

{AttnRes}_{1}

(scalar dynamic gate

α_{1}

), then processed by GS-FFN with a second

{AttnRes}_{2}

gate

α_{2}

.

Figure 4. The Deformable Feature Extractor (DFE). The offset network predicts

2 K^{2}

spatial offsets and

K^{2}

sigmoid modulation masks (

K = 3

, total 27 channels), all initialized to zero. The Modulated Deformable Convolution samples features at the

K^{2}

warped locations using bilinear interpolation, aligning the receptive field with speech formant trajectories.

Figure 4. The Deformable Feature Extractor (DFE). The offset network predicts

2 K^{2}

spatial offsets and

K^{2}

sigmoid modulation masks (

K = 3

, total 27 channels), all initialized to zero. The Modulated Deformable Convolution samples features at the

K^{2}

warped locations using bilinear interpolation, aligning the receptive field with speech formant trajectories.

Figure 5. The Phase-Guided Feature Conditioner (PGFC). A DWConv

3 \times 3

approximates local phase gradients; GELU → Conv

1 \times 1

→ sigmoid produces a per channel gate

α

that suppresses incoherent noise activations before the SSM stage.

Figure 5. The Phase-Guided Feature Conditioner (PGFC). A DWConv

3 \times 3

approximates local phase gradients; GELU → Conv

1 \times 1

→ sigmoid produces a per channel gate

α

that suppresses incoherent noise activations before the SSM stage.

Figure 6. The DOM Mamba Block and CDGF. The input is projected to 32 channels and fed to a temporal Mamba, a frequency Mamba, and TCA in parallel. TCA generates sigmoid gates

G_{T}

and

G_{F}

that element-wise modulate the respective Mamba outputs before concatenation and final projection.

Figure 6. The DOM Mamba Block and CDGF. The input is projected to 32 channels and fed to a temporal Mamba, a frequency Mamba, and TCA in parallel. TCA generates sigmoid gates

G_{T}

and

G_{F}

that element-wise modulate the respective Mamba outputs before concatenation and final projection.

Figure 7. Gateactivation maps of the bottleneck DOM Mamba Block (channel-averaged). Bright yellow (≈1): gate open; dark purple (≈0): gate suppressed. Each panel shows (top) the noisy input spectrogram, (middle) Time Gate

G_{T}

, and (bottom) Frequency Gate

G_{F}

. (Top figure): p232_281 (SNR 12.5 dB, cafe noise). (Bottom figure): p232_283 (SNR 2.5 dB, cafe noise).

G_{T}

displays sharp, structured horizontal suppression bands that are frequency-selective and aligned with inter-harmonic gaps.

G_{F}

produces smoother, spatially diffuse suppression concentrated in the 4000–6000 Hz band during noise-only frames. The visual distinction between the two gates validates the complementary roles of temporal and frequency scanning in CDGF.

Figure 7. Gateactivation maps of the bottleneck DOM Mamba Block (channel-averaged). Bright yellow (≈1): gate open; dark purple (≈0): gate suppressed. Each panel shows (top) the noisy input spectrogram, (middle) Time Gate

G_{T}

, and (bottom) Frequency Gate

G_{F}

. (Top figure): p232_281 (SNR 12.5 dB, cafe noise). (Bottom figure): p232_283 (SNR 2.5 dB, cafe noise).

G_{T}

displays sharp, structured horizontal suppression bands that are frequency-selective and aligned with inter-harmonic gaps.

G_{F}

produces smoother, spatially diffuse suppression concentrated in the 4000–6000 Hz band during noise-only frames. The visual distinction between the two gates validates the complementary roles of temporal and frequency scanning in CDGF.

Figure 8. Attention-Based Skip Connection (ABSC). The encoder feature

E

and decoder feature

D

are each projected by a depthwise

1 \times 1

convolution before global average pooling and concatenation. A bottleneck network produces a per channel gate

α \in {(0, 1)}^{C}

; the output is the gated residual

D + α ⊙ E

.

Figure 8. Attention-Based Skip Connection (ABSC). The encoder feature

E

and decoder feature

D

are each projected by a depthwise

1 \times 1

convolution before global average pooling and concatenation. A bottleneck network produces a per channel gate

α \in {(0, 1)}^{C}

; the output is the gated residual

D + α ⊙ E

.

Figure 9. Spectrogram comparison on a test utterance (p232_283, SNR 2.5 dB, cafe noise) from the VoiceBank-DEMAND test set. (a) Clean speech reference; (b) noisy input; (c) enhanced by MUSE; (d) enhanced by MUSE++; (e) enhanced by DOM-MUSE (ours). This figure is provided for qualitative illustration; quantitative results are in Table 1 and Table 5.

Figure 10. Spectrogram comparison on a second test utterance (p232_281, SNR 12.5 dB, cafe noise) from the VoiceBank-DEMAND test set. (a) Clean speech reference; (b) noisy input; (c) enhanced by MUSE; (d) enhanced by MUSE++; (e) enhanced by DOM-MUSE (ours).

Table 1. SE performance on the VoiceBank-DEMAND test set for the MUSE model family. MUSE (reported) denotes the scores from the original paper [18]; MUSE (reproduced) denotes our reproduction using the official repository. MUSE++ is trained with dynamic SNR augmentation and the augmented multi-objective loss [19]; DOM-MUSE uses neither. Bold indicates the best score in each column.

Model	PESQ	CSIG	CBAK	COVL	STOI	# Para. (M) ↓
MUSE (reported)	3.37	4.63	3.80	4.10	0.95	0.51
MUSE (reproduced)	3.3475	4.6163	3.7965	4.0827	0.9506	0.51
MUSE++	3.3636	4.6619	3.8584	4.1209	0.9538	0.17
DOM-MUSE (proposed)	3.4241	4.6738	3.8224	4.1531	0.9522	0.39

Table 2. Wilcoxon signed-rank test (two-sided,

N = 824

) comparing DOM-MUSE vs. MUSE++ on per utterance scores.

Δ

= DOM-MUSE − MUSE++. All differences are statistically highly significant (

p < 0.001

).

Table 2. Wilcoxon signed-rank test (two-sided,

N = 824

) comparing DOM-MUSE vs. MUSE++ on per utterance scores.

Δ

= DOM-MUSE − MUSE++. All differences are statistically highly significant (

p < 0.001

).

Metric	$Δ$ (Mean)	Test Statistic	p-Value (Two-Sided)
PESQ	$+ 0.0605$	102,315.5	$4.32 \times 10^{- 23}$
CSIG	$+ 0.0120$	52,732.5	$6.58 \times 10^{- 5}$
CBAK	$- 0.0360$	109,089.0	$1.18 \times 10^{- 18}$
COVL	$+ 0.0322$	122,743.0	$9.60 \times 10^{- 11}$
STOI	$- 0.0017$	129,798.0	$1.03 \times 10^{- 7}$

Table 3. Parameter count and computational efficiency on the VoiceBank-DEMAND test set. RTF and IFT: lower is better. THP: higher is better. VRAM: lower is better.

Model	Para. (M) ↓	RTF ↓	IFT (s) ↓	THP (utt/s) ↑	VRAM (MB) ↓
MUSE [18]	0.51	0.1038	218	9.64	7584
MUSE++ [19]	0.17	0.0116	27	85.98	1198
DOM-MUSE (ours)	0.39	0.0645	137	15.50	4506

Table 4. Ablation study on the VoiceBank-DEMAND test set. All variants share the same U-Net skeleton as DOM-MUSE. Bold indicates the best score in each column.

Variant	PESQ	CSIG	CBAK	COVL	STOI	Para.
DOM-MUSE	3.4241	4.6738	3.8224	4.1531	0.9522	0.39 M
A3 (w/o DFE, PGFC)	3.4024	4.6555	3.8269	4.1343	0.9513	0.39 M
A2 (w/o CDGF, DFE, PGFC)	3.3609	4.6354	3.7878	4.0950	0.9509	0.31 M
A1 (w/o CDGF, DFE, PGFC, GS-FFN)	3.3785	4.6368	3.7623	4.1006	0.9486	0.30 M

Table 5. Per SNR performance on the VoiceBank-DEMAND test set. MUSE++ uses dynamic SNR augmentation and augmented multi-objective loss. Bold indicates the best score at each SNR for each metric.

SNR	Method	PESQ	CSIG	CBAK	COVL	STOI
2.5 dB	MUSE	2.7786	4.1865	3.3754	3.5340	0.9190
	MUSE++	2.8122	4.2659	3.4221	3.5924	0.9237
	DOM-MUSE	2.8989	4.2934	3.4104	3.6493	0.9217
7.5 dB	MUSE	3.2512	4.5741	3.7166	3.9935	0.9498
	MUSE++	3.2848	4.6320	3.7749	4.0440	0.9545
	DOM-MUSE	3.3526	4.6443	3.7506	4.0836	0.9523
12.5 dB	MUSE	3.5440	4.7885	3.9359	4.2739	0.9629
	MUSE++	3.5472	4.8180	3.9963	4.3007	0.9652
	DOM-MUSE	3.5910	4.8205	3.9536	4.3171	0.9640
17.5 dB	MUSE	3.8226	4.9202	4.1644	4.5355	0.9709
	MUSE++	3.8163	4.9355	4.2455	4.5522	0.9722
	DOM-MUSE	3.8597	4.9411	4.1798	4.5681	0.9709

Table 6. SE performance (rounded to two decimal places) of DOM-MUSE and representative SE methods on the VoiceBank-DEMAND test set. Wiener filtering [32] scores are reported in [7]; LogMMSE [5] is implemented using [33]; IMSE scores are from [27]. Remaining neural method scores are compiled from [18]. MUSE++ uses dynamic SNR augmentation and augmented multi-objective loss [19]; DOM-MUSE uses neither. Bold indicates the best result in each column; “—” indicates not reported.

Method	# Para. (M) ↓	PESQ	CSIG	CBAK	COVL	STOI
Noisy	—	1.97	3.35	2.44	2.63	0.91
Wiener [32]	—	2.22	3.23	2.68	2.63	—
LogMMSE [5]	—	2.34	3.67	3.12	3.04	0.91
TSTNN [13]	0.92	2.96	4.33	3.53	3.67	0.95
DB-AIAT [8]	2.81	3.31	4.61	3.75	3.96	—
DPT-FSNet [9]	0.88	3.33	4.58	3.72	4.00	0.96
MetricGAN- OKDv2 [14]	0.82	3.12	4.27	3.16	3.71	0.95
MANNER-S [15]	0.90	3.06	4.42	3.58	3.77	0.95
MUSE [18]	0.51	3.35	4.62	3.80	4.08	0.95
IMSE [27]	0.43	3.40	4.67	—	4.14	—
MUSE++ [19]	0.17	3.36	4.66	3.86	4.12	0.95
DOM-MUSE (ours)	0.39	3.42	4.67	3.82	4.15	0.95

Table 7. Comparison of DOM-MUSE and MUSE++ under identical augmented training conditions (dynamic SNR augmentation + augmented multi-objective loss). Bold indicates the best score in each column.

Model	PESQ	CSIG	CBAK	COVL	STOI	# Para. (M)
MUSE++ (aug.) [19]	3.3636	4.6619	3.8584	4.1209	0.9538	0.17
DOM-MUSE (aug.)	3.4609	4.7366	3.9351	4.2165	0.9586	0.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, T.-J.; Su, B.-Y.; Lin, J.-S.; Hung, J.-W. DOM-MUSE: A Deformable Omnidirectional State Space Architecture for Efficient Speech Enhancement. Electronics 2026, 15, 2159. https://doi.org/10.3390/electronics15102159

AMA Style

Li T-J, Su B-Y, Lin J-S, Hung J-W. DOM-MUSE: A Deformable Omnidirectional State Space Architecture for Efficient Speech Enhancement. Electronics. 2026; 15(10):2159. https://doi.org/10.3390/electronics15102159

Chicago/Turabian Style

Li, Tsung-Jung, Bo-Yu Su, Jung-Shan Lin, and Jeih-Weih Hung. 2026. "DOM-MUSE: A Deformable Omnidirectional State Space Architecture for Efficient Speech Enhancement" Electronics 15, no. 10: 2159. https://doi.org/10.3390/electronics15102159

APA Style

Li, T.-J., Su, B.-Y., Lin, J.-S., & Hung, J.-W. (2026). DOM-MUSE: A Deformable Omnidirectional State Space Architecture for Efficient Speech Enhancement. Electronics, 15(10), 2159. https://doi.org/10.3390/electronics15102159

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DOM-MUSE: A Deformable Omnidirectional State Space Architecture for Efficient Speech Enhancement

Abstract

1. Introduction

2. Related Work

2.1. U-Net-Based Speech Enhancement

2.2. Transformer-Based Speech Enhancement

2.3. State Space Models for Speech Enhancement

2.4. Phase-Aware Speech Enhancement

2.5. Deformable Convolution in Audio Processing

3. Backbone: MUSE and MUSE++

3.1. MUSE

3.2. MUSE++

3.3. Limitations of the MUSE++ Backbone

4. Proposed Method: DOM-MUSE

4.1. Overall Architecture

4.2. DOM Block

4.2.1. Deformable Feature Extractor (DFE)

4.2.2. Phase-Guided Feature Conditioner (PGFC)

4.2.3. DOM Mamba Block and Cross-Dimensional Gated Fusion (CDGF)

4.3. Attention-Based Skip Connection (ABSC)

4.4. Training Objective

4.5. Comparison with MUSE and MUSE++

5. Experimental Setup

6. Experimental Results and Discussions

6.1. Overall Performance Comparison

6.2. Parameter and Computational Efficiency

6.3. Ablation Study

6.4. Per SNR Performance Analysis

6.5. Qualitative Spectrogram Analysis

6.6. Comparison with State-of-the-Art Methods

6.7. DOM-MUSE Under Augmented Training Conditions

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI