1. Introduction
Speech enhancement (SE) aims to recover clean speech from noise-corrupted recordings and serves as a foundational front-end component for a wide range of applications, including mobile telephony, hearing aids, smart assistants, remote conferencing, and automatic speech recognition (ASR) [
1,
2,
3]. Despite decades of research, SE remains challenging in real-world conditions where noise is non-stationary, the signal-to-noise ratio (SNR) is low, and computational resources are constrained.
Classical SE methods, such as spectral subtraction [
4], Wiener filtering, and the minimum mean-square error (MMSE) log-spectral amplitude estimator [
5], primarily operate in the frequency domain by estimating a noise-suppressing gain applied to the magnitude spectrum. Although computationally efficient, these approaches largely ignore phase information [
6] and degrade rapidly under non-stationary or highly dynamic noise conditions.
The advent of deep learning transformed SE by enabling complex, data-driven mappings between noisy and clean speech. Convolutional neural networks (CNNs), recurrent networks, and later U-Net encoder–decoder architectures [
7] demonstrated strong generalization to diverse acoustic environments. Further gains were achieved by adopting Transformer-based attention mechanisms to model long-range spectro-temporal dependencies, leading to a series of high-performing yet parameter-heavy models [
8,
9]. Alongside these performance advances, growing awareness of phase information as a critical component for perceptual quality [
6] prompted the development of phase-aware architectures, such as PHASEN [
10] and magnitude-phase dual-path networks [
11,
12], which jointly predict magnitude masks and phase corrections.
The scalability gap between Transformer-based models and the requirements of resource-constrained deployment sparked renewed interest in lightweight SE architectures. Methods such as TSTNN [
13], DPT-FSNet [
9], MetricGAN-OKDv2 [
14], and MANNER-S [
15] pushed the efficiency frontier through bottleneck compression, grouped convolutions, and efficient attention approximations. In parallel, SSMs emerged as a compelling alternative to self-attention: Mamba [
16] and its successor Mamba-2 [
17] achieve linear complexity with respect to sequence length while maintaining competitive modeling capacity, making them attractive for long-sequence audio processing.
Building on this foundation, MUSE [
18] introduced a lightweight U-Net-based SE framework centered on the Multi-path Enhanced Taylor (MET) Transformer block. By combining Deformable Embedding, Channel-and-Spatial Attention (CSA), and a Taylor-series approximation to softmax attention, MUSE achieves competitive SE performance with only 0.51 M parameters. Its successor, MUSE++ [
19], further reduced model complexity by replacing the MET Transformer with a 1D Mamba-2 block—cutting the parameter count to 0.17 M—and supplemented the architecture with dynamic SNR-based data augmentation and a multi-objective loss function to compensate for the reduced representational capacity.
Despite these advances, both MUSE and MUSE++ share several fundamental limitations that are rooted in concrete acoustic and signal-processing properties. First, standard rectangular convolution grids impose an implicit assumption that speech features are locally stationary and geometrically uniform in the time–frequency plane. This assumption is violated by the curved trajectories of formants and the harmonic structures of voiced speech, which follow the glottal pulse rate and vocal tract resonances rather than fixed horizontal lines. Second, MUSE++ models sequential dependencies along a single (typically temporal) axis using a unidirectional 1D Mamba, thereby ignoring cross-frequency correlations that arise from the harmonic energy distribution across frequency bins for a given fundamental frequency. Third, local phase gradients provide a physically grounded cue for separating speech from noise: harmonics produce smooth, predictable phase progressions, while noise produces erratic, high-variance gradients. Neither model exploits this property to condition the feature extraction process itself; the phase is treated only as a post hoc output target. Fourth, the standard concatenation-based skip connections in both models treat all encoder features indiscriminately, potentially propagating redundant or noise-corrupted information to the decoder.
To address these limitations simultaneously, we propose DOM-MUSE (Deformable Omnidirectional Mamba-based MUSE), a lightweight SE framework that retains the efficient U-Net skeleton of MUSE while introducing four targeted innovations, each derived from the above acoustic and signal-processing observations. The main contributions of this work are as follows:
Deformable Feature Extractor (DFE). We introduce a learnable deformable convolution module that predicts per location 2D spatial offsets and modulation masks, warping the feature sampling grid to align with speech formant trajectories and harmonic structures. This enables the subsequent SSM to process geometrically coherent feature sequences rather than arbitrary rectangular patches.
DOM Mamba Block with Cross-Dimensional Gated Fusion (CDGF). We design a parallel dual-branch Mamba block that scans the time and frequency axes independently with two dedicated Mamba-2 instances. A TCA module computes a channel covariance projection that generates axis-specific semantic gates, allowing high-level channel semantics to modulate state transitions along both the temporal and spectral dimensions simultaneously—rather than fusing them by naive concatenation that treats all channel activations equally.
Phase-Guided Feature Conditioner (PGFC). We propose a lightweight conditioning module that uses local phase-gradient coherence as a discriminative signal to generate per channel attention gates, suppressing noise-dominated activations prior to the SSM stage. This exploits the statistical distinction between smooth harmonic phase progressions and erratic noise-phase patterns, making the feature extraction pathway implicitly phase-aware without incurring significant computational overhead.
Attention-Based Skip Connection (ABSC). We replace the conventional concatenation skip connection with a content-adaptive channel gate that computes a bottleneck attention weight from both encoder and decoder features. This selectively incorporates encoder information based on its relevance to the current decoding context, avoiding the channel-dimension doubling and information conflict associated with naive concatenation.
Experiments on the VoiceBank-DEMAND benchmark [
20,
21] demonstrate that DOM-MUSE outperforms the reproduced MUSE across all five evaluation metrics—including PESQ (+0.077), CSIG (+0.058), CBAK (+0.026), COVL (+0.070), and STOI (+0.002)—while reducing the parameter count by 24% (0.51 M to 0.39 M). Compared with MUSE++, DOM-MUSE achieves substantially higher scores on all perceptual quality metrics, with a PESQ gain of 0.061 and a COVL gain of 0.032, suggesting that the proposed architectural innovations translate into meaningful enhancement quality improvements beyond what a plain 1D Mamba can deliver.
The remainder of this paper is organized as follows.
Section 2 reviews the related work.
Section 3 describes the MUSE and MUSE++ backbone architectures.
Section 4 presents the proposed DOM-MUSE framework and its individual components.
Section 5 describes the experimental setup and evaluation metrics.
Section 6 reports the quantitative results with analysis.
Section 7 concludes this paper and outlines future directions.
3. Backbone: MUSE and MUSE++
DOM-MUSE is built directly upon the MUSE++ framework [
19], which itself extends the original MUSE architecture [
18]. This section briefly reviews the key design choices of both models, as they form the starting point for our proposed improvements.
3.1. MUSE
MUSE (Multi-path Enhanced Taylor Transformer-based U-Net for Speech Enhancement) is a lightweight SE framework with only 0.51 M parameters. Given a time-domain input x, the Short-Time Fourier Transform (STFT) produces a magnitude spectrogram and phase spectrogram . The magnitude is compressed using a power-law transform (with ), and the compressed magnitude and raw phase are concatenated to form a two-channel input tensor that enters a U-Net encoder–decoder.
The core processing unit of MUSE is the Multi-path Enhanced Taylor (MET) Transformer block, which integrates three components: (1) Deformable Embedding (DE) that adapts the receptive field to voiceprint geometry, (2) a Taylor-MSA branch that approximates softmax attention in linear time via a first-order Taylor expansion, and (3) a Channel-and-Spatial Attention (CSA) branch that compensates for information loss in the Taylor approximation. The outputs of the three branches are fused element-wise and passed through a feed-forward network (FFN). At the output, enhanced magnitude and phase features are decoded by separate branches and recombined via the inverse STFT (iSTFT).
3.2. MUSE++
MUSE++ replaces the MET Transformer block with a 1D Mamba-2 module, reducing the parameter count dramatically from 0.51 M to 0.17 M. The 1D Mamba-2 block processes a flattened time–frequency sequence using a selective state space model with linear complexity, defined by the recurrence:
where
is a learned forgetting factor and
are learned projections of the input at step
t. To compensate for the reduced model capacity, MUSE++ introduces dynamic SNR-based data augmentation and an augmented multi-objective loss that adds STFT consistency, time-domain, and multi-resolution STFT terms to the original MUSE loss. Dynamic SNR-based data augmentation is used in MUSE++ to offset the limited representational capacity of the plain 1D Mamba-2 backbone. Specifically, during training, MUSE++ mixes clean utterances with noise at SNR values sampled uniformly from
dB—a much broader range than the fixed
dB in the standard VoiceBank-DEMAND protocol—to improve robustness across diverse conditions.
It is important to note that in MUSE++, the 1D Mamba-2 block placed at each encoder and decoder level performs the same mathematical operations—the selective state space scan as defined above. The blocks differ only in the tensor dimensions they process: shallower encoder levels handle features with higher spatial resolution and fewer channels (e.g., at encoder level one), while deeper levels process lower-resolution features with more channels (e.g., at the bottleneck), following the standard U-Net channel schedule. There is no structural or operational distinction between encoder-side and decoder-side Mamba blocks. The same principle carries over to DOM-MUSE: all DOM Blocks share identical internal structure across all encoder, bottleneck, and decoder levels, differing only in input channel width.
3.3. Limitations of the MUSE++ Backbone
Although MUSE++ achieves a remarkable reduction in the parameter count, the direct substitution of the MET Transformer with 1D Mamba alone—without the augmented training strategy—yields only comparable performance to MUSE, indicating that the 1D Mamba backbone alone has limited representational capacity. Moreover, the 1D unidirectional scanning ignores cross-frequency dependencies, phase information is used only as an output target, and the concatenation-based skip connections treat all encoder features uniformly. These structural weaknesses motivate the four innovations introduced in DOM-MUSE, detailed in the following section.
4. Proposed Method: DOM-MUSE
We propose DOM-MUSE, a lightweight speech enhancement framework that extends the MUSE++ backbone with four targeted innovations: a Deformable Feature Extractor (DFE), a Cross-Dimensional Gated Fusion (CDGF) mechanism, a Phase-Guided Feature Conditioner (PGFC), and an Attention-Based Skip Connection (ABSC). Rather than replacing the entire backbone, each innovation addresses a specific structural weakness while preserving the efficiency of the original design.
4.1. Overall Architecture
DOM-MUSE retains the U-Net-style encoder–decoder skeleton of MUSE and MUSE++, with two structural changes applied consistently: the 1D Mamba blocks are replaced by the proposed DOM Block, and the concatenation-based skip connections are replaced by ABSC.
Figure 1 and
Figure 2 place the two architectures side by side.
The input waveform
x is transformed by STFT into magnitude
and phase
spectrograms. Following MUSE [
18], the magnitude is power-law compressed (
) and concatenated with the raw phase to form a two-channel feature map, which passes through a dilated Dense-Net codec (DDN) inherited from MP-SENet [
12]. The features traverse two encoder stages—each consisting of DOM Blocks and a pixel-unshuffle downsampling layer—and a bottleneck DOM Block, before being decoded symmetrically with pixel-shuffle upsampling and ABSC skip connections. After decoding, two independent refinement DOM Blocks separately process the magnitude and phase streams. The clean phase is recovered via Cartesian coordinates (real and imaginary parts) as
; the enhanced waveform is reconstructed by iSTFT.
4.2. DOM Block
The DOM Block is the fundamental processing unit of DOM-MUSE. Despite its structural resemblance to a Transformer block, it contains
no spatial self-attention—the attention role is fulfilled by the Mamba-based CDGF mechanism and TCA, both operating in the channel domain. The block belongs to the SSM–CNN hybrid family. As shown in
Figure 3, each DOM Block applies batch normalization (BN) to the input
and splits processing into a
local branch and a
global branch. Their outputs are merged by a
convolution and combined with
via a Scalar Attention Residual (
), followed by GS-FFN combined via a second Scalar Attention Residual (
):
Scalar Attention Residual (Scalar AttnRes). To prevent indiscriminate propagation of residual features, both residual connections use a Scalar AttnRes [
23] rather than standard addition. Given a main feature
and a branch feature
, both are globally average-pooled and projected to an 8-dimensional subspace via bias-free linear layers, yielding
. A data-dependent scalar gate is computed as:
Local branch. A depthwise convolution (DWConv: a convolution in which each channel is processed by its own independent filter) paired with a pointwise convolution serves as a lightweight local feature extractor, capturing fine-grained spectro-temporal textures while preserving the harmonic structure critical for artifact suppression. The extracted features are activated by the SiLU (Sigmoid Linear Unit, ) and GELU (Gaussian Error Linear Unit) to selectively amplify perceptually relevant components.
Global branch. The global branch passes the normalized input through three sequential operations: DFE, PGFC, and the DOM Mamba Block. The ordering is theoretically motivated: geometric alignment must precede phase-guided gating because phase coherence is best assessed on formant-aligned features rather than on distorted rectangular-grid samples; similarly, noise suppression via PGFC must precede SSM scanning to prevent noise-corrupted activations from propagating through the recurrent state.
4.2.1. Deformable Feature Extractor (DFE)
Standard convolutions sample features on a fixed rectangular grid that cannot conform to the curved harmonic structures in speech spectrograms. The DFE addresses this by learning
independent per location 2D spatial offsets (
, i.e., nine sampling points per position) that warp the sampling grid toward the underlying speech geometry, following the modulated deformable convolution framework of [
26]. As shown in
Figure 4, a two-stage offset network (DWConv
→ Conv
) predicts
spatial offsets
and
sigmoid modulation masks
m (total
output channels). The offsets are initialized to zero so that training begins from a standard convolution and progressively adapts to data-driven deformations. A modulated deformable convolution then samples features at all
warped locations via bilinear interpolation.
The choice of
follows the standard setting established in the original deformable convolution literature [
26] and is motivated by three considerations. First,
yields a
sampling grid that is center-symmetric, providing balanced coverage around each feature location and preserving the spatial locality essential for formant trajectory alignment. Second,
provides only four sampling points, which is insufficient to capture the curvature of formant trajectories; the asymmetric 2 × 2 layout also introduces an implicit spatial bias that can destabilize the offset prediction. Third, increasing to
would expand the offset network output from
to 75 channels, substantially increasing the parameter count and FLOPs in a module whose purpose is lightweight preprocessing—contradicting the efficiency objective of DOM-MUSE. A systematic sensitivity analysis of
K across different values would require training separate model instances and is left as future work; however, the above considerations provide a well-grounded rationale for
as the appropriate operating point.
4.2.2. Phase-Guided Feature Conditioner (PGFC)
Phase coherence is a reliable indicator of voiced speech that can be characterized mathematically by the local phase gradient
. For voiced speech, this gradient follows a smooth, predictable pattern determined by the fundamental frequency
; for noise, it produces erratic gradients with high variance. The PGFC exploits this statistical distinction by computing local phase-gradient features and deriving a per channel gate that selectively suppresses noise-dominated activations before the SSM stage. This is one of the first lightweight U-Net-style SE modules to use local phase gradients to gate SSM inputs before sequence modeling, rather than treating the phase only as an output target [
6,
10,
12]. Unlike prior phase-aware methods that operate only at the output decoder stage, PGFC operates directly on the hybrid complex-aware feature maps produced by the DFE, allowing noise suppression to occur seamlessly in the latent feature space.
A
DWConv approximates local differential operators on the deformed features. These responses are then passed through a GELU activation, a
pointwise convolution, and a sigmoid to produce a per channel gate
:
This pre-filters the SSM input with only two lightweight convolution layers, as illustrated in
Figure 5.
4.2.3. DOM Mamba Block and Cross-Dimensional Gated Fusion (CDGF)
The DOM Mamba Block (
Figure 6) processes the time and frequency axes
in parallel with two dedicated Mamba-2 instances, fusing their outputs via semantic gates from a TCA module—a mechanism we term Cross-Dimensional Gated Fusion (CDGF).
The theoretical motivation for CDGF over simpler fusion strategies is as follows. Naive concatenation doubles the channel dimension ( cost for subsequent mixing) and treats all channel activations equally, implicitly assuming that temporal and frequency Mamba outputs are always equally informative across all regions—an assumption that does not hold for speech, where voiced segments benefit more from temporal continuity while fricatives and consonants are better characterized by spectral patterns. Additive fusion is equivalent to assuming that the temporal and frequency dimensions are perfectly aligned and equally weighted in the latent space for all time–frequency positions, which is physically unreasonable. By contrast, CDGF uses TCA to compute a channel covariance matrix between the temporal and frequency representations, capturing the cross-channel energy correlations that arise from formant structures spanning multiple frequency bands. This covariance is then used to derive axis-specific gates and , allowing the model to dynamically emphasize whichever dimensional scan is more semantically relevant at each channel and time–frequency location. The ablation study (Table 4, A2 vs. full model) confirms that CDGF provides the largest single-component gain (+0.041 PESQ, +0.039 CBAK), consistent with its role as the primary cross-dimensional integration mechanism.
The input is projected to
by a
convolution. TCA [
18] applies L2 normalization to query and key vectors to ensure convergence of the Taylor expansion and introduces a learnable temperature parameter
to dynamically scale the channel covariance projection—empirically vital for stabilizing the cross-dimensional gating signals. Two independent
convolutions followed by sigmoid activations produce gates
and
:
To provide qualitative evidence of CDGF’s behavior,
Figure 7 visualizes the Time Gate
and Frequency Gate
(channel-averaged) for two representative utterances at contrasting SNR levels (12.5 dB and 2.5 dB).
Two consistent observations hold across both conditions. First, exhibits structured horizontal suppression bands (dark purple, ≈0) that are frequency-selective and spatially sharp, aligned with high-frequency noise-dominated regions and inter-harmonic gaps in the voiced segments. This reflects the temporal Mamba’s role in tracking harmonic continuity along the time axis: it selectively suppresses channels with low temporal coherence while remaining open at harmonically stable frequencies. Second, produces comparatively smoother and more spatially diffuse suppression, with the most prominent dark region concentrated in a contiguous block around 4000–6000 Hz during the early noise-only frames. Unlike ’s sharp band structure, ’s broader patterns are consistent with the frequency Mamba’s role in capturing spectral energy distributions across the frequency axis rather than tracking time-continuous harmonics.
The visual contrast between the two gates—sharp, channel-selective banding for versus smooth, broadband attenuation for —confirms that CDGF learns complementary axis-specific gating behavior rather than redundant suppression. Comparing the two SNR conditions, both gates exhibit stronger overall suppression at 2.5 dB (right column), consistent with the heavier noise contamination, validating that the gating is input-adaptive.
Gated Star FFN (GS-FFN). Inspired by the element-wise product design of StarNet [
28], a pointwise convolution expands the channel by
and splits into a content stream
and a gate stream
, each with a
ratio. A
DWConv is applied to the
content stream to introduce local spatial inductive bias:
Compared with a standard two-layer FFN at the same expansion ratio, GS-FFN reduces parameters by approximately one-third.
4.3. Attention-Based Skip Connection (ABSC)
Standard U-Net skip connections concatenate encoder and decoder features channel-wise, doubling the channel dimension and indiscriminately passing all encoder information to the decoder. The proposed ABSC replaces this with a content-adaptive channel gate (
Figure 8).
The features
and
are first independently projected via depthwise
convolutions, then globally average-pooled, concatenated, and passed through a bottleneck (Conv
→ SiLU → Conv
→ sigmoid) to produce
:
This preserves the decoder channel dimension and allows the network to learn at each level how much encoder information is beneficial. ABSC can be viewed as the macro-level counterpart to the Scalar AttnRes within the DOM Block: both implement content-adaptive residual gating, but at different structural scales and gate granularities (per channel vector vs. scalar).
4.4. Training Objective
DOM-MUSE is trained with the original MUSE loss [
18]:
with
,
,
, and
[
12,
22]. The augmented loss of MUSE++ is not used as the primary training configuration, as the PGFC and CDGF already embed sufficient phase-aware inductive biases.
Section 6.7 presents additional results under the augmented training protocol.
4.5. Comparison with MUSE and MUSE++
DOM-MUSE shares the same U-Net backbone lineage as MUSE and MUSE++, but differs substantially in how each stage processes features. MUSE relies on the MET Transformer block—a combination of Taylor-approximated attention and deformable patch embedding—achieving strong performance at the cost of a relatively heavy attention mechanism. MUSE++ simplifies this by replacing the MET Transformer entirely with a 1D Mamba-2 block, cutting the parameter count substantially; however, the unidirectional 1D scan sacrifices cross-frequency modeling, and the reduced representational capacity must be compensated by augmented training strategies rather than architectural improvements.
DOM-MUSE redesigns the internal structure of the processing block to address three concrete weaknesses shared by both predecessors: the DFE replaces fixed rectangular grids with learnable warp locations; CDGF processes time and frequency in parallel with semantic gating rather than simple concatenation; and PGFC actively conditions features on local phase coherence rather than treating the phase only as an output supervision target. Scalar AttnRes and ABSC together form a two-level content-adaptive gating system at both the block level and the encoder–decoder interface.
5. Experimental Setup
To evaluate the effectiveness of the proposed DOM-MUSE framework, we conduct experiments on the VoiceBank-DEMAND corpus [
20,
21]. This widely used benchmark pairs clean speech from the VoiceBank dataset with diverse noise recordings from the DEMAND database. The training set contains 11,572 utterances from 28 distinct speakers, while the test set comprises 824 utterances from two unseen speakers. During training, clean speech is artificially corrupted by ten categories of DEMAND noise at four SNR levels: 0, 5, 10, and 15 dB. The test set employs five types of DEMAND noise at SNRs of 2.5, 7.5, 12.5, and 17.5 dB. Approximately 200 utterances are reserved for validation.
Time–frequency representation: All models use the STFT as the time–frequency front-end. This choice is motivated by three concrete requirements of our architecture. First, the PGFC relies on explicit phase separation: STFT natively produces a complex spectrum from which phase can be directly extracted and used to compute local phase gradients. Discrete Wavelet Transform (DWT) coefficients are real-valued and do not provide direct access to an instantaneous phase, making pixel-level phase-gradient gating impractical without additional computation. Second, the iSTFT provides a lossless reconstruction path that is a mathematical requirement of our magnitude-phase dual-decoder design; DWT-based reconstruction is more complex and is not universally invertible for arbitrary filter banks. Third, using STFT ensures a fully fair architectural comparison with all competing methods in Table 6, including MUSE, MUSE++, IMSE, and DPT-FSNet, which uniformly adopt STFT as the standard front-end in this benchmark.
Data preprocessing and training protocol: All waveforms are uniformly segmented into chunks of 30,700 samples. STFT is computed with a FFT (Fast Fourier Transform) size of 510, window length of 510, hop size of 100, and a sampling rate of 16 kHz. Models are trained for up to 100 epochs using the AdamW optimizer, with an initial learning rate of , an exponential decay factor of 0.99, a weight decay of , and a batch size of two. Early stopping is applied if the validation loss does not improve for 10 consecutive epochs.
The initial learning rate of
is adopted directly from the MUSE and MUSE++ training protocols [
18,
19], ensuring a controlled comparison within the model family and avoiding confounding variables. This value is also consistent with widely used AdamW settings for SSM-based audio models [
24], where the learning rate must be kept moderate to avoid gradient instability caused by the recurrent state update dynamics. The batch size of two is dictated by GPU memory constraints (NVIDIA RTX 3060, 12 GB VRAM): variable-length speech utterances require zero-padding or chunking, and a larger batch size causes out-of-memory (OOM) errors. All three models in this comparison (MUSE, MUSE++, DOM-MUSE) are trained with an identical batch size to ensure fair comparison.
Model architecture configuration: DOM-MUSE adopts a two-level U-Net encoder–decoder backbone, with the base channel dimension initialized at 16 and increased linearly at each downsampling stage, yielding channel widths of
. All DOM Mamba Blocks share a fixed internal projection dimension of 32, and each embedded Mamba-2 module is configured with a state size of 16, head dimension of eight, and expansion factor of two. The front-end Dense Encoder and the back-end Mask and Phase Decoders follow the MP-SENet design [
12], employing dilated convolutions with dilation rates
and dense skip connections. The magnitude mask is estimated via a learnable sigmoid activation.
Loss function configuration: DOM-MUSE is trained using the original MUSE loss in Equation (
10), with weights set to
,
,
, and
, following the configuration in [
12,
18].
Evaluation metrics: To benchmark enhancement performance, we employ five widely adopted objective metrics, each assessing a different aspect of the enhanced speech:
- 1.
Perceptual Evaluation of Speech Quality (PESQ) [
29]: Ranges from
to 4.5; higher is better. As an ITU-T P.862 standard, PESQ is validated against subjective mean opinion scores (MOS) and is widely adopted as the primary proxy for perceptual quality in the SE literature [
18,
19,
27].
- 2.
Short-Time Objective Intelligibility (STOI) [
30]: Scores from 0 to 1; higher values indicate greater intelligibility.
- 3.
Composite Overall Quality (COVL) [
31]: Ranges from 0 to 5; an objective MOS predictor for overall perceived speech quality.
- 4.
Composite Signal Distortion (CSIG) [
31]: MOS scale (0–5) for signal distortion; higher values indicate less distortion.
- 5.
Composite Background Noise Intrusiveness (CBAK) [
31]: MOS scale (0–5) for background noise; higher values indicate more effective suppression.
These standardized metrics enable objective and comprehensive comparison with both the MUSE family of models and other state-of-the-art enhancement systems. No formal subjective listening evaluation (MOS test) is conducted in this work, which is consistent with the established practice in the lightweight SE literature: none of the competing methods in Table 6—including MUSE, MUSE++, IMSE, and DPT-FSNet—report formal subjective assessments. PESQ, as a validated ITU-T proxy for MOS, is considered the primary quality indicator.
6. Experimental Results and Discussions
6.1. Overall Performance Comparison
Table 1 reports the SE performance of MUSE (both the originally reported scores and our reproduction), MUSE++, and the proposed DOM-MUSE on the VoiceBank-DEMAND test set. All models were trained and evaluated under the same protocol described in
Section 5.
Statistical significance: To confirm that the reported differences are not attributable to random variation, we performed the Wilcoxon signed-rank test (two-sided) on per utterance scores across the full test set (
utterances), comparing DOM-MUSE against MUSE++.
Table 2 summarizes the results for all five reported metrics.
All five differences are statistically highly significant (
), confirming that none of the observed gains or trade-offs are attributable to random variation. Notably, the CBAK and STOI differences—where MUSE++ leads—are also statistically significant, which strengthens rather than weakens our interpretation: the CBAK gap is a real, systematic consequence of MUSE++’s augmented training strategy, not a measurement artifact. This is further confirmed in
Section 6.7, where DOM-MUSE trained with augmented strategies achieves CBAK 3.9351, substantially exceeding MUSE++’s 3.8584.
It is important to note that MUSE++ is trained with dynamic SNR augmentation and the augmented multi-objective loss introduced in [
19], while DOM-MUSE uses neither of these training enhancements. The fact that DOM-MUSE still outperforms MUSE++ on the most perceptually important metrics (PESQ and COVL) despite this asymmetry demonstrates that the architectural innovations alone—not training strategy—drive the quality improvement.
DOM-MUSE vs. MUSE (reproduced): DOM-MUSE outperforms the reproduced MUSE on all five evaluation metrics. PESQ improves from 3.3475 to 3.4241 (+0.077), CSIG from 4.6163 to 4.6738 (+0.058), CBAK from 3.7965 to 3.8224 (+0.026), COVL from 4.0827 to 4.1531 (+0.070), and STOI from 0.9506 to 0.9522 (+0.002). These gains are achieved while simultaneously reducing the parameter count by 24% (from 0.51 M to 0.39 M).
DOM-MUSE vs. MUSE++: Despite the training asymmetry, DOM-MUSE outperforms MUSE++ on PESQ (+0.061) and COVL (+0.032), with CSIG also marginally higher (+0.012). MUSE++ leads on CBAK (+0.036) and STOI (+0.002), metrics that reward aggressive noise suppression—a behavior directly encouraged by its augmented training strategy.
Discussion on CBAK: DOM-MUSE scores slightly lower than MUSE++ on CBAK (3.8224 vs. 3.8584), a metric that rewards aggressive noise suppression. This gap is directly attributable to MUSE++’s dynamic SNR augmentation and augmented loss, which explicitly push the model toward stronger noise attenuation at the cost of potential over-suppression of low-energy speech components. In real-world deployment—especially for hearing aids and high-quality voice communication—over-suppression manifests as a “machine-like” or “clipped” perceptual quality that reduces listener preference even when the background noise is numerically lower. PESQ and COVL capture precisely this perceptual dimension, and DOM-MUSE’s higher scores on both metrics suggest it strikes a more natural balance between noise removal and speech preservation. As confirmed in
Section 6.7, when DOM-MUSE is trained with the same augmented protocol, CBAK also improves substantially (to 3.9351), confirming that the CBAK gap is attributable to the training strategy difference rather than an architectural limitation.
6.2. Parameter and Computational Efficiency
Table 3 reports parameter counts alongside computational efficiency metrics measured on the VoiceBank-DEMAND test set (824 utterances) using a single NVIDIA RTX 3060 GPU, including the real-time factor (RTF), total inference time for the full test set (IFT, in seconds), throughput (THP, utterances per second), and peak VRAM (Video Random Access Memory) usage.
The natural architectural comparison baseline for DOM-MUSE is MUSE, not MUSE++, since DOM-MUSE and MUSE share the same U-Net design philosophy, while MUSE++ represents a more extreme architectural simplification. Relative to MUSE, DOM-MUSE achieves simultaneous improvements on all efficiency dimensions: the parameter count is reduced by 24% (0.51 M → 0.39 M), the inference time is reduced by 37% (218 s → 137 s), throughput increased by 61% (9.64 → 15.50 utt/s), and peak VRAM reduced by 41% (7584 MB → 4506 MB)—while outperforming MUSE on all perceptual quality metrics.
Relative to MUSE++, DOM-MUSE requires more computation, which is commensurate with its richer architectural design. However, an important reference point is the bare MUSE++ backbone without its augmented training strategies: our experiments show that a plain 1D Mamba-2 substitution (without SNR augmentation or augmented loss) achieves PESQ of only 3.2860, significantly below DOM-MUSE’s 3.4241. This demonstrates that DOM-MUSE’s moderate computational overhead delivers a substantially larger architectural improvement over the plain Mamba baseline, beyond what MUSE++’s efficiency numbers alone might suggest.
All three models achieve a RTF well below 1.0, confirming real-time feasibility across the board. DOM-MUSE’s RTF of 0.0645 means that processing a 10 ms speech frame requires less than 1 ms of computation, well within the requirements of real-time telephony and hearing aid applications (RTF < 1.0).
6.3. Ablation Study
To quantify the individual contribution of each proposed component, we evaluate three variants of DOM-MUSE, denoted A1–A3, each omitting a different subset of the innovations. All variants share the same two-level asymmetric U-Net skeleton and are trained under the same protocol.
Table 4 summarizes the results.
Role of CDGF and TCA (A3 vs. A2): Adding CDGF and TCA yields the largest single-step improvement across all perceptual metrics: PESQ increases by 0.041, CSIG by 0.020, CBAK by 0.039, and COVL by 0.039, confirming that Cross-Dimensional Gated Fusion is the most critical component. These gains are achieved under identical training conditions, demonstrating that the improvement is attributable to the CDGF architecture rather than any training strategy difference.
Role of Scalar AttnRes (A2 vs. A1): Scalar AttnRes produces clear improvements in CBAK (+0.026) and STOI (+0.002), consistent with content-adaptive residual gating effectively filtering noise-corrupted branch outputs.
Role of DFE, PGFC, and GS-FFN (DOM-MUSE vs. A3): Adding DFE, PGFC, and GS-FFN improves PESQ by 0.022, CSIG by 0.018, COVL by 0.019, and STOI by 0.001, with no increase in parameter count. CBAK decreases slightly (), reflecting PGFC’s deliberate design priority of favoring perceptual naturalness over maximum noise suppression.
Summary: Each component contributes in a complementary way, and the full performance of DOM-MUSE emerges from their joint interaction. All ablation comparisons are conducted under identical training protocols, ensuring that the observed differences reflect architectural contributions exclusively.
6.4. Per SNR Performance Analysis
To further examine the behavior of DOM-MUSE across different noise conditions,
Table 5 reports per SNR results for all three models on the VoiceBank-DEMAND test set, which spans input SNRs of 2.5, 7.5, 12.5, and 17.5 dB.
DOM-MUSE leads on PESQ and COVL at every SNR level: Across all four test conditions, DOM-MUSE achieves the highest PESQ and COVL scores, indicating that its perceptual quality advantage is consistent across the full range of noise conditions present in the benchmark. Crucially, this advantage is maintained without dynamic SNR augmentation or the augmented multi-objective loss that MUSE++ employs.
The PESQ gain is largest at low SNR: The improvement of DOM-MUSE over MUSE++ on PESQ narrows progressively as the SNR increases: +0.087 at 2.5 dB, +0.068 at 7.5 dB, +0.044 at 12.5 dB, and +0.043 at 17.5 dB. This SNR-dependent pattern provides direct empirical support for PGFC’s phase-gradient gating: at low SNR, smooth harmonic phase progressions are strongly contrasted against erratic noise-phase gradients, making PGFC’s signal maximally discriminative. As SNR increases, the noisy phase itself approaches the clean phase, reducing the marginal benefit of phase-guided conditioning.
MUSE++ leads on CBAK at every SNR level: MUSE++’s advantage on CBAK is consistent across all conditions and reflects the direct effect of its augmented training strategy rather than an architectural advantage.
6.5. Qualitative Spectrogram Analysis
Figure 9 and
Figure 10 show representative spectrogram comparisons on two test utterances selected from the VoiceBank-DEMAND test set to illustrate contrasting SNR conditions (SNR 12.5 dB and SNR 2.5 dB respectively). These examples are provided for qualitative illustration only and are not intended as statistical evidence of generalization. The generality of DOM-MUSE’s improvements across all noise conditions is substantiated quantitatively by
Table 5, which reports per SNR performance spanning all four test SNR levels (2.5, 7.5, 12.5, 17.5 dB), and by
Table 1, which aggregates results across all five noise categories in the VoiceBank-DEMAND test set. DOM-MUSE consistently leads on PESQ and COVL at every SNR level without exception, providing more rigorous evidence of generality than additional spectrogram examples would offer.
DOM-MUSE (e) produces spectrogram patterns closely aligned with the clean reference (a) in the voiced regions, with harmonic contours that remain smooth and continuous. This qualitative alignment is consistent with the higher PESQ and COVL scores of DOM-MUSE, which capture precisely these perceptual dimensions of harmonic fidelity and overall speech naturalness.
6.6. Comparison with State-of-the-Art Methods
To place DOM-MUSE in the broader context of the SE literature,
Table 6 compares its performance against two classical non-neural baselines and several representative lightweight neural SE systems on the VoiceBank-DEMAND test set. Scores for competing methods are taken from their respective original papers or from the summary reported in [
18].
Performance of classical baselines: Both Wiener filtering and LogMMSE provide limited enhancement relative to modern neural models, confirming the well-established advantage of data-driven architectures over classical spectral-domain estimators.
DOM-MUSE vs. state-of-the-art neural models: DOM-MUSE achieves the best PESQ (3.42), CBAK (3.82), and COVL (4.15) among all models in the comparison, including the larger DB-AIAT (2.81 M) and the competitive DPT-FSNet (0.88 M). The gains over DPT-FSNet are consistent across all perceptual quality metrics (+0.09 PESQ, +0.09 CSIG, +0.10 CBAK, +0.15 COVL), indicating that the deformable scanning, cross-dimensional gating, and phase-guided conditioning provide complementary benefits that neither attention-based Transformers nor fixed-grid convolutional networks can easily replicate. On STOI, DOM-MUSE (0.95) matches the majority of competing models but falls behind DPT-FSNet (0.96), which is consistent with the conservative gating behavior of PGFC and CDGF discussed above.
DOM-MUSE vs. IMSE: IMSE [
27] is the most directly comparable recent method, as it also builds upon MUSE at a similar parameter scale (0.43 M vs. our 0.39 M). IMSE proposes replacing the MET Transformer with Amplitude-Aware Linear Attention (MALA) and substituting Deformable Embedding with a static Inception Depthwise Convolution (IDConv), arguing that dynamic deformable offsets introduce unnecessary computational burden. DOM-MUSE takes a different design philosophy: rather than eliminating deformable convolution, we deploy it as a dedicated preprocessing step (DFE) that warps the feature grid to align with speech formant trajectories before the SSM stage, allowing deformable sampling to directly benefit the Mamba scanning rather than being absorbed into patch embedding. DOM-MUSE achieves PESQ 3.42 and COVL 4.15, both marginally higher than IMSE (PESQ 3.40, COVL 4.14), while using slightly fewer parameters (0.39 M vs. 0.43 M), demonstrating that the Mamba-based parallel T-F scanning approach with phase-guided conditioning offers a competitive and complementary alternative to the attention-replacement strategy of IMSE.
Parameter efficiency: DOM-MUSE achieves the highest overall perceptual quality scores while using only 0.39 M parameters—fewer than every competing neural model except MUSE++. In particular, it surpasses DB-AIAT with more than fewer parameters and outperforms DPT-FSNet and MetricGAN-OKDv2 with less than half their parameter counts.
Comparison within the MUSE family: Within the MUSE family, DOM-MUSE consistently surpasses MUSE and MUSE++ across all perceptual quality metrics. These results confirm that DOM-MUSE occupies a favorable position: more expressive and higher-performing than MUSE++, more compact than the original MUSE, and competitive with the most recent MUSE-derived lightweight model IMSE—while offering a different and complementary architectural perspective through Mamba-based parallel T-F scanning with phase-guided conditioning.
6.7. DOM-MUSE Under Augmented Training Conditions
To address the question of whether DOM-MUSE’s architectural improvements persist under identical training conditions as MUSE++, we additionally trained DOM-MUSE with the same dynamic SNR augmentation and augmented multi-objective loss employed by MUSE++.
Table 7 presents the results.
Under identical augmented training conditions, DOM-MUSE (aug.) achieves PESQ 3.4609, CSIG 4.7366, CBAK 3.9351, COVL 4.2165, and STOI 0.9586—surpassing MUSE++ on all five metrics. These results confirm two key points. First, DOM-MUSE’s architectural innovations provide genuine representational improvements that are independent of and complementary to training strategy: the same architecture benefits from augmented training while maintaining its structural advantages over MUSE++. Second, the CBAK gap observed in
Table 1 between DOM-MUSE and MUSE++ is a consequence of the training strategy difference rather than an architectural limitation—once trained with augmented strategies, DOM-MUSE’s CBAK (3.9351) substantially exceeds MUSE++’s (3.8584). The parameter count of DOM-MUSE (aug.) is 0.39 M, essentially identical to the primary DOM-MUSE model (0.39 M), confirming that no architectural expansion was required.
7. Conclusions
This paper presented DOM-MUSE, a lightweight speech enhancement framework that addresses four structural limitations shared by the MUSE and MUSE++ architectures within a unified U-Net backbone. The Deformable Feature Extractor (DFE) replaces fixed rectangular sampling grids with learnable warp locations that align with speech formant trajectories, providing geometrically coherent inputs to the subsequent state space model. The DOM Mamba Block with Cross-Dimensional Gated Fusion (CDGF) extends unidirectional 1D Mamba scanning to parallel time–frequency axes modulated by channel-level semantic gates derived from Taylor Channel Attention, enabling the model to selectively emphasize spectrally meaningful state transitions—a theoretically principled alternative to naive concatenation that captures the cross-channel energy correlations inherent in formant structures. The Phase-Guided Feature Conditioner (PGFC) exploits the statistical distinction between smooth harmonic phase progressions and erratic noise-phase gradients to suppress noise-dominated activations before the SSM stage. The Attention-Based Skip Connection (ABSC) replaces channel-doubling concatenation with a content-adaptive per channel gate, preserving the decoder channel dimension while selectively incorporating encoder information at each level.
Experiments on the VoiceBank-DEMAND benchmark demonstrate that DOM-MUSE outperforms the reproduced MUSE baseline on five evaluation metrics—PESQ (+0.077), CSIG (+0.058), CBAK (+0.026), COVL (+0.070), and STOI (+0.002)—while reducing the parameter count by 24% (0.51 M → 0.39 M). Compared with MUSE++—which employs dynamic SNR augmentation and an augmented multi-objective loss—DOM-MUSE achieves higher PESQ (+0.061) and COVL (+0.032) without any of these training enhancements, with statistical significance confirmed by the Wilcoxon signed-rank test (two-sided for PESQ; all five metrics ). When trained under the same augmented protocol as MUSE++, DOM-MUSE further improves to PESQ 3.4609 and COVL 4.2165, surpassing MUSE++ on all five metrics. Ablation experiments confirm that each proposed component contributes positively and complementarily, with CDGF accounting for the largest individual gain. Computational efficiency measurements show that DOM-MUSE reduces inference time by 37% and peak VRAM by 41% relative to MUSE, while maintaining a RTF of 0.0645—well within the requirements of real-time deployment.
Several directions remain open for future investigation. First, all experiments in this work are conducted on the VoiceBank-DEMAND benchmark, which uses relatively controlled additive noise conditions. Cross-dataset evaluation on more challenging benchmarks—such as the DNS Challenge [
34] and reverberant environments (e.g., Valentini-Reverb)—would provide stronger evidence of generalization to out-of-domain acoustic conditions; this is explicitly identified as the primary limitation of the current work and a priority for future evaluation. Second, the current architecture is non-causal; extending DOM-MUSE to a causal variant through masked convolutions and unidirectional Mamba scanning would enable low-latency streaming deployment on mobile and embedded devices. Third, exploring adaptive bottleneck dimensionality in Scalar AttnRes and ABSC may further improve the quality–efficiency trade-off. Finally, integrating DOM-MUSE as a front-end for downstream tasks, such as ASR or speaker verification, is a natural next step toward real-world deployment.