Next Article in Journal
GPRNet: A Geometric Prior-Refined Semantic Segmentation Network for Land Use and Land Cover Mapping
Next Article in Special Issue
Task-Oriented Unsupervised SAR Image Enhancement with Semantic Preservation for Robust Target Recognition
Previous Article in Journal
Quantifying the Contribution of Forest Restoration to Wind Erosion Control Using RWEQ—A Case Study of Duolun County in Inner Mongolia, China
Previous Article in Special Issue
Sequential SAR-to-Optical Image Translation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Effective SAR Image Despeckling Using Noise-Guided Transformer and Multi-Scale Feature Fusion

1
College of Mechanical Engineering, Guizhou University, Guiyang 550025, China
2
Guizhou Provincial Key Laboratory of Mountainous Intelligent Agricultural Machinery, Guizhou University, Guiyang 550025, China
3
School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(23), 3863; https://doi.org/10.3390/rs17233863
Submission received: 2 November 2025 / Revised: 25 November 2025 / Accepted: 27 November 2025 / Published: 28 November 2025
(This article belongs to the Special Issue SAR Images Processing and Analysis (3rd Edition))

Highlights

What are the main findings?
  • A novel SAR image despeckling method is proposed, which incorporates a dual-branch network for noise estimation and coarse despeckling, along with a noise-guided Transformer for refinement.
  • The model uses multi-scale fusion with grouped pooling attention (GPA) and context-aware fusion (CAF), along with deformable convolutions and masked self-attention for region-specific improvements.
What is the implication of the main finding?
  • Separating noise estimation and despeckling improves both noise suppression and the preservation of fine details, especially in areas with varying noise.
  • Experiments on synthetic and real SAR images show our method outperforms existing approaches, providing a strong solution for SAR applications in noisy conditions.

Abstract

Speckle noise is a significant challenge in synthetic aperture radar (SAR) images, severely degrading the visual quality and compromising subsequent image interpretation tasks. While existing despeckling methods can reduce noise, they often fail to strike a appropriate balance between noise suppression and the preservation of fine image details. To address this issue, in this paper, we propose a novel SAR image despeckling method that leverages both structural image priors and noise distribution characteristics in an end-to-end framework. Our approach consists of two key components: a dual-branch subnet for coarse despeckling and noise estimation, and a noise-guided Transformer-based subnet for final image refinement. The dual-branch subnet decouples the tasks of noise estimation and despeckling, improving both noise suppression accuracy and structural detail preservation. Furthermore, a combination of grouped pooling attention (GPA) and context-aware fusion (CAF) modules enables effective multi-scale feature fusion by jointly capturing local details and global contextual information. The noise estimation branch generates adaptive priors that guide the Transformer refinement, which incorporates deformable convolutions and a masked self-attention mechanism to selectively focus on relevant image regions. Extensive experiments conducted on both synthetic and real SAR datasets demonstrate that the proposed method consistently outperforms current state-of-the-art methods, achieving superior speckle suppression while preserving fine details more effectively.

1. Introduction

Synthetic aperture radar (SAR) has become an indispensable remote sensing modality due to its ability to acquire high-resolution images regardless of illumination and weather conditions. It is widely employed in various fields, including environmental monitoring, disaster response, land-cover mapping, and security applications [1]. However, despite its advantages, SAR images are often affected by speckle noise, a form of multiplicative noise that is inherent to the coherent nature of radar signals. This speckle noise degrades radiometric accuracy and obscures fine structural details, thereby compromising the effectiveness of downstream tasks such as target detection [2,3,4], classification [5], and change detection [6]. Consequently, effectively removing speckle noise while preserving critical structural and textural details remains a significant challenge in SAR image processing [7].
Traditional SAR image despeckling techniques can be broadly classified into spatial domain and transform domain approaches [8]. Spatial domain methods operate directly on image pixels or local/non-local patches, employing filters or patch-based aggregation strategies to reduce speckle noise while preserving essential structural features such as edges, textures, and fine details [9,10,11]. These methods typically rely on local statistics and neighborhood relationships for noise suppression [12]. However, they often face challenges in preserving critical image features, especially when the noise distribution is heterogeneous or the image contains complex textures. In contrast, transform domain methods involve transforming the image into a different basis (e.g., wavelet transforms), applying shrinkage or thresholding in that domain, and then inverting the transform to obtain a despeckled image [13]. While conceptually straightforward and computationally efficient, these methods often depend on hand-crafted priors and finely tuned parameters. In many practical scenarios, such tuning is highly sensitive to scene and noise characteristics, leading to over-smoothing of edges and textures or insufficient noise suppression [14].
Recent advancements in deep learning have introduced data-driven solutions to the SAR image despeckling problem [15]. By leveraging the powerful representational capacity of convolutional neural networks (CNNs) and related architectures, these methods have achieved significant improvements over traditional approaches in both noise reduction and detail preservation [16]. Unlike conventional techniques that rely on manually designed filters or handcrafted priors, deep learning-based methods can learn complex, non-linear mappings directly from data [17]. In particular, supervised learning frameworks have proven highly effective, training neural networks on large datasets of paired noisy and clean SAR images [18]. These models learn pixel-wise mappings from the noisy domain to the clean target, enabling precise noise removal while preserving structural integrity. However, CNN-based approaches typically rely on fixed local receptive fields, which limits their ability to capture long-range dependencies and global contextual information [19,20,21,22]. Additionally, these models are often designed to apply noise suppression uniformly across the entire image, without accounting for the spatially varying noise distribution in different regions [23,24,25]. In contrast, the self-attention mechanism in Transformer-based models effectively captures long-range dependencies, leading to improved despeckling performance. For instance, Pan et al. [21] and Guo et al. [22] proposed combining Transformer models with Diffusion models to achieve superior despeckling results. Shen et al. [15] introduced a Transformer-based model that incorporates a dynamic gated attention module and a frequency-domain multi-expert enhancement module. Liu et al. [26] and Wang et al. [27] proposed a despeckling method that integrates CNN with Transformers to enhance performance. While the self-attention mechanism in Transformer-based models effectively captures long-range dependencies, it often overlooks critical interactions between different regions of the image, leading to suboptimal performance in handling local structures and fine-grained details [28]. Additionally, these models struggle with integrating cross-window interactions and adapting the receptive field dynamically. To address these limitations, we propose a novel SAR image despeckling framework that integrates both structural image priors and noise distribution characteristics in an end-to-end model. This method consists of two main components: a dual-branch subnet for coarse despeckling and noise estimation, and a noise-guided Transformer-based subnet for final refinement. The dual-branch architecture allows the model to explicitly decouple the speckle noise from the underlying image content. The first branch focuses on coarse despeckling, while the second branch estimates the noise distribution across the image. Furthermore, by incorporating a noise-guided Transformer with deformable convolutions and masked self-attention, our approach provides adaptive refinement. Although hybrid CNN-Transformer models have been explored for SAR image despeckling, most of these approaches focus primarily on global feature extraction or enhancing feature representation through Transformer mechanisms. In contrast, our method utilizes the noise estimation branch to generate adaptive priors, enabling the Transformer-based subnet to focus more effectively on regions with varying noise intensities. This noise-aware strategy results in superior speckle suppression and better preservation of image quality, offering significant improvements over existing methods that lack dynamic adaptation. The main contributions of this work are summarized as follows:
(1)
We propose a novel dual-branch architecture that decouples the tasks of coarse despeckling and noise estimation, enabling each branch to specialize in its respective function. This decoupling improves noise suppression accuracy while effectively preserving structural image details. The proposed grouped pooling attention (GPA) and context aware fusion (CAF) modules leverage multi-scale contextual integration, which enables the model to combine local details with global contextual information effectively. Additionally, the introduction of a bidirectional feature interaction mechanism between the two branches further enhances the accuracy of both noise estimation and despeckling performance.
(2)
We introduce a noise-guided Transformer subnet that leverages the adaptive, learned coarse despeckling map and noise map from the dual-branch subnet as prior knowledge. By incorporating deformable convolutions and a learnable attention mask, the Transformer subnet can capture complex long-range dependencies and selectively focus on relevant regions of the image, enhancing despeckling in areas with varying noise levels.
(3)
Extensive experiments conducted on both synthetic and real-world SAR datasets demonstrate the superiority of the proposed method over existing state-of-the-art approaches. The results show significant improvements in both quantitative metrics and visual quality, underscoring the robustness and generalization ability of the method across different noise levels and types of SAR images.
The remainder of this paper is organized as follows: Section 2 reviews the related work. Section 3 details the proposed method. Section 4 presents experimental results and analysis, Section 5 presents an ablation study to evaluate the contributions of different components of the proposed method. Finally, Section 6 concludes the paper and discusses potential future research directions.

2. Related Works

2.1. Traditional Despeckling Methods

Traditional methods for SAR image despeckling primarily rely on statistical modeling and image processing techniques [29]. These methods aim to reduce speckle noise while preserving essential image features such as edges and textures. One of the earliest and most widely used approaches is spatial domain filtering, which operates directly on image pixels. Notable examples include the Lee filter [30], Kuan filter [31], and Frost filter [32]. These filters rely on local statistics within a sliding window to smooth homogeneous regions while retaining edge information. Another class of despeckling methods is based on transformations, where the image is mapped to a different domain [33]. Wavelet transform, in particular, is a widely used technique for SAR image despeckling. By providing a multi-scale representation, wavelets enable effective decomposition of the image into various frequency bands.Despeckling is achieved by thresholding the wavelet coefficients, suppressing high-frequency noise while preserving the critical low-frequency components [34]. Although computationally efficient and relatively simple to implement, these methods often struggle to maintain edge sharpness and preserve fine details. Furthermore, their performance heavily depends on the selection of parameters such as filter size or threshold values, which may not generalize well across different SAR images or varying noise conditions [35].
More advanced spatial domain methods, such as non-local means (NLM) filtering, improve despeckling by considering the similarity between distant image patches [36]. The key idea behind NLM is to compare patches of the image and weigh them according to their similarity, aggregating similar patches to form a denoised estimate for each target patch [37]. Several extensions of NLM have been proposed to improve the efficiency and effectiveness of non-local despeckling methods [38]. SAR-BM3D improves upon this concept by grouping similar image blocks into 3D arrays and applying collaborative filtering [13]. Probabilistic patch-based (PPB) methods further enhance despeckling by introducing probabilistic models that capture the underlying statistical relationships between image patches [12]. Non-local low-rank (NLR) methods represent a more recent extension in non-local techniques. These methods assume that the image can be approximated as a low-rank matrix in a suitable transform or patch-based representation. A common method is to decompose the SAR image into non-local patches and solve a low-rank matrix approximation problem using techniques such as singular value thresholding [11,39]. Compared to NLM and BM3D methods, low-rank methods typically provide better performance in terms of noise suppression and detail preservation. However, their computational cost is relatively high, and their effectiveness diminishes in scenarios with weak self-similarity or high noise levels.

2.2. Deep Learning-Based Despeckling Methods

Unlike traditional methods, deep learning methods do not rely on hand-crafted features or explicit image models. Instead, they exploit large datasets to automatically learn the complex relationships between noisy and clean images [40]. With the increasing power of deep learning models, these techniques have shown considerable promise in overcoming the limitations of conventional despeckling methods. CNNs have been widely applied to SAR image despeckling due to their robust feature extraction and representation capabilities [16]. Networks such as ID-CNN [41] and SAR-DRN [42] have improved despeckling performance by incorporating extended convolutions and multi-scale feature extraction techniques. These models excel in capturing both local and global contextual information, which enables effective noise removal while preserving fine image details. Building on the success of CNN-based models, generative adversarial networks (GANs) have also been introduced for SAR image despeckling [43]. Through adversarial training, the generator learns to produce outputs that closely resemble clean SAR images, as evaluated by a discriminator. This process encourages the network to generate sharper and more realistic results by explicitly modeling fine textures and structures, often outperforming standard CNN-based methods in terms of perceptual quality. Moreover, Transformer-based models have emerged as powerful tools for capturing long-range dependencies in SAR images [21,27,28]. Additionally, multi-task learning frameworks have been introduced to jointly optimize despeckling performance alongside high-level tasks [44]. Despite their impressive results, supervised deep learning methods exhibit certain limitations. One major drawback is their dependency on large datasets of paired noisy and clean images for training. In addition, these models typically assume a fixed noise level during the training process, limiting their adaptability to real-world scenarios where noise conditions can dynamically vary. Although training on multiple noise levels is a potential solution, it often leads to trade-offs between generalization and optimal performance.
In contrast, unsupervised and self-supervised learning methods have emerged as promising alternatives for SAR image despeckling, as they do not require ground truth clean images [45,46]. These methods leverage inherent image statistics or other forms of self-supervision to learn effective noise removal strategies without relying on paired data [47]. By learning directly from the image or unannotated data, these methods eliminate the dependency on large-scale labeled datasets. Speckle2Void, for example, uses a blind-spot network combined with Bayesian posterior reconstruction techniques to implement self-supervised learning on a single image [48]. SDUDNet adopts a speckle learning strategy and introduces an unpaired despeckling framework based on generative adversarial strategies, enabling unsupervised SAR image despeckling without paired training data [49]. Despite these advancements, unsupervised and self-supervised learning methods still face several challenges. They often require carefully designed loss functions and regularization strategies to ensure that the network converges to a meaningful solution [50]. Additionally, training stability remains a concern in some of these methods, particularly with GAN-based architectures, where issues such as mode collapse or instability in adversarial training can lead to suboptimal performance.

3. Proposed Method

The overall framework of the proposed SAR image despeckling method is illustrated in Figure 1. The architecture is designed to integrate both structural image priors and noise distribution characteristics within an end-to-end framework. It consists of two primary components: a dual-branch subnet for coarse despeckling and noise estimation, and a noise-guided Transformer-based subnet for final refinement. The objective is to explicitly decouple speckle noise from the underlying image content and guide the reconstruction process using noise-related cues. Each of these components is described in detail in the following sections.

3.1. Dual-Branch Subnet for Coarse Despeckling and Noise Estimation

Given a noisy input SAR image Y R H × W , where H and W represent the height and width of the image, respectively. The network first processes the input through two parallel convolutional pathways. These pathways are designed to separately estimate the underlying noise distribution and produce a preliminary despeckled result. This dual-branch strategy enables the model to decouple the noise component from the structural content, thereby improving both despeckling accuracy and detail preservation. In the proposed architecture, both the coarse despeckling branch and the noise estimation branch share an identical structural design, enabling parallel yet complementary feature extraction. Each branch begins with a 3 × 3 convolutional layer that performs shallow feature extraction from the input SAR image. The output features are then processed by a GPA block, which enhances feature discrimination by incorporating both spatial and channel-wise attention mechanisms. Following this, the features are fed into a CAF block, which is specifically designed to integrate local detail and global contextual information. This module employs multi-scale dilated convolutions to aggregate features across various receptive fields and applies adaptive weighting to emphasize semantically relevant spatial regions. The combination of the convolutional layer, GPA block, and CAF block forms a modular unit, which is repeated N times (with N = 5 in our implementation) to construct a deep hierarchical feature extraction pipeline. To facilitate effective gradient propagation and avoid vanishing gradient issues during training, residual connections are introduced between consecutive modules. Additionally, a bidirectional feature interaction mechanism is implemented between the coarse despeckling and noise estimation branches through the GPA block. This mechanism enables cross-branch information exchange, thereby enhancing both noise estimation accuracy and despeckling performance. Finally, the fused features from both branches are processed through a convolutional layer to produce the final output for each branch.
The detailed structure of the GPA block is illustrated in Figure 2a. Let F g R C × H × W denote the input feature map, where C, H, and W represent the number of channels, height, and width of the feature map, respectively. The first step involves partitioning the input feature map into r groups, resulting in the grouped feature map F g / r i R C / r × H × W . This grouping reduces the dimensionality of the feature map, enabling each group to focus on specific spatial or channel-wise patterns. This approach enhances the computational efficiency of the attention mechanism while maintaining its ability to model fine-grained local dependencies. Each subgroup F g / r i is then processed through spatial pooling and channel pooling to generate W g and W c . The spatial branch computes position-wise importance using global context information, while the channel branch captures inter-channel relationships. Mathematically, this process can be expressed as:
W g = Conv LReLU Conv GAP ( F g / r i )
W c = Conv LReLU Conv c a t [ CAP ( F g / r i ) , CMP ( F g / r i ) ]
where LReLU refers to the Leaky ReLU activation function, GAP denotes global average pooling, CAP represents channel average pooling, and CMP denotes channel max pooling. The  c a t [ · ] operation refers to the concatenation of two feature maps.
Following the generation of the spatial and channel pooling maps W g and W c , we perform a fusion operation to combine these maps. Specifically, the spatial and channel weights are added together, enabling the network to jointly refine both the spatial and channel-wise feature distributions. The fused weight map, W f , is then calculated using the following formula:
W f = σ ( Conv ( LReLU ( Conv ( c a t [ ( W g + W c ) , F g / r i ] ) ) ) )
where σ represents the sigmoid activation function, which normalizes the values of W f to the range [0, 1], providing the final learned attention map.
The final weighted result is obtained by multiplying the feature map by its corresponding weight map: F w = c a t [ W f F g / r i ] , where ⊙ represents element-wise multiplication. Finally, all processed subgroups are concatenated along the channel dimension to reconstruct the full feature map with enhanced representation. Although a similar attention mechanism has been employed in [51], our method adopts a grouped attention strategy, which allows the model to capture a broader range of diverse local patterns.
The CAF block is designed to facilitate the adaptive fusion of multi-scale contextual features, enabling the model to capture both local and global dependencies. The detailed structure of the CAF module is depicted in Figure 2b. The input feature map F w is first passed through a fundamental convolutional block for initial processing and then fed into three parallel dilated convolutional branches, each with dilation rates d { 1 , 3 , 5 } , to capture information at different receptive field sizes. The different dilation rates enable the network to simultaneously capture fine-grained details as well as broader contextual information. After the three branches extract multi-scale features, the resulting feature maps are fused through element-wise addition. Additionally, the fused multi-scale feature map is further refined through a single-path convolutional pathway. The final output is produced by passing the processed features through one last convolutional layer.
After passing through N modular units, the coarse despeckling branch and the noise estimation branch engage in feature interaction via a GPA block. The coarse despeckling branch focuses on refining the image and suppressing noise, while the noise estimation branch isolates and predicts noise patterns. The interaction between these branches, facilitated by the GPA block, enables the model to better refine its understanding of the noise characteristics, leading to improved despeckling performance. The fused features from both branches are processed through a final convolutional layer to generate the output. This dual-branch architecture introduces an explicit mechanism for modeling the statistical and spatial characteristics of speckle noise. It offers two main advantages. First, by decoupling the estimation of noise from the reconstruction of the clean image, each branch can specialize in its respective task, avoiding interference between noise suppression and structural preservation. Second, the estimated noise map serves as a learnable and adaptive prior, providing rich and spatially varying cues for the downstream Transformer-based refinement module. These noise-aware features enable the subnetwork to perform conditioned despeckling, adapting its filtering behavior to local noise intensity across different regions of the image.

3.2. Noise-Guided Despeckling Subnet

Unlike CNN-based models, which primarily rely on local receptive fields to capture spatial patterns, Transformer architectures have the capability to model long-range dependencies. Transformers typically use a multi-head self-attention mechanism in conjunction with a multi-layer perceptron (MLP) to capture global feature dependencies and perform nonlinear transformations. The self-attention mechanism in Transformers dynamically adjusts the importance of various regions in the image by assigning different weights to each region, allowing the model to focus on the most relevant contextual information. However, despite their advantages, conventional Transformer architectures have notable limitations when using linear projections or standard convolutions to generate the query (Q), key (K), and value (V) matrices. Such formulations often overlook cross-window interactions and fail to adapt the receptive field dynamically, which can lead to insufficient fine-grained spatial modeling and inadequate capture of local contextual information. To address these challenges, we propose an efficient Transformer architecture that integrates deformable convolution with a masked self-attention mechanism, thereby enhancing enhanced spatial adaptivity and selective contextual interaction.
Unlike the pixel-level self-attention used in [52,53], we employ a patch-level self-attention mechanism. As illustrated in Figure 3, the proposed Transformer network consists of M residual Transformer blocks (RTBs), each containing four cascaded deformable masked Transformer layers (DMTLs) and three convolutional layers. Each DMTL consists of a layer normalization step, followed by a deformable masked self-attention (DMSA) block, and then another layer normalization followed by an MLP layer. For each DMSA, given an input feature map T R C × H × W , the Q, K, and V matrices are generated through learnable linear projections. Specifically, the projections are defined by:
Q = W d Q W p Q T , K = W d K W p K T , V = T W d V W p V T
where W p ( · ) represents a 1 × 1 convolution used to aggregate pixel-level contextual information across channels. W d ( · ) refers to a deformable convolution operation. Unlike conventional convolutions, which operate with fixed receptive fields, deformable convolutions enable more flexible sampling positions within the feature map. This spatial adaptability allows the network to dynamically adjust its sampling locations according to the local content, thereby enhancing its ability to capture complex and non-local spatial dependencies. The attention matrix is thus computed by the self-attention mechanism as:
A t t e n t i o n ( Q , K , V ) = softmax Q K T d V
where d is a learnable scaling parameter. The softmax operation normalizes the attention scores. To further enhance the model’s ability to focus on the most relevant regions of the input feature map and suppress irrelevant or noisy areas, we introduce a masked self-attention mechanism. This method enables selective attention by introducing a learnable mask that adjusts the attention distribution across different regions of the image. The attention mechanism with the masked self-attention is defined as follows:
A t t e n t i o n ( Q , K , V , M ) = softmax Q K T d + M V
where M is the learnable mask. The addition of the mask M modifies the attention scores by effectively suppressing unwanted interactions. Specifically, M is a learnable matrix that is optimized during training to specify which parts of the image should be attended to and which should be ignored. This ability to learn a dynamic mask allows the model to adaptively filter out irrelevant features and focus its attention on the most significant areas of the image.

3.3. Loss Function and Implementation Details

The loss function plays a critical role in guiding the model’s learning process by penalizing discrepancies between predicted and true values, ensuring the optimization of essential tasks such as image reconstruction, noise estimation, and structural preservation. Specifically, we adopt a combination of L1 and L2 loss functions to balance pixel-wise accuracy, robustness to outliers, and the preservation of fine textures. The L2 loss penalizes larger deviations more heavily, making it suitable for tasks requiring high-fidelity reconstruction. In contrast, the L1 loss introduces lower sensitivity to outliers, making it better suited for supervising noise estimation. To guide the network in learning feature representations that are both noise-aware and structurally consistent, we design a total loss function L as a weighted combination of four distinct loss terms:
L = λ 1 L S + λ 2 L N + λ 3 L D 2 + λ 4 L D 1 = λ 1 Y Y ^ 2 + λ 2 N N ^ 1 + λ 3 X X ˜ 2 + λ 4 X X ^ 2
where | | · | | 1 and | | · | | 2 denote the L1 and L2 losses, respectively. For detailed dependencies of the variables and loss terms, refer to the corresponding relationships shown in Figure 1. λ 1 , λ 2 , λ 3 , and  λ 4 are weight parameters. In this paper, these parameters are set to 0.5, 0.5, 2, and 5, respectively. The flow of the proposed model is shown in Algorithm 1.
Algorithm 1 Noise-Aware Transformer-based Network for SAR Despeckling
Input: Speskled image Y, Clean image X, speckle noise N
Initialization: All network parameters θ i
 1:
for numbers of training iterations do
 2:
   Estimate the X ˜ using the coarse despeckling branch
 3:
   Estimate the N ^ using the noise estimation branch
 4:
   Concatenate priors: Y G = c a t ( X ˜ , N ^ , Y )
 5:
   Estimate the X ^ using the Transformer block
 6:
   Calculate Loss: L = λ 1 Y Y ^ 2 + λ 2 N N ^ 1 + λ 3 X X ˜ 2 + λ 4 X X ^ 2
 7:
   Update the Networks by A d a m ( θ i , L)
 8:
end for
 9:
return Despeckled image X ^
The proposed model is implemented using the PyTorch (version 2.7.0) framework and trained on two NVIDIA 4090 GPUs, each with 24 GB of memory. The training is conducted over 80 epochs using the Adam optimizer with a learning rate of 1 × 10 4 . The training dataset is sourced from the UC Merced LandUse dataset, which includes 2100 remote sensing images with a resolution of 256 × 256 pixels, spanning 10 distinct categories (http://vision.ucmerced.edu/datasets/landuse.html, accessed on 28 July 2025). We converted all RGB images to grayscale to match the SAR domain. The synthetic SAR images were generated by adding multiplicative speckle noise, with the noise level controlled by adjusting the ENL. The test dataset consists of synthetic SAR images with varying noise levels and real-world SAR images, providing a comprehensive evaluation of the model’s performance under different noise conditions.

4. Experimental Results

4.1. Comparison Methods and Objective Evaluation Metrics

To evaluate the effectiveness of the proposed method, we conduct a comprehensive set of experiments comparing it with several state-of-the-art (SOTA) SAR image despeckling techniques. The selected baselines include both classical model-based techniques and deep learning methods: SAR-BM3D [13], FANS [54], MulogB [55], SAR-NNFN [11], PDSNet [17], HTNet [18], MSANN [19], MFAENet [20], SemDNet [44], and LGDBNet [26]. In addition, several unsupervised methods were included for comparison, such as SAR2SAR [45], MERLIN [46], Speckle2Void [48], SDUDNet [49]. These methods represent a wide range of despeckling paradigms, encompassing non-local filtering as well as both supervised and unsupervised learning frameworks.
To quantitatively assess the performance of the proposed SAR image despeckling method, several objective metrics are employed, including the peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), equivalent number of looks (ENL), target-to-clutter ratio (TCR), natural image quality evaluator (NIQE) [56], and blind/referenceless image spatial quality evaluator (BRISQUE) [57]. Each of these metrics provides valuable insights into different aspects of image quality, such as noise reduction, structural preservation, and perceptual fidelity. The mathematical formulations for these metrics are as follows:
(1)
PSNR measures the ratio between the maximum possible power of a signal and the power of the distortion (noise). A higher PSNR generally indicates better despeckling performance and less image distortion:
PSNR = 10 · log 10 x m a x 2 MSE
where x m a x is the maximum possible pixel value. MSE is the mean squared error:
MSE = 1 M N i = 1 M j = 1 N y ( i , j ) x ( i , j ) 2
where y and x represent noisy and clean reference images, respectively, and M and N denote the height and width of the image in pixels, respectively.
(2)
SSIM is a perceptual metric that compares the luminance, contrast, and structural similarity between the reference and the processed image. SSIM values range from 0 to 1, with values closer to 1 indicating higher structural similarity. The SSIM index is defined as:
SSIM = ( 2 μ x μ y + C 1 ) ( 2 σ x y + C 2 ) ( μ x 2 + μ y 2 + C 1 ) ( σ x 2 + σ y 2 + C 2 )
where μ x , μ y are the average intensities of x and y, respectively. σ x 2 , σ y 2 are the variances of x and y. σ x y is the covariance between x and y. C 1 and C 2 are small constants used to stabilize the division with weak denominators.
(3)
ENL evaluates the noise suppression capability and is typically computed within selected homogeneous regions of interest (RoI) in the image. The ENL is given by:
ENL = μ ( X ^ R o I ) 2 σ ( X ^ R o I ) 2
where μ and σ are the mean and standard deviation of pixel intensities in the selected homogeneous region. Higher ENL values imply better speckle suppression.
(4)
TCR is primarily employed to quantify the relative strength of the target signal compared to the surrounding clutter. A higher TCR indicates that the target signal is significantly stronger than the clutter, leading to better detection and recognition of the target. The TCR index is defined as:
TCR = 20 l o g 10 m a x ( X ^ R o I ) μ ( X ^ R o I )
(5)
NIQE is a no-reference metric that estimates perceptual image quality based on statistical features derived from natural scenes. It does not require a reference image and is computed by comparing the distribution of features from the test image with those learned from a corpus of high-quality natural images. Lower NIQE scores indicate better perceptual quality.
(6)
BRISQUE is another no-reference quality metric that evaluates the spatial naturalness of an image using features derived from local image statistics. It relies on a machine learning model trained on human-rated image datasets. As with NIQE, lower BRISQUE scores correspond to higher perceptual quality.

4.2. Experiments on Synthetic SAR Images

Table 1 presents the average PSNR and SSIM values for each comparison method on both the SAR-9 (https://github.com/BFY-official/SAR-9-dataset, accessed on 28 July 2025) and SSAR (https://github.com/BFY-official/SSAR-dataset, accessed on 28 July 2025) datasets, evaluated across various noise levels (denoted as L = 1 , 2 , 4 ). As shown in Table 1, the proposed method almost achieves the highest PSNR and SSIM scores across all noise levels, outperforming all existing methods in both metrics. Notably, under the most challenging condition ( L = 1 ), where speckle noise is at its most severe, our method achieves the highest PSNR and SSIM values, demonstrating remarkable robustness to high-intensity noise. This advantage becomes even more apparent in high-noise scenarios, where traditional methods such as FANS, MulogB, and SAR-NNFN exhibit significant performance degradation. On the other hand, methods such as PDSNet and HTNet display similar performance degradation as L increases, revealing their instability in adapting to varying noise levels. The proposed method, however, consistently outperforms all other methods across different noise intensities. It achieves the highest PSNR and SSIM scores in all test cases, underscoring its robustness in handling a wide range of noise conditions and its effectiveness in preserving both image structure and pixel-level accuracy.
In addition to the objective metrics, a visual comparison of the despeckled images further emphasizes the effectiveness of the proposed method. Figure 4 shows results of various despeckling methods applied to a test image with noise level L = 1 . Traditional methods like SAR-BM3D and SAR-NNFN preserve edge sharpness but struggle with high-texture regions, leaving residual noise and artifacts. Deep learning-based methods, such as PDSNet and HTNet, demonstrate notable improvements, maintaining structural integrity and achieving better noise reduction, but they tend to over-smooth complex areas, losing fine details. Methods like SDUDNet, MSANN, and LGDBNet excel at preserving textures, but sometimes introduce artifacts or instability. In contrast, the proposed method, using a noise-aware Transformer architecture, effectively balances despeckling performance and structural fidelity. As shown in Figure 4, it reduces speckle noise while preserving edges and intricate textures. An additional example in Figure 5 demonstrates its performance at a lower noise level ( L = 4 ), where it maintains clear edges and accurate textures. Figure 6 presents the ratio images of the speckled to despeckled images, where SAR-BM3D, PDSNet, and HTNet exhibit noticeable contours indicating imperfect noise separation. In contrast, the ratio images from SDUDNet, MFAENet, and LGDBNet show better structural preservation but still display subtle leakage. Remarkably, SemDNet and the proposed method exhibit almost no structural leakage, highlighting their superior ability to preserve textures and details while effectively removing speckle noise.

4.3. Experiments on Real SAR Images

To assess the performance of the proposed method in real-world scenarios, we selected several real SAR images for testing. Figure 7 displays four representative images from different datasets: two images acquired by TerraSAR-X with a noise level of L = 1 (Figure 7a,b), an image from the Lynx airborne radar at Sandia National Laboratories with L = 3 (Figure 7c), and an image from Sentinel-1 with L = 4 (Figure 7d). Due to space limitations, additional experimental results can be found in the Supplementary Materials.
Figure 8 presents the despeckling results for R1. It is evident that methods such as SAR-BM3D, PDSNet, HTNet, and LGDBNet exhibit weak noise suppression capabilities, leaving significant speckle artifacts. Although SAR-NNFN, SAR2SAR, and SDUDNet effectively suppress speckle noise, they tend to over-smooth the image, resulting in a loss of fine structural details. In contrast, the proposed method achieves an optimal balance, effectively preserving both edge and spatial details while providing superior speckle suppression. A similar trend can be observed in Figure 9, where methods such as SAR-BM3D, SAR-NNFN, MSANN, Speckle2Void, and SDUDNet exhibit clear signs of over-smoothing, particularly in building regions. This leads to blurred structural edges and a loss of important textural details. Methods such as PDSNet, HTNet, and LGDBNet also exhibit significant speckle artifacts, which indicate their inability to effectively suppress speckle noise. While MFAENet offers improved despeckling results, it compromises the preservation of fine structural information, as indicated by the red boxes in the figure. In comparison, the proposed method excels by effectively removing speckle noise while preserving critical structural features, such as building contours, linear structures, and intricate details. Figure 10 presents ratio images for R2. The results demonstrate that methods such as SAR-NNFN, PDSNet, LGDBNet, and SAR2SAR suffer from noticeable texture leakage, suggesting difficulties in preserving fine textures. HTNet, MSANN, and SDUDNet also show some degree of texture leakage, indicating room for improvement in structural preservation. In contrast, both MFAENet and the proposed method exhibit minimal texture leakage, demonstrating their ability to retain fine details while effectively suppressing speckle noise. Figure 11 and Figure 12 further reinforce these observations, with the proposed method continuing to outperform other methods in preserving both structural integrity and texture details, while effectively minimizing speckle noise. Finally, Figure 13 and Figure 14 provide additional examples, further illustrating the superior performance of the proposed method across varying noise levels.
The quantitative evaluation results for real SAR images further validate the findings from the visual inspection. As shown in Table 2, the proposed method achieves the best NIQE and BRISQUE scores. These results suggest that the proposed method excels in maintaining perceptual quality by effectively reducing speckle noise while preserving fine structural details. Additionally, the proposed method outperforms other methods in terms of ENL values and achieves the highest TCR, further demonstrating its superior noise suppression and better preservation of information. The last column of Table 2 presents the runtime of each method. Compared to other approaches, our method may not be the most efficient, but it achieves relatively fast runtime. The improvement in despeckling performance further justifies the additional computational cost, demonstrating that the trade-off between performance and efficiency is well-balanced in our approach.

5. Discussion

The experimental results clearly demonstrate that the proposed method outperforms existing methods in terms of despeckling performance across a wide range of noise conditions, as evidenced by both synthetic and real-world SAR images. This superior performance can be attributed to the synergistic integration of three key components: the noise estimation branch, the coarse despeckling branch, and the DMSA Transformer block. These components work together to address the inherent limitations of individual methods, thereby enhancing the overall performance of the model. To further validate the contribution of each component, we conducted an ablation study on both synthetic and real SAR images. In this study, we systematically disabled each module and evaluated the resulting performance. Note that when the DMSA module is disabled, we replaced it with a basic Vision Transformer to isolate its specific impact. The results of this study are summarized in Table 3. As shown in Table 3, when none of the modules are used, the performance is the weakest, highlighting the importance of each component in achieving optimal results. This outcome underscores the necessity of combining all modules to fully exploit their complementary strengths. When only the noise estimation branch was utilized, the performance showed an improvement over the baseline but remained suboptimal. This finding demonstrates the value of explicit noise modeling, which provides valuable prior information to better guide the despeckling process. Incorporating the coarse despeckling branch alongside the noise estimation branch led to further improvement in performance, underscoring the importance of progressively refining the despeckling process. Finally, when all modules were enabled, the proposed method achieved the best performance across all tested noise conditions. This result highlights the synergistic effect of combining the noise estimation branch, the coarse despeckling branch, and the DMSA Transformer block. The successful integration of these modules not only enhances the denoising results but also makes the method highly robust and adaptable to a wide range of noise conditions.
To better evaluate the effectiveness of each component in the proposed method, we conducted an ablation study on the GPA and CAF modules. As shown in Table 4, the introduction of the GPA module led to improvements across all evaluation metrics, demonstrating its ability to enhance the model’s capacity to capture both spatial and channel dependencies. Similarly, the inclusion of the CAF module further boosted performance, as evidenced by the higher evaluation metrics. Therefore, the integration of GAP and CAF plays a pivotal role in improving the overall effectiveness of the proposed method. We also conducted an ablation study on the number of groups r in the GPA module, evaluating several values of r (i.e., r = 2 , 4 , 6 , 8 ). The purpose of grouping is to reduce the dimensionality of the feature map, allowing each group to focus on specific spatial or channel patterns, thereby capturing diverse semantic information while still maintaining the ability to model fine-grained local dependencies. As shown in Table 5, our experiments reveal that setting r = 4 achieves the optimal balance across various performance metrics.
Although our method performs well across various noise levels, its effectiveness may decrease in extremely high-noise scenarios, where the noise can overwhelm the signal and challenge the model’s ability to preserve fine details. Figure 15 demonstrates the denoising results at different noise levels for the same scene. As shown in the figure, in regions with high noise and complexity, such as densely structured or heterogeneous terrain areas, our method may introduce artifacts or residual speckle. This is a common limitation shared by many denoising methods, highlighting the difficulty in balancing noise suppression and detail preservation under such challenging conditions.

6. Conclusions

In this paper, we propose a novel SAR image despeckling method that combines dual-branch convolutional subnet with a noise-guided Transformer refinement subnet. The proposed framework effectively decouples speckle noise from the underlying image structure, allowing for enhanced despeckling performance and better preservation of fine details. Through the dual-branch subnet, which separately handles coarse despeckling and noise estimation, we were able to leverage complementary information from both the noise and image content, resulting in improved feature disentanglement and more accurate noise suppression. The noise-aware features provided by the Transformer-based subnet enabled adaptive despeckling, adjusting the filtering process to varying noise intensities across different regions. Additionally, the incorporation of spatial-channel attention and multi-scale contextual aggregation strengthened the model’s ability to capture both local and global dependencies, significantly enhancing the overall performance. Future work could explore further optimization of the model for real-time processing and extend the approach to multi-modal or multi-source data.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/rs17233863/s1.

Author Contributions

Conceptualization, L.Z. (Linna Zhang) and Y.W.; formal analysis, L.Z. (Le Zheng) and Y.W.; methodology, L.Z. (Linna Zhang) and F.Z.; software, L.Z. (Linna Zhang), Y.W. and F.B.; investigation, F.Z. and Y.C.; writing—original draft preparation, L.Z. (Linna Zhang); writing—review and editing, L.Z. (Linna Zhang), F.B. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62463002 and 62473033, in part by the Guizhou Provincial Key Laboratory of Mountainous Intelligent Agricultural Machinery (Qiankehe Platform ZSYS[2025]013).

Data Availability Statement

The Dataset is available on request from the authors.

Acknowledgments

The authors would like to thank anonymous reviewers for their valuable comments and suggestions, which led to substantial improvements to this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Baraha, S.; Sahoo, A.K. Synthetic aperture radar image and its despeckling using variational methods: A Review of recent trends. Signal Process. 2023, 212, 109156. [Google Scholar] [CrossRef]
  2. Zhang, T.; Zhang, X.; Gao, G. Divergence to Concentration and Population to Individual: A Progressive Approaching Ship Detection Paradigm for Synthetic Aperture Radar Remote Sensing Imagery. IEEE Trans. Aerosp. Electron. Syst. 2025; early access. [Google Scholar] [CrossRef]
  3. Ke, H.; Ke, X.; Zhang, Z.; Chen, X.; Xu, X.; Zhang, T. SLA-Net: A Novel Sea–Land Aware Network for Accurate SAR Ship Detection Guided by Hierarchical Attention Mechanism. Remote Sens. 2025, 17, 3576. [Google Scholar] [CrossRef]
  4. Xue, W.; Ai, J.; Zhu, Y.; Chen, J.; Zhuang, S. AIS-FCANet: Long-Term AIS Data Assisted Frequency-Spatial Contextual Awareness Network for Salient Ship Detection in SAR Imagery. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 15166–15171. [Google Scholar] [CrossRef]
  5. Ai, J.; Mao, Y.; Luo, Q.; Jia, L.; Xing, M. SAR Target Classification Using the Multikernel-Size Feature Fusion-Based Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5214313. [Google Scholar] [CrossRef]
  6. Chen, Y.; Shen, Y.; Duan, C.; Wang, Z.; Mo, Z.; Liang, Y.; Zhang, Q. Robust and Efficient SAR Ship Detection: An Integrated Despecking and Detection Framework. Remote Sens. 2025, 17, 580. [Google Scholar] [CrossRef]
  7. Cardona-Mesa, A.A.; Vásquez-Salazar, R.D.; Travieso-González, C.M.; Gómez, L. Comparative Analysis of Despeckling Filters Based on Generative Artificial Intelligence Trained with Actual Synthetic Aperture Radar Imagery. Remote Sens. 2025, 17, 828. [Google Scholar] [CrossRef]
  8. Fracastoro, G.; Magli, E.; Poggi, G.; Scarpa, G.; Valsesia, D.; Verdoliva, L. Deep learning methods for synthetic aperture radar image despeckling: An overview of trends and perspectives. IEEE Geosci. Remote Sens. Mag. 2021, 9, 29–51. [Google Scholar] [CrossRef]
  9. An, X.; Zeng, H.; Li, Z.; Yang, W.; Xiong, W.; Wang, Y.; Liu, Y. SAR Images Despeckling Using Subaperture Decomposition and Non-Local Low-Rank Tensor Approximation. Remote Sens. 2025, 17, 2716. [Google Scholar] [CrossRef]
  10. Fang, J.; Mao, T.; Bo, F.; Hao, B.; Zhang, N.; Hu, S.; Lu, W.; Wang, X. A SAR image-despeckling method based on HOSVD using tensor patches. Remote Sens. 2023, 15, 3118. [Google Scholar] [CrossRef]
  11. Bo, F.; Ma, X.; Cen, Y.; Hu, S. SAR Image Speckle Reduction Based on Nuclear Norm Minus Frobenius Norm Regularization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5227915. [Google Scholar] [CrossRef]
  12. Deledalle, C.A.; Denis, L.; Tupin, F. Iterative weighted maximum likelihood denoising with probabilistic patch-based weights. IEEE Trans. Image Process. 2009, 18, 2661–2672. [Google Scholar] [CrossRef]
  13. Parrilli, S.; Poderico, M.; Angelino, C.V.; Verdoliva, L. A nonlocal SAR image denoising algorithm based on LLMMSE wavelet shrinkage. IEEE Trans. Geosci. Remote Sens. 2011, 50, 606–616. [Google Scholar] [CrossRef]
  14. Singh, P.; Diwakar, M.; Shankar, A.; Shree, R.; Kumar, M. A Review on SAR Image and its Despeckling. Arch. Comput. Methods Eng. 2021, 28, 4633–4653. [Google Scholar] [CrossRef]
  15. Shen, Y.; Chen, Y.; Wang, Y.; Ma, L.; Zhang, X. DATNet: Dynamic Adaptive Transformer Network for SAR Image Denoising. Remote Sens. 2025, 17, 3031. [Google Scholar] [CrossRef]
  16. Chierchia, G.; Cozzolino, D.; Poggi, G.; Verdoliva, L. SAR image despeckling through convolutional neural networks. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 5438–5441. [Google Scholar]
  17. Lin, C.; Qiu, C.; Jiang, H.; Zou, L. A Deep Neural Network Based on Prior-Driven and Structural Preserving for SAR Image Despeckling. J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 6372–6392. [Google Scholar] [CrossRef]
  18. Cheng, L.; Guo, Z.; Li, Y.; Xing, Y. Two-Stream Multiplicative Heavy-Tail Noise Despeckling Network With Truncation Loss. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5213817. [Google Scholar] [CrossRef]
  19. Guo, Y.; Lu, Y.; Liu, R.W.; Zhu, F. Blind Image Despeckling Using a Multiscale Attention-Guided Neural Network. IEEE Trans. Artif. Intell. 2024, 5, 205–216. [Google Scholar] [CrossRef]
  20. Liu, S.; Zhang, L.; Tian, S.; Hu, Q.; Li, B.; Zhang, Y. MFAENet: A Multi-Scale Feature Adaptive Enhancement Network for SAR Image Despeckling. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 10420–10433. [Google Scholar] [CrossRef]
  21. Pan, Y.; Zhong, L.; Chen, J.; Li, H.; Zhang, X.; Pan, B. SAR image despeckling based on denoising diffusion probabilistic model and swin transformer. Remote Sens. 2024, 16, 3222. [Google Scholar] [CrossRef]
  22. Guo, Z.; Hu, W.; Zheng, S.; Zhang, B.; Zhou, M.; Peng, J.; Yao, Z.; Feng, M. Efficient Conditional Diffusion Model for SAR Despeckling. Remote Sens. 2025, 17, 2970. [Google Scholar] [CrossRef]
  23. Liu, S.; Lei, Y.; Zhang, L.; Li, B.; Hu, W.; Zhang, Y.D. MRDDANet: A Multiscale Residual Dense Dual Attention Network for SAR Image Denoising. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5214213. [Google Scholar] [CrossRef]
  24. Thakur, R.K.; Maji, S.K. AGSDNet: Attention and Gradient-Based SAR Denoising Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4506805. [Google Scholar] [CrossRef]
  25. Wang, X.; Wu, Y.; Shi, C.; Yuan, Y.; Zhang, X. ANED-Net: Adaptive Noise Estimation and Despeckling Network for SAR Image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4036–4051. [Google Scholar] [CrossRef]
  26. Liu, S.; Tian, S.; Zhao, Y.; Hu, Q.; Li, B.; Zhang, Y.D. LG-DBNet: Local and Global Dual-Branch Network for SAR Image Denoising. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205515. [Google Scholar] [CrossRef]
  27. Wang, C.; Zheng, R.; Zhu, J.; Xu, W.; Li, X. A Practical SAR Despeckling Method Combining Swin Transformer and Residual CNN. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4001205. [Google Scholar] [CrossRef]
  28. Xiao, S.; Zhang, S.; Huang, L.; Wang, W.Q. Trans-NLM Network for SAR Image Despeckling. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5211912. [Google Scholar] [CrossRef]
  29. Bo, F.; Lu, W.; Wang, G.; Zhou, M.; Wang, Q.; Fang, J. A blind SAR image despeckling method based on improved weighted nuclear norm minimization. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4515305. [Google Scholar] [CrossRef]
  30. Lee, J.S. Digital image enhancement and noise filtering by use of local statistics. IEEE Trans. Pattern Anal. Mach. Intell. 1980, PAMI-2, 165–168. [Google Scholar] [CrossRef]
  31. Kuan, D.; Sawchuk, A.; Strand, T.; Chavel, P. Adaptive restoration of images with speckle. IEEE Trans. Acoust. Speech Signal Process. 1987, 35, 373–383. [Google Scholar] [CrossRef]
  32. Frost, V.S.; Stiles, J.A.; Shanmugan, K.S.; Holtzman, J.C. A Model for Radar Images and Its Application to Adaptive Digital Filtering of Multiplicative Noise. IEEE Trans. Pattern Anal. Mach. Intell. 1982, PAMI-4, 157–166. [Google Scholar] [CrossRef]
  33. Xu, L.; Liu, P.; Jin, Y.Q. A New Nonlocal Iterative Trilateral Filter for SAR Images Despeckling. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5213319. [Google Scholar] [CrossRef]
  34. Aranda-Bojorges, G.; Ponomaryov, V.; Reyes-Reyes, R.; Sadovnychiy, S.; Cruz-Ramos, C. Clustering-Based 3-D-MAP Despeckling of SAR Images Using Sparse Wavelet Representation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4018005. [Google Scholar] [CrossRef]
  35. Penna, P.A.; Mascarenhas, N.D. SAR speckle nonlocal filtering with statistical modeling of HAAR wavelet coefficients and stochastic distances. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7194–7208. [Google Scholar] [CrossRef]
  36. Buades, A.; Coll, B.; Morel, J.M. A non-local algorithm for image denoising. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 60–65. [Google Scholar]
  37. Wang, G.; Bo, F.; Chen, X.; Lu, W.; Hu, S.; Fang, J. A collaborative despeckling method for SAR images based on texture classification. Remote Sens. 2022, 14, 1465. [Google Scholar] [CrossRef]
  38. Zhang, J.; Chen, J.; Yu, H.; Yang, D.; Xu, X.; Xing, M. Learning an SAR Image Despeckling Model Via Weighted Sparse Representation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7148–7158. [Google Scholar] [CrossRef]
  39. Liang, Y.; Yang, X.; Tan, W.; Wang, Z.; Huang, P.; Yang, J. Ratio-Based Multitemporal SAR Image Despeckling With Low-Rank Approximation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4000105. [Google Scholar] [CrossRef]
  40. Guan, J.; Liu, R.; Tian, X.; Tang, X.; Li, S. Robust SAR Image Despeckling by Deep Learning From Near-Real Datasets. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 2963–2979. [Google Scholar] [CrossRef]
  41. Wang, P.; Zhang, H.; Patel, V.M. SAR Image Despeckling Using a Convolutional Neural Network. IEEE Signal Process. Lett. 2017, 24, 1763–1767. [Google Scholar] [CrossRef]
  42. Zhang, Q.; Yuan, Q.; Li, J.; Yang, Z.; Ma, X. Learning a dilated residual network for SAR image despeckling. Remote Sens. 2018, 10, 196. [Google Scholar] [CrossRef]
  43. Bai, Y.; Xiao, Y.; Hou, X.; Li, Y.; Shang, C.; Shen, Q. SAR Image Despeckling with Residual-in-Residual Dense Generative Adversarial Network. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
  44. Bo, F.; Jin, Y.; Ma, X.; Cen, Y.; Hu, S.; Li, Y. SemDNet: Semantic-Guided Despeckling Network for SAR Images. Expert Syst. Appl. 2025, 296, 129200. [Google Scholar] [CrossRef]
  45. Dalsasso, E.; Denis, L.; Tupin, F. SAR2SAR: A Semi-Supervised Despeckling Algorithm for SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4321–4329. [Google Scholar] [CrossRef]
  46. Dalsasso, E.; Denis, L.; Tupin, F. As If by Magic: Self-Supervised Training of Deep Despeckling Networks With MERLIN. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4704713. [Google Scholar] [CrossRef]
  47. Lin, H.; Zhuang, Y.; Huang, Y.; Ding, X. Unpaired Speckle Extraction for SAR Despeckling. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5201014. [Google Scholar] [CrossRef]
  48. Molini, A.B.; Valsesia, D.; Fracastoro, G.; Magli, E. Speckle2Void: Deep self-supervised SAR despeckling with blind-spot convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5204017. [Google Scholar] [CrossRef]
  49. Bo, F.; Ma, X.; Hu, S.; An, G.; Li, Y.; Cen, Y. Speckle-Driven Unsupervised Despeckling for SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 13023–13034. [Google Scholar] [CrossRef]
  50. Deng, J.W.; Li, M.D.; Chen, S.W. Sublook2Sublook: A Self-Supervised Speckle Filtering Framework for Single SAR Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5211613. [Google Scholar] [CrossRef]
  51. Zhang, T.; Zhang, X.; Shi, J.; Wei, S. HyperLi-Net: A hyper-light deep learning network for high-accurate and high-speed ship detection from synthetic aperture radar imagery. ISPRS J. Photogramm. Remote Sens. 2020, 167, 123–153. [Google Scholar] [CrossRef]
  52. Zhang, T.; Zhang, X.; Liu, C.; Shi, J.; Wei, S.; Ahmad, I.; Zhan, X.; Zhou, Y.; Pan, D.; Li, J.; et al. Balance learning for ship detection from synthetic aperture radar remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 182, 190–207. [Google Scholar] [CrossRef]
  53. Zhang, T.; Zhang, X.; Ke, X.; Liu, C.; Xu, X.; Zhan, X.; Wang, C.; Ahmad, I.; Zhou, Y.; Pan, D.; et al. HOG-ShipCLSNet: A novel deep learning network with hog feature fusion for SAR ship classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5210322. [Google Scholar] [CrossRef]
  54. Cozzolino, D.; Parrilli, S.; Scarpa, G.; Poggi, G.; Verdoliva, L. Fast Adaptive Nonlocal SAR Despeckling. IEEE Geosci. Remote Sens. Lett. 2014, 11, 524–528. [Google Scholar] [CrossRef]
  55. Deledalle, C.A.; Denis, L.; Tabti, S.; Tupin, F. MuLoG, or How to Apply Gaussian Denoisers to Multi-Channel SAR Speckle Reduction? IEEE Trans. Image Process. 2017, 26, 4389–4403. [Google Scholar] [CrossRef] [PubMed]
  56. Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 2012, 20, 209–212. [Google Scholar] [CrossRef]
  57. Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overall framework of our proposed method.
Figure 1. Overall framework of our proposed method.
Remotesensing 17 03863 g001
Figure 2. Architecture of the GPA and CAF.
Figure 2. Architecture of the GPA and CAF.
Remotesensing 17 03863 g002
Figure 3. Architecture of the RTB, DMTL, and DMSA.
Figure 3. Architecture of the RTB, DMTL, and DMSA.
Remotesensing 17 03863 g003
Figure 4. Despeckling results of each method for synthetic SAR image. (a) Clean image. (b) Noisy image (L = 1). (c) SAR-BM3D. (d) SAR-NNFN. (e) SDUDNet. (f) SemDNet. (g) PDSNet. (h) HTNet. (i) MSANN. (j) MFAENet. (k) LGDBNet. (l) Proposed. The red boxes highlight the structural details.
Figure 4. Despeckling results of each method for synthetic SAR image. (a) Clean image. (b) Noisy image (L = 1). (c) SAR-BM3D. (d) SAR-NNFN. (e) SDUDNet. (f) SemDNet. (g) PDSNet. (h) HTNet. (i) MSANN. (j) MFAENet. (k) LGDBNet. (l) Proposed. The red boxes highlight the structural details.
Remotesensing 17 03863 g004
Figure 5. Despeckling results of each method for synthetic SAR image. (a) Clean image. (b) Noisy image (L = 4). (c) SAR-BM3D. (d) SAR-NNFN. (e) SDUDNet. (f) SemDNet. (g) PDSNet. (h) HTNet. (i) MSANN. (j) MFAENet. (k) LGDBNet. (l) Proposed. The red boxes highlight the structural details.
Figure 5. Despeckling results of each method for synthetic SAR image. (a) Clean image. (b) Noisy image (L = 4). (c) SAR-BM3D. (d) SAR-NNFN. (e) SDUDNet. (f) SemDNet. (g) PDSNet. (h) HTNet. (i) MSANN. (j) MFAENet. (k) LGDBNet. (l) Proposed. The red boxes highlight the structural details.
Remotesensing 17 03863 g005
Figure 6. Ratio images obtained by different methods. (a) SAR-BM3D. (b) SAR-NNFN. (c) SDUDNet. (d) SemDNet. (e) PDSNet. (f) HTNet. (g) MSANN. (h) MFAENet. (i) LGDBNet. (j) Proposed.
Figure 6. Ratio images obtained by different methods. (a) SAR-BM3D. (b) SAR-NNFN. (c) SDUDNet. (d) SemDNet. (e) PDSNet. (f) HTNet. (g) MSANN. (h) MFAENet. (i) LGDBNet. (j) Proposed.
Remotesensing 17 03863 g006
Figure 7. Real SAR images. (a) R1 (L = 1). (b) R2 (L = 1). (c) R3 (L = 3). (d) R4 (L = 4). The blue and green boxes are selected RoI to calculate TCR and ENL, respectively.
Figure 7. Real SAR images. (a) R1 (L = 1). (b) R2 (L = 1). (c) R3 (L = 3). (d) R4 (L = 4). The blue and green boxes are selected RoI to calculate TCR and ENL, respectively.
Remotesensing 17 03863 g007
Figure 8. Despeckling results of each method for R1. (a) Noisy image (L = 1). (b) SAR-BM3D. (c) SAR-NNFN. (d) PDSNet. (e) HTNet. (f) MSANN. (g) MFAENet. (h) LGDBNet. (i) SAR2SAR. (j) Speckle2void. (k) SDUDNet. (l) Proposed. The red boxes highlight the structural details.
Figure 8. Despeckling results of each method for R1. (a) Noisy image (L = 1). (b) SAR-BM3D. (c) SAR-NNFN. (d) PDSNet. (e) HTNet. (f) MSANN. (g) MFAENet. (h) LGDBNet. (i) SAR2SAR. (j) Speckle2void. (k) SDUDNet. (l) Proposed. The red boxes highlight the structural details.
Remotesensing 17 03863 g008
Figure 9. Despeckling results of each method for R2. (a) Noisy image (L = 1). (b) SAR-BM3D. (c) SAR-NNFN. (d) PDSNet. (e) HTNet. (f) MSANN. (g) MFAENet. (h) LGDBNet. (i) SAR2SAR. (j) Speckle2void. (k) SDUDNet. (l) Proposed. The red boxes highlight the structural details.
Figure 9. Despeckling results of each method for R2. (a) Noisy image (L = 1). (b) SAR-BM3D. (c) SAR-NNFN. (d) PDSNet. (e) HTNet. (f) MSANN. (g) MFAENet. (h) LGDBNet. (i) SAR2SAR. (j) Speckle2void. (k) SDUDNet. (l) Proposed. The red boxes highlight the structural details.
Remotesensing 17 03863 g009
Figure 10. Ratio images for R2 obtained by different despeckling methods. (a) SAR-BM3D. (b) SAR-NNFN. (c) PDSNet. (d) HTNet. (e) MSANN. (f) MFAENet. (g) LGDBNet. (h) SAR2SAR. (i) SDUDNet. (j) Proposed.
Figure 10. Ratio images for R2 obtained by different despeckling methods. (a) SAR-BM3D. (b) SAR-NNFN. (c) PDSNet. (d) HTNet. (e) MSANN. (f) MFAENet. (g) LGDBNet. (h) SAR2SAR. (i) SDUDNet. (j) Proposed.
Remotesensing 17 03863 g010
Figure 11. Despeckling results of each method for R3. (a) Noisy image (L = 3). (b) SAR-BM3D. (c) SAR-NNFN. (d) PDSNet. (e) HTNet. (f) MSANN. (g) MFAENet. (h) LGDBNet. (i) SAR2SAR. (j) Speckle2void. (k) SDUDNet. (l) Proposed. The red/yellow boxes highlight the structural details.
Figure 11. Despeckling results of each method for R3. (a) Noisy image (L = 3). (b) SAR-BM3D. (c) SAR-NNFN. (d) PDSNet. (e) HTNet. (f) MSANN. (g) MFAENet. (h) LGDBNet. (i) SAR2SAR. (j) Speckle2void. (k) SDUDNet. (l) Proposed. The red/yellow boxes highlight the structural details.
Remotesensing 17 03863 g011
Figure 12. Ratio images for R3 obtained by different despeckling methods. (a) SAR-BM3D. (b) SAR-NNFN. (c) PDSNet. (d) HTNet. (e) MSANN. (f) MFAENet. (g) LGDBNet. (h) SAR2SAR. (i) SDUDNet. (j) Proposed.
Figure 12. Ratio images for R3 obtained by different despeckling methods. (a) SAR-BM3D. (b) SAR-NNFN. (c) PDSNet. (d) HTNet. (e) MSANN. (f) MFAENet. (g) LGDBNet. (h) SAR2SAR. (i) SDUDNet. (j) Proposed.
Remotesensing 17 03863 g012
Figure 13. Despeckling results of each method for R4. (a) Noisy image (L = 4). (b) SAR-BM3D. (c) SAR-NNFN. (d) PDSNet. (e) HTNet. (f) MSANN. (g) MFAENet. (h) LGDBNet. (i) SAR2SAR. (j) Speckle2void. (k) SDUDNet. (l) Proposed. The red boxes highlight the structural details.
Figure 13. Despeckling results of each method for R4. (a) Noisy image (L = 4). (b) SAR-BM3D. (c) SAR-NNFN. (d) PDSNet. (e) HTNet. (f) MSANN. (g) MFAENet. (h) LGDBNet. (i) SAR2SAR. (j) Speckle2void. (k) SDUDNet. (l) Proposed. The red boxes highlight the structural details.
Remotesensing 17 03863 g013
Figure 14. Ratio images for R4 obtained by different despeckling methods. (a) SAR-BM3D. (b) SAR-NNFN. (c) PDSNet. (d) HTNet. (e) MSANN. (f) MFAENet. (g) LGDBNet. (h) SAR2SAR. (i) SDUDNet. (j) Proposed.
Figure 14. Ratio images for R4 obtained by different despeckling methods. (a) SAR-BM3D. (b) SAR-NNFN. (c) PDSNet. (d) HTNet. (e) MSANN. (f) MFAENet. (g) LGDBNet. (h) SAR2SAR. (i) SDUDNet. (j) Proposed.
Remotesensing 17 03863 g014
Figure 15. Despeckling results of different noise levels in the same scene. (a) Clean image. (b) Speckled (L = 1). (c) Speckled (L = 2). (d) Speckled (L = 4). (e) Despeckled (L = 1). (f) Despeckled (L = 2). (g) Despeckled (L = 4). The red boxes highlight the structural details.
Figure 15. Despeckling results of different noise levels in the same scene. (a) Clean image. (b) Speckled (L = 1). (c) Speckled (L = 2). (d) Speckled (L = 4). (e) Despeckled (L = 1). (f) Despeckled (L = 2). (g) Despeckled (L = 4). The red boxes highlight the structural details.
Remotesensing 17 03863 g015
Table 1. The average objective indices of each method on synthetic SAR images. The bold indicates the best values and underline indicates the second best values.
Table 1. The average objective indices of each method on synthetic SAR images. The bold indicates the best values and underline indicates the second best values.
SAR-9SSAR
L= 1L= 2L= 4L= 1L= 2L= 4
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
SAR-BM3D [13]25.210.733427.030.799128.750.854027.090.773928.920.824030.640.8670
FANS [54]25.070.706626.980.786528.830.848727.050.757128.910.813630.660.8603
MulogB [55]25.060.716226.820.784628.680.846027.090.767628.870.817030.630.8619
SAR-NNFN [11]25.070.704727.160.792729.030.854827.100.750529.120.816730.980.8658
SDUDNet [49]25.610.730527.510.802429.190.856228.080.793429.370.826330.850.8662
SemDNet [44]26.100.761227.620.807229.300.859628.030.793829.580.833530.980.8736
PDSNet [17]25.880.733427.320.790028.740.837827.730.768829.270.817130.730.8566
HTNet [18]25.660.720227.100.776728.520.826927.540.760929.020.806430.420.8464
MSANN [19]25.810.739527.240.798428.580.842127.490.768928.930.818830.180.8541
MFAENet [20]26.050.752827.680.810429.290.859328.010.787529.610.833231.110.8713
LGDBNet [26]24.080.652926.970.782428.870.845824.950.623728.340.775430.670.8565
Proposed26.150.763827.670.816629.320.863728.090.793629.600.836531.130.8750
Table 2. The average objective indices of each method on real SAR images. The bold indicates the best values and underline indicates the second best values.
Table 2. The average objective indices of each method on real SAR images. The bold indicates the best values and underline indicates the second best values.
NIQE ↓BRISQUE ↓ENL ↑TCR ↑Time (s)↓
SAR-BM3D [13]5.6241.0825.873.20 42.5
FANS [54]5.7637.8444.622.58 1.62
Non-LearningMulogB [55]6.0039.8885.431.80 10.71
SAR-NNFN [11]7.0940.40136.253.39 51.12
SemDNet [44]5.2227.3467.385.44 0.09
PDSNet [17]5.0432.2672.515.35 1.22
SupervisedHTNet [18]5.0133.8878.394.94 1.38
MSANN [19]4.9126.8856.745.08 0.86
MFAENet [20]5.1628.1772.585.870.06
LGDBNet [26]4.7528.7956.434.21 0.25
SAR2SAR [45]5.0325.80158.392.20 1.13
UnsupervisedSpeckle2Void [48]5.0134.3364.274.35 1.09
MERLIN [46]7.2628.0070.584.97 0.84
SDUDNet [49]4.7125.4783.245.620.04
Proposed4.3224.46133.716.86 0.61
Table 3. Results of ablation studies on Noise Branch, Coarse Branch, DMSA. The bold indicates the best values.
Table 3. Results of ablation studies on Noise Branch, Coarse Branch, DMSA. The bold indicates the best values.
SAR-9Real SAR
Noise BranchCoarse BranchDMSAPSNRSSIMNIQEBRISQUE
×××26.050.75526.2533.16
××26.100.75945.0428.52
×26.120.76114.8526.09
26.150.76384.3224.46
Table 4. Results of ablation studies on GPA and CAF. The bold indicates the best values.
Table 4. Results of ablation studies on GPA and CAF. The bold indicates the best values.
SAR-9Real SAR
GPACAFPSNRSSIMNIQEBRISQUE
××26.070.75237.3630.75
×26.090.75716.6526.19
×26.130.76045.6225.73
26.150.76384.3224.46
Table 5. Ablation study of the number of groups r in the GPA module. The bold indicates the best values.
Table 5. Ablation study of the number of groups r in the GPA module. The bold indicates the best values.
SAR-9Real SAR
r PSNRSSIMNIQEBRISQUE
226.110.75735.0724.13
426.150.76384.3224.46
626.170.76294.9325.41
826.120.75885.2525.83
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, L.; Zheng, L.; Wen, Y.; Zhang, F.; Bo, F.; Cen, Y. Effective SAR Image Despeckling Using Noise-Guided Transformer and Multi-Scale Feature Fusion. Remote Sens. 2025, 17, 3863. https://doi.org/10.3390/rs17233863

AMA Style

Zhang L, Zheng L, Wen Y, Zhang F, Bo F, Cen Y. Effective SAR Image Despeckling Using Noise-Guided Transformer and Multi-Scale Feature Fusion. Remote Sensing. 2025; 17(23):3863. https://doi.org/10.3390/rs17233863

Chicago/Turabian Style

Zhang, Linna, Le Zheng, Yuxin Wen, Fugui Zhang, Fuyu Bo, and Yigang Cen. 2025. "Effective SAR Image Despeckling Using Noise-Guided Transformer and Multi-Scale Feature Fusion" Remote Sensing 17, no. 23: 3863. https://doi.org/10.3390/rs17233863

APA Style

Zhang, L., Zheng, L., Wen, Y., Zhang, F., Bo, F., & Cen, Y. (2025). Effective SAR Image Despeckling Using Noise-Guided Transformer and Multi-Scale Feature Fusion. Remote Sensing, 17(23), 3863. https://doi.org/10.3390/rs17233863

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop