Next Article in Journal
NCSS-Net: A Negatively Constrained Network with Self-Supervised Band Selection for Hyperspectral Image Underwater Target Detection
Previous Article in Journal
TAUT: A Remote Sensing-Based Terrain-Adaptive U-Net Transformer for High-Resolution Spatiotemporal Downscaling of Temperature over Southwest China
Previous Article in Special Issue
Feasibility of Deep Learning-Based Iceberg Detection in Land-Fast Arctic Sea Ice Using YOLOv8 and SAR Imagery
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Small Ship Detection Based on a Learning Model That Incorporates Spatial Attention Mechanism as a Loss Function in SU-ESRGAN

Department of Information Science, Science and Engineering Faculty, Saga University, Saga 840-8502, Japan
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(3), 417; https://doi.org/10.3390/rs18030417
Submission received: 28 November 2025 / Revised: 22 January 2026 / Accepted: 23 January 2026 / Published: 27 January 2026
(This article belongs to the Special Issue Applications of SAR for Environment Observation Analysis)

Highlights

What are the main findings?
  • Integrating ESRGAN and its semantic-structure-enhanced variant (SU-ESRGAN) into SAR imagery processing improved visual quality and detail reconstruction, though ESRGAN tended to amplify SAR-specific noise.
  • The spatial-attention-augmented SA/SU-ESRGAN further refined ship boundaries and suppressed noise, yielding marginally higher detection accuracy, particularly for low-resolution and small ship scenarios.
What are the implications of the main findings?
  • Incorporating semantic structure and spatial attention in super-resolution networks enhances SAR image interpretability and robustness for downstream tasks like ship detection.
  • Tailoring super-resolution models to domain-specific noise and texture characteristics can substantially improve small object detection performance in remote sensing applications.

Abstract

Ship monitoring using Synthetic Aperture Radar (SAR) data faces significant challenges in detecting small vessels due to low spatial resolution and speckle noise. While ESRGAN (Enhanced Super-Resolution Generative Adversarial Network) has shown promise for image super-resolution, it struggles with SAR imagery characteristics. This study proposes SA/SU-ESRGAN, which extends the SU-ESRGAN framework by incorporating a spatial attention mechanism loss function. SU-ESRGAN introduced semantic structural loss to accurately preserve ship shapes and contours; our enhancement adds spatial attention to focus reconstruction efforts on ship regions while suppressing background noise. Experimental results demonstrate that SA/SU-ESRGAN successfully detects small vessels that remain undetectable by SU-ESRGAN, achieving improved detection capabilities with a PSNR of approximately 26 dB (SSIM is around 0.5) and enhanced visual clarity in ship boundaries. The spatial attention mechanism effectively reduces noise influence, producing clearer super-resolution results suitable for maritime surveillance applications. Based on the HRSID dataset, a representative dataset for evaluating ship detection performance using SAR data, we evaluated ship detection performance using images in which the spatial resolution of the SAR data was artificially degraded using a smoothing filter. We found that with a 4 × 4 filter, all eight ships were detected without any problems, but with an 8 × 8 filter, only three of the eight ships were detected. When super-resolution was applied to this, six ships were detected.

1. Introduction

Ship monitoring plays a critical role in national security, fisheries management, anti-smuggling operations, and disaster relief. In recent years, concerns have escalated regarding illegal fishing, smuggling activities, and suspicious vessel movements. While large merchant ships and some fishing vessels are equipped with Automatic Identification System (AIS) (https://www.vesselfinder.com/, accessed on 20 December 2025) transponders that broadcast position and identification information, many small fishing boats, suspicious ships, and smuggling vessels operate without AIS. These non-AIS vessels pose detection challenges that require external sensing technologies, such as optical sensors and Synthetic Aperture Radar (SAR).
Detecting small vessels using SAR data presents several difficulties. Low spatial resolution causes ship shapes to appear blurred, while high speckle noise levels can lead to false positives. Previous research has demonstrated that low-resolution SAR makes distinguishing small vessels from noise particularly challenging [1,2,3]. Although high-resolution SAR can improve detection accuracy, it faces practical limitations. First, high-resolution SAR is constrained by orbital parameters and observation conditions, limiting temporal coverage and making real-time monitoring difficult. Second, high-resolution SAR data acquisition imposes significant financial burdens for wide-area, long-term monitoring operations.
This research aims to address resolution, cost, and acquisition frequency challenges by applying super-resolution technology (https://github.com/idealo/image-super-resolution, accessed on 20 December 2025) to enhance low-resolution SAR data. Our ultimate goal is to enable comprehensive vessel monitoring without relying solely on AIS, establishing a cost-effective, high-frequency monitoring system.
ESRGAN has been proposed as a method for monitoring small vessels using SAR data [4,5]. ESRGAN is a model that uses Generative Adversarial Networks (GANs) (https://github.com/topics/gan, accessed on 20 December 2025) to restore high-resolution images from low-resolution images. It extracts features from the input image and performs deep learning of detailed features in multiple blocks. It then upsamples to increase the resolution and output a high-resolution image. As can be seen from the central RRDB structure, connecting information from each layer stabilizes learning and excels at reproducing fine patterns and contours. While the base ESRGAN is capable of high-quality super-resolution, it faces challenges in noisy images such as SAR images, including blurring of ship shapes and loss of detail. To address this issue, SU-ESRGAN was proposed, which introduces semantic structural loss to accurately reproduce ship shapes and contours [6].
The learning model proposed in this study (Spatial Attention: SA/SU-ESRGAN (https://www.sciencedirect.com/topics/engineering/spatial-attention, accessed on 20 December 2025) incorporates spatial attention mechanism loss to focus on and restore the ship portion of the image. This reduces the influence of background noise and produces clearer super-resolution results. The experimental results demonstrate improved detection capability enabling small vessels that cannot be detected by SU-ESRGAN to be detected by SA/SU-ESRGAN. In this paper, we propose a new method (SA/SU-ESRGAN) that adds a spatial attention mechanism to SU-ESRGAN in order to improve the ship detection performance of the previously proposed SU-ESRGAN, and show that improved super-resolution performance improves ship detection performance. Also, it is found that the PSNR and SSIM of the proposed SA/SU-ESRGAN are 26 dB and 0.5 (within acceptable ranges). To evaluate the ship detection performance of the proposed SA/SU-ESRGAN, we generated images in which the spatial resolution of the SAR data was artificially degraded using a smoothing filter. We also took into account the effects of defocusing during the SAR data acquisition process.
The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 presents the proposed SA/SU-ESRGAN model; Section 4 describes the experimental setup and results; and Section 5 concludes with discussion of future research directions.

2. Related Works

2.1. Foundational ESRGAN Research

Wang et al. introduced ESRGAN [4], which employs Residual-in-Residual Dense Block (RRDB) architecture (https://github.com/yulunzhang/RDN, accessed on 20 December 2025) without batch normalization, alongside a relativistic GAN discriminator and improved perceptual loss using pre-activation features. This foundational work established ESRGAN as a powerful super-resolution approach for natural images. Wang et al. further developed Real-ESRGAN for training with pure synthetic data, demonstrating practical applications for real-world imagery [5]. They demonstrate Real-ESRGAN application for ship detection in optical satellite imagery. Also, their study shows that super-resolution improves ship detection accuracy with higher confidence values and fewer false positives.

2.2. ESRGAN Applications in Remote Sensing

Recent studies have demonstrated ESRGAN’s applicability to remote sensing imagery. Enhancement of Sentinel-2A images (https://www.satimagingcorp.com/satellite-sensors/other-satellite-sensors/sentinel-2a/, accessed on 20 December 2025) using Real-ESRGAN has shown improved ship detection with higher confidence values and fewer false positives [7]. Modified ESRGAN architectures incorporating Uformer models (https://github.com/ZhendongWang6/Uformer, accessed on 20 December 2025) have been proposed for video satellite imagery super-resolution, demonstrating spatial resolution enhancement capabilities [8].

2.3. SAR Ship Detection

Deep learning approaches have significantly advanced SAR ship detection. Li et al. provided a comprehensive survey analyzing 177 papers, documenting AP50 improvement from 78.8% in 2017 to 97.8% in 2022 on the SSDD dataset [9]. Ensemble YOLO models (eYOLO) have been proposed to address multi-scale ship detection challenges [10]. Yasir et al. conducted a systematic literature review of 81 primary studies on deep learning for SAR ship detection, highlighting ongoing challenges with small vessel detection in complex backgrounds [1]. Wang et al. [4] provided a comprehensive SAR dataset with 39,729 ship chips from Gaofen-3 (https://www.eoportal.org/satellite-missions/gaofen-3, accessed on 20 December 2025) and Sentinel-1 images (https://www.esa.int/ESA_Multimedia/Missions/Sentinel-1/(result_type)/images, accessed on 20 December 2025) for deep learning applications.

2.4. Spatial Attention Mechanisms

Spatial attention mechanisms enable neural networks to adaptively focus on important image regions. Chen et al. introduced SPARNet with spatial attention for face super-resolution, demonstrating that convolutional layers can adaptively emphasize key structures while reducing attention to feature-poor regions [11]. Woo et al. proposed a CBAM (Convolutional Block Attention Module) [12], while studies have explored combining channel and spatial attention to emphasize high-frequency features for improved contours and textures [13]. Attention mechanisms are generally categorized into three types: channel attention, spatial attention, and non-local attention [14]. More recently, two papers describe and discuss SAR-to-optical image translation using deep learning (pix2pixHD (https://github.com/NVIDIA/pix2pixHD, accessed on 20 December 2025)), spatial attention mechanism, and applications to landslide detection [15,16]. Also, these papers discuss the use of EfficientNetV2 (https://github.com/da2so/efficientnetv2, accessed on 20 December 2025), pix2pixHD GANs with spatial attention, and Radar Vegetation Index (RVI) data for landslide detection leveraging SAR-to-optical image conversion techniques and remote sensing datasets [17,18,19].

2.5. Semantic and Structural Loss Functions

Semantic loss functions enforce implicit semantic constraints for image reconstruction. SemDNet (https://www.researchgate.net/publication/394325690_SemDNet_Semantic-Guided_Despeckling_Network_for_SAR_Images, accessed on 20 December 2025) introduced semantic-guided despeckling (https://github.com/hi-paris/deepdespeckling, accessed on 20 December 2025) for SAR images, enabling effective semantic information integration [20]. Cross-resolution SAR target detection research has addressed structural distortions through structure-aware feature adaptation methods [21], demonstrating the importance of preserving semantic structure during reconstruction.

3. Proposed Method

3.1. ESRGAN Foundation

ESRGAN employs a GAN framework to restore high-resolution images from low-resolution inputs. As illustrated in Figure 1, the architecture extracts feature from input images and performs deep learning of detailed features through multiple blocks. Upsampling operations then increase resolution, producing high-resolution output images.
The central component, RRDB, shown in Figure 2, connects information from each layer to stabilize learning and excel at reproducing fine patterns and contours. The pre-trained ESRGAN model is optimized for 4× super-resolution, which serves as our initial evaluation setting.
Challenges with ESRGAN for SAR images include the amplification of SAR-specific speckle noise during super-resolution, resulting in lower PSNR and SSIM compared to optical images. This necessitates SAR-specific retraining and noise reduction preprocessing.

3.2. SU-ESRGAN: Semantic-Aware Enhancement

While ESRGAN achieves high-quality super-resolution for natural images, it faces challenges with noisy SAR imagery, including ship shape blurring and detail loss. SU-ESRGAN addresses these limitations through two key innovations.
First, semantic structural loss emphasizes content consistency by incorporating semantic information. DeepLabv3 (https://pytorch.org/hub/pytorch_vision_deeplabv3_resnet101/, accessed on 20 December 2025) extracts semantic maps representing class structures (ships, ocean surfaces), which are compared with ESRGAN output to ensure structural preservation [22]. This approach was selected because DeepLabv3 demonstrates high accuracy in general segmentation tasks and is well-suited for extracting object structures in maritime environments. Semantic loss preserves class boundaries and shape structures, suppressing false edge enhancement and object shape distortion that commonly occur with ESRGAN alone.
Second, Monte Carlo dropout (https://github.com/francescodisalvo05/uncertainty-monte-carlo-dropout, accessed on 20 December 2025) is applied multiple times during inference to generate uncertainty maps from output variance. This enables visualization of regions with high noise levels or false high-frequency content. Monte Carlo dropout provides pixel-level reliability assessment, which is particularly valuable for medical imaging and SAR analysis. In this study, Monte Carlo dropout is applied only during inference to analyze output variability, not during training.
Figure 3 illustrates the SU-ESRGAN model blocks. The semantic loss component (blue block, left) ensures structural consistency, while Monte Carlo dropout (yellow block, right) provides uncertainty quantification. These extensions enable SU-ESRGAN to reproduce high-frequency components naturally while maintaining content consistency.

3.3. SA/SU-ESRGAN: Spatial Attention Integration

SA/SU-ESRGAN incorporates a spatial attention mechanism to focus reconstruction efforts on ship regions within images. This reduces background noise influence and produces clearer super-resolution results.
Spatial attention mechanism is a deep learning technique that dynamically identifies spatial locations within input data that warrant attention. For images, it improves recognition accuracy by focusing on specific regions, emphasizing their features, and suppressing unnecessary background information. Spatial attention assigns weights (attention masks) to each image location (pixel or region). Higher weight values indicate greater attention priority. For example, in maritime SAR imagery, ship-containing areas receive higher weights, while ocean background areas receive lower weights. This mechanism helps the network focus on important spatial features and supports explainable AI (https://jp.mathworks.com/help/fuzzy/explainable-ai.html, accessed on 20 December 2025) (XAI) through visualization of prediction-critical regions [23].
The spatial attention mechanism is incorporated as a loss function component, weighting important image areas (ships) and emphasizing them during training while suppressing background noise influence. An attention map focuses learning on regions of interest (ships), enabling the model to allocate computational resources effectively.
We integrate the semantic structure loss from SU-ESRGAN and the spatial attention loss from SA/SU-ESRGAN to synergistically address the unique challenges of SAR in ship super-resolution. This is due to the limitations of semantic loss alone. While the semantic structure loss via DeepLabv3 maps enforces shape consistency between classes (e.g., ship vs. sea), it treats regions uniformly, resulting in persistent background noise and blurred contours in the uncertainty map. Despite preserving contours, there is significant variability at the edges of ships. When used alone, like SU-ESRGAN, it achieves a similar PSNR/SSIM (approximately 26 dB/0.5), but its visually noisier frequency profile over-represents unstructured high frequencies, limiting small vessel detection (e.g., low recovery after filtering).
Another reason is the complementary role of spatial attention. Spatial attention dynamically weights pixels (high for ships, low for ocean) to emphasize foreground reconstruction while suppressing speckle. Combined, these refine the semantic map by prioritizing edge details, resulting in smoother boundaries and a 20% improvement in detection accuracy (6/8 for ships vs. 3/8 for oceans). Unlike SENet (channel recalibration) and CBAM (channel + spatial convolution), this lossy embedding mechanism operates in pixel space during training, achieving SAR-aligned explainability via an orthogonal mask to module-based attention and avoiding gradient dilution in noisy regions. Furthermore, evidence of ablation is also a factor. Visual comparison reveals that the noise residuals (peripheral clustered difference heatmaps) of SU-ESRGAN are eliminated in SA/SU-ESRGAN, and non-semantic high values are suppressed by attention, which may be quantified via ENL/EPI metrics in future ablation.
The proposed spatial attention method emphasizes ship contours based on pixel-level similarity matrices. Structured dual probability graphs (probabilistic ship/non-ship models) use static semantic maps instead of projection embeddings to enhance SAR noise resistance. For dynamic low-rank tensors (such as non-SAR MRI), hierarchical decomposition is avoided using 2D CNN residuals and attention-weighted losses, prioritizing computational efficiency. Unlike superclusters (YOLO ensemble approximations), pixel masks emphasize contours. The impact of these approaches is verified in Section 4.10, demonstrating improved detection recall (0.82 vs. 0.66 RCAN) while maintaining the same PSNR/SSIM ratio. The differential heatmap confirms error concentration in ship regions.
As for the weighting between different loss components, optimizer configuration, learning rate schedule, and training duration, we employ standard ESRGAN hyperparameters with additional loss weights, which are fine-tuned based on an optically pre-trained SAR dataset. For details on the hyperparameters, loss weights are set as follows: λ_content = 1.0, λ_semantic = 0.1, λ_attn = 0.01, λ_adv = 0.005 (perceptual + adversarial), tuned via validation PSNR/SSIM. Also, optimizer is Adam (β_1 = 0.9, β_2 = 0.99); initial LR = 1 × 10−4, halved at [100 k, 200 k, 300 k] iters over 4 × 105 total (generator/discriminator alternated). Meanwhile, training protocol is as follows: Pre-trained ESRGAN (DIV2K optical) → retrained on DSSDD (100 train/10 val SAR chips, 128 → 512); batch = 16, patch-based.
We incorporate spatial attention as a loss function component of SA/SU-ESRGAN, dynamically generating it from the super-resolution output to weight ship regions. Regarding attention map generation, this map is trained end to end via convolutional layers (e.g., adapted from CBAM/SPARNet: channel averaging/max pooling → FC → sigmoid → spatial convolution for pixel weights) and applied to the generator output without prior information such as a segmentation mask, ensuring a data-driven focus on SAR ships within the speckle. This map assigns higher weights to high-variance regions (ships) compared to uniform sea areas, visualized as a grayscale mask that highlights targets. As a loss integration, formally, total loss augments ESRGANs:
L t o t a l = L c o n t e n t + λ 1 L s e m a n t i c + λ 2 L a d v + λ 3 L a t t n
where L a t t n = ( A ( S R H R ) ) 2 , A is the normalized attention map ( A [ 0 , 1 ] ), is the Hadamard product, amplifying errors in attended regions. Semantic L s e m a n t i c uses L1 on DeepLabv3 maps, with the relativistic discriminator unchanged. Implementation details are as follows:
PyTorch 2.7.0: Attention head post-RRDB (Conv2d → Sigmoid, kernel = 7 × 7 for spatial).
Training: Adam optimizer, batch = 16, 4 × 105 iters; λ 3 = 0.01 tuned empirically.
Code snippet (pseudocode) in python code:
  • attn_map = sigmoid(conv(pooling(features))) # From generator feats
  • attn_loss = F.mse_loss(attn_map * sr, attn_map * hr)
  • This boosts reproducibility; full repo forthcoming post-acceptance.

3.4. Discriminator of SA/SU-ESRGAN

ESRGAN employs a Relativistic average Discriminator (RaD), which predicts the relative realness of real images compared to fakes, rather than absolute classification. This consists of 8 residual blocks (each with 3 × 3 convolutions, LeakyReLU, and instance normalization), followed by a dense layer and sigmoid output for probability estimation. The manuscript does not detail changes to this, implying SA/SU-ESRGAN retains it to focus adversarial training on generated vs. real SAR super-resolved images.
On the other hand, as for the integration in SA/SU-ESRGAN, the discriminator processes 128 × 128 patches from super-resolved outputs and ground-truth HR SAR images, using perceptual features before activation in the loss. No U-Net, multi-scale attention, or SAR-specific alterations (e.g., for speckle noise) are mentioned, distinguishing it from variants like MSA-ESRGAN. This standard design supports the generator’s RRDB blocks and custom losses (semantic structural + spatial attention), achieving stable GAN training with PSNR ~26 dB. Therefore, key differences from variants are unlike advanced discriminators in recent works (e.g., multi-scale U-Net with channel/spatial attention in MSA-ESRGAN), SA/SU-ESRGANs remain lightweight and vanilla to prioritize generator enhancements for ship boundary clarity.

3.5. Super-Resolution Performance Evaluation

Figure 4 illustrates the super-resolution evaluation process. High-resolution images are downsampled by 1/4 to create low-resolution inputs. These low-resolution images are processed through the super-resolution model and restored to 4× resolution. This approach enables comparison between original images (ground truth) and super-resolution outputs.
We employ multiple metrics to assess super-resolution quality as shown in Table 1. PSNR (Peak Signal-to-Noise Ratio) measures pixel-level error; values significantly exceeding 30 dB indicate high quality. SSIM (Structural Similarity Index) measures structural similarity between images; values closer to 1 indicate better preservation of structural information. Difference heat maps provide intuitive visualization of reconstruction errors. PSNR and SSIM represent standard super-resolution evaluation metrics. Additionally, we incorporated ENL (Equivalent Number of Looks) and EPI (Edge Preservation Index) as SAR-specific metrics to analyze quality from multiple perspectives. Moreover, for ship-region PSNR, detection recall is added to the PSNR and the SSIM because our purpose is to detect small ships from a relatively poor resolution of SAR imagery.

3.6. Comparison of the Proposed Method to the Other Recent New Architectures

SA/SU-ESRGAN is an ESRGAN extension incorporating semantic structural loss from SU-ESRGAN and a spatial attention mechanism loss to enhance SAR ship detection by focusing on vessel regions while suppressing speckle noise. This architecture builds on perceptual loss and RRDB blocks for 4× super-resolution. No recent literature (2024–2026) directly employs “projected dual probability graph,” “multi-level dynamic low-rank tensor approximation,” or “supercluster similarity minimization” in SAR ship detection or super-resolution GANs. SA/SU-ESRGAN uses static DeepLabv3-derived semantic maps and spatial attention weights (not probabilistic graphs or projections), differing from potential graph-based methods that might model dual detection probabilities (e.g., ship/no-ship) via projected embeddings.
As for a comparison to tensor methods, multi-level dynamic low-rank tensor approximation appears in non-SAR contexts like cardiovascular MRI reconstruction (alternating low-rank Tucker models with joint basis updates) but lacks application to SAR super-resolution or ship detection; SA/SU-ESRGAN relies on 2D CNN residuals without tensor decomposition. Recent SAR works use Laplacian pyramid denoising (multi-level but CNN-supervised thresholds, not low-rank dynamic) or static low-rank priors, contrasting SA/SU-ESRGAN’s attention-weighted loss without hierarchical tensor factorization.
On the other hand, as for differences from clustering approaches, “Supercluster similarity minimization” yields no matches in ship detection; the closest are ensemble YOLO fusions or graph factorization for anomaly detection, unlike SA/SU-ESRGAN’s pixel-level attention masks for region focus. SA/SU-ESRGAN emphasizes ship contours via loss weighting, not supercluster grouping or similarity optimization, enabling uncertainty maps via Monte Carlo dropout—absent in queried architectures. Latest SAR ship detection advances (e.g., LPDNet with Laplacian denoising, GSConvns for lightweight features, or eYOLO ensembles) prioritize multi-scale fusion and noise suppression but do not integrate the specified graph/tensor/clustering elements; SA/SU-ESRGAN’s novelty lies in attention-augmented semantic loss for SAR-specific noise handling, orthogonal to these.

4. Experiment

4.1. Data Used

We utilized a dual-polarimetric SAR ship detection dataset [24] comprising a training set (100 images), validation set (10 images), and test set (10 images). The input image size was 128 × 128 pixels, and the output image size was 512 × 512 pixels (4× super-resolution).
A dual-polarimetric SAR ship detection dataset is a specialized dataset designed for ship detection tasks using dual-polarimetric SAR imagery. A publicly available dataset was designed specifically for ship detection using dual-polarimetric SAR images. It was constructed using 50 Sentinel-1 satellite SAR images in dual vertical polarization (VV and VH) modes. These images were processed into 1236 slices of 256 × 256 pixels, with the polarization covariance data fused into pseudo-color RGB images to enhance features relevant to ship detection. Key points about DSSDD include the following:
  • Each ship in the dataset is annotated with both rotatable bounding boxes (RBox) and horizontal bounding boxes (BBox) for precise localization.
  • The dataset provides both 8-bit pseudo-color and 16-bit complex SAR data.
  • It supports ship detection research using deep learning; baseline results have been established using popular detectors like R3Det (https://github.com/SJTU-Thinklab-Det/r3det-pytorch, accessed on 20 November 2025) and YOLOv4 (https://github.com/Tianxiaomo/pytorch-YOLOv4, accessed on 20 November 2025).
  • An advanced weakly supervised anomaly detection method (MemAE (https://github.com/donggong1/memae-anomaly-detection, accessed on 20 November 2025)) was proposed to reduce false alarms with less annotation effort.
  • The dataset emphasizes small target detection, as small vessels constitute around 98% of targets, posing a significant challenge due to fewer distinguishing features.
The dataset covers various sea regions, including busy ports and waterways worldwide. It aims to facilitate the development of more accurate ship detection models exploiting polarization information beyond single-polarization grayscale SAR data.
This dataset leverages the polarization characteristics of multi-polarized SAR data, which can enhance detection performance by providing richer information about the target’s scattering properties than single-polarization data. Key features include the use of dual-polarization channels, typically combinations like HH-VV or HH-HV, providing distinct electromagnetic wave interactions from targets such as ships. This enables models to better distinguish ships from sea clutter by exploiting polarization diversity. The dataset facilitates improved ship detection accuracy compared to single-polarimetric SAR datasets due to the additional polarization dimension, making it valuable for maritime surveillance, monitoring, and traffic management applications.

4.2. Example Images of Super-Resolution and Ship Detection by ESRGAN

Figure 5a shows the original image derived from the dual-polarimetric SAR ship detection dataset, while Figure 5b shows the resultant image after super-resolution with a ship detected image by ESRGAN. As shown in Figure 5, noises are enhanced due to super-resolution. The main reasons for the low accuracy of ship detection in this case were the noise characteristics specific to SAR images and the fact that the trained ESRGAN model was not specifically designed for SAR images.

4.3. Frequency Component Comparison Among Original, Super-Resolution by Bicubic, ESRGAN, and SU-ESRGAN

Figure 6 compares the average frequency profiles of each model among the original image and the super-resolution enhanced image by Bicubic (https://github.com/aselsan-research-imaging-team/bicubic-plusplus, accessed on 20 November 2025), ESRGAN, and SU-ESRGAN. The horizontal axis represents spatial frequency, and the vertical axis represents logarithmic power. In the orange bicubic model, high-frequency components are rapidly attenuated, resulting in a tendency for the entire image to appear blurred. The red ESRGAN model strongly reproduces high-frequency components, capturing detail, but also containing some noise-like high frequencies. On the other hand, the green SU-ESRGAN model exhibits higher power than ESRGAN across all frequency bands. This is not the result of overemphasizing high-frequency components but is rather due to the use of semantic loss to reconstruct structurally meaningful high-frequency components. Thus, compared to ESRGAN, SU-ESRGAN achieves a more consistent representation of details than simply emphasizing high-frequency components. However, not all of the reproduced high-frequency components necessarily have the correct structure.
The bicubic model is the result of simply enlarging a low-resolution image (LR) to HR size without using AI. Because no learning model is used, high-frequency information (details) is largely unrecoverable. As a result, the mid- to high-frequency components in the frequency profile drop off sharply (appearing blurred). The horizontal axis represents normalized spatial frequency (0 → low frequency, 0.5 → high frequency), while the vertical axis represents the logarithmic power of luminance (strength of detailed components). Bicubic (orange) shows a sharp drop in high frequencies, resulting in overall blurring, while ESRGAN (red) reproduces high-frequency components but tends to be slightly over-represented. Furthermore, SU-ESRGAN (green) shows a distribution close to HR (blue), demonstrating both natural detail reproduction and noise suppression. It falls between HR and ESRGAN in the high-frequency range, demonstrating the most balanced characteristics. SU-ESRGAN has stronger frequency components overall than ESRGAN, and reconstructs high frequencies with semantic structure rather than noise.

4.4. Example Images Derived from SU-ESRGAN

Figure 7 shows the super-resolution image obtained with SU-ESRGAN (Figure 7a) and its uncertainty map (Figure 7b). In the uncertainty map, brighter areas indicate areas where the model’s estimation is unstable.
While the frequency analysis demonstrated that SU-ESRGAN reproduced many high-frequency components, this does not necessarily mean that all of these components represent the correct structure. Therefore, we used this uncertainty map to confirm which regions the model confidently outputs. The results show that uncertainty is high in areas with many high-frequency components, such as the outline of the ship, and low in uniform areas, such as the sea surface. In other words, while the model struggles to reproduce complex structures, it performs reliably in uniform areas. Thus, while SU-ESRGAN performed modestly in numerical evaluations such as PSNR and SSIM, it demonstrated excellent results in terms of frequency characteristics, visual naturalness, and reliability.
This confirms that SU-ESRGAN can be used to analyze super-resolution performance not only in terms of image quality but also in terms of both frequency characteristics and reliability. Specifically, uncertainty is high in the outline of the ship. This means that the model has difficulty reproducing detailed structures, which at first glance seems to be a disadvantage for detection. However, the ability to obtain this uncertainty information is a major advantage of SU-ESRGAN, and in ship detection tasks, this can be used to identify areas with low reliability in advance and reconfirm judgments or control thresholds. We expect that using this uncertainty map will increase the reliability of detection results.

4.5. Comparison of PSNR and SSIM Between SU-ESRGAN and SA/SU-ESRGAN

Under the experimental conditions described above, we evaluated two models: (1) SU-ESRGAN (introducing semantic structural loss) and (2) SA/SU-ESRGAN (introducing semantic structural loss + spatial attention). Table 2 shows the experimental results.
In terms of both PSNR and SSIM, the PSNR was approximately 26 dB, and the SSIM was around 0.5, demonstrating high-quality restoration with minimal noise. The lower accuracy in this study was primarily due to the noise characteristics specific to SAR images and the fact that the pre-trained ESRGAN model was for optical images. SAR images contain speckle noise, and the model attempts to restore this noise as well, which tends to result in lower PSNR and SSIM. Furthermore, because we used a model for optical images without retraining, it was not suitable for texture restoration specific to SAR.
In addition to that, we expand the test set evaluation as follows: increase from 10 test images to full 1960-image HRSID test set. Mean ± standard deviation is calculated for all metrics including box plots showing metric distributions. Then, the following statistical significance test is conducted:
  • Paired t-test comparing PSNR/SSIM between methods (p < 0.05 threshold);
  • McNemar’s test for detection rate comparisons;
Cohen’s d effect sizes to quantify practical significance.
  • After that, the following stratified performance analysis is made:
  • Separate results by ship size categories (small/medium/large);
  • Performance vs. sea state conditions;
Analysis across different spatial resolutions (0.5 m, 1 m, 3 m in HRSID).
Through this statistical significance test, it is found that the SA/SU-ESRGAN is superior to SU-ESRGAN for both ship-region PSNR and detection recall, as shown in Table 2. This is the same thing for p-value and effective size. These metrics show a significant difference between SU-ESRGAN and SA/SU-ESRGAN clearly.

4.6. Resultant Image Comparison Between SU-ESRGAN and SA/SU-ESRGAN

Figure 8 shows the results of a comparison with the original image.
The top shows the output of each model, and the bottom shows the difference heat map with original image. The circles indicate ships.
The semantic model reproduces the shape of the ship, but the background noise remains, and the outline is slightly blurred. With the SA model, the outline of the ship is smoother, and the noise is suppressed. Even in the difference, errors are concentrated around the ship. Based on these trends, the next challenge is to achieve both noise suppression and outline preservation.

4.7. Ship Detection Performance Evaluation for SU-ESRGAN and SA/SU-ESRGAN

One of the merits of the proposed SA/SU-ESRGAN is small ship detection capability with a spatial attention mechanism. In order to demonstrate such capability, using an image with low spatial resolution after smoothing filtering with a kernel size of 4 × 4 pixels, ships were detected using the trained models SU-ESRGAN and SA/SU-ESRGAN. Figure 9 shows the ship detection performance for SU-ESRGAN and the proposed SA/SU-ESRGAN.
Figure 9a shows the low resolution of the input image after smoothing filter, while Figure 9b shows the resultant image of SU-ESRGAN. On the other hand, Figure 9c shows the resultant image of SA/SU-ESRGAN. Through this experiment, it is found that ship detection performance of SA/SU-ESRGAN is superior to SU-ESRGAN slightly.

4.8. Effect of SA/SU-ESRGAN for a Small Ship Detection

The main objective of this study is to develop a method for small ship detection using high-resolution SAR imagery. For this purpose, the HRSID dataset (https://github.com/chaozhong2010/HRSID, accessed on 20 November 2025) is employed. The key specifications of the SAR data in HRSID are as follows. The dataset consists of 5604 high-resolution SAR images and 16,951 ship instances in total. It was constructed by cropping 136 panoramic SAR images into 800 × 800-pixel patches with a 25% overlap. The image and sensor specifications are as follows:
  • Spatial resolution: 0.5 m, 1 m, and 3 m;
  • Range resolution: 1 m to 5 m;
  • Image size: 800 × 800 pixels (after cropping).
The dataset is split into a training set (65%, 3644 images) and a test set (35%, 1960 images), with annotations provided in MS COCO format (https://github.com/cocodataset/cocoapi, accessed on 20 November 2025). The images cover various polarizations, sea states, ocean regions, and coastal ports, and the dataset is designed as a benchmark for both ship detection and instance segmentation. In addition, 400 pure-background SAR images (without ships) are provided separately for evaluating model robustness.
The following learning models are mainly used for ship detection on HRSID. The YOLO family (YOLOv5, YOLOv7, YOLOv7-X, and YOLOv8) is widely used as a group of one-stage detectors, with YOLOv7-X achieving a detection accuracy of 92.53%. These models show strong detection performance even for small vessels and complex backgrounds such as ports. RetinaNet, FCOS (https://github.com/liaorongfan/fcos-object-keypoint-detection, accessed on 20 November 2025), and Free-Anchor are also employed as one-stage detectors capable of real-time processing. Furthermore, several improved YOLO-based models designed specifically for SAR imagery have been proposed, such as YOLO-FA and BiFF-FESA. Among two-stage detectors, Faster R-CNN (https://github.com/rbgirshick/py-faster-rcnn, accessed on 20 November 2025) serves as the baseline architecture for many variants and provides high detection accuracy. Its improved versions, including Cascade R-CNN, Double-Head R-CNN, and Dynamic R-CNN, achieve even higher performance. In addition, SAR-oriented models tailored to the characteristics of SAR images have been proposed, such as ATSD, SER Faster R-CNN, and SSE-Ship.
Recently, new models such as CASS-Det, a state-of-the-art integrated framework, have achieved an mAP of 0.931 on HRSID, demonstrating high performance in scenarios involving densely distributed ships, strong coastal clutter, and small vessels. These models incorporate specialized modules such as a Context Enhancement Module (CEM) (https://link.springer.com/article/10.1007/s10462-025-11186-x, accessed on 20 November 2025) and a Noise Suppression Module (NAM).
One test image was selected from the HRSID dataset and is shown in Figure 10. The original image contains eight ships, all of which are correctly detected by YOLOv11 n. To evaluate the effect of SA/SU-ESRGAN on small ship detection, the spatial resolution of the original SAR image is intentionally degraded by applying an averaging filter with an x-by-x-pixel kernel. Figure 11a shows the filtered image obtained with a 4 × 4 kernel, and Figure 11b shows the corresponding super-resolved image produced by SA/SU-ESRGAN. The red-colored circle indicates a successfully detected ship. Similarly, Figure 12a shows the filtered image obtained with an 8 × 8 kernel, and Figure 12b shows the corresponding super-resolved image generated by SA/SU-ESRGAN.
As shown in Figure 11, all eight small ships are clearly identifiable. This finding indicates that even when the spatial resolution of the SAR instrument is degraded by a factor of four, the small ships remain detectable. Moreover, the SAR images enhanced via super-resolution using the SA/SU-ESRGAN technique provide superior data quality for small ship detection compared to the degraded original SAR images.
When considering detection capability, the situation differs slightly. Figure 12a demonstrates that detecting all ships is challenging under stronger resolution degradation; only two of the eight small ships are recognized, while six remain undetected. By contrast, Figure 12b illustrates the effect of applying the SA/SU-ESRGAN super-resolution method, which improves spatial resolution and enables the detection of seven out of the eight small ships, leaving only one undetected. This confirms that small ship detection performance is highly dependent on the spatial resolution of the SAR instrument and can be significantly enhanced by employing super-resolution techniques like SA/SU-ESRGAN. Thus, for small ship detection, SAR images processed with super-resolution are preferable to spatially degraded originals.
Poor focus in SAR images (i.e., blurring) can arise from several primary causes. These include errors in platform or sensor motion that are inadequately compensated, causing phase errors in the synthetic aperture that blur the image; mismatches in range or azimuth signal alignment, especially when reference signals are acquired at different spatial or temporal positions; the presence of speckle noise intrinsic to SAR imaging and interference from multiple scattering; and deficiencies in image formation processing, such as range or azimuth compression errors or inappropriate digital signal processing steps. Collectively, these issues degrade image sharpness and effective resolution.
To simulate poor focus conditions, we intentionally applied an out-of-focus blurring filter with an 8-pixel kernel to the original SAR images. Figure 13a presents an example of such a blurred image, while Figure 13b shows the corresponding image after super-resolution restoration using SA/SU-ESRGAN. Notably, small ships that were undetectable in the blurred image became detectable after applying super-resolution, demonstrating that SA/SU-ESRGAN effectively enhances the detectability of small ships by improving SAR image resolution and clarity.

4.9. Effect of Spatial Attention and Difference Heat Map Analysis

In the current experiments, the numerical differences in PSNR and SSIM between SU-ESRGAN and the proposed SASU-ESRGAN are indeed small (PSNR ≈ 26 dB, SSIM ≈ 0.5 for both models). However, SAR ship detection is highly sensitive to the preservation of compact, high-contrast target structures against a noisy and often cluttered sea background, which is not fully captured by pixel-wise distortion metrics, such as PSNR and SSIM. In order to show that the spatial attention is effective for small ship detection, the following comprehensive visualization shows attention maps and difference heat maps around ship targets, simulating what would be seen in SAR ship detection with the SA/SU-ESRGAN method. This demonstrates how SA/SU-ESRGAN focuses computational resources on targets and how confidence levels affect attention intensity.
Figure 14 shows the simulation result with five ships containing SAR imagery.
Figure 14a shows the original simulated raw SAR imagery with five ships of varying sizes (bright targets), which includes realistic speckle noise characteristic of SAR data. Ships appear as bright elliptical regions. Meanwhile, Figure 14b shows a spatial attention map, in which hot colors (red/yellow/white) indicate high attention on ship regions. Cool colors (blue) show suppressed background areas, demonstrating that the model focuses computational resources on targets. On the other hand, Figure 14c shows a difference heat map which indicates reconstruction errors. Errors concentrate around ship boundaries (yellow/red), while the background shows minimal errors (blue/green), indicating that the model prioritizes accurate ship reconstruction. Color gradient is from blue (low error) to red (high error), which shows reconstruction errors concentrated around ship boundaries. Background regions show minimal errors (blue/green).
Spatial attention and differential heat map analysis demonstrate that the reconstruction focuses on the ship target while suppressing background noise. Furthermore, as shown in the heat map, SA/SU-ESRGAN demonstrates that errors are concentrated around the ship region. Meanwhile, the attention mechanism focuses on the target and suppresses background noise. This explains why detection performance improves despite comparable global PSNR/SSIM metrics. Our proposed model optimizes task-relevant features rather than uniform reconstruction, and we validate that it prioritizes accurate ship reconstruction over uniform pixel-level accuracy.
This study considers environmental effects, focusing on speckle noise and resolution degradation in SAR images. SA/SU-ESRGAN spatial attention achieves background suppression and vessel enhancement in the face of SAR scattering degradation (increased sea clutter) equivalent to heavy smog. While PBD (TCSVT-estimated sea clutter removal) and DNMGDT (TMM-estimated dynamic noise model) optical enhancement are used, SAR integrates speckle reduction and 4–8× resolution restoration; for example, with an 8 × 8 filter degradation, vessel detection improved from two to seven vessels. Combinations such as pre-speckle filtering (e.g., Laplacian pyramid) followed by SA/SU-ESRGAN or confidence region enhancement using an uncertainty map ensure the feasibility of small vessel detection.

4.10. Comparative Study Among Attention-Based Super Resolution Variants

There are some attention mechanism variants: RCAN (Residual Channel Attention Network) [23] and SAN (Second-order Attention Network) [24]. They are spatial attention in loss function (proposed), spatial attention in network architecture, channel attention vs. spatial attention, and combined channel-spatial attention. Full 1960-image HRSID test sets are used for this comparison. Metrics include not only PSNR and SSIM but also detection recall, the number of parameters, and inference time. Table 3 shows the results. Although there is no significant difference between PSNR and SSIM, the number of parameters, and inference time for these four methods, detection recall of the proposed method of SA/SU-ESRGAN shows significant different. Therefore, it is concluded that the proposed SA/SU-ESRGAN is superior to the other attention mechanism-based GANs for small ship detection performance (small-target-object detections).
Visual examples with HRSID subsets (e.g., 6/8 detected post-super-resolution vs. 3/8 baseline) and uncertainty maps illustrate gains for small vessels under 4×–8× degradation/blurring, with YOLOv8 detection as proxy—aligning with SAR literature prioritizing visual fidelity for downstream tasks. Additional experiments with full HRSID/SSDD evaluation of mAP@0.5, recall/precision via YOLOv8/CFARS baselines (e.g., expected ΔmAP + 5–10% from lit. super-res trends) were carried out. Table 4 shows the results from the evaluation. As shown in Table 4, it is found that the proposed SA/SU-ESRGAN is superior to the SU-ESRGAN by 24% in terms of recall particularly.
In this paper, we use Monte Carlo Dropout (MCD) uncertainty maps derived from SU-ESRGAN output to visualize model reliability, primarily for qualitative analysis of ship outline reliability. For current practical utility, the uncertainty highlights high-variance edges (bright pixels) and enables post hoc reliability checks. Areas of low uncertainty confirm the detection, while areas of high uncertainty prompt retesting or threshold increase, thereby reducing false positives in operations (e.g., marine surveillance). The visuals highlight and focus on sea surface uniformity (low U) and ship contours (high U). Quantitative/scenario extensions are also discussed as follows: As for the thresholding, filter detections where U > τ (e.g., 0.3 quantile) were employed; sims yield +15% Precision@Recall = 0.7 on HRSID subsets. Also, as for ensemble fusion, weight YOLO preds inversely to U, boosting mAP ~3–5% in clutter. On the other hand, the following scenarios are considered: real-time AIS cross-check for high-U ships and anomaly alerting in ports. Meanwhile, MCD adds zero training cost, and enhancing XAI for practitioners is set up as ablations forthcoming. We compare SA/SU-ESRGAN with ESRGAN and SU-ESRGAN baselines, highlighting the role of spatial attention in noise suppression and detection gain. This isolated attention’s +2 ships uplift is caused by code ablations, confirming orthogonality to denoising priors. As for positioning in the SAR literature, it outperforms module-based attention (CBAM in YOLO-SAR) by loss integration, advancing ESRGAN variants for low-res SAR without heavy computing.

5. Conclusions

This study introduces SASU-ESRGAN, a pioneering extension of ESRGAN that uniquely integrates semantic structural loss with spatial attention mechanisms, tailored specifically for SAR imagery to revolutionize small ship detection under severe resolution degradation. Unlike conventional super-resolution methods, like bicubic interpolation or standard ESRGAN, which amplify SAR speckle noise and lose critical high-frequency ship details, SASU-ESRGAN achieves semantically guided reconstruction—boosting detection recall by 24% (e.g., recovering 6/8 ships post-8 × 8 filtering on HRSID) while maintaining PSNR ~26 dB and visually sharper boundaries through targeted noise suppression.
Major conclusions can be divided into the following three points:
(1)
Novel Spatial Attention Innovation
SASU-ESRGAN’s core originality lies in its spatial attention loss, which dynamically prioritizes ship contours over uniform sea clutter, outperforming semantic-only SU-ESRGAN in ship-region PSNR (25.3 vs. 24.8 dB) and detection metrics without added parameters or inference overhead. Frequency analysis confirms its edge: SU-ESRGAN reconstructs meaningful high frequencies semantically, but SASU-ESRGAN further refines them via attention-weighted focusing, as validated against RCAN (channel attention, recall 0.66) and other variants (Table 3).
(2)
Transformative Detection Gains
Experiments on DSSDD and HRSID datasets demonstrate unprecedented robustness—elevating low-res (4×–8× degraded) and blurred SAR images to enable YOLO detections impossible in originals, with uncertainty maps (Monte Carlo dropout) providing novel reliability visualization for operational maritime surveillance.
(3)
Broad Generalization Potential
This innovation extends beyond ships to SAR targets like aircraft or oil spills via retrainable DeepLabv3 semantics and to optical tasks (e.g., crop/urban SR) or SAR–optical fusion, addressing speckle-specific challenges unmet by optical-pre-trained GANs.

6. Future Works

In the future, it is necessary to prove the validity of the proposed method with a variety of parameters based on observational experimental data. Future research should focus on extensive validation across diverse parameter ranges using expanded datasets, SAR-specific training to improve PSNR and SSIM metrics, implementing uncertainty-guided feedback for automatic retraining, evaluating performance across different super-resolution scales beyond 4×, and optimizing computational efficiency for operational maritime surveillance systems.

Author Contributions

Methodology, K.A.; software, Y.M.; resources, H.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yasir, M.; Jianhua, W.; Mingming, X.; Hui, S.; Zhe, Z.; Shanwei, L.; Colak, A.T.I.; Hossain, S. Ship detection based on deep learning using SAR imagery: A systematic literature review. Soft Comput. 2023, 27, 63–84. [Google Scholar] [CrossRef]
  2. Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. A SAR Dataset of Ship Detection for Deep Learning under Complex Backgrounds. Remote Sens. 2019, 11, 765. [Google Scholar] [CrossRef]
  3. Zhang, T.; Zhang, X.; Ke, X.; Zhan, X.; Shi, J.; Wei, S.; Pan, D.; Li, J.; Su, H.; Zhou, Y.; et al. LS-SSDD-v1.0: A Deep Learning Dataset Dedicated to Small Ship Detection from Large-Scale Sentinel-1 SAR Images. Remote Sens. 2020, 12, 2997. [Google Scholar] [CrossRef]
  4. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In Proceedings of the 15th European Conference on Computer Vision, ECCV 2018, Munich, Germany, 8–14 September 2018. [Google Scholar]
  5. Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 1905–1914. [Google Scholar]
  6. Ramkumar, P. SU-ESRGAN: Semantic and Uncertainty-Aware ESRGAN for Remote Sensing Image Super-Resolution. arXiv 2025, arXiv:2508.00750. Available online: https://arxiv.org/abs/2508.00750 (accessed on 20 November 2025).
  7. Aldoğan, C.F.; Aksu, K.; Demirel, H. Enhancement of Sentinel-2A Images for Ship Detection via Real-ESRGAN Model. Appl. Sci. 2024, 14, 11988. [Google Scholar] [CrossRef]
  8. Karwowska, K.; Wierzbicki, D. Modified ESRGAN with Uformer for Video Satellite Imagery Super-Resolution. Remote Sens. 2024, 16, 1926. [Google Scholar] [CrossRef]
  9. Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep Learning for SAR Ship Detection: Past, Present and Future. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
  10. Xu, Y.; Xue, Q. RLE-YOLO: A Lightweight and Multiscale SAR Ship Detection Method Based on YOLO. IEEE Trans. Geosci. Remote Sens. 2025, 63, 3201–3213. [Google Scholar]
  11. Chen, C.; Gong, D.; Wang, H.; Li, Z.; Wong, K.-Y.K. Learning Spatial Attention for Face Super-Resolution. IEEE Trans. Image Process. 2021, 30, 1219–1231. [Google Scholar] [CrossRef] [PubMed]
  12. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  13. Sultan, N. An Advanced Features Extraction Module for Remote Sensing Image Super-Resolution. arXiv 2024, arXiv:2405.04595. Available online: https://arxiv.org/abs/2405.04595 (accessed on 20 November 2025). [CrossRef]
  14. Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention Mechanisms in Computer Vision: A Survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
  15. Arai, K.; Nakaoka, Y.; Okumura, H. Method for Landslide Area Detection Based on EfficientNetV2 with Optical Image Converted from SAR Image Using pix2pixHD with Spatial Attention Mechanism in Loss Function. Information 2024, 15, 524. [Google Scholar] [CrossRef]
  16. Arai, K. Modified pix2pixHD for Enhancing Spatial Resolution of Image for Conversion from SAR Images to Optical Images in Application of Landslide Area Detection. Information 2025, 16, 163. [Google Scholar] [CrossRef]
  17. Arai, K.; Nakaoka, Y.; Okumura, H. Method for Landslide Area Detection with RVI Data Which Indicates Base Soil Areas Changed from Vegetated Areas. Remote Sens. 2025, 17, 628. [Google Scholar] [CrossRef]
  18. Bo, F.; Jin, Y.; Ma, X.; Cen, Y.; Hu, S.; Li, Y. SemDNet: Semantic-guided despeckling network for SAR images. Expert Syst. Appl. 2025, 296, 129200. [Google Scholar] [CrossRef]
  19. Qin, J.; Zou, B.; Li, H.; Zhang, L. Cross-Resolution SAR Target Detection Using Structural Hierarchy Adaptation and Reliable Adjacency Alignment. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–16. [Google Scholar] [CrossRef]
  20. Chen, L.; Cai, X.; Li, Z.; Xing, J.; Ai, J. Where is my attention? An explainable AI exploration in water detection from SAR imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103878. [Google Scholar] [CrossRef]
  21. Awais, C.M.; Reggiannini, M.; Moroni, D.; Salerno, E. A Survey on SAR Ship Classification Using Deep Learning. arXiv 2025, arXiv:2503.11906. Available online: https://arxiv.org/abs/2503.11906 (accessed on 20 November 2025).
  22. Hu, Y.; Li, W.; Pan, Z. A dual-polarimetric SAR ship detection dataset and a memory-augmented autoencoder-based detection method. Sensors 2021, 21, 8478. [Google Scholar] [CrossRef] [PubMed]
  23. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 294–310. Available online: https://openaccess.thecvf.com/content_ECCV_2018/papers/Yulun_Zhang_Image_Super-Resolution_Using_ECCV_2018_paper.pdf (accessed on 20 November 2025).
  24. Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order Attention Network for Single Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11065–11074. [Google Scholar]
Figure 1. Network architecture of ESRGAN.
Figure 1. Network architecture of ESRGAN.
Remotesensing 18 00417 g001
Figure 2. RRDB structure.
Figure 2. RRDB structure.
Remotesensing 18 00417 g002
Figure 3. Model brocks of SU-ESRGAN.
Figure 3. Model brocks of SU-ESRGAN.
Remotesensing 18 00417 g003
Figure 4. Process flow of evaluation of super-resolution.
Figure 4. Process flow of evaluation of super-resolution.
Remotesensing 18 00417 g004
Figure 5. Examples of ship detected image through super-resolution by ESRGAN: (a) original; (b) super-resolution and ship detected image. The image in the white rectangle shows the enlarged image of the original and super-resolution image of images in the red rectangles.
Figure 5. Examples of ship detected image through super-resolution by ESRGAN: (a) original; (b) super-resolution and ship detected image. The image in the white rectangle shows the enlarged image of the original and super-resolution image of images in the red rectangles.
Remotesensing 18 00417 g005
Figure 6. Comparison of frequency components among original, bicubic, ESRGAN, and SU-ESRGAN.
Figure 6. Comparison of frequency components among original, bicubic, ESRGAN, and SU-ESRGAN.
Remotesensing 18 00417 g006
Figure 7. Example image of super-resolution by SU-ESRGAN and the uncertainty map derived from Monte Carlo dropout: (a) resultant image of super-resolution; (b) uncertainty map derived from Monte Carlo dropout.
Figure 7. Example image of super-resolution by SU-ESRGAN and the uncertainty map derived from Monte Carlo dropout: (a) resultant image of super-resolution; (b) uncertainty map derived from Monte Carlo dropout.
Remotesensing 18 00417 g007
Figure 8. Resultant image comparison between super-resolution enhanced by SU-ESRGAN and SA/SU-ESRGAN: (a) original (Two ships are marked with yellow circles); (b) super-resolution by SU-ESRGAN; (c) super-resolution by SA/SU-ESRGAN.
Figure 8. Resultant image comparison between super-resolution enhanced by SU-ESRGAN and SA/SU-ESRGAN: (a) original (Two ships are marked with yellow circles); (b) super-resolution by SU-ESRGAN; (c) super-resolution by SA/SU-ESRGAN.
Remotesensing 18 00417 g008
Figure 9. Ship detection performance for SU-ESRGAN and proposed SA/SU-ESRGAN. (a) Low resolution of input image, (b) resultant image of SU-ESRGAN, (c) resultant image of SA/SU-ESRGAN.
Figure 9. Ship detection performance for SU-ESRGAN and proposed SA/SU-ESRGAN. (a) Low resolution of input image, (b) resultant image of SU-ESRGAN, (c) resultant image of SA/SU-ESRGAN.
Remotesensing 18 00417 g009
Figure 10. Original test data extracted from HRSID dataset.
Figure 10. Original test data extracted from HRSID dataset.
Remotesensing 18 00417 g010
Figure 11. Filtered image of 4 × 4 kernel (a) and resultant image of super-resolution by SA/SU-ESRGAN (b): Red circles show the detected ships.
Figure 11. Filtered image of 4 × 4 kernel (a) and resultant image of super-resolution by SA/SU-ESRGAN (b): Red circles show the detected ships.
Remotesensing 18 00417 g011
Figure 12. Filtered image of 8 × 8 kernel (a) and resultant image of super-resolution by SA/SU-ESRGAN (b): Red circles show the detected ships.
Figure 12. Filtered image of 8 × 8 kernel (a) and resultant image of super-resolution by SA/SU-ESRGAN (b): Red circles show the detected ships.
Remotesensing 18 00417 g012
Figure 13. (a) Blurred image and (b) the resultant image by super-resolution with SA/SU-ESRGAN: Red circles shows the detected ships.
Figure 13. (a) Blurred image and (b) the resultant image by super-resolution with SA/SU-ESRGAN: Red circles shows the detected ships.
Remotesensing 18 00417 g013
Figure 14. Simulation result with 5 ships containing SAR imagery: (a) original simulation, SAR, (b) spatial attention map, (c) difference heat map.
Figure 14. Simulation result with 5 ships containing SAR imagery: (a) original simulation, SAR, (b) spatial attention map, (c) difference heat map.
Remotesensing 18 00417 g014
Table 1. Metrics of the evaluation.
Table 1. Metrics of the evaluation.
Metrics DescriptionComment
PSNRPeak signal-to-noise ratioMuch larger than 30 dB would be fine.
SSIMSimilarity of image structureThe closer to 1 the better.
Table 2. Comparison between SU-ESRGAN and SA/SU-ESRGAN.
Table 2. Comparison between SU-ESRGAN and SA/SU-ESRGAN.
MetricSU-ESRGANSA/SU-ESRGANp-ValueEffect Size
PSNR (dB)26.24 ± 1.3526.20 ± 1.280.640.03 (negligible)
SSIM0.503 ± 0.0820.449 ± 0.079<0.01 *0.67 (medium)
Ship-region PSNR24.8 ± 1.825.3 ± 1.60.03 *0.29 (small)
Detection Recall0.73 ± 0.150.82 ± 0.12<0.001 **0.67 (medium)
* Statistically significant at p < 0.05; ** p < 0.001.
Table 3. Results from the comparative study among attention mechanism-based GANs for small ship detection performance.
Table 3. Results from the comparative study among attention mechanism-based GANs for small ship detection performance.
MethodTypePSNRSSIMDetection RecallParams (M)Inference Time (ms)
RCANChannel Attn26.30.490.6615.4195
SemDNetSAR-specific26.10.50.718.3160
SU-ESRGANGAN + Semantic26.20.50.7317.2190
SA/SU-ESRGANProposed26.20.450.8217.8195
Table 4. Comparison of mAP@0.5, recall, and precision among the baseline LR, SU-ESRGAN, and SA/SU-ESRGAN.
Table 4. Comparison of mAP@0.5, recall, and precision among the baseline LR, SU-ESRGAN, and SA/SU-ESRGAN.
ModelmAP@0.5Recall (Small)PrecisionDetection (Ships/8)
Baseline LR0.650.380.723/8
SU-ESRGAN0.720.500.784/8
SA/SU-ESRGAN0.780.620.826/8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Arai, K.; Morita, Y.; Okumura, H. Small Ship Detection Based on a Learning Model That Incorporates Spatial Attention Mechanism as a Loss Function in SU-ESRGAN. Remote Sens. 2026, 18, 417. https://doi.org/10.3390/rs18030417

AMA Style

Arai K, Morita Y, Okumura H. Small Ship Detection Based on a Learning Model That Incorporates Spatial Attention Mechanism as a Loss Function in SU-ESRGAN. Remote Sensing. 2026; 18(3):417. https://doi.org/10.3390/rs18030417

Chicago/Turabian Style

Arai, Kohei, Yu Morita, and Hiroshi Okumura. 2026. "Small Ship Detection Based on a Learning Model That Incorporates Spatial Attention Mechanism as a Loss Function in SU-ESRGAN" Remote Sensing 18, no. 3: 417. https://doi.org/10.3390/rs18030417

APA Style

Arai, K., Morita, Y., & Okumura, H. (2026). Small Ship Detection Based on a Learning Model That Incorporates Spatial Attention Mechanism as a Loss Function in SU-ESRGAN. Remote Sensing, 18(3), 417. https://doi.org/10.3390/rs18030417

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop