Next Article in Journal
Improvement of Snow Albedo Simulation Considering Water Content
Previous Article in Journal
Accounting for 10 m Resolution Mapping for Above-Ground Biomass of Urban Trees in C40 Cities Across Eurasia Continent
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel U-Shaped Network Combined with a Hierarchical Sparse Attention Mechanism for Coastal Aquaculture Area Extraction in a Complex Environment

by
Chengyi Wang
1,
Yuyang Zhao
2,
Lu Li
2,* and
Tianyi Liu
2
1
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
2
School of Automation, Beijing Information Science and Technology University, Beijing 100192, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(23), 3897; https://doi.org/10.3390/rs17233897
Submission received: 20 October 2025 / Revised: 21 November 2025 / Accepted: 24 November 2025 / Published: 30 November 2025

Highlights

What are the main findings?
  • The proposed HSAUNet model achieves state-of-the-art accuracy (93.44 percent IoU) in extracting coastal aquaculture areas from satellite imagery, demonstrating superior performance, especially in complex environments near saltpans.
  • The model’s novel components—the Dycross Sample Module for precise boundary delineation and the Sparse Attention Module for capturing global context—are key to its success in distinguishing spectrally similar features.
What is the implication of the main finding?
  • This work provides a highly reliable, automated tool for monitoring and managing coastal resources, offering critical information for the sustainable development of both the aquaculture and salt industries.
  • The architectural innovations present a valuable technical reference for the remote sensing community, advancing the capability of deep learning models in semantic segmentation tasks for complex geographical environments.

Abstract

Aquaculture pond extraction based on remote sensing (RS) plays a pivotal role in coastal resource utilization and production management. However, most existing studies have focused on limited coastal aquaculture pond extraction and neglected the extraction around saltpans. There are two key challenges in aquaculture pond extraction. Firstly, aquaculture ponds are difficult to accurately extract owing to the spectral and spatial similarities with evaporation ponds and brine concentration ponds within saltpans. Secondly, refining and delineating the boundaries of aquaculture ponds remains challenging. To address these issues, we propose a novel deep learning neural network, namely the U-shaped Network with Hierarchical Sparse Attention (HSAUNet), for coastal aquaculture pond extraction. We proposed the Dycross Sample Module to dynamically generate learnable offsets, which empower our model to accurately capture edge-specific information under the guidance of lower-level feature maps, thus improving the precise perceptiveness of aquaculture boundaries. The Sparse Attention Module with rolling mechanism is proposed to effectively capture global semantic relationships and contextual information in different directions, achieving clear differentiation between aquaculture ponds and evaporation or brine ponds within saltpans. Our datasets are derived from the multispectral Sentinel-2 imagery satellite data including aquaculture ponds around saltpans such as the Changlu Hangu, Huaibei, and Yinggehai salt fields and also some other coastal aquaculture areas such as Shanwei Changsha Bay (Guangdong province) and Dalian Biliuhe Bay (Liaoning province). Experimental results demonstrate that HSAUNet outperforms other state-of-the-art methods on test datasets, achieving an intersection over union (IoU) of 93.42%, which exceeds the highest scores of Deeplabv3+ with a IoU of 92.97%. Our proposed method greatly facilitates and serves as a valuable reference for resource management authorities in monitoring aquaculture ponds.

Graphical Abstract

1. Introduction

With the rapid development of the aquaculture industry and the growing demands for aquatic products, people are now paying more attention to marine resources [1]. Having a dynamic and precise grasp of the spatial distribution of aquaculture is crucial for the standardization and intelligentization of aquaculture management and policy-making [2]. With the rapid advancement of satellite remote sensing imagery (RSI), many researchers have employed high-resolution optical imagery, such as Sentinel-1, Sentinel-2, and GF-2 data to extract aquaculture areas [3]. Machine learning techniques, including random forest [4], offer automated solutions for feature identification and extraction in remote sensing data. These methods minimize human involvement in the workflow while improving the efficiency of handling large-scale image datasets. Furthermore, algorithms like support vector machine (SVM) [5] and extreme gradient boosting (XGBoost) [6] have strengthened the capacity to discriminate between artificial ponds and natural water bodies, even under challenging environmental conditions. However, traditional machine learning paradigms often require hand-craft feature engineering, which becomes a critical limitation when dealing with the spatial heterogeneity and complexity typical of coastal wetland environments.
Deep learning algorithms are well-known because of their powerful data fitting capabilities and the ability to quickly process information. Among the deep learning algorithms, the fully convolutional network (FCN) [7] is the first to apply deep learning to image segmentation. FCN used the encoding and decoding structure for pixel-level classification of images for the first time. Compared with FCN, UNet [8] further adopted different levels of feature fusion. However, these methods always have a limited receptive field. To enlarge the receptive field, the authors of DeeplabV3 [9] and DeeplabV3+ [10] proposed and adopted the atrous spatial pyramid pooling (ASPP) module to optimize image feature representation. ASPP employs four atrous convolutions [11] with distinct atrous rates to extract multi-scale local feature maps, while integrating global average pooling to capture the overall contextual information of images, thus achieving perfect performance in segmentation tasks.
Moreover, attention mechanisms have been widely adopted in diverse visual tasks, owing to their inherent capacity to capture global information and model contextual relationships. Specifically, self-attention mechanisms have been introduced into scene segmentation tasks to characterize feature dependencies across both spatial and channel dimensions [12]. In the field of remote sensing, attention-based models have also found extensive application in object classification on various satellite images. Zeng et al. [13] employed a fully convolutional network (FCN) combined with the Row-wise and Column-wise Self-Attention (RCSA) mechanism for large-scale extraction of inland aquaculture ponds from high-spatial-resolution remote-sensing imagery. Ai et al. [14] created a SAMALNet structure network combined with an improved ASPP module to obtain multiscale and contextual features, thus enhancing its capability to extract the coastal aquaculture areas more accurately.
However, these methods seldom establish a edge-aware module to refine the boundary of coastal aquaculture ponds. Since the aquaculture ponds are formed gradually by embankment, partition, and regularization of other land cover types, for accurate extraction, the importance of boundary segmentation cannot be overlooked. Existing methods typically employ hierarchical designs and multiple downsampling operations to gradually reduce feature dimensions, but this process often leads to the loss of boundary information. To address this limitation, Dang et al. [15] created a UPS-Net network structure based on the UNet network structure that can fuse boundary and contextual information to reduce “adhesion” in raft culture areas. They have excellent experimental results.
Additionally, the above studies did not consider the extraction of aquaculture ponds around saltpans. Aquaculture ponds cannot be easily extracted because of the spectral and spatial similarities between aquaculture and evaporation ponds and brine concentration ponds within saltpans [16]. As shown in Figure 1c, evaporation ponds, which are contiguous and vary in size from 0.03 to 0.08 km2, typically exhibit a blue-gray color. Brine concentration ponds, often rectangular or square with a small area and higher seawater concentration, are located on both sides of crystallization ponds and appear dark-gray or deep gray (Figure 1b). However, these ponds are spectrally similar to aquaculture ponds (Figure 1d), which are also blue-gray or deep-blue in color and mainly rectangular or a small square in shape. This similarity makes it difficult to distinguish between them. In contrast, crystallization ponds, where salt drying occurs, are adjacent to brine concentration ponds and are easily identifiable by their white-red color and small square shape (Figure 1a).
At present, two principal challenges hinder the automatic extraction of coastal aquaculture ponds. Firstly, we need to construct specific modules to distinguish the coastal aquaculture ponds from saltpans. Secondly, we need to design an edge-aware module to refine and delineate the boundaries of the aquaculture ponds. For these purposes, a novel coastal aquaculture pond extraction method based on a deep learning neural network called U-shaped Network Combined with Hierarchical Sparse Attention Mechanism is proposed. We chose ResNet50-V1c [17] as the encoder. To alleviate the degradation of boundary information incurred during downsampling and acquire edge-aware capacity, inspired by the Dynamic Upsample [18], we propose a Dycross Sample Module that dynamically generates learnable offsets from lower-level detailed feature maps and then guides high-level semantic feature maps to sample. These offsets empower the model to accurately capture edge-specific information under the guidance of lower-level feature maps, thus improving the precise perceptiveness of aquaculture boundaries. Additionally, in order to address the similarity in the spectral characteristics of the evaporation ponds and the aquaculture areas, capturing global semantic relationships is of great importance. Therefore, we propose the Sparse Attention Module with a multi-directional rolling window strategy, which enables our model to have receptive fields in different directions and capture global information and semantic associations across the entire feature spectrum.
Therefore, the main contributions of the paper can be summarized as follows:
(1) In this paper, HSAUNet is proposed to precisely extract coastal aquaculture areas to address the spectral and spatial similarities between aquaculture ponds and evaporation ponds and brine concentration ponds within saltpans and extract preciser boundary details.
(2) Propose the Dycross Sample Module to extract more detailed and refined aquaculture area boundaries. The Sparse Attention Module with a multi-directional rolling window strategy is employed to capture global semantic relationships and contextual information in different directions.
(3) Our proposed HSAUNet model demonstrates strong generalization performance for aquaculture area extraction around saltpans and other coastal aquaculture areas in China.

2. Materials and Methods

2.1. Study Areas

As for aquaculture extraction around saltpans, we chose to study three major coastal saltpans in China—Changlu Hangu (Bohai Sea coast, Tianjin Province, Figure 2d), Huaibei (Yellow Sea coast, Jiangsu Province, Figure 2c), and Yinggehai (Hainan Province, Figure 2a) which integrate salt production and aquaculture. Changlu Hangu uses solar evaporation with favorable coastal and sunlight conditions; Huaibei specializes in industrial salt from mine/brine well resources, supported by coastal plains and a temperate climate; Yinggehai adopts natural evaporation benefiting from a tropical climate. All three contribute significantly to the national salt supply and local economies through coastal aquaculture, enhancing food security. Furthermore, two additional coastal aquaculture areas were incorporated into our study: Shanwei Changsha Bay (Guangdong Province, Figure 2b) and Dalian Biliuhe Bay (Liaoning Province, Figure 2e).

2.2. Data and Preprocessing

The satellite images utilized in our research were sourced from the Sentinel-2 satellite’s Multispectral Instrument (MSI). This state-of-the-art instrument is renowned for its ability to capture detailed imagery across 13 distinct spectral bands, spanning visible, near-infrared, and short-wave infrared wavelengths. Table 1 summarizes the key technical parameters of each dataset. Sentinel-2 imagery was obtained from the Copernicus Data Space Ecosystem. For the purposes of our study, we specifically focused on four key spectral bands: Band 2 (blue), Band 3 (green), Band 4 (red), and Band 8 (near-infrared). Each of these bands offers a spatial resolution of 10 m.
Preprocessing steps included the following: we chose these data from Level 2A, which are high-quality data where the effects of the atmosphere on the light being reflected off of the surface of the Earth and reaching the sensor are excluded. All selected datasets have a cloud cover below 10%; ground truth labels were generated through expert-assisted annotation based on high-resolution imagery using ENVI 5.6 software; labeling was performed using ENVI 5.6, in combination with Google Earth imagery and field survey data. Three remote sensing experts independently annotated the samples; disagreements were resolved through group discussion. Image mosaicking and cropping to patches were conducted using Python 3.9.
We categorized aquaculture ponds as a single semantic segmentation class, while all saltpan regions (evaporation ponds, crystallization ponds, narrow ponds, brine ponds) were labeled as a separate class to enhance distinction from aquaculture areas. The original satellite imagery was split into 512 × 512 pixel tiles for dataset preparation: 4000 samples from the Changlu hangu Salt field, part of the Huaibei Salt field, part of the Shanwei Changsha Bay served as the training and validation set, while part of the Huaibei Salt field (yellow box in Figure 2c), all of the Yinggehai Salt field, part of Shanwei Changsha Bay (yellow box in Figure 2b) and the Dalian Biliuhe Bay were reserved as the test set. These test samples were drawn from non-overlapping local regions to ensure unbiased model evaluation.
During testing, we employed Test Time Augmentation (TTA) to augment each image patch into six variations, including horizontal/vertical flipping and rotations of 90°, 180°, and 270°. All augmented patches were individually fed into the trained network model to generate SoftMax-processed segmentation maps. These outputs were then reverted through corresponding rotations and flips to align with the original orientation. Finally, we aggregated the features by summing them and selected the maximum feature value per pixel as the category label, producing the ultimate segmentation map.

2.3. Model Architecture Design

As illustrated in Figure 3, we utilize ResNet-50V1c as the encoder of our proposed HSAUnet to extract multiscale feature maps. After inputting an image of the size 4 × H × W and after processing by a Stem and several ResNet stages, multiscale hierarchical feature maps are extracted. Then, the neck composed of different Dycross Sample Modules and Sparse Attention Modules is applied to effectively extract contextual relations, harvest different subregion representations, and aggregate different hierarchical feature maps. We propose the Dycross Sample Module to fuse the lower-level detailed feature maps and the higher-level semantic feature maps and dynamically extract more detailed and refined aquaculture area boundaries. The Sparse Attention Module with a multi-directional rolling window strategy is employed to capture global semantic relationships and contextual information in different directions.
Specifically, our proposed HSAUnet comprises three primary components. The first component is a hierarchical convolution encoder based on the ResNet-50V1c network. In this module, the multiscale feature maps of an input four-band image are obtained by convolutions. The second component consists of three Dycross Sample Modules that fuse the lower-level detailed feature maps and the higher-level semantic feature maps and dynamically extract more detailed and refined aquaculture area boundaries. In this module, Align Block is employed to generate learnable offset lower-level feature maps. After being processed by the Dycross Sample Module, the Sparse Attention Module is applied to effectively extract contextual information and harvest global representations in different directions. Finally, the linear projection is employed to predict the segmentation maps.

2.3.1. Sparse Attention Module

As illustrated in Figure 4, the Sparse Attention Module composed of five Sparse Attention blocks is the core of our method. Different from traditional window-based attention, which only allows information flow within fixed local windows and suffers from a limited receptive field and boundary discontinuity, our module adopts a multi-directional rolling window strategy. Since aquaculture areas and saltpans always cover large areas, it is crucial to extract attention and feature representations with a larger receptive field. So, we introduce the Interlaced-ViT for a larger receptive field and add the roll mechanism to break interleaved window isolation for cross-window interaction without extra computation. Compared with the Swin Transformer, which requires multiple ViT and Swin-ViT blocks to be concatenated in different layers of the backbone to achieve global information and cross-window interaction, our module only used a small number of Transformer structures in the SKIP and Fusion Layer, with a smaller number of parameters. Also, we do not apply the mask operation since the aquaculture ponds and saltpans always cover large areas and may have long contextual dependency. The Shifted-Window size in the Swin Transformer is half of the size of the input feature maps, while our roll size is smaller than that for breaking interleaved window isolation from the Interlaced-ViT.
Specifically, as shown in Figure 4, the feature map is cyclically shifted in four directions (bottom-right, top-left, top-right, bottom-left) with periodic padding. The roll size is half of the window size and the window size is 16 in our experiment. Then, the Sparse Attention is employed to capture both local and long-range dependencies. Additionally, as shown in Figure 5, the sparse attention block consists of two sequential Vision Transformer structures: Interlaced-Windows ViT and standard ViT. The structure of Interlaced-ViT is illustrated in Figure 6; Interlaced-ViT can be decomposed based on the ideas of grouping and interleaving. Specifically, the elements are sampled and grouped at regular intervals within a grid, and window self-attention calculations are conducted within each group. The interlaced design allows for efficient long-range dependency modeling, while the standard ViT focuses on local refinement. Compared to Interlaced-ViT, adding the roll mechanism breaks interleaved window isolation for cross-window interaction without extra computation. It enhances multi-directional feature capture, mitigating Interlaced-ViT’s axis bias for large-area scenes. The sparsity ratios in our experiment we set is 8. After sparse attention, a lightweight MLP is applied. Unlike conventional MLPs, as shown in Figure 7, our design assigns different weights to pixels across different channels, thereby enhancing channel attention perception.

2.3.2. Dycross Sample Module

The architecture of the Dycross Sample Module is depicted in Figure 8. Drawing inspiration from the Dynamic Upsample method [18], we introduce the Dycross Sample Module, which dynamically generates learnable offsets from lower-level detailed feature maps and utilizes these offsets to guide high-level semantic feature maps to sample. Unlike the Dynamic Upsample approach [18], where offsets are derived from the input feature maps that need upsampling, our module sources these offsets from the lower-level detailed feature maps provided by the encoder. Our module requires two inputs: the lower-level detailed feature maps from the encoder and the high-level semantic feature maps that need upsampling. The generated offsets then effectively guide the upsampling process of the higher-level feature maps and then aggregate the input feature maps. To be specific, as shown in Figure 9, we incorporate Align Block to align and merge multilevel feature maps. Specifically, for each pixel, offsets are generated to adjust the sampling locations. These offsets are learned during training and can adaptively capture more detailed and smooth edge information. Finally, we perform a weighted fusion of the input feature map and the feature map output by the Align Block after grid sampling to obtain the output of this Dycross Sample Module. The formula for calculating a single Dycross Sample Module is as follows:
P i 1 = f 1 × 1 1 ( P i 1 )
N i 1 = A l i g n ( P i 1 , C i )
N i 1 = α P i 1 + ( 1 α ) N i 1
where P i 1 ( i = 3 , 4 , 5 ) R r H × r W × C represents the lower-level feature map obtained from different stages of the ResNet-50V2 backbone, C i ( i = 3 , 4 , 5 ) R H × W × C represents the output feature map obtained from different Sparse Attention Modules, while α is a learnable parameter, which means the proportion that P i 1 R r H × r W × C occupies when P i 1 and N i 1 R r H × r W × C are added together. Align represents the output feature map after processing by Align Block.
O i 1 = O f f s e t ( P i 1 ) = 0.5 S i g m o i d [ f 1 × 1 3 ( P i 1 ) ] × f 1 × 1 3 ( P i 1 )
N i 1 = Grid_Sample ( C i , [ O i 1 + G i 1 ] )
where O i 1 ( i = 3 , 4 , 5 ) R r H × r W × 2 represents the learnable relative offset ( x , y ), and G i 1 ( i = 3 , 4 , 5 ) R r H × r W × 2 represents the absolute position offset, which can be predetermined and corresponds to the absolute position coordinates of each pixel. Meanwhile, Grid_Sample is a built-in function in PyTorch 1.13.0, which resamples the input feature map C i according to the sampling set using bilinear interpolation.

2.4. Training Details

For training the proposed network, the primary loss function is a multi-class cross-entropy loss between the predicted classification map and the ground truth label map:
L main = i = 1 N c = 1 C y i c log ( y ^ i c )
where C = 3 denotes the number of classes (background, aquaculture pond, and salt field), and N = 512 × 512 is the total number of pixels in each image. Here, y i c is a binary indicator of whether the i-th pixel belongs to class c, and y ^ i c is the predicted probability from the final softmax layer.
To further enhance training stability and extract useful intermediate representations, we introduce an auxiliary loss. Specifically, the semantic features C 3 , C 4 , and C 5 are aggregated and fed into a separate classification head. The output is upsampled by a factor of 8 and compared with the ground truth using the same cross-entropy loss:
L aux = i = 1 N c = 1 C y i c log ( y ˜ i c )
The final loss used for optimization is a weighted combination of the main and auxiliary losses:
L total = L main + λ · L aux
In our experiments, we set λ = 0.8 . Specifically, we set this coefficient within the range of 0 to 2.0, sampled at intervals of 0.1, and performed a total of 20 experiments. We selected the coefficient corresponding to the highest F1-score in the validation sets as the coefficient for the auxiliary loss. The F1-score reaches its highest value when the coefficient is set to 0.8.

3. Results

3.1. Evaluation Criteria

In our research, we employed various evaluation metrics to assess the performance of our neural network algorithms. These metrics include precision, recall, F1-score, mIoU (intersection over union), overall accuracy (OA) and Kappa. TP, FP, FN, and TN, respectively, represent the count of true positives, false positives, false negatives, and true negatives. Recall represents recall rate.
Recall = TP TP + FN
Precision represents precision rate.
Precision = TP TP + FP
IoU stands for the means of the overlap between the predicted segmentation area and the true segmentation area.
IoU = TP TP + FP + FN
F1-score is the harmonic mean calculated using precision and recall.
F 1 = 2 · Precision · Recall Precision + Recall
The Kappa coefficient is also used to measure the extraction performance of the proposed method. In practical applications, the greater the Kappa coefficient, the better the method’s performance. The Kappa coefficient can be expressed as
κ = P o P e 1 P e
where P0 means the sum of the number of correctly classified samples for each class divided by the total number of samples.
P 0 = TP + TN TP + TN + FP + FN
where Pe means the sum of the product of the actual and predicted number corresponding to each category, divided by the square of the total number of samples, and could be expressed as
P e = ( TP + FP ) ( TP + FN ) + ( FP + TN ) ( FN + TN ) ( TP + TN + FP + FN ) 2

3.2. Experiment Details

In our experiment, we employed a ResNet-50Vc model, which had been pre-trained on ImageNet, to initialize the parameters of our HSAUnet. All research activities were conducted within the mmsegmentation framework, which is based on the PyTorch framework. We fine-tuned our model to further optimize its parameters. The training, validation, and testing processes were carried out on an NVIDIA GPU (NVIDIA Corporation, Santa Clara, CA, USA) GeForce RTX 4090 graphics card. Specifically, we set the batch size to 4 and used the Stochastic Gradient Descent (SGD) optimizer, with an initial learning rate of 0.01, momentum of 0.9, and weight decay of 0.0001. The model was trained for 40,000 iterations, during which the learning rate was progressively reduced from 0.01 to 0 using a decay strategy. The window size we set is 16, the roll size we set is 8, and the sparsity ratios in our experiment we set is 8. Additionally, we applied simple data augmentation techniques such as random horizontal flipping, random scaling, and tensor transformations.

3.3. Comparisons of Other Models

3.3.1. Comparative Methods

We conducted a comparative experiment to evaluate the performance of several state-of-the-art methods on test datasets, including ImprovedDeepLabV3+ [19], SegFormer [20], ImprovedPSPNet [21] and Swin-Transformer [22].
(1) ImprovedDeepLabv3+ [19]: The improved DeepLabv3+ network is a semantic segmentation-based method for extraction tasks from remote sensing images. It involves using ResNeSt as the backbone network, which incorporates channel attention mechanisms, to enhance feature representation. The network also integrates the ASPP module to capture multi-scale information, which helps to achieve higher accuracy and better connectivity in extraction tasks.
(2) SegFormer [20]: SegFormer is an end-to-end semantic segmentation method designed for multispectral images. It leverages the multispectral imaging capabilities to achieve higher segmentation accuracy through enhanced spectral information. The method directly learns the complex mapping between image pixels and semantic categories in an end-to-end manner.
(3) ImprovedPSPNet [21]: The Improved Pyramid Scene Parsing Network (PSPNet) is proposed to address the semantic segmentation challenge. A style transfer module is added to adapt the network to new tasks, enhancing generalization ability. An SE structure is integrated into PSPNet to weigh channel features, improving segmentation precision for small targets. Experiments demonstrate the effectiveness of the improved method.
(4) Swin-Transformer [22]: Swin-Transformer-UperNet is proposed for semantic segmentation tasks, integrating the UperNet framework with a Swin Transformer backbone. It inherits the unified perceptual parsing architecture of UperNet to fuse multi-scale features, strengthening scene understanding capability. A Swin Transformer backbone is adopted to leverage hierarchical feature extraction via shifted windows, enhancing the capture of long-range dependencies. Experiments verify the superior performance of this integrated model in semantic segmentation.

3.3.2. Experimental Results

Table 2 presents several comparative experiments and the results of our proposed HSAUNet method on our test dataset. The best results of the metrics are highlighted in red. Across all five metrics, HSAUNet is the clear front runner, topping both overall accuracy (0.970) and Intersection over Union (0.934) while also posting the highest F1-score (0.965) and a near-perfect balance of precision (0.967) and recall (0.963).
As shown in Figure 10, the visualization of Dycross sample features and predicted offsets highlights that our model can effectively comprehend the edges and textural information of aquaculture areas with diverse sizes and spectral features. This capability enables our model to differentiate aquaculture regions around saltpans.
In addition, we have chosen a number of images from the test set and applied various algorithms to generate segmentation maps. Figure 11 demonstrates the generalization performance of different methods in coastal areas and aquaculture regions with various appearances. To facilitate the interpretation of the comparison results, we use different colors to express different meanings. Blue indicates that both the prediction and label correspond to aquaculture areas. Green means the prediction of aquaculture areas; however, the ground truth is the background. Red represents that the prediction is mistakenly labeled as the background, whereas the true label is aquaculture areas. The first and second rows of images in Figure 11 are both from Changsha Bay, Shanwei, Guangdong Province, where the aquaculture ponds are mainly rectangular in shape with blurred boundaries. The third row of images in Figure 11 is from Biliuhe Bay, Dalian, Liaoning Province, where the aquaculture ponds have relatively regular shapes and distinct edges. HSAUNet effectively extracts coastal aquaculture areas of diverse appearances, producing results with detailed and smooth boundaries, as evidenced by Figure 11.
Additionally, Figure 12 is specifically focused on aquaculture areas around saltpans in a complex coastal environment. White indicates that both the prediction and label are saltpans, while blue shows that both correspond to aquaculture areas. Green means the prediction is either saltpans or aquaculture areas, but the actual ground truth is the background. Red represents two types of errors: one where the prediction is incorrectly identified as saltpans when the true label is an aquaculture area, and another where the prediction is mistakenly labeled as an aquaculture area when the true label is saltpans. The fist image in Figure 12 depicts a test sample from the Changlu Hangu Salt field, where dispersed aquaculture ponds adjacent to evaporation ponds exhibit striking spatial feature similarity to internal brine concentration ponds. However, aquaculture ponds are mainly smaller than evaporation ponds, generally less concentrated and more scattered, while evaporation ponds are often distributed continuously over large areas. The second image illustrates the Huaibei Salt Field, where economically marginal evaporation ponds have been converted to higher-yield aquaculture zones, resulting in substantial spectral similarity to evaporation features. The third image depicts coastal scattered and small aquaculture ponds distributed around the Yinggehai Salt Field. Although similar in appearance, aquaculture ponds are more scattered and smaller in shape. In comparison, our model produces accurate segmentation of aquaculture areas, clearly differentiating them from evaporation and brine concentration ponds in saltpans, with smoother and more precise boundary delineation. Overall, these findings indicate that our method is highly effective in extracting aquaculture ponds, particularly in complex environments.

3.4. Ablation Study

It is well known that many factors can affect the model results, such as network structure and parameter initialization method. In this section, we mainly research the influences of the Dysample Fusion Module and Sparse Attention Module on our proposed HSAUNet model. For these two factors, we conducted ablation experiments on our dataset.
(1) Impact of Dycross Sample Module: The Dycross Sample Module is designed to precisely fuse features from lower-level texture information and higher-level semantic information. Its learnable offsets enable the model to capture finer boundary details during sampling, thereby enhancing overall performance. Experimental validation confirms the effectiveness of this module. To evaluate its contribution, we replaced the module with normal bilinear interpolation upsampling. Experimental results in Table 3 demonstrate that the Dycross Sample Module outperforms normal bilinear interpolation upsampling. As shown in Figure 10, the learned offsets significantly improve the model’s sensitivity to edge information, enhancing overall performance.
(2) Impact of the Sparse Attention Module: The Sparse Attention Module is designed to capture global semantic information and contextual relationships in different directions. The roll mechanism effectively addresses the issue of information loss at window boundaries and enables our model to have a larger receptive field. To evaluate the contribution of the Sparse Attention Module, we remove it. Experimental results in Table 3 demonstrate that the Sparse Attention Module outperforms without-Sparse Attention approaches, showing a 0.95% improvement in IoU.
(3) Model Efficiency: As shown in Table 4, the proposed HSAUNet has fewer parameters and lower computational cost compared with other methods. The Dycross Sample Module and Sparse Attention Module play a crucial role in the model’s accurate classification performance. By continuously fusing and sampling deep semantic information, these modules not only have fewer parameters but also enable the model to better understand semantic information.

4. Discussion

The efficient and accurate extraction of aquaculture ponds greatly facilitates resource management authorities in monitoring the illegal expansion of aquaculture ponds and calculating the total area. However, aquaculture ponds are not easily extracted due to the spectral and spatial similarities between the evaporation ponds as well as brine concentration ponds within saltpans. So, in this paper, two key challenges impede the automatic extraction of coastal aquaculture ponds. Firstly, aquaculture ponds are difficult to accurately extract owing to the spectral and spatial similarities with evaporation ponds and brine concentration ponds in saltpans. Secondly, refining and delineating the boundaries of aquaculture ponds remains challenging. To address these problems, we propose a deep learning-based method for extracting coastal aquaculture ponds using HSAUNet. This algorithm is validated in selected study areas. The results demonstrate that it can effectively extract aquaculture ponds with diverse morphological characteristics across different geographical environments, while accurately distinguishing aquaculture areas from evaporation ponds and brine ponds within saltpans. This capability provides crucial information for monitoring and managing both the coastal salt industry and aquaculture regions. In this section, we first experimentally analyze the factors affecting aquaculture extraction accuracy. Subsequently, we thoroughly discuss the uncertainties and limitations of the HSAUNet algorithm during the aquaculture extraction process.

4.1. The Impact of the Local and Limited Perspective

Although HSAUNet can effectively extract aquaculture areas, there are still uncertainties affecting extraction accuracy. Due to the high resolution of original Sentinel-2 L2A imagery (approximately 10,000 × 10,000 pixels) and the model’s limitation to process 512 × 512-pixel patches, we divide the images into non-overlapping 512 × 512-pixel blocks. However, this division approach may result in partial or incomplete appearance of evaporation/brine ponds (especially those adjacent to crystallization ponds) within individual 512 × 512-pixel patches. Since crystallization ponds have distinct spectral features and are typically adjacent to evaporation/brine ponds, capturing spatial relationships between crystallization ponds and their surroundings is crucial for distinguishing aquaculture areas from saltpans. Unfortunately, crystallization ponds often occupy small areas and may not fully appear within a 512 × 512-pixel patch, leading the model to misclassify evaporation/brine ponds as aquaculture regions. As shown in Figure 13, the broad perspective image (1024 × 1024 pixels) clearly displays crystallization ponds and their adjacent features, making it easy to identify surrounding evaporation/brine ponds. When we crop a 512 × 512-pixel local region from the lower-left corner (outlined by the blue box), only partial evaporation ponds are visible. Consequently, the model incorrectly predicts these areas as aquaculture (blue color regions in the prediction map), while the ground truth should be saltpan evaporation ponds (white color regions). However, when we crop a 512 × 512-pixel local region outlined by the red box, the majority of the crystallization ponds are included. This allows our model to effectively capture the spatial relationship between the crystallization ponds and their adjacent evaporation ponds, enabling the model to accurately distinguish aquaculture areas from evaporation ponds. Therefore, our model is sensitive to input images obtained through different cropping methods.

4.2. Advantages, Limitations, and Potential Improvements

4.2.1. Advantages

The advantages of the proposed HSAUNet can be summarized in three aspects: first, HSAUNet demonstrates a strong capability in understanding and extracting spatial features of coastal aquaculture areas across different regions, particularly capturing multi-scale morphological characteristics under varying regional conditions. This implies our model exhibits excellent generalization performance for aquaculture area extraction in various environments. Second, HSAUNet efficiently identifies evaporation ponds by leveraging the local alternate distribution patterns between evaporation ponds and crystallization ponds, enabling precise detection through contextual spatial relationships. Third, as an end-to-end deep learning framework, HSAUNet pioneers the solution for aquaculture area extraction in various and complex environments. Therefore, HSAUNet provides valuable technical references for coastal resource utilization and production management.

4.2.2. Limitations

Despite the advantages of HSAUNet in aquaculture area extraction, two key issues require further investigation:
(1) Pseudo-labeling Issue: With the development of China’s aquaculture industry, the economic benefits of aquaculture have surged, while salt production benefits remain relatively low. This has led to a phenomenon where low-profit evaporation ponds near some salterns are being repurposed for aquaculture to pursue higher returns. Additionally, as exemplified by the Tianjin Changlu Hangu Saltern, certain evaporation ponds are also being utilized for aquaculture to maximize profits while still maintaining their salt production functionality. These transformations significantly increase the difficulty of accurately identifying aquaculture areas, inevitably introducing labeling errors during annotation.
(2) Limited Data: While our model has been validated for its generalization performance in aquaculture areas of Yinggehai Saltpan and Dalian Biliuhe Bay, our current dataset exclusively focuses on aquaculture regions near typical saltpans in China—this inherently limits the model’s applicability. Specifically, the model still exhibits limitations in extracting aquaculture areas across other domestic regions, foreign aquaculture sites, and more complex environments. Additionally, it lacks robust transferability to different data sources, such as images from other satellites like GF-2, and only performs effectively on Sentinel-2 datasets. Furthermore, we only utilized the 2, 3, 4, and 8 bands as inputs; whether other bands can contribute to improving aquaculture area extraction is also worthy of further investigation. To address these limitations, future studies should expand the dataset to incorporate more diverse geographical contexts and explore additional potential improvements.

5. Conclusions

This article introduces HSAUNet, an innovative methodology for extracting aquaculture ponds in complex coastal environments, particularly around saltpans. By integrating the Dycross Sample Module, HSAUNet precisely delineates coastal aquaculture ponds and extracts more precise boundary details. Through the utilization of the Sparse Attention Module, the proposed model effectively captures global semantic relationships and contextual information in different directions, enabling clear differentiation between aquaculture ponds and evaporation ponds and brine ponds within saltpans. Our proposed method greatly facilitates and serves as a valuable reference for resource management authorities in monitoring aquaculture ponds. Future improvements will focus on (1) applying semi-supervised learning and data augmentation techniques to address challenges related to small datasets and pseudo-label noise; and (2) optimizing image cropping strategies and dimensions, as the current cropping size is insufficient to fully capture entire evaporation ponds, while balancing computational efficiency and memory constraints. Consequently, future work will concentrate on refining extraction algorithms to resolve these issues.

Author Contributions

Conceptualization, C.W. and Y.Z.; methodology, Y.Z. and L.L.; software, C.W. and Y.Z.; validation, C.W., Y.Z. and L.L.; formal analysis, T.L. and C.W.; investigation, Y.Z.; resources, C.W.; data curation, Y.Z. and T.L.; writing—original draft preparation, C.W. and Y.Z.; writing—review and editing, C.W. and Y.Z.; visualization, L.L. and T.L.; supervision, C.W. and Y.Z.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key R&D Program of Hainan Province (Grant No. ZDYF2025GXJS176) for the development of integrated navigation-communication-remote sensing equipment and its application in marine fisheries, and the National Natural Science Foundation of China (Grant No. 62471049).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Duan, Y.; Li, X.; Zhang, L.; Chen, D.; Liu, S.; Ji, H. Mapping national-scale aquaculture ponds based on the Google Earth Engine in the Chinese coastal zone. Aquaculture 2020, 520, 734666. [Google Scholar] [CrossRef]
  2. Du, S.; Huang, H.; He, F.; Luo, H.; Yin, Y.; Li, X.; Xie, L.; Guo, R.; Tang, S. Unsupervised stepwise extraction of offshore aquaculture ponds using super-resolution hyperspectral images. Int. J. Appl. Earth Obs. Geoinf. 2023, 119, 103326. [Google Scholar] [CrossRef]
  3. Roy, D.P.; Wulder, M.A.; Loveland, T.R.; Ce, W.; Allen, R.G.; Anderson, M.C.; Helder, D.; Irons, J.R.; Johnson, D.M.; Kennedy, R.; et al. Landsat-8: Science and product vision for terrestrial global change research. Remote Sens. Environ. 2014, 145, 154–172. [Google Scholar] [CrossRef]
  4. Zhang, P.; Gui, F.; Feng, D.; Zhang, G. Remote sensing extraction of aquaculture ponds in China’s coastal zone based on random forest. J. Phys. Conf. Ser. 2024, 2863, 012018. [Google Scholar] [CrossRef]
  5. Zeng, Z.; Wang, D.; Tan, W.; Huang, J. Extracting aquaculture ponds from natural water surfaces around inland lakes on medium resolution multispectral images. Int. J. Appl. Earth Obs. Geoinf. 2019, 80, 13–25. [Google Scholar] [CrossRef]
  6. Xie, G.; Bai, X.; Peng, Y.; Li, Y.; Zhang, C.; Liu, Y.; Liang, J.; Fang, L.; Chen, J.; Men, J.; et al. Aquaculture ponds identification based on multi-feature combination strategy and machine learning from Landsat-5/8 in a typical inland lake of China. Remote Sens. 2024, 16, 2168. [Google Scholar] [CrossRef]
  7. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
  8. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar] [CrossRef]
  9. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
  10. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computser Vision (ECCV), Munich, Germany, 8–14 September 2018; p. 818. [Google Scholar] [CrossRef]
  11. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 848. [Google Scholar] [CrossRef] [PubMed]
  12. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; p. 3154. [Google Scholar] [CrossRef]
  13. Zeng, Z.; Wang, D.; Tan, W.; Yu, G.; You, J.; Lv, B.; Wu, Z. RCSANet: A full convolutional network for extracting inland aquaculture ponds from high-spatial-resolution images. Remote Sens. 2020, 13, 92. [Google Scholar] [CrossRef]
  14. Ai, B.; Xiao, H.; Xu, H.; Yuan, F.; Ling, M. Coastal aquaculture area extraction based on self-attention mechanism and auxiliary loss. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 2250–2261. [Google Scholar] [CrossRef]
  15. Dang, K.B.; Nguyen, M.H.; Nguyen, D.A.; Phan, T.T.H.; Giang, T.L.; Pham, H.H.; Nguyen, T.N.; Tran, T.T.V.; Bui, D.T. Coastal wetland classification with deep u-net convolutional networks and sentinel-2 imagery: A case study at the tien yen estuary of vietnam. Remote Sens. 2020, 12, 3270. [Google Scholar] [CrossRef]
  16. Jiao, X.; Shi, X.; Shen, Z.; Ni, K.; Deng, Z. Automatic Extraction of Saltpans on an Amendatory Saltpan Index and Local Spatial Parallel Similarity in Landsat-8 Imagery. Remote Sens. 2023, 15, 3413. [Google Scholar] [CrossRef]
  17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; p. 778. [Google Scholar] [CrossRef]
  18. Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; p. 6037. [Google Scholar] [CrossRef]
  19. Zhang, D.; Yang, Y.; Qu, F.; Liu, Y. Road Extraction from Remote Sensing Images Based on Improved Deeplabv3+ Network. In Proceedings of the 2024 4th International Conference on Computer Science and Blockchain (CCSB), Shenzhen, China, 6–8 September 2024; pp. 446–449. [Google Scholar]
  20. Nuradili, P.; Zhou, J.; Melgani, F. Wetland Segmentation Method for UAV Multispectral Remote Sensing Images Based on SegFormer. In Proceedings of the IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 6576–6579. [Google Scholar]
  21. Zhang, C.; Zhao, J.; Feng, Y. Research on semantic segmentation based on improved PSPNet. In Proceedings of the 2023 International Conference on Intelligent Perception and Computer Vision (CIPCV), Xi’an, China, 19–21 May 2023; pp. 1–6. [Google Scholar]
  22. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. Available online: https://api.semanticscholar.org/CorpusID:232352874 (accessed on 17 August 2021).
Figure 1. Visualization of the typical aquaculture ponds (white lines) and crystallization ponds (blue lines), brine concentration ponds (red lines), and evaporation ponds (yellow lines) within saltpans (False Color).
Figure 1. Visualization of the typical aquaculture ponds (white lines) and crystallization ponds (blue lines), brine concentration ponds (red lines), and evaporation ponds (yellow lines) within saltpans (False Color).
Remotesensing 17 03897 g001
Figure 2. Study area and aquaculture pond datasets. Global map showing the location of the study area. (ae) Locations of the five study areas: Yinggehai Salt Field, Shanwei Changsha Bay, Huaibei Salt Field, Changlu Hangu Salt Field, and Dalian Biliuhe Bay.
Figure 2. Study area and aquaculture pond datasets. Global map showing the location of the study area. (ae) Locations of the five study areas: Yinggehai Salt Field, Shanwei Changsha Bay, Huaibei Salt Field, Changlu Hangu Salt Field, and Dalian Biliuhe Bay.
Remotesensing 17 03897 g002
Figure 3. Framework overview of HSAUNet. HSAUNet extracts multiscale features from an input image using the ResNet50-V1c, followed by the neck composed of the Sparse Attention Modules and Dycross Sample Modules, and finally derives a segmentation map.
Figure 3. Framework overview of HSAUNet. HSAUNet extracts multiscale features from an input image using the ResNet50-V1c, followed by the neck composed of the Sparse Attention Modules and Dycross Sample Modules, and finally derives a segmentation map.
Remotesensing 17 03897 g003
Figure 4. Overall architecture of the Sparse Attention Module.
Figure 4. Overall architecture of the Sparse Attention Module.
Remotesensing 17 03897 g004
Figure 5. Overview of the Sparse Attention Block.
Figure 5. Overview of the Sparse Attention Block.
Remotesensing 17 03897 g005
Figure 6. Overview of the Interlaced-Windows ViT.
Figure 6. Overview of the Interlaced-Windows ViT.
Remotesensing 17 03897 g006
Figure 7. Overview of the modified and improved MLP.
Figure 7. Overview of the modified and improved MLP.
Remotesensing 17 03897 g007
Figure 8. Overall structure of the Dycross Sample Module.
Figure 8. Overall structure of the Dycross Sample Module.
Remotesensing 17 03897 g008
Figure 9. Structure of the Align Block, which can generate learnable offsets for the grid sample.
Figure 9. Structure of the Align Block, which can generate learnable offsets for the grid sample.
Remotesensing 17 03897 g009
Figure 10. Visualization of the process of the Dycross sample. We chose the C3 feature layer in Figure 3 as the input feature and N2 as the Dysampled feature. A part of the boundary in the red box is highlighted for a close view of the predicted offsets. We generate learnable offsets to construct new sampling points to resample the input feature map with bilinear interpolation. The new sampling positions are indicated by the arrowheads.
Figure 10. Visualization of the process of the Dycross sample. We chose the C3 feature layer in Figure 3 as the input feature and N2 as the Dysampled feature. A part of the boundary in the red box is highlighted for a close view of the predicted offsets. We generate learnable offsets to construct new sampling points to resample the input feature map with bilinear interpolation. The new sampling positions are indicated by the arrowheads.
Remotesensing 17 03897 g010
Figure 11. Visualization results of several methods on our test datasets, which are specifically focused on aquaculture ponds with various appearances in different coastal regions. For the convenience of viewing, we use different colors to express different meanings. Blue indicates that both prediction and label correspond to aquaculture areas (TP). Green means the prediction of aquaculture areas; however, the ground truth is the background (FP). Red represents that the prediction is mistakenly labeled as the background, whereas the true label is aquaculture areas (TN).
Figure 11. Visualization results of several methods on our test datasets, which are specifically focused on aquaculture ponds with various appearances in different coastal regions. For the convenience of viewing, we use different colors to express different meanings. Blue indicates that both prediction and label correspond to aquaculture areas (TP). Green means the prediction of aquaculture areas; however, the ground truth is the background (FP). Red represents that the prediction is mistakenly labeled as the background, whereas the true label is aquaculture areas (TN).
Remotesensing 17 03897 g011
Figure 12. Visualization results of several methods on our test datasets, which are specifically focused on aquaculture ponds around saltpans in complex coastal environments. White indicates that both the prediction and label are saltpans (TP), while blue shows that both correspond to aquaculture areas (TP). Green means the prediction is either saltpans or aquaculture areas, but the actual ground truth is the background (FP). Red represents two types of errors: one where the prediction is incorrectly identified as saltpans when the true label is an aquaculture area, and another where the prediction is mistakenly labeled as an aquaculture area when the true label is saltpans.
Figure 12. Visualization results of several methods on our test datasets, which are specifically focused on aquaculture ponds around saltpans in complex coastal environments. White indicates that both the prediction and label are saltpans (TP), while blue shows that both correspond to aquaculture areas (TP). Green means the prediction is either saltpans or aquaculture areas, but the actual ground truth is the background (FP). Red represents two types of errors: one where the prediction is incorrectly identified as saltpans when the true label is an aquaculture area, and another where the prediction is mistakenly labeled as an aquaculture area when the true label is saltpans.
Remotesensing 17 03897 g012
Figure 13. Visualization of predicted results of two special samples on our dataset. The blue and red boxes in the broader perspective image indicate the position of two local regions. For the convenience of viewing, we use different colors to express different meanings. Blue indicates aquaculture areas and white indicates saltpans.
Figure 13. Visualization of predicted results of two special samples on our dataset. The blue and red boxes in the broader perspective image indicate the position of two local regions. For the convenience of viewing, we use different colors to express different meanings. Blue indicates aquaculture areas and white indicates saltpans.
Remotesensing 17 03897 g013
Table 1. Technical parameters of datasets.
Table 1. Technical parameters of datasets.
Study AreaSatelliteSpatial ResolutionBandsImage DateImage Size
Yinggehai Salt FieldSentinel-210 m2, 3, 4, 82 March 202510,980 × 10,980
Changlu Hangu Salt FieldSentinel-210 m2, 3, 4, 817 July 202410,980 × 10,980
Shanwei Changsha BaySentinel-210 m2, 3, 4, 812 October 202410,980 × 10,980
Haibei Salt FieldSentinel-210 m2, 3, 4, 86 November 202410,980 × 10,980
Dalian Biliuhe BaySentinel-210 m2, 3, 4, 826 June 202410,980 × 10,980
Table 2. Comparison results on the test set. Color convention: BEST, Second.
Table 2. Comparison results on the test set. Color convention: BEST, Second.
PreRecF1IoUOAKappa
Swin-ViT96.12 ± 0.0296.24 ± 0.0396.21 ± 0.0392.75 ± 0.0396.69 ± 0.0394.29 ± 0.02
PSPNet95.84 ± 0.0296.25 ± 0.0296.06 ± 0.0392.43 ± 0.0396.64 ± 0.0294.28 ± 0.03
DeepLabv3+96.42 ± 0.0396.31 ± 0.0296.35 ± 0.0292.97 ± 0.0296.84 ± 0.0294.68 ± 0.03
SegFormer96.15 ± 0.0196.32 ± 0.0296.22 ± 0.0392.87 ± 0.0296.72 ± 0.0294.56 ± 0.01
HSAUNet96.66 ± 0.0296.51 ± 0.0296.58 ± 0.0393.42 ± 0.0197.07 ± 0.0395.02 ± 0.02
Table 3. Comparison results on the test set. Color convention BEST.
Table 3. Comparison results on the test set. Color convention BEST.
PreRecF1IoUOAKappa
HSAUNet-without Dycross Sample Module96.3296.2196.3993.0696.8194.52
HSAUNet-without Sparse Attention Module95.6395.6395.7292.4795.9593.77
HSAUNet96.6696.5196.5893.4297.0795.02
Table 4. Model computational complexity and parameter count.
Table 4. Model computational complexity and parameter count.
ModelHSAUNetDeeplabv3-ResNet50PSPNet-ReNet50UperNetSwin Transformer Base-SizedSegformer-b5
FLOPs(T)0.120.1770.1790.2990.19
Params(M)25.68341.24646.612122.861.408
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, C.; Zhao, Y.; Li, L.; Liu, T. A Novel U-Shaped Network Combined with a Hierarchical Sparse Attention Mechanism for Coastal Aquaculture Area Extraction in a Complex Environment. Remote Sens. 2025, 17, 3897. https://doi.org/10.3390/rs17233897

AMA Style

Wang C, Zhao Y, Li L, Liu T. A Novel U-Shaped Network Combined with a Hierarchical Sparse Attention Mechanism for Coastal Aquaculture Area Extraction in a Complex Environment. Remote Sensing. 2025; 17(23):3897. https://doi.org/10.3390/rs17233897

Chicago/Turabian Style

Wang, Chengyi, Yuyang Zhao, Lu Li, and Tianyi Liu. 2025. "A Novel U-Shaped Network Combined with a Hierarchical Sparse Attention Mechanism for Coastal Aquaculture Area Extraction in a Complex Environment" Remote Sensing 17, no. 23: 3897. https://doi.org/10.3390/rs17233897

APA Style

Wang, C., Zhao, Y., Li, L., & Liu, T. (2025). A Novel U-Shaped Network Combined with a Hierarchical Sparse Attention Mechanism for Coastal Aquaculture Area Extraction in a Complex Environment. Remote Sensing, 17(23), 3897. https://doi.org/10.3390/rs17233897

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop