Aquaculture Areas Extraction Model Using Semantic Segmentation from Remote Sensing Images at the Maowei Sea of Beibu Gulf

Qin, Weirong; Ismail, Mohd Hasmadi; Luo, Yangyang; Yuan, Yifeng; Deng, Junlin; Ramli, Mohammad Firuz; Wu, Ning

doi:10.3390/fishes10050236

Open AccessArticle

Aquaculture Areas Extraction Model Using Semantic Segmentation from Remote Sensing Images at the Maowei Sea of Beibu Gulf

by

Weirong Qin

^1,2,†

,

Mohd Hasmadi Ismail

^2,*,†

,

Yangyang Luo

^3,†,

Yifeng Yuan

¹,

Junlin Deng

^1,†,

Mohammad Firuz Ramli

^2,†

and

Ning Wu

^1,*,†

¹

Key Laboratory of Beibu Gulf Offshore Engineering Equipment and Technology, Beibu Gulf University, Qinzhou 535011, China

²

Faculty of Forestry and Environment, Universiti Putra Malaysia, Serdang 43400, Selangor, Malaysia

³

School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Fishes 2025, 10(5), 236; https://doi.org/10.3390/fishes10050236

Submission received: 29 March 2025 / Revised: 13 May 2025 / Accepted: 16 May 2025 / Published: 19 May 2025

(This article belongs to the Special Issue Machine Learning in Aquaculture)

Download

Browse Figures

Versions Notes

Abstract

The extraction of aquaculture areas from high-resolution remote sensing images is crucial for effective coastal management and resource preservation. This study introduces SwinNet, a semantic segmentation model leveraging multi-scale feature fusion to enhance the extraction of aquaculture areas, particularly in the Maowei Sea of the Beibu Gulf, China. Utilizing the Swin Transformer backbone and a novel Parallel Pooling Attention Module (PPAM), SwinNet minimizes background noise and improves segmentation accuracy. SwinNet achieved a pixel accuracy of 96.53% and an intersection over the union of 93.07% on an aquaculture dataset, demonstrating superior performance in overcoming noise and accurately extracting aquaculture areas. SwinNet offers an effective solution for large-scale, high-precision monitoring of coastal aquaculture, with potential broader applicability in aquatic resource conservation and management.

Keywords:

remote sensing images; aquaculture areas; semantic segmentation; swin transformer; multi-scale feature fusion

Key Contribution: A Parallel Pooling Attention Module (PPAM) is introduced into the encoder in the Swin Transformer to extract multi-scale features from remote sensing images and suppress interference noise in the complex background to focus on the aquaculture areas. A multi-scale feature fusion master decoder based on FPN is developed to effectively fuse the semantic and detailed features of the aquacultural target to improve the segmentation accuracy.

1. Introduction

Mariculture is an important part of the global supply of aquatic products and plays an indispensable role in ensuring food security and promoting economic development in coastal areas. However, with the continuous expansion of farming scale, achieving sustainable development in the industry has become an urgent global issue. The Food and Agriculture Organization of the United Nations (FAO) is highly concerned about the ecological sustainability of mariculture and has put forward a series of policy recommendations and development goals. Strengthening the ecosystem management of mariculture is regarded as the top priority, including optimizing the layout and density of aquaculture, reducing environmental impacts, and achieving coordinated development of aquaculture and ecological protection. FAO places special emphasis on promoting the interests of small-scale fisheries and developing mariculture models suitable for local economic development and employment in less-developed coastal areas. In this context, marine oyster farming, as a typical form of small-scale aquaculture, is ecologically friendly and socially inclusive, aligning well with FAO’s concept of sustainable development. This study focuses on the marine oyster farming industry in the Maowei Sea and explores the potential of applying remote sensing technology to the optimization of the layout and scale of local aquaculture, thus providing scientific evidence for high-stake decision-making in the benefit of sustainable development of the mariculture industry.

Since the continuous management and monitoring of marine aquaculture are essential for the sustainable development of the local economy and environmental conservation, high-resolution remote sensing imagery has become an invaluable tool in this domain, offering detailed insights that support effective planning and dynamic monitoring of aquaculture activities [1,2].

Accurate extraction of aquaculture areas from remote sensing imagery is critical for optimizing resource use, assessing production scales, and mitigating environmental impacts. However, extracting accurate information from remote sensing imagery presents several challenges, one of which is the presence of substantial noise in remote sensing imagery, including both coherent noise from optical sources and incoherent noise from sediments and floating objects on the sea surface. This noise complicates the segmentation process, making it difficult to accurately isolate aquaculture targets such as oyster rafts, which often occupy a minimal fraction of the overall pixel count and predominantly consist of small-scale objects [3,4,5]. The complexity of recording and collecting remote sensing imagery datasets makes it more difficult for the robustness of tools specifically for aquaculture applications [6,7]. Traditional Convolutional Neural Network (CNN) architectures, characterized by successive pooling operations and large convolutional kernels, may result in reduced resolution of feature maps, which is detrimental to the segmentation accuracy of small-scale targets [8].

Previously, conventional segmentation techniques such as threshold segmentation and mathematical transformations have been employed to identify aquaculture areas based on characteristics like shape, color, grayscale, and texture [9,10,11,12,13,14,15,16]. While useful for preliminary classification and basic feature identification, these methods are resource-intensive and less efficient. They are also prone to disturbances from extraneous noise sources like ships and floating debris, limiting their accuracy and applicability [11,12,14,16]. Recent advancements in Deep Learning (DL) have shown promise in addressing these challenges. For instance, Cheng et al. developed a semantic segmentation model based on a Hybrid Extended Convolutional Neural Network (HECNet), combining UNet and Hybrid Extended Convolution to reduce the influence of similar objects on the sea surface, thus improving boundary segmentation accuracy [6]. Semi-supervised learning methods have been designed to tackle the issues of data acquisition and labeling, proving effective in emergency response and area estimation for aquaculture areas [7]. An improved version of UNet, incorporating an Atrous Spatial Pyramid Pooling (ASPP) module, has been applied to medium-resolution remote sensing imagery to address the boundary edge segmentation problem of densely distributed aquaculture areas [17]. However, these methods generally overlook the multi-scale characteristics of aquaculture areas, leading to the loss of critical features at small and medium scales as the network depth increases. Current DL models struggle to fully capture the features of aquaculture areas, particularly those with similar spacing, due to the reliance on outputs from limited network layers.

In order to overcome these limitations, this study introduces SwinNet, a novel semantic segmentation model designed to enhance the extraction of aquaculture areas in high-resolution remote sensing images, focusing on the Maowei Sea in the Beibu Gulf area of China. SwinNet employs the Swin Transformer as its backbone, leveraging its ability to model global semantic information through a self-attention mechanism without traditional convolution and pooling operations [18,19]. The model incorporates a novel Parallel Pooling Attention Module (PPAM) to minimize background noise and emphasize essential aquaculture areas. Furthermore, a multi-scale feature decoder merges high-level semantic details with multi-scale features, improving segmentation accuracy [20,21]. By addressing the shortcomings of traditional methods and existing DL models, SwinNet aims to provide a robust and accurate solution for the large-scale, high-precision monitoring of marine aquaculture targets. This model holds significant potential for enhancing the management and conservation of coastal aquaculture resources, particularly in the Maowei Sea region of the Beibu Gulf [1,2,9,22,23,24].

The rest of the paper is organized into seven sections. The second section introduces the previous works related to the proposed method, and the third section lists the materials used in this study, including the study region and the data used. The fourth section demonstrates the proposed SwinNet for aquaculture area extraction and compares it with the related works. In the fifth part, the experiments and results are presented. Discussions will be given in the sixth Section, and finally a conclusion will be drawn.

2. Related Work

In previous works, a Full Convolutional Neural Network (FCN) is regarded as a milestone of deep learning in the field of image segmentation [21]. FCN solves the pixel-level image segmentation problem by replacing the fully connected layer with a convolutional layer. Another representative approach is UNet [25], which fuses the different scale features of the encoding phase corresponding to the decoding phase, thus obtaining rich contextual and spatial location information [25]. The DeepLab family of algorithms has further advanced the field of semantic segmentation [26,27,28,29]. In particular, DeepLab V3+ [29] not only uses the Atrous Spatial Pyramid Pooling (ASPP) module to capture the target information and its contextual information from multiple scales but also introduces Depthwise Separable Convolution (DSConv) into the encoder, which effectively reduces the number of parameters in the model and the amount of computation [29]. In addition, for the receptor field limitation problem caused by the fixed size of the convolutional kernel, KNet improves the segmentation accuracy by applying the unification of instance segmentation and semantic segmentation through a set of learnable convolutional kernel parameters [30].

The detailed information in the shallow output is particularly important for the accurate segmentation of small-scale targets. Nevertheless, the CNN-based FCN variant models only focus on the output of the deeper parts of the network for semantic segmentation without taking full advantage of the multi-scale features. In target detection, Feature Pyramid Networks (FPN) is an important multi-scale feature fusion strategy that is often used to capture small-scale targets [20]. Similarly, FPNs have powerful multi-scale feature fusion capabilities for semantic segmentation. For example, Kirillov et al. expanded the resolution of each layer of the FPN’s feature map through operations such as up-sampling and bilinear interpolation and fused them into a final feature map for segmentation, which effectively improved the segmentation of small-scale targets [31]. Since the scale of aquaculture areas is much smaller than the land region, in order to improve the accuracy of segmentation in remote sensing imagery, a multi-scale feature fusion decoder is designed in this paper to retain more feature channels in the FPN and utilize the ASPP (DSConvASPP) module with depth-separable convolution to denoise the fused feature maps.

It is generally recognized that the Transformer scheme has achieved great success in the field of natural language processing and gained popularity in tasks such as image classification, target detection, and semantic segmentation due to its excellent sequence modeling capabilities and global information perception [32]. VIT was demonstrated as the first Transformer structure for the task of image classification [33]. Later, Zheng et al. proposed a SEgmentation TRansformer (SETR), which introduced semantic segmentation as a sequence-to-sequence prediction task [34]. In the following effort, Segmenter uses VIT as an encoder for feature encoding and applies Mask Transformer to decode the output of the encoder and class embedding vectors [35]. In contrast, SegFormer [36] was introduced with Mix Transformers to extract and fuse multiscale features through an MLP decoder, to achieve a combination of local and global attention [36].

Despite all the performance of the VIT, only single-scale and low-resolution feature maps can be obtained and it is generally computationally expensive. In order to extract the rich multi-scale features of aquaculture areas in the remote sensing imagery, Swin Transformer has drawn researchers’ attention. In this research, the Swin Transformer is used as the backbone network, with a hierarchical pyramid structure of CNN to effectively reduce the model complexity and computation cost [8].

In this study, the existing limitations in the field of semantic segmentation are addressed, particularly focusing on the application of CNNs such as the FCN, UNet, and the DeepLab algorithmic suite. While these methods have notably advanced segmentation accuracy and target recognition efficiency—especially through innovations in handling multi-scale features and reducing model parameters—significant research gaps remain, particularly in the context of remote sensing imagery and the segmentation of aquaculture areas. These gaps include:

(1): Accurate segmentation of small-scale targets: Despite advancements in multi-scale feature fusion strategies such as FPNs, challenges persist in the segmentation of very small-scale targets. This is especially pertinent in specialized application scenarios, like aquaculture area segmentation in remote sensing imagery, where the scale of interest significantly differs from larger terrestrial regions, necessitating enhanced feature capture capabilities.
(2): Effective fusion of deep and shallow features in UNet: Current methodologies tend to emphasize the output of deeper network layers for semantic segmentation. This approach often overlooks the potential of shallow layer features, which contain rich detail information critical for the accurate recognition of small-scale targets.
(3): Balancing computational efficiency and accuracy in DeepLab models: Although the introduction of depthwise separable convolution in DeepLab V3+ has reduced model parameters and computational requirements, further optimization is needed. Enhancing model efficiency, while maintaining or improving accuracy for resource-constrained scenarios, remains an essential research direction.

By addressing these gaps, this study aims to enhance semantic segmentation in terms of accuracy, efficiency, generalization, and robustness through targeted innovations.

3. Materials

3.1. Study Region

Located in Guangxi, China (21.5° N, 109.1° E), the Maowei Sea exemplifies ideal conditions for sustainable oyster farming as advocated by FAO such as clear water, moderate salinity, and rich plankton to support rapid oyster growth. The warm climate, with winter temperatures above 10 °C, summer temperatures between 28 and 30 °C, and annual rainfall around 1500 mm, is one of the most ideal marine ranches for oyster farming. According to the 2023 Statistical Review of World Food and Agriculture, Maowei Sea’s oyster production increased from 190,000 tons in 2010 to 310,000 tons in 2021. As the largest oyster breeding bay in Southwest China, the oyster density per square meter has also been increased to meet the market demand. The recent construction of the Pinglu Canal in Guangxi Province also brings new opportunities and challenges for marine aquaculture in the Maowei Sea.

Situated in the Beibu Gulf’s northern sector, the Maowei Sea encompasses approximately 135 km², marking it as the Guangxi Zhuang Autonomous Region’s sole marine park within China’s expansive South China Sea. This region, teeming with plankton, serves as one of China’s key aquaculture hubs [37]. For the communities lining the Maowei Sea, aquaculture stands as a pivotal economic pillar. However, the sector is currently navigating through a multitude of challenges, prompted by rapid economic growth [38,39,40]. These hurdles are fast becoming critical impediments to the region’s sustainable economic progression. A notable concern is the practice of near-shore aquaculture, particularly the dense cultivation of oysters, which now surpasses the area’s carrying capacity. This intensive farming has inflicted significant harm on marine biodiversity, impairing the sea’s natural purification processes, and leading to riverbed elevation and potential waterway blockages [38,39]. Moreover, discarded aquaculture plastics further exacerbate marine pollution, endangering local flora and fauna. The current approach to aquaculture management in the Qinzhou city area relies on a cumbersome system of community feedback and hierarchical reporting. Given the aquaculture’s extensive scope, this method is not only inefficient but also susceptible to inaccuracies and underreporting. This backdrop underscores the urgent need for a more effective technique to delineate oyster farming zones, thereby enhancing the stewardship of Maowei Sea’s aquaculture assets.

The sample data for this study were obtained from the Maowei Sea waters near Qinzhou City, Guangxi Province in China, using Google Earth high-resolution remote sensing imagery data taken on 5 November 2019. The remote sensing imagery contains three channels of RGB WorldView 2 with a spatial resolution of 0.593 m as shown in Figure 1A. Figure 1 illustrates a detailed view of the aquaculture landscape in a part of the area with geographic coordinates of 108°32′ E and 21°47′ N (Figure 1B(a)). There are two main types of aquacultures in the Maowei Sea waters such as raft culture and cage culture, respectively. Raft culture (Figure 1B(b)) employs floating rafts equipped with supports and ropes, identifiable by sporadic luminous markers along their peripheries, presenting as dark, cohesive rectangles from above. Cage culture (Figure 1B(c)), alternatively, utilizes natural materials like bamboo and wood, forming structures of varying shapes marked by distinct grid patterns, indicative of small enclosures used for marine cultivation.

3.2. Experimental Data

The dataset for this study, pivotal in training a model for remote sensing imagery segmentation specifically for aquaculture area identification, is constructed on the backbone of meticulously selected principles aimed at maximizing its effectiveness. These guiding principles—sample representation, accurate labeling, flexibility, discrimination, and computational feasibility—are integral for the precise and efficient analysis of the Maowei Sea’s aquaculture environments. Ensuring comprehensive representation, the dataset includes a wide array of aquaculture practices across different seasons and tidal conditions to capture environmental variabilities, enhancing the robustness of the model. Accurate labeling, performed with expert precision, defines the aquaculture areas, setting a clear ground truth essential for the model’s learning accuracy. Designed to be flexible, the dataset accommodates future changes in aquaculture practices and environmental shifts, maintaining its relevance. Discriminative capability is heightened by including distinct features that separate aquaculture zones from non-aquaculture areas, thus minimizing false positives and bolstering segmentation accuracy. Lastly, computational feasibility is carefully balanced against data quality and complexity, ensuring the dataset enriches the model’s training without imposing prohibitive computational demands. This holistic approach not only facilitates effective remote sensing imagery segmentation but also contributes significantly to the sustainable management and ecological conservation of aquaculture practices in the Maowei Sea, illustrating a commitment to both technological advancement and environmental stewardship.

3.2.1. Dataset Selection and Labeling

In recognition of the necessity for a dataset that accurately reflects the diverse environmental characteristics of aquaculture areas within the Maowei Sea, our selection focused on the imagery from Longmen Harbor and Fortress Point of the Beibu Gulf. This selection strategy was guided by the aim to encompass a wide distribution of aquaculture zones, thereby enhancing the dataset’s representativeness and diversity. Such diversity is crucial for improving the model’s generalization capabilities, enabling it to perform reliably across large-scale remote sensing imagery segmentation tasks. In order to facilitate precise training and validation, we employed Labelme V4.5.7 an open-source data labeling software, for the meticulous labeling of each image within the dataset. This process involved categorizing the imagery into two principal classes: aquaculture areas and land. Concurrently, objects such as ships and sea surface floats, which do not belong to these categories, were labeled as complex backgrounds. This approach ensures in-class flexibility and out-of-class discrimination, critical for the model’s accuracy and robustness.

3.2.2. Data Transformation and Augmentation

Acknowledging the challenges in acquiring extensive remote sensing imagery data, the collected dataset is relatively modest in size. However, to mitigate this limitation and enhance model performance, we adopted data augmentation techniques. Notably, we employed random rotation transformations on the sample images. This method allows for the extraction of aquaculture areas across various orientations and morphologies, significantly enriching the dataset. Examples of these transformations are illustrated in Figure 2, showcasing the augmented variability introduced to the model training process.

3.2.3. Dataset Configuration and Evaluation

The dataset was strategically divided into training and validation sets, adhering to an 8:2 ratio [41,42]. This division is aimed at optimizing the training process while reserving a significant portion of the data for model validation and testing. In furtherance of a rigorous evaluation of the model’s generalization performance, half of the data within the validation set were designated as a test set. Table 1 details the dataset’s configuration, including the sizes of the sample images and labeled maps, both before and after augmentation. This table also includes information on the distribution of samples across training, validation, and test sets, providing a comprehensive overview of the dataset’s structure.

Through meticulous adherence to the outlined principles of dataset creation and preparation, we have developed a resource that is not only representative of the Maowei Sea’s aquaculture landscapes but is also tailored to maximize the performance of our remote sensing imagery segmentation model. This careful preparation underscores the dataset’s suitability for the task at hand, ensuring that it meets the computational demands of modern processing equipment while accommodating the nuanced requirements of aquaculture area identification through remote sensing imagery analysis.

4. Methodology

4.1. The Proposed SwinNet: The Model Structure

The multi-scale feature fusion-based semantic segmentation model SwinNet is shown in Figure 3. There are three components in the proposed SwinNet structure, an attention-enhanced encoder based on Swin Transformer, a multi-scale feature fusion main decoder based on Feature Pyramid Network (FPN) and an auxiliary decoder based on residual structure. The Swin Transformer is used to extract multi-scale features from the input remote sensing imagery, and the Pooling Attention Module (PPAM) is used to draw the focus on aquatic regions in the global image and suppress the noise. The main decoder employs FPN for the fusion of multi-scale features to complement the detailed information of the aquaculture areas, where the DSConvASPP module is applied to eliminate the noise caused by the fusion operation. During the training stage, the auxiliary decoder utilizes the feature maps output by the encoder in phase 3 for segmentation, which does not require any computational cost during the deployment stage.

The combination of FPN and DSConvASPP significantly enhances the performance of semantic segmentation, particularly in noisy and complex backgrounds. The FPN structure fuses features from different levels to capture multi-scale information by constructing a pyramid. This approach maintains both global low-resolution features and detailed high-resolution features, thereby improving the recognition of small targets [28,43,44]. DSConvASPP, on the other hand, uses Atrous convolution with varying sampling rates to capture multi-scale contextual information, expanding the receptive field without increasing computational complexity. This method comprehensively considers context information at multiple scales, enhancing segmentation accuracy and background noise suppression [28,43,44]. By applying DSConvASPP after the FPN, the multi-scale context information in multi-level feature maps is further extracted, fully utilizing multi-scale features and context information to improve robustness and segmentation accuracy in complex and noisy environments.

4.2. Attention Enhanced Encoder Based on Swin Transformer and PPAM

The attention-enhanced encoder is comprised of four Swin Transformers and PPAMs, respectively, for the benefit of interference noise reduction in the remote sensing imagery [45]. The backbone network in the encoder with Swin Transformers extracts multi-scale features from the input remote sensing imagery, and then the PPAMs perform feature filtering to enhance the weight of the target area and suppress the background and interference noise. Figure 4 shows the detailed structure of the Swin Transformer Block containing the combination operations of Window Multi-heads Self-Attention (W-MSA) [46] and Shifted Window Multi-Head Self-Attention (SW-MSA) [47] in which the neighboring windows of filtering can interact with each other. Instead of applying convolution operation like the DCNN, Swin Transformer divides the input image into 4 × 4 fixed-size, non-overlapping blocks, and maps the image blocks into embedding vectors by Linear Embedding operation before extracting the vector features by using the self-attention mechanism. A patch merging operation is used without the pooling layer for window feature mapping and in this way, information loss can be avoided in the Swin Transformer. The feature maps with multiple scale resolutions are then output from the Swin Transformer.

The self-attention process is given in Equations (1)–(4) such that,

{\hat{d}}_{l} = W - MSA (LN (d_{l - 1})) + d_{l - 1}

(1)

d_{l} = MLP (LN ({\hat{d}}_{l})) + {\hat{d}}_{l}

(2)

{\hat{d}}_{l + 1} = SW - MSA (LN (d_{l})) + d_{l}

(3)

d_{l + 1} = MLP (LN ({\hat{d}}_{l + 1})) + {\hat{d}}_{l + 1}

(4)

where

{\hat{d}}_{l} ({\hat{d}}_{l + 1})

and

d_{l} (d_{l + 1})

denote the features obtained from the W-MSA (SW-MSA) and the MLP in the

l

(

l

+ 1) module, respectively.

d_{l - 1}

represents the output of the feature from the

l

− 1 block.

In the attention-enhanced encoder, the PPAM module performs target region enhancement and noise reduction on the feature maps obtained from the four stages of Swin Transformer, which includes the average pooled attention branch and the maximum pooled attention branch. The structure of the PPAM is shown in Figure 5.

Assuming that there is an input feature map Z with the height, width and channel number of H, W and C, respectively, the global average feature compression value vector Cafs and the global maximum feature value vector Cmfs corresponding to each channel in Z are given in Equations (5) and (6), respectively, such that,

C_{a f s} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} z_{i, j}

(5)

C_{m f s} = Max (z_{i, j}), 1 \leq i \leq H, 1 \leq j \leq W

(6)

where

z_{i, j}

denotes the value on the feature map representing the global sensory range of the corresponding channel, and Fafs and Fmfs are the global average pooling operation and the global tie pooling operation, respectively.

In the fully connected layer FC, the correlation between different channels based on the extracted global features is applied with a sigmoid activation function, and the weight vector of the feature map is generated to mark the importance of each channel. Equations (7) and (8) express the FC processing such that,

S_{C_{a f s}} = F_{C_{a f s}} (C_{a f s}, ω) = δ (\partial (C_{a f s}, ω)) = δ (ω_{2} \partial (ω_{1} C_{a f s}))

(7)

S_{C_{m f s}} = F_{C_{m f s}} (C_{m f s}, ω) = δ (\partial (C_{m f s}, ω)) = δ (ω_{2} \partial (ω_{1} C_{m f s}))

(8)

where

\partial

is the ReLU activation function,

δ

is the Sigmoid activation function, and

ω_{1} {, ω}_{2}

∈

R^{C \times \frac{C}{r}}

are the parameters of the first and second FC layers, respectively, and r is the compression factor of the channel, which is set to r = 16 in this experiment. The number of channel C is adjusted to

1 / r

after the first FC layer and then restored after the second FC layer [23]. Then the weight values of each channel are calculated with the Sigmoid function, and the final output of the encoder can be obtained such that,

\hat{Z} = F_{S c a l e} (Z, S_{C_{a f s}}, S_{C_{m f s}}) = Z \cdot S_{C_{a f s}} \cdot S_{C_{m f s}}

(9)

where

F_{S c a l e}

represents the product operation which adjusts the importance of each feature channel in the original feature map according to the evaluation results of the two attention branches, so that the features of important targets can be enhanced and the background noise can be suppressed.

4.3. Main Decoder Based on FPN for Multi-Scale Feature Fusion

In order to address the problem of detail feature loss in aquaculture areas, a multi-scale feature fusion master decoder is designed in the SwinNet [31], and Figure 6 shows the structure of this encoder. The decoder utilizes FPN for the fusion of feature maps at different scales, and the denoising is performed by a null-space pyramid pooling (ASPP) module with Depthwise-Separable Convolution (DSConv) to improve the segmentation accuracy. DSConv is able to reduce the number of parameters and the computational effort in comparison to conventional convolution.

The input to the decoder is the multiscale features from the encoder such that P = {P1, P2, P3, P4}, with the resolutions of {1/4, 1/8, 1/16, 1/32} and the number of channels {96, 192, 384, 768} of the original input image, respectively. In the decoding process, the number of channels of P is firstly set to 512 using a 1 × 1 convolutional pair to achieve a fusion of features at different scales. Next, the low-resolution feature maps are up-sampled by a factor of 2 and summed element by element with the high-level feature maps to obtain F = {F2, F3, F4} through the top-down feature fusion channel. Then, the aliasing effect generated by element-by-element summation is eliminated by a 3 × 3 depth-separable convolution DSConv3. The outputs of each level of the FPN are fused with features at the upsampling rates of {8, 4, 2, 1}, respectively. The fused feature maps are then generated using a DSConv3 with a 3 × 3 depth-separable convolution kernel which includes Depthwise and Pointwise convolution, respectively, and the outputs of all the convolutions are stitched together.

The DSConvASPP module is designed to denoise the fused feature maps based on the ASPP module which contains one 1 × 1 convolution and three 3 × 3 depth-separable convolutions with expansion rates of 6, 12, and 18, respectively. The global context information is integrated into the feature map by pooling by recovering the sample after the average pooling operation. This process not only expands the sensory field of the network but also reduces the number of parameters and computational cost.

4.4. Auxiliary Decoder Based on Residual Connection

In the SwinNet, an auxiliary decoder based on residual linkage is designed to further improve the segmentation accuracy for aquaculture areas, and Figure 7 shows the structure of the auxiliary decoder [43,48]. There are two DSConv3 processes in the auxiliary decoder with residual operation. The residual connection in this decoder transforms the feature channels by splicing before going through the DSConv3, in this way leading to the multiplexing of the features in the channel dimension [49]. In practice, the auxiliary decoder helps to increase the gradient in the loss calculation during the training process and improve the model’s segmentation accuracy on the target, while it is not involved in the inference process.

4.5. The Loss Function

The purpose of this experiment is to perform semantic segmentation of the aquaculture areas in the remote sensing imagery and classify the oyster rafts, land, and background accordingly, and therefore the output of the proposed model is a predictive value image corresponding to the three categories. The output layer of SwinNet is activated with a Softmax function and uses cross entropy as the loss function. The final loss function includes the loss function of the main decoder and the loss function of the auxiliary decoder each of which is regularized, such that,

L_{a l l} = α L_{1} + β L_{2}

(10)

L_{1} = L_{2} = - \frac{1}{N} \sum_{i, j} [(y_{i, j} \ln {\hat{y}}_{i, j}) + (1 - y_{i, j}) \ln (1 - {\hat{y}}_{i, j})]

(11)

where

L_{a l l}

is the total loss function,

L_{1}

and

L_{2}

are the cross-entropy loss functions of the main decoder and the auxiliary decoder, respectively. α and β are the corresponding weighting factors and in this experiment α = 1.0, β = 0.4. N is the total number of samples, and (i, j) is the positional coordinates of the pixel.

{\hat{y}}_{i, j} \in

{0, 1, 2} is the predicted value of the model for the location of (i, j). The notation of 0, 1, and 2 represent the background, aquaculture areas, and land, respectively, and

y_{i, j} \in

{0, 1} with 1 to be the true value of the label and 0 otherwise.

4.6. Evaluation Metrics

In this study, the metrics used to evaluate the segmentation performance of the proposed model include the Pixel Accuracy (PA), the mean PA (mPA), the Intersection over Union (IoU), the mean IoU (mIoU), Precision, Recall, F1-Score, and mean F1-Score (mF1), which are given, respectively, as,

PA = \frac{T P + T N}{T P + F P + T N + F N}

(12)

mPA = \frac{1}{k + 1} \sum_{i = 1}^{k} {P A}_{i}

(13)

IoU = \frac{T P}{T P + F P + F N}

(14)

mIoU = \frac{1}{k + 1} \sum_{i = 1}^{k} {I o U}_{i}

(15)

Precision = \frac{1}{k + 1} \sum_{i = 1}^{k} \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i}}

(16)

Recall = \frac{1}{k + 1} \sum_{i = 1}^{k} \frac{{T P}_{i}}{{T P}_{i} + {F N}_{i}}

(17)

F 1 - Score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(18)

mF 1 = \frac{1}{k + 1} \sum_{i = 1}^{k} {F 1}_{i}

(19)

where, it is assumed that the data contain a total of k + 1 categories (1 represents an extra class of the empty or background), TP denotes true positives, FP is false positives, TN represents true negatives and FN is false negatives. PA is the percentage of pixels correctly predicted by the model to the total number of pixels, and IoU denotes the ratio of intersection and concatenation between segmentation regions predicted by the model and the true segmentation regions, which is used to assess the degree of overlap of prediction results with the true labels. Precision is defined by the ratio of the number of pixel points correctly predicted by the model to the target number of pixel points to the total number of pixel points predicted as a target. The recall is the ratio of the total number of pixel points correctly predicted as a target by the model to the total number of pixel points that are true as a target, and F1-Score reflects the ability to recognize and differentiate between the target and the background. mPA, mIoU, and mF1 are the average values of PA, IoU, and F1 for all categories, respectively.

4.7. Comparison Experiment

In order to objectively evaluate the segmentation performance of the proposed SwinNet, some representative mainstream segmentation models are also tested on the aquaculture areas dataset, including the CNN-based methods such as UNet [25], DeepLabV3+ [29], Semantic_FPN [31], KNet [30,32], SETR [34], Segmenter [35], SegFormer [36] and other Transformer structure-based segmentation methods. The parameter configurations and environments are the same for all testing.

In order to rigorously validate the proposed SwinNet model’s effectiveness and superiority in the domain of remote sensing image segmentation, especially for aquaculture area delineation, this study undertakes a comprehensive comparative analysis against both CNN-based and Transformer-based algorithms. The CNN-based models selected for comparison include DeepLabV3+ [29], renowned for its enhanced feature extraction through atrous convolution and adept handling of objects at various scales; UNet [25], noted for its symmetric architecture that efficiently captures contextual information and achieves precise localization; Semantic FPN [31], which utilizes a Feature Pyramid Network to integrate multi-scale features; and KNet [30,32], a recent advancement that introduces kernel predictors for dynamic convolution. Alongside these, Transformer-based algorithms such as SETR [34], which pioneers the application of Transformers to semantic segmentation [35]; Segmenter [36], employing a Vision Transformer (ViT) as the encoder with a straightforward decoder; and SegFormer, combining a hierarchical Transformer encoder with a lightweight MLP decoder for a balance between accuracy and efficiency, are also analyzed for a holistic evaluation.

The proposed SwinNet leverages the Swin Transformer’s capabilities within a specialized network architecture designed for the efficient and precise segmentation of aquaculture areas in remote sensing imagery. By harnessing the hierarchical structure and shift window mechanism of the Swin Transformer, SwinNet excels in capturing both the global context and fine details essential for distinguishing aquaculture zones from complex backgrounds. This comparative study serves multiple purposes: benchmarking SwinNet’s performance to highlight its accuracy, efficiency, and adaptability; demonstrating its specific advantages over existing CNN-based and Transformer-based models in addressing the unique challenges of aquaculture area segmentation; and substantiating its superiority in terms of segmentation precision and generalization capabilities. Through this extensive analysis, we aim to provide a clear, evidence-based affirmation of SwinNet’s superior performance, underscoring its significant contribution to advancing the state-of-the-art in remote sensing image segmentation for aquaculture monitoring and showcasing its potential for widespread application in the remote sensing domain.

4.8. Training and Optimization

In this training process, the optimizer of AdamW is used for the optimization of the proposed model, which not only adapts to sparse gradients but also mitigates the gradient oscillation [5,19]. The Ploy dynamic learning rate adjustment strategy based on exponential transformation is applied and given in Equation (20) such that,

l r = {l r}_{b a s e} \times {(1 - \frac{I t e r a t i o n}{{I t e r a t i o n}_{m a x}})}^{ρ}

(20)

where the initial learning rate

{l r}_{b a s e}

is set to 0.0001, ρ is set to 0.9, Iteration denotes the current training iteration batch, and

{I t e r a t i o n}_{m a x}

is set to 20,000 for the maximum number of iterations. The weight decay of this optimization is set to 0.0005.

The program of this experiment was implemented under the MMSegmentation (Pytorch 1.11.0, mmcv-full 1.7.1, Python 3.8) development framework, and the experimental environment was based on the Windows operating system running with an Intel(R) Xeon(R) Platinum 8255C (CPU) and an NVIDIA GeForce RTX 3090 (GPU) with 24 GB of video memory.

4.9. Aquaculture Areas Monitoring

In general, semantic segmentation achieves pixel-level classification with different semantic identifiers for remote sensing imagery to be segmented into regions. In this experiment, SwinNet is designed to semantically segment the remote sensing imagery of the Maowei Sea of Beibu Gulf, and by counting the total number of pixels classified as aquaculture area, the aquaculture farming scale can be calculated accurately. Combined with the spatial resolution of remote sensing imagery, the accurate estimation of aquaculture area can be obtained by Equations (21) and (22) such that,

S = N_{p} \times S S R

(21)

S S R = S R \times S R

(22)

where

N_{p}

is the total number of pixels classified as an aquaculture area. SR and SSR are the calibration constants representing the spatial resolution of the input image and the actual area of each pixel, respectively, and in this experiment SR = 0.593 m, SSR = 0.337417 m². S denotes the actual area of the aquaculture area. In addition, an error rate is also introduced to measure the estimation accuracy of the proposed model, which indicates the difference between the true value and the predicted value, such that,

E = \frac{| N_{g} - N_{p} |}{N_{g}} \times 100 %

(23)

where E is the error rate, and

N_{g}

represents the true value of the total number of pixels which belong to the aquaculture area. A small value of E means a higher prediction accuracy. In order to achieve this target, a dataset of aquaculture areas suitable for semantic segmentation using high-resolution remote sensing imagery of the Maowei Sea is created as training data, and then the trained SwinNet model to extract aquaculture areas from the remote sensing imagery and further count the area of aquaculture areas based on the segmentation results.

5. Experiments and Results

5.1. The Performance of the Seven Commonly Used Different Models in the Training Stage

Figure 8 shows the loss values of different methods during 20,000 training iterations, it can be seen that with the increasing number of iterations, the loss of the proposed method levels off and stabilizes after 16,000 iterations at around 0.02, while the loss of the other methods is within 0.05 to 0.1. The loss values of other methods also present large fluctuations during the training compared with the smooth decrease in the proposed model. The proposed SwinNet has the best convergence and robustness among other methods (Equations (10) and (11)).

During the training process, model validation is performed to further examine the changes in the accuracy of different methods every 1000 iterations. Figure 9 shows the PA and IoU metrics for the 2 target categories of terrestrial and aquaculture areas including the PA values and IoU values of the aquaculture areas and the land at each validation, respectively. It can be seen that the PA values of Semantic_FPN, SETR, Segmenter et al. present large fluctuation during the training process, while the proposed model has a smaller fluctuation, and it tends to be stabilized after the 16,000th iteration. The model of this paper has good robustness while realizing the recognition with high accuracy. It can also be seen from Figure 9 that the IoU values of other models increase along with the iteration and when the training is completed, the IoU value of the proposed model for the aquaculture areas category reaches 0.93 while others stay within the range of 0.83 to 0.91. The IoU value of the proposed model for the land-based category reaches 0.99, while others are within the range of 0.95 to 0.97 (Equations (12)–(15)).

It can be seen from Figure 9 that the advantages of the proposed Swin Transformer are obvious. Compared with the SETR and Segmenter, the Swin Transformer has better data fitting ability in the case of a small dataset. In the sense of segmentation accuracy for aquaculture areas, the SETR, Segmenter, and DeepLabV3+ only utilize the outputs of some layers of the backbone network, and therefore the segmentation performance is limited. Although KNet, Semantic_FPN and SegFormer achieve better segmentation performance with the multi-scale feature fusion structure, the performance of the decoders limits the accuracy of the overall model. Hence, a multi-scale feature fusion decoder based on the FPN and the DSConvASPP can be fitted to improve the segmentation accuracy for small-scale aquaculture areas.

5.2. Performance Comparison in Inference Stage

Table 2 shows the comparison of the performance of the proposed model with mainstream segmentation methods, and it can be seen that the mPA and mIoU of the proposed model are 97.82% and 95.77%, respectively, which outperform the KNet in the CNN-based model by 1.13% and 1.78%, and the SegFormer in the model based on Transformer structure by 1.20% and 2.40%, respectively. Furthermore, the IoU of the proposed model on the aquaculture areas category is 93.07%, achieving a 2.06% improvement over the KNet and 2.67% over the SegFormer.

It is well known that the pooling operation of CNN inevitably loses part of the pattern features and therefore limits the model performance. The proposed Swin Transformer instead adopts automatic learning for window feature merging and applies an attention mechanism in the encoder to capture the aquaculture areas within the global image without losing any important features while effectively suppressing the interference of the cluttered background. A multi-scale feature fusion decoder based on FPN and DSConvASPP module is then introduced to utilize the multi-scale features output from the encoder, and the auxiliary decoder is used to improve the segmentation accuracy.

Table 3 demonstrates the Precision and Recall achieved by the different models, although SwinNet achieves a precision rate of 96.29% in the aquaculture areas category in the recall metric, where SegFormer is slightly higher than SwinNet by 0.1%. In addition, in the land category, SwinNet’s Precision and Recall are 99.33% and 99.11%, respectively, which are significantly better than the other models. Therefore, compared to other models, SwinNet can identify and segment aquaculture areas and land more accurately.

Table 4 demonstrates the F1-Score of different models on the aquaculture zone dataset for various types of features. The results show that the SwinNet model proposed in this chapter achieved the highest mF1 of 97.82%. Specifically, SwinNet achieved an F1-Score of 96.41% and 99.22% for the aquaculture areas and land categories, respectively, which were 0.74% and 0.53% higher than the next best model, respectively. Especially in the aquaculture areas category, SwinNet was 5.42% higher compared to Segmenter, which had the lowest F1-Score. These results fully demonstrate the efficient decoding ability of SwinNet in prospect-target differentiation.

Figure 10 demonstrates the comparison of the best five segmentation results on the test set according to Table 2, Table 3 and Table 4, and the ground truth of column (b) represents the labeled values while column (c)~(f) show the output values of the proposed model. It can be seen from the segmentation result for Image1 that, the proposed SwinNet model is the only one to be able to clearly segment both the edges of aquaculture areas even if they are closely spaced. In the results with Image2, Image3 and Image4, KNet misclassifies some land areas similar to the background, and Segmantic_FPN and SegFormer segment the edges of the land unevenly, while the proposed SwinNet shows the advantages in the segmentation of small-scale and densely distributed aquaculture areas.

5.3. Ablation Experiment

In this research, ablation experiments have also been carried out to further verify the performance of the proposed SwinNet, including the four units such as the PPAM, multi-scale feature fusion main decoder and auxiliary decoder. The ablation experiments are numbered from one to six with one or two of the four units removed during the testing, respectively, and experiment No. Six is the test with the proposed SwinNet including all the units. Table 5 shows the results of the ablation experiments. In experiment No. 1, the PPAM module is removed from the SwinNet encoder and the resulting mPA, mIoU, and mF1 have been reduced by 1.30%, 0.94%, and 0.58%, respectively, indicating that the use of the PPAM module can effectively inhibit the interference of the background noise during segmentation and help to focus on the aquaculture areas in the RSI. In experiment No. 2, the multi-scale feature fusion operation is removed from the decoder and utilizes only the deepest feature maps of the encoder for segmentation (corresponding to the P4 branch in Figure 6 in Section 4.3), and the resulting mPA, mIoU, and mF1 are reduced by 7.79%, 12.32%, and 12.04%, respectively. The multi-scale feature fusion can effectively complement the features of the target, which is crucial for improving the segmentation performance of the model. Experiment No. 3 is carried out with the DSConvASPP module removed from the proposed SwinNet and the fused feature map is directly used for segmentation, leading to the reduction of mPA, mIoU, and mF1 by 0.51%, 0.35%, and 0.21%, respectively, which shows certain denoising capability on the fused features. In experiment No. 4, the auxiliary decoder is absent from the SwinNet, and the resulting mPA, mIoU, and mF1 are reduced by 0.62%, 0.50%, and 0.27%, respectively. Therefore, it can be deduced that the application of the auxiliary decoder during the training process helps to improve the accuracy of the model. In experiment No. 5, both the multi-scale feature fusion operation and the DSConvASPP module are removed from the decoder, and the mPA, mIoU, and mF1 have been reduced by 14.08%, 17.12%, and 17.53%, respectively. The combination of the multi-scale feature fusion operation and the DSConvASPP plays the most important role in the segmentation performance of the proposed SwinNet.

In this study, Equations (21)–(23) were utilized to perform a detailed analysis of the test set images. Specifically, a random selection of images from the test set was segmented using SwinNet. Subsequently, the area of the aquaculture regions and the error rate associated with these segmented areas were calculated based on the obtained results. The outcomes of this analysis, including the calculated areas and error rates, are comprehensively presented in Table 6.

As can be seen from Table 6, the proposed method can accurately count the total number of pixel points in the aquaculture area on the remote sensing imagery in the extracted test results, corresponding to an error rate of 0.30%, 1.50%, and 1.51%, all of which have a low error rate and are very close to the true value of the label. A random selection of images from the test set is then applied with the proposed SwinNet, and the obtained segmentation results are detailed in Table 6. It can be seen that the proposed method achieves error rates of 0.30%, 1.50%, and 1.51%, respectively. By comparing the three sets of data, it can be seen that the proposed model performs the best in predicting the scale of oyster farming. However, the extraction of oyster rafts is particularly complicated because the individual oyster raft is small, densely arranged, and easy to disturb by water bodies. Nevertheless, the proposed model is able to successfully extract oyster rafts in the Maowei Sea area, and the aquaculture yields and economic income can be accurately estimated based on the average density of aquaculture for the year. The low error rates (0.30%, 1.50%, and 1.51%, respectively) indicate that the proposed model can help to make high-stake management decisions in aquaculture planning and policy in practical applications.

6. Discussion

6.1. Image Cuts and Small Drawings Are Labelled with Unique Identifiers

In the field of remote sensing image analysis, accurate data annotation is a key part of improving model performance. In this study, LabelMe V4.5.7 software was used to annotate remote sensing images to improve the recognition accuracy of farm breeding ponds. LabelMe’s ease of use and efficient support for large remote sensing images, especially its accurate image annotation capability, played a decisive role in ensuring high recognition accuracy in this study. However, the inconsistency of label data values becomes a significant challenge during the multi-participant annotation process, which may be caused by subtle differences in the judgment criteria of different annotators, thus affecting the consistency of label data and the accuracy of the final model. To address this issue, this study adopts the method of assigning Uniform Identification Codes (UICs) to objects, which reduces the inconsistency between annotations by ensuring that each annotated object has a unique identification code. In addition, to further improve the consistency and quality of annotation, this study developed detailed annotation guidelines and provided rigorous training to all annotators. These guidelines and training covered how to correctly identify farmed ponds from non-farmed ponds, how to assign an identification code to each labeled object, and how to use the LabelMe software for high-precision labeling. In addition, to ensure the accuracy of the labeling results, we have established an annotation review process and adopted a multi-person cross-validation method, which means that each annotated image will be independently reviewed by multiple annotators, and any inconsistencies in the annotation will be rechecked and corrected. These measures significantly improve the consistency and quality of the labeled data, laying a solid foundation for subsequent model training. Although the introduction of UICs and other quality control measures have increased the complexity of the annotation process, these steps are essential to ensure the accuracy of data annotation and further improve the precision of farm breeding pond identification. Going forward, with the development of automated annotation techniques, we expect to be able to further reduce the labor intensity required for manual annotation while maintaining or even improving the quality and consistency of data annotation. Figure 11 shows an example of the labeling operation and the resulting annotation data from LabelMe software.

6.2. Discussion Based on the SwinNet Model Results

In the discussion of the results presented by our proposed SwinNet model, particularly as showcased in Figure 12, the model’s capability to effectively segment large-size remote sensing imagery [50] and accurately compute the area of aquaculture regions is of notable significance [51]. The methodology employed—segmenting high-resolution remote sensing imagery of 2500 × 2500 into smaller 500 × 500 images for SwinNet processing—demonstrates the model’s adeptness not just in handling large datasets but also in maintaining precision at a granular level. The regions of Sipo Dun, Sanpo Dun, and Toupo Dun within the Maowei Sea area, as featured, underscore the model’s utility in real-world applications.

A critical observation from the segmented outputs, particularly the stitched images (Figure 12b) and the detailed aquaculture region extraction (Figure 12d), is SwinNet’s precision in delineating the aquaculture boundaries and its minimal artifact generation at the seams of stitched images. This outcome reinforces the model’s proficiency in extracting detailed features such as the edges of aquaculture areas, which is crucial for accurate area calculation and further applications in coastal aquaculture management. Moreover, the area statistics derived from the model’s predictions, illustrated in Figure 12d,e, highlight the practical utility of SwinNet in providing tangible data for the aquaculture industry. The ability to translate pixel counts into real-world area measurements, as conducted for both local (approximately 24,281.01 m²) and global predictions (approximately 331,637.01 m²), offers an efficient and accurate method for managing and evaluating the scope of aquaculture operations.

The potential applications of SwinNet extend beyond single-time measurements. When historical remote sensing imagery data are available, SwinNet’s application can be expanded to track the development trends of aquaculture areas over time. This capability could provide invaluable insights into the sustainable development of the coastal aquaculture industry, allowing for the monitoring of expansion, contraction, or any changes in aquaculture practices. Such dynamic tracking aligns with the broader objectives of environmental sustainability and resource management, offering a tool not only for current operational assessment but also for future planning and conservation efforts.

Thus, SwinNet stands out not merely for its technical sophistication but for its practical implications, offering a robust solution for the challenges faced in large-scale aquaculture monitoring and management. The presented results and subsequent analysis affirm the model’s superiority and adaptability to various scales of operation, marking a significant advancement in the field of remote sensing applied to aquaculture.

6.3. Challenges and Scale to Large Datasets

The proposed SwinNet model has been validated in relatively small experimental areas, but scaling it up to larger areas may pose significant challenges, primarily due to computational power requirements. High-resolution remote sensing imagery of large study areas can be challenging to implement with SwinNet, as it is designed to run on a single PC with limited RAM. Handling large datasets necessitates more powerful computing resources and memory, such as those offered by distributed and cloud computing systems [2]. Adapting the SwinNet model for distributed computing platforms will require redesign and optimization, potentially incurring additional costs. Large remote sensing imagery datasets also introduce complexity due to the inclusion of diverse surface characteristics beyond aquaculture areas, such as offshore industries and island facilities, which can generate additional noise. Therefore, SwinNet must be robust enough to handle these disturbances [52,53]. Additionally, large remote sensing imagery datasets are often composed of images captured at different times, meaning the target aquaculture areas may vary in size, shape, color, or texture due to seasonal changes and varying meteorological conditions, further challenging the model’s robustness. There are several potential solutions to improve the generality of the proposed SwinNet, including a larger and more diverse dataset [54], data augmentation [55], regularization of the model [56] and early stopping [57] et al.

However, in practical terms, oyster farming areas typically concentrate in designated offshore waters where conditions are favorable for breeding. Oyster farming in distant sea waters is not environmentally or economically feasible. Thus, large-scale application of the SwinNet model may not be necessary in most practical situations.

This study provides a foundation for remote sensing-based aquaculture monitoring using high-resolution imagery; however, several important future explorations can be considered. For example, enhancing the robustness and generalizability of the proposed model with validation across diverse aquaculture environments with varying geographic, ecological, and operational characteristics. The consideration of satellite imagery of varying resolutions and from different sources will be critical for broadening the proposed model’s practical applicability. While this work focuses on high-resolution images, extending the framework to medium- or low-resolution data would allow for integration with more widely available satellite platforms and facilitate regional or national-scale monitoring. The temporal dimension of aquaculture development—such as seasonal variability, farming cycles, and long-term structural changes—requires further investigation. Incorporating time-series analysis and multi-temporal imagery could significantly enhance the model’s capacity to detect and interpret temporal trends and transitions in aquaculture areas. Reducing the model’s computational expense without sacrificing accuracy remains an important challenge, particularly for deployment in real-time or resource-constrained environments. Future work should explore model compression techniques, lightweight architectures, and efficient inference strategies to support scalable implementation.

7. Conclusions

This study presents the development and application of the SwinNet model, a cutting-edge solution for the semantic segmentation of RSI, specifically designed to identify aquaculture areas within the Maowei Sea, part of the Beibu Gulf in China. At its core, SwinNet leverages a Swin Transformer to adeptly extract multi-scale features from RSI, addressing the critical need for high-precision aquaculture delineation. The introduction of PPAM within the encoder significantly boosts the model’s capability to discern nuanced features of aquaculture zones, enhancing the overall accuracy of feature extraction. To capitalize on the rich semantic information and intricate details present in both low and high-resolution features, a multi-scale feature decoder, inspired by the principles of FPN and the DSConvASPP module, is meticulously designed. This decoder architecture ensures the effective fusion of diverse feature scales, thereby enriching the model’s semantic segmentation output. Further augmenting SwinNet’s segmentation precision is an auxiliary decoder, incorporating a residual structure to refine the segmentation accuracy further. The performance of the SwinNet model is rigorously evaluated on a specially curated dataset, representing a diverse range of aquaculture areas. Through comprehensive experimental tests, SwinNet has demonstrated remarkable segmentation proficiency, achieving mean Pixel Accuracy (mPA) and mean Intersection over Union (mIoU) values of 97.82% and 95.77%, respectively. These metrics not only underscore SwinNet’s exceptional performance but also signify its superiority over existing mainstream segmentation models. SwinNet’s innovation extends beyond its technical prowess; it offers a pragmatic solution to the aquaculture industry. By facilitating the precise and convenient extraction of aquaculture areas from RSI, SwinNet enables the efficient management and statistical analysis of aquaculture operations. This capability holds significant implications for the sustainable and effective oversight of aquaculture practices, paving the way for enhanced productivity and environmental stewardship within the sector.

Author Contributions

Software, W.Q. and Y.L.; Validation; W.Q., Y.L. and Y.Y.; Writing—original draft, W.Q., Y.L. and N.W.; Formal analysis: Y.L. and Y.Y.; Conceptualization, W.Q. and N.W.; Supervision—investigation, M.H.I., M.F.R. and N.W.; Resources, J.D.; Writing—review and editing, M.H.I., M.F.R. and N.W.; Supervision, M.H.I., M.F.R. and N.W.; Methodology, W.Q. and Y.L.; Project administration: M.H.I., M.F.R. and N.W.; Funding acquisition, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported partially by the National Natural Science Foundation of China (Grant No. 52161042); by the Guangxi Science and Technology Major Program (2024AA29055); and by the 100 Scholar Plan of the Guangxi Zhuang Autonomous Region of China (2018).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available upon request.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Grant No. 52161042 and No. 42366008), the Guangxi Science and Technology Major Program (2024AA29055), and the 100 Scholar Plan of the Guangxi Zhuang Autonomous Region of China (Grant No. 2018).

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kotaridis, I.; Lazaridou, M. Remote Sensing Image Segmentation Advances: A Meta-Analysis. ISPRS J. Photogramm. Remote Sens. 2021, 173, 309–322. [Google Scholar] [CrossRef]
Yuan, X.; Shi, J.; Gu, L. A Review of Deep Learning Methods for Semantic Segmentation of Remote Sensing Imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Amer Aleissaee, A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S. Transformers in Remote Sensing: A Survey. arXiv 2022, arXiv:2209.01206. [Google Scholar]
Naseer, M.M.; Ranasinghe, K.; Khan, S.H.; Hayat, M.; Shahbaz Khan, F.; Yang, M.H. Intriguing properties of vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 23296–23308. [Google Scholar]
Cheng, B.; Liang, C.; Liu, X.; Liu, Y.; Ma, X.; Wang, G. Research on a Novel Extraction Method Using Deep Learning Based on GF-2 Images for Aquaculture Areas. Int. J. Remote Sens. 2020, 41, 3575–3591. [Google Scholar] [CrossRef]
Liang, C.; Cheng, B.; Xiao, B.; He, C.; Liu, X.; Jia, N.; Chen, J. Semi-/Weakly-Supervised Semantic Segmentation Method and Its Application for Coastal Aquaculture Areas Based on Multi-Source Remote Sensing Images—Taking the Fujian Coastal Area (Mainly Sanduo) as an Example. Remote Sens. 2021, 13, 1083. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9992–10002. [Google Scholar]
Xinguo, L.; Nan, J.; Yingbao, Y. Remote Sensing Investigation and Survey of Lake Reclamation and Enclosure Aquaculture in Lake Taihu. Trans. Oceanol. Limnol. 2006, 1, 93–99. [Google Scholar]
Zhou, X.C.; Wang, X.Q.; Xiang, T.L.; Jiang, H. Method of Automatic Extraction Seaside Aquaculture Land Based on ASTER Remote Sensing Image. Wetl. Sci. 2006, 4, 64–68. [Google Scholar]
Li, J.; He, L.H.; Dai, J.F.; Li, J.L. Extract Enclosure Culture in Lakes Based on Remote Sensing Image Texture Information. J. Lake Sci. 2006, 4, 337–342. [Google Scholar]
Wang, J.; Gao, J.F. Extraction of Enclosure Culture in Gehu Lake Based on Correspondence Analysis. J. Remote Sens. 2008, 12, 716–723. [Google Scholar]
Xie, Y.; Wang, M.; Zhang, X. An Object-oriented Approach for Extracting Farm Waters within Coastal Belts. Remote Sens. Technol. Appl. 2009, 24, 68–72. [Google Scholar]
Ma, Y.J.; Zhao, D.L.; Wang, R.M.; Su, W. Offshore Aquatic Farming Areas Extraction Method Based on Aster Data. Trans. Chin. Soc. Agric. Eng. 2010, 26, 120–124. [Google Scholar]
Jialan, C.; Dongzhi, Z.; Fengshou, Z. Wakame Raft Interpretation Method of Remote Sensing based on Association Rules. Remote Sens. Technol. Appl. 2012, 27, 941–946. [Google Scholar]
Peng, L.; Yunyan, D. A CBR Approach for Extracting Coastal Aquaculture areas. Remote Sens. Technol. Appl. 2012, 27, 857–8642. [Google Scholar]
Lu, Y.; Shao, W.; Sun, J. Extraction of Offshore Aquaculture Areas from Medium-Resolution Remote Sensing Images Based on Deep Learning. Remote Sens. 2021, 13, 3854. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient Transformer for Remote Sensing Image Segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 936–944. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Zhao, T.; Xu, J.; Chen, R.; Ma, X. Remote Sensing Image Segmentation Based on the Fuzzy Deep Convolutional Neural Network. Int. J. Remote Sens. 2021, 42, 6264–6283. [Google Scholar] [CrossRef]
Zeng, Z.; Wang, D.; Tan, W.; Yu, G.; You, J.; Lv, B.; Wu, Z. RCSANet: A Full Convolutional Network for Extracting Inland Aquaculture Ponds from High-Spatial-Resolution Images. Remote Sens. 2020, 13, 92. [Google Scholar] [CrossRef]
Fu, Y.; Deng, J.; Wang, H.; Comber, A.; Yang, W.; Wu, W.; You, S.; Lin, Y.; Wang, K. A new satellite-derived dataset for marine aquaculture areas in China’s coastal region. Earth Syst. Sci. Data 2021, 12, 1829–1842. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Liu, Y.; Liu, N.; Cao, Q.; Yao, X.; Han, J.; Shao, L. Learning Non-Target Knowledge for Few-Shot Semantic Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 11563–11572. [Google Scholar]
Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6399–6408. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7242–7252. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Zhu, J.; Zhang, Q.; Li, Y.; Tan, S.; Kang, Z.; Yu, X.; Lan, W.; Cai, L.; Wang, J.; Shi, H. Microplastic Pollution in the Maowei Sea, a Typical Mariculture Bay of China. Sci. Total Environ. 2019, 658, 62–68. [Google Scholar] [CrossRef]
Sridharan, S.; Kumar, M.; Bolan, N.S.; Singh, L.; Kumar, S.; Kumar, R.; You, S. Are Microplastics Destabilizing the Global Network of Terrestrial and Aquatic Ecosystem Services? Environ. Res. 2021, 198, 111243. [Google Scholar] [CrossRef]
Tian, Y.; Yang, Z.; Yu, X.; Jia, Z.; Rosso, M.; Dedman, S.; Zhu, J.; Xia, Y.; Zhang, G.; Yang, J.; et al. Can We Quantify the Aquatic Environmental Plastic Load from Aquaculture? Water Res. 2022, 219, 118551. [Google Scholar] [CrossRef]
Comeau, L.A.; Mallet, A.L.; Carver, C.E.; Guyondet, T. Impact of high-density suspended oyster culture on benthic sediment characteristics. Aquac. Eng. 2014, 58, 95–102. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3213–3223. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing through ADE20K Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5122–5130. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A2-FPN for semantic segmentation of fine-resolution remotely sensed images. Int. J. Remote Sens. 2022, 43, 1131–1155. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, T.; Wang, H.; Li, G.; Liu, S.; Tang, L. SwinF: Swin Transformer with feature fusion in target detection. J. Phys. Conf. Ser. 2022, 2284, 012027. [Google Scholar] [CrossRef]
He, X.; Chen, Y.; Li, Q. Two-branch pure transformer for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6015005. [Google Scholar] [CrossRef]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N.B. V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation. arXiv 2020, arXiv:2004.02147. [Google Scholar] [CrossRef]
Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Darrell, T.; Keutzer, K.D. Implementing efficient convnet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar]
Wang, D.; Zhang, J.; Du, B.; Xu, M.; Liu, L.; Tao, D.; Zhang, L. Samrs: Scaling-up remote sensing segmentation dataset with segment anything model. Adv. Neural Inf. Process. Syst. 2023, 36, 8815–8827. [Google Scholar]
Wangang, X.; Hu, Z.; Shi, S.; Hou, M.; Xu, L.; Zhang, X. A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet. Sci. Rep. 2023, 13, 7600. [Google Scholar]
Bischke, B.; Helber, P.; Folz, J.; Borth, D.; Dengel, A. Multi-Task Learning for Segmentation of Building Footprints with Deep Neural Networks. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1480–1484. [Google Scholar]
Yu, Y.; Wang, C.; Fu, Q.; Kou, R.; Huang, F.; Yang, B.; Yang, T.; Gao, M. Techniques and challenges of image segmentation: A review. Electronics 2023, 12, 1199. [Google Scholar] [CrossRef]
Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 843–852. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. NeurIPS 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]

Figure 1. Location and the corresponding remote sensing imagery of the study area. (A) The specific location of the study area on maps. (B) (a) Sample remote sensing images of partial aquaculture areas in the Maowei Sea; (b) raft culture areas; (c) cage culture areas.

Figure 2. Random rotation enhancement of aquaculture areas dataset: (a) original image and label before enhancement; (b) rotate 45 degrees clockwise; (c) rotate 92 degrees clockwise; (d) rotate 124 degrees clockwise.

Figure 3. The SwinNet structure with three different parts: the encoder is used to extract multi-scale features, the main decoder fuses these features using FPN, and the auxiliary decoder serves to improve the model accuracy during training.

Figure 4. The two successive Swin Transformer blocks.

Figure 5. The structure of the Parallel Pooling Attention Module (PPAM). The blue block is the average pooled attention branch and the yellow block is the maximum pooled attention branch.

Figure 6. The structure of the multi-scale feature fusion master decoder. Multi-scale features are first fused using FPN, and then the fused features are denoised using the DSConvASPP module, DSConv3 denotes 3 × 3 Depthwise Separable Convolution.

Figure 7. The auxiliary decoder structure.

Figure 8. Loss values of the testing models during the training process. The horizontal coordinate is the number of iteration rounds Iteration/1000 and the vertical coordinate represents the loss value.

Figure 9. Performance of different models on the validation set. (a) PA of aquaculture areas; (b) PA of land; (c) IoU of aquaculture areas; (d) IoU of land. The horizontal coordinate is the number of iterations Iteration/1000, and the vertical coordinate is the PA or IoU value.

Figure 10. Comparison of the segmentation results on the test set. (a) Original image; (b) Ground truth; (c) SwinNet (Ours); (d) KNet; (e) Semantic_FPN; (f) SegFormer. Green regions represent islands and land areas. White regions correspond to oyster racks (aquaculture structures) Blue background indicates water bodies or sea surface. Red boxes are used to highlight regions of interest, especially areas where different models exhibit noticeable differences in classification accuracy or segmentation performance.

Figure 11. An example of (a) the labeling operation and (b) the resulting annotation data in LabelMe software.

Figure 12. Segmentation and statistics on large RSI. (a) Global image before cropping; (b) Global image obtained by stitching together the model predictions; (c) Local image cropped from the global image; (d) The local prediction result; (e) Statistics of the aquaculture areas in the global image; (f) Statistics of the aquaculture areas in the local image. Green regions represent islands and land areas. White regions correspond to oyster racks (aquaculture structures) Blue background indicates water bodies or sea surface. Red boxes are used to highlight regions of interest, especially areas where different models exhibit noticeable differences in classification accuracy or segmentation performance.

Table 1. Information of aquaculture areas dataset.

Sample Set	Spatial Resolution/m	Image Size/Pixels	Amount	Total/Proportion of Sample Pixels
Sample Set	Spatial Resolution/m	Image Size/Pixels	Amount	Aquaculture Areas	Land	Background
Train set	0.593	512 × 512	2309	40,282,744/ 6.66%	252,565,095/ 41.73%	312,442,657/ 51.61%
Validation set	0.593	512 × 512	289	4,914,692/ 6.49%	30,964,340/ 40.87%	39,880,584/ 52.64%
Test set	0.593	512 × 512	288	4,595,578/ 6.09%	30,220,401/ 40.03%	40,681,493/ 53.88%

Table 2. Performance comparison of seven commonly used different models.

Model-Based	Model	PA/%		IoU/%		mPA/%	mIoU/%
Model-Based	Model	Aquaculture Areas	Land	Aquaculture Areas	Land	mPA/%	mIoU/%
CNN-based	UNet [25]	96.22	97.85	89.87	91.77	97.04	90.82
	DeepLabV3+ [29]	93.58	96.89	89.44	94.59	95.23	92.02
	Semantic_FPN [31]	95.55	98.27	90.53	96.81	96.91	93.67
	KNet [32]	95.34	98.04	91.01	96.96	96.69	93.99
Transformer-based	SETR [34]	91.39	98.19	83.48	95.35	94.79	89.42
	Segmenter [35]	91.86	98.77	82.47	97.17	95.32	89.82
	SegFormer [36]	95.88	97.36	90.40	96.33	96.62	93.37
	SwinNet (Ours)	96.53	99.11	93.07	98.46	97.82	95.77

Table 3. Comparison of Precision and Recall of different models.

Model-Based	Model	Precision/%		Recall/%
Model-Based	Model	Aquaculture Areas	Land	Aquaculture Areas	Land
CNN-based	UNet [25]	93.16	93.66	96.22	97.84
	DeepLabV3+ [29]	95.28	97.55	93.70	96.89
	Semantic_FPN [31]	94.52	98.48	95.55	98.27
	KNet [39]	94.99	98.91	96.35	98.47
Transformer-based	SETR [32]	90.61	97.06	91.39	98.19
	Segmenter [35]	88.96	98.37	91.86	98.77
	SegFormer [36]	93.24	98.74	96.63	98.04
	SwinNet (Ours)	96.29	99.33	96.53	99.11

Table 4. Comparison of F1-score for different models.

Model-Based	Model	F1-Score/%		mF1/%
Model-Based	Model	Aquaculture Areas	Land	mF1/%
CNN-based	UNet [25]	94.66	95.71	95.19
	DeepLabV3+ [29]	94.43	97.22	95.82
	Semantic_FPN [31]	95.03	98.37	96.70
	KNet [39]	95.67	98.69	97.18
Transformer-based	SETR [32]	90.99	97.62	94.31
	Segmenter [35]	90.39	98.56	94.48
	SegFormer [36]	94.90	98.39	96.65
	SwinNet (Ours)	96.41	99.22	97.82

Table 5. The results of ablation experiments. “√” indicates that the structure is included in the SwinNet model and “×” means that the structure is removed from the SwinNet model.

Experiment No.	PPAM	Multi-Scale Feature Fusion	DSConvASPP	Auxiliary Decoder	mPA/%	mIoU/%	mF1/%
1	×	√	√	√	96.52	94.83	97.24
2	√	×	√	√	90.03	83.45	85.78
3	√	√	×	√	97.31	95.42	97.61
4	√	√	√	×	97.20	95.27	97.55
5	√	×	×	√	83.74	78.65	80.29
6 (SwinNet)	√	√	√	√	97.82	95.77	97.82

Table 6. Estimation of aquaculture areas and error rate.

Image	Ground Truth	Predict Result	S/m²		E/%
Image	Ground Truth	Predict Result	Ground Truth	Predict Result	E/%
			Ng: 70,623 S: 24,834.51	Np: 70,838 S: 24,910.11	0.30
			Ng: 65,412 S: 23,002.06	Np: 66,395 S: 23,347.74	1.50
			Ng: 78,232 S: 27,510.20	Np: 77,054 S: 27,095.96	1.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, W.; Ismail, M.H.; Luo, Y.; Yuan, Y.; Deng, J.; Ramli, M.F.; Wu, N. Aquaculture Areas Extraction Model Using Semantic Segmentation from Remote Sensing Images at the Maowei Sea of Beibu Gulf. Fishes 2025, 10, 236. https://doi.org/10.3390/fishes10050236

AMA Style

Qin W, Ismail MH, Luo Y, Yuan Y, Deng J, Ramli MF, Wu N. Aquaculture Areas Extraction Model Using Semantic Segmentation from Remote Sensing Images at the Maowei Sea of Beibu Gulf. Fishes. 2025; 10(5):236. https://doi.org/10.3390/fishes10050236

Chicago/Turabian Style

Qin, Weirong, Mohd Hasmadi Ismail, Yangyang Luo, Yifeng Yuan, Junlin Deng, Mohammad Firuz Ramli, and Ning Wu. 2025. "Aquaculture Areas Extraction Model Using Semantic Segmentation from Remote Sensing Images at the Maowei Sea of Beibu Gulf" Fishes 10, no. 5: 236. https://doi.org/10.3390/fishes10050236

APA Style

Qin, W., Ismail, M. H., Luo, Y., Yuan, Y., Deng, J., Ramli, M. F., & Wu, N. (2025). Aquaculture Areas Extraction Model Using Semantic Segmentation from Remote Sensing Images at the Maowei Sea of Beibu Gulf. Fishes, 10(5), 236. https://doi.org/10.3390/fishes10050236

Article Menu

Aquaculture Areas Extraction Model Using Semantic Segmentation from Remote Sensing Images at the Maowei Sea of Beibu Gulf

Abstract

1. Introduction

2. Related Work

3. Materials

3.1. Study Region

3.2. Experimental Data

3.2.1. Dataset Selection and Labeling

3.2.2. Data Transformation and Augmentation

3.2.3. Dataset Configuration and Evaluation

4. Methodology

4.1. The Proposed SwinNet: The Model Structure

4.2. Attention Enhanced Encoder Based on Swin Transformer and PPAM

4.3. Main Decoder Based on FPN for Multi-Scale Feature Fusion

4.4. Auxiliary Decoder Based on Residual Connection

4.5. The Loss Function

4.6. Evaluation Metrics

4.7. Comparison Experiment

4.8. Training and Optimization

4.9. Aquaculture Areas Monitoring

5. Experiments and Results

5.1. The Performance of the Seven Commonly Used Different Models in the Training Stage

5.2. Performance Comparison in Inference Stage

5.3. Ablation Experiment

6. Discussion

6.1. Image Cuts and Small Drawings Are Labelled with Unique Identifiers

6.2. Discussion Based on the SwinNet Model Results

6.3. Challenges and Scale to Large Datasets

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI