Previous Article in Journal
Structural-Functional Suitability Assessment of Yangtze River Waterfront in the Yichang Section: A Three-Zone Spatial and POI-Based Approach
Previous Article in Special Issue
Improving Agricultural Efficiency of Dry Farmlands by Integrating Unmanned Aerial Vehicle Monitoring Data and Deep Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fine-Grained Classification of Lakeshore Wetland–Cropland Mosaics via Multimodal RS Data Fusion and Weakly Supervised Learning: A Case Study of Bosten Lake, China

1
State Key Laboratory of Ecological Safety and Sustainable Development in Arid Lands, Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences, Urumqi 830011, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
3
China-Kazakhstan Joint Laboratory for Remote Sensing Technology and Application, Al-Farabi Kazakh National University, Almaty 050012, Kazakhstan
4
Key Laboratory of RS & GIS Application Xinjiang, Urumqi 830011, China
5
School of Geography, Geomatics and Planning, Jiangsu Normal University, Xuzhou 221116, China
6
Department of Electrical, Computer and Biomedical Engineering, University of Pavia, 27100 Pavia, Italy
*
Author to whom correspondence should be addressed.
Land 2026, 15(1), 92; https://doi.org/10.3390/land15010092 (registering DOI)
Submission received: 24 November 2025 / Revised: 25 December 2025 / Accepted: 30 December 2025 / Published: 1 January 2026
(This article belongs to the Special Issue Challenges and Future Trends in Land Cover/Use Monitoring)

Abstract

High-precision monitoring of arid wetlands is vital for ecological conservation, yet traditional methods incur prohibitive labeling costs due to complex features. In this study, the wetland of Bosten Lake in Xinjiang is selected as a case area, where Pleiades and PlanetScope-3 multimodal remote sensing data are fused using the Gram–Schmidt method to generate imagery with high spatial and spectral resolution. Based on this dataset, we systematically compare the performance of fully supervised models (FCN, U-Net, DeepLabV3+, and SegFormer) with a weakly supervised learning model, One Model Is Enough (OME), for classifying 19 wetland–cropland mosaic types. Results demonstrate that: (1) SegFormer achieved the best overall performance (98.75% accuracy, 95.33% mIoU), leveraging its attention mechanism to enhance semantic understanding of complex scenes. (2) The weakly supervised OME, using only image-level labels, matched fully supervised performance (98.76% accuracy, 92.82% F1-score) while drastically reducing labeling effort. (3) Multimodal fusion boosted all models’ accuracy, most notably increasing U-Net’s mIoU by 63.39%. (4) Models exhibited complementary strengths: U-Net excelled in wetland vegetation segmentation, DeepLabV3+ in crop classification, and OME in preserving spatial details. This study validates a pathway integrating multimodal fusion with WSL to balance high accuracy and low labeling costs for arid wetland mapping.

1. Introduction

Wetlands are unique ecosystems sustained by perennial or seasonal water saturation, supporting distinctive biotopes through specialized hydrology, soils, and vegetation [1]. As critical transition zones between terrestrial and aquatic systems, they provide irreplaceable ecological services, including biodiversity conservation, hydrological regulation, biogeochemical cycling, and climate modulation [2,3]. A prime example is Xinjiang’s Bosten Lake, the largest inland freshwater wetland in China’s arid zone. It plays a vital role in maintaining regional ecological security and serves as a key stopover for Central Asian migratory birds [4]. However, climate change and intensive human activities have transformed its littoral zone into a classic wetland-cropland mosaic [5]. This intricate landscape pattern, while allowing agricultural drainage to partially recharge wetlands, also intensifies water competition and introduces non-point source pollution risks, thereby directly threatening the wetland ecosystem’s integrity [6]. Consequently, accurately characterizing the spatiotemporal distribution of wetlands and croplands within this composite landscape is a fundamental prerequisite for informing critical management initiatives, such as delineating ecological protection redlines, assessing biodiversity habitats, and optimizing agricultural water allocation.
At the technical level, RS technology has become an indispensable tool for wetland vegetation identification and information extraction due to its large coverage, periodic observation, and multi-scale capabilities [7,8,9]. However, wetland ecosystems are generally characterized by prominent spectral mixing phenomena, fuzzy object boundaries, and significant seasonal changes, which, together with the inherent noise interference in RS imagery, present major challenges for traditional classification methods in terms of feature expression and model robustness [10]. In recent years, deep learning (DL) has revolutionized RS image classification [11]. Particularly in semantic segmentation, architectures such as the fully convolutional network (FCN) and U-Net can automatically extract multi-level spatial-semantic features, significantly improving accuracy in complex scenes and establishing themselves as the mainstream method in this field [12,13]. However, such fully supervised DL methods are heavily dependent on large volumes of high-quality labeled samples, and the complexity and heterogeneity of wetland feature classes, coupled with the extremely high cost of manual labeling, severely limit their practical application in wetland classification. Therefore, the development of intelligent classification methods with strong adaptability and high labeling efficiency has become a critical issue to be addressed in the field of wetland remote sensing.
Weakly Supervised Learning (WSL) offers an efficient solution to the core challenge of scarce labeled data in wetland classification [14]. Unlike traditional methods that rely on pixel-level fine annotations, WSL requires only coarse-grained weak labels such as image-level labels, point annotations, or bounding box annotations to achieve accurate semantic segmentation [15,16]. Recent years have witnessed significant advances in this technology in remote sensing, with the technical pathway maturing from class activation map (CAM) optimization methods (e.g., CAM [17], AdvCAM [18], SC-CAM [19])—which utilize the activation maps from classification networks to initially localize target regions—to adversarial erasing methods that progressively localize target regions to improve segmentation accuracy [20]. However, these methods are mostly limited to binary segmentation or simple scenes, making it difficult to address the requirements for fine-grained classification in complex landscapes. To overcome this limitation, the recently proposed “One Model Is Enough” (OME) framework introduces a sample filtering mechanism and a co-occurrence evaluation metric to optimize training sample quality, and further proposes an uncertainty-driven pixel-level weighted masking method based on multi-class CAM, which significantly improves segmentation performance in multi-class scenarios [21]. Specifically, these methods lack systematic validation in arid wetland-cropland mosaics characterized by complex spectral-spatial heterogeneity [22]. Furthermore, they face a practical data dilemma: high-resolution imagery is often cost-prohibitive, whereas more accessible medium-resolution data lacks the spatial detail required for fine-grained discrimination within wetlands. To bridge these gaps, this study constructs a multimodal dataset by fusing high-resolution Pleiades imagery with high-temporal-frequency PlanetScope data. We then systematically evaluate state-of-the-art WSL models, including the “One Model Is Enough” (OME) framework, within this challenging arid wetland context. Our work aims to establish an effective paradigm for leveraging multi-source remote sensing data to achieve fine-grained wetland classification under weak supervision.
At the data level, the development of multimodal RS data fusion technology opens new avenues for the fine classification and dynamic monitoring of wetlands [23]. Data from different sensors have their own advantages and limitations [24,25]. Optical imagery can reflect rich details of feature appearance but is susceptible to meteorological conditions; hyperspectral imagery possesses continuous and abundant spectral information, facilitating fine feature differentiation, but its spatial resolution is usually limited; radar imagery can penetrate clouds and rain, making it suitable for all-weather monitoring, but its signals are susceptible to surface roughness and dielectric properties [26]. Complementary information from different sources can be integrated by fusing multimodal RS data [27]. Prior studies in wetland contexts have demonstrated their potential to enhance the monitoring of hydrological dynamics and vegetation succession [28,29]. For the specific task of arid wetland-cropland mosaic classification, the key is to fuse high spatial detail with rich spectral information while preserving spectral fidelity. Among various fusion algorithms, the Gram-Schmidt (GS) method is recognized for its strong spectral preservation in multi-band fusion [30]. Therefore, this study employs the GS method to integrate high-resolution Pleiades panchromatic imagery with PlanetScope multispectral data, generating a fused product designed to optimally support the subsequent weakly supervised classification.
Building upon the issues outlined above and confronting the dual challenges of high annotation costs and complex landscapes in the fine-grained classification of arid zone wetlands, a significant gap remains in the systematic performance evaluation of advanced deep learning models within specific, representative contexts. Therefore, this study proposes a classification framework that integrates multi-source remote sensing data with weakly supervised learning (WSL), aiming to achieve a synergy between classification accuracy and efficiency. Specifically, this study first applies the Gram-Schmidt method to fuse Pleiades (0.5 m panchromatic) and PlanetScope-3 (3 m multispectral) imagery to simultaneously achieve high spatial resolution and rich spectral information, thereby providing a data foundation for robust feature extraction. Further, the performance differences between the two paradigms of fully supervised learning (FCN, U-Net, DeepLabV3+, and SegFormer) and weakly supervised learning, represented by the “One Model Is Enough” (OME) framework, in this specific wetland scenarios are systematically compared through quantitative experiments, focusing on evaluating and answering the following three core questions: (1) How effective is multi-modal remote sensing data fusion in improving the recognition accuracy of boundaries between fine-grained features within the Bosten Lake wetland? (2) How well does the weakly supervised learning approach adapt to the complex wetland environment? (3) What is the optimal model selection strategy and its potential application in monitoring wetlands in arid areas? The objectives corresponding to these questions are: (a) to quantify the accuracy gain from multimodal fusion for fine boundaries; (b) to evaluate the performance and robustness of WSL under label scarcity; and (c) to derive a practical model selection guideline.
The main contribution of this study is threefold: (1) a systematic comparison of the classification performance of fully supervised and weakly supervised learning in complex arid wetland scenarios, which clarifies the relative strengths and weaknesses in identifying land-cover boundaries and decomposing mixed pixels within the wetland-farmland ecotone; (2) the verification of the applicability of weakly supervised learning under scarce annotation conditions in this setting, demonstrating that semantic segmentation methods using image-level labels can achieve accuracy levels comparable to fully supervised methods in fine wetland classification; and (3) a quantitative evaluation of the improvement in fine-grained feature recognition accuracy achieved through the fusion of high-resolution panchromatic and multispectral imagery, confirming the practical value of this method in providing a high-quality data basis for arid zone wetland monitoring as exemplified by our study.

2. Study Area and Dataset Sources

2.1. Study Area

Bosten Lake (86°40′–87°56′ E, 41°56′–42°14′ N) is located in the central part of the Bayin’guoleng Mongol Autonomous Prefecture, Xinjiang, China. It is the largest inland freshwater lake in China’s arid zone and serves as a critical ecological barrier for the Tarim Basin [31]. The lake area is flat and contains a well-developed water system, primarily recharged by the Kaidu and Peacock Rivers, which gives it typical inland wetland ecological characteristics [32]. This study focuses on the northern wetland ecosystem of Bosten Lake as the core experimental area. It systematically analyzes the spatio-temporal heterogeneity and spectral mixing phenomena in transition zones among typical wetland landscapes—including open water, reed communities, degraded reeds, mudflats, farmland, and bare ground—with a specific focus on evaluating the capability of fully supervised and weakly supervised learning paradigms to recognize land classes in these complex scenarios. This study aims to address the challenge of finely segmenting areas with fuzzy boundaries, such as the vegetation-water interspersion zone and the degraded vegetation-bare land transition zone. Collectively, this work provides critical methodological support for the high-precision mapping and dynamic monitoring of arid zone wetland ecosystems (Figure 1).

2.2. Multimodal Remote Sensing Data Sources

In order to take into account the high spatial resolution and the spectral characteristics of different features, two data sources, Pleiades and PlanetScope-3, are selected to construct the multimodal remote sensing dataset for this study (Table 1).
The Pleiades data, provided by Airbus Defence and Space, have global coverage with a spatial resolution of 0.5 m in the panchromatic band and 2 m in the multispectral (B/G/R/NIR) resolution, with an imaging width of about 20 km, which allows for a fine delineation of the structure of wetland vegetation communities, the spatial transition zone between the water body and the land, and their heterogeneity [33]. The data acquisition in the summer of 2024 was timed to match the peak vegetation growth period in the study area, enabling the extraction of spectral features from typical ground objects.
The PlanetScope-3 data were provided by Planet Labs, Inc. and contain eight spectral bands with a spatial resolution of 3 m [34]. In this study, Level 3B (L3B) products were used. These products have undergone sensor radiometric correction, Sentinel-2-based geometric alignment, and orthometric correction. Additionally, atmospheric correction was performed using the 6S radiative transfer model with MODIS near-real-time atmospheric parameters. This processing effectively eliminated the influence of aerosols and water vapor, ensured the radiometric consistency of multi-temporal data, and provided a reliable spectral benchmark for subsequent classification.

2.3. Data Preparation and Pre-Processing

This study followed a standardized, multi-tool workflow to systematically prepare raw remote sensing data into datasets suitable for deep learning. All image data were uniformly projected to the WGS_1984_UTM_Zone_45N coordinate system to ensure spatial reference consistency and geo-location accuracy.

2.3.1. Field Survey and Sample Labeling

Data preparation followed a standardized workflow from field collection to digital processing. First, a field survey of wetland vegetation was conducted in the summer of 2024 in Heshuo, Hejing, Yanqi Hui Autonomous County and Bohu County in Bayin’guoleng Mongol Autonomous Prefecture, Xinjiang. Handheld Global Positioning System (GPS) devices were used to accurately locate different land use and cover types in the study area, and a total of 697 sample points representing 19 typical feature types were collected. These samples covered both natural land cover types (e.g., roads, settlements, water, reeds, degraded reeds, mudflats, bare ground, grasslands, sparse vegetation, shrubland) and major crop types (e.g., grapes, fennel, sugar beet, wheat, maize, stevia, chili peppers, tomatoes, licorice). Field views of the various feature types are shown in Figure 2. After fieldwork, geospatial data processing was carried out based on the ArcGIS 10.2 platform, which included pre-processing steps such as fusion of multimodal remote sensing data, geometric fine correction, image cropping, and band synthesis. Using the processed high-resolution imagery as a basemap, sample areas were accurately delineated and labeled by integrating field GPS data with visual interpretation guided by expert knowledge. This process established the preliminary labeled dataset.

2.3.2. Data Augmentation Strategies and Dataset Construction

To meet the stringent requirements of deep learning models for fixed-size inputs, the preprocessed data were further processed. Using the PyCharm 2025.2.4 (Community Edition) development environment, the large-size raw images and their corresponding label masks were systematically cropped into 30 × 30 pixel image patches. An overlap sampling strategy was used in the cropping process, setting the overlap area between neighboring image patches to 3 pixels. This strategy has two advantages: first, it generates samples with subtle differences through the sliding window mechanism, enabling efficient data augmentation; second, it significantly alleviates the common issue of decreased prediction accuracy at patch edges, which lays a solid foundation for the seamless mosaicking of the subsequent prediction results, ensuring the final predictions exhibit both overall consistency and spatial continuity.
To address the imbalance of categories in the dataset, especially the significant lack of samples in the “grassland” category, a targeted data augmentation strategy was performed. By applying 90°, 180°, and 270° rotational transformations to the original grassland samples, we tripled the sample count, and these augmented samples were added to the dataset. This approach not only mitigated the class imbalance but also enhanced the model’s recognition accuracy and generalization performance for grassland by incorporating rotational invariance.
Based on the above processing, a large-scale semantic segmentation dataset was successfully constructed. The dataset contains a total of 9296 paired image-label samples, which were strictly partitioned according to standard machine learning practice into three subsets with an 8:1:1 ratio: the training set contains 7414 samples for model parameter learning and optimization; the validation set contains 941 samples for monitoring model performance during training and tuning hyperparameters to prevent overfitting; the test set also contains 941 samples as an independent set completely withheld from the training process, used for the final objective assessment of the model’s generalization ability and robustness. Detailed statistics on the dataset size and class distribution are shown in Figure 3.

2.3.3. Data Storage and Organizational Structure

In terms of data storage, the original input image is stored in TIFF format to preserve spatial details and spectral information with minimal loss, while the corresponding annotation labels are stored in PNG format as a lossless grayscale image with pixel values ranging from 1 to 19 corresponding to 19 predefined semantic categories. All data are organized in a unified directory structure. Image files are stored in an “images” folder and label files are stored in a “masks” folder. Both of these main folders are further divided into three subdirectories, namely “train”, “val”, and “test”, to ensure that each image file can be accurately matched with its corresponding label file, providing a reliable foundation for model training.

3. Methodology

3.1. Overview of the Proposed Method

To address the core challenges of high sample annotation costs and insufficient synergistic use of multimodal data in wetland remote sensing classification, this study developed a technical framework. Utilizing the ArcGIS 10.2 and PyCharm 2025.2.4 (Community Edition) platforms, it integrates remote sensing data from Pleiades and PlanetScope-3 satellites through the Gram–Schmidt fusion method with a weakly supervised learning method to achieve high-precision classification using only a small number of annotated samples. The technical process aims to provide an efficient and reusable solution for wetland feature classification. In the data pre-processing stage, the cloud-free high-quality image acquired in June 2024 was selected, and the Gram–Schmidt method was used to fuse the 0.5 m panchromatic data from Pleiades satellite with the 3 m multispectral data from PlanetScope-3 satellite to generate high-precision fused imagery that possesses both high spatial resolution and multi-band spectral information, thereby laying a data foundation for the subsequent fine classification of wetlands. For sample preparation, a standardized sample database covering typical ground object categories was constructed using systematic field surveys, supplemented by manual visual interpretation and GPS verification. This process ensures the representativeness of the samples in terms of category and spatial distribution, providing a reliable basis for model training as well as accuracy verification. In terms of classification model construction, this study compares and analyzes the classification performance of multiple traditional supervised classifiers (FCN, U-Net, DeepLabV3+, SegFormer) and weakly supervised learning methods represented by OME in different scenarios, and systematically evaluates the performances of each type of method in terms of feature expression, generalization ability and boundary maintenance. The objective is to identify effective paradigms applicable to high-resolution remote sensing classification of wetlands and to provide a reference for studies in similar ecological regions, as shown in Figure 4.

3.2. Semantic Segmentation Methods

To systematically evaluate the performance of semantic segmentation under different supervision paradigms, this study selected representative fully supervised and weakly supervised methods for comparison. The selected fully supervised methods include classical architectures such as FCN, U-Net, DeepLabV3+, and SegFormer, all of which rely on pixel-level annotations for end-to-end training (Figure 5). The weakly supervised approach is represented by OME, which generates pseudo-masks based on image-level labels only and follows a two-stage framework of “image-level labeling → pseudo-label generation → segmentation network training”, aiming to explore the feasibility of achieving accurate wetland feature segmentation in the absence of pixel-level annotations (Figure 6). The following sections provide a brief description of the basic principles of the selected methods.

3.2.1. Fully Supervised Methods

  • FCN: As a pioneering work in the field of semantic segmentation, the Fully Convolutional Network (FCN) proposed by Long et al. (2015) [12] pioneered end-to-end pixel-level prediction for the first time. Its core is to replace the fully connected layer of the classification network with a convolutional layer to support arbitrary-sized inputs. FCN adopts an encoder–decoder structure, where the encoder performs feature extraction, and the decoder upsamples feature maps using transposed convolution. A key innovation is the introduction of skip connections, which fuse deep semantic information with shallow spatial details. This significantly improved boundary accuracy and laid the foundation for the field.
  • U-Net: Ronneberger et al. (2015) [13] proposed U-Net, a symmetric U-shaped encoder–decoder architecture based on FCN. Its core innovation is the dense skip connections, which fuse the high-resolution features of the encoder layers directly with the corresponding layers of the decoder, effectively preserving the spatial details. This design enables U-Net to achieve accurate localization even with a small amount of labeled data, making it particularly suitable for tasks with limited labeled data such as medical image analysis [35].
  • DeepLabV3+: Chen et al. (2018) [36] proposed DeepLabV3+, a further development of the DeepLab [37] family of models. As an extension of DeepLabV3, the model introduces an encoder–decoder structure and incorporates the Xception backbone network to enhance feature extraction. The encoder side employs the Atrous Spatial Pyramid Pooling (ASPP) to capture contextual information at multiple scales through dilated convolution. The decoder side up-samples high-level semantic features and fuses them with the low-level features of the encoder to balance semantic richness and boundary details. The model significantly improves the segmentation performance of complex scenes by expanding the receptive field while maintaining the resolution through dilated convolution.
  • SegFormer: SegFormer, proposed by Xie et al. (2021) [38], represents a new trend. It features a pure Transformer encoder that globally models contextual relationships via self-attention, and a lightweight MLP decoder that uniformly upsamples, concatenates, and fuses multi-scale features. The design completely abandons the inductive bias of convolution, enhancing its ability to capture long-range dependencies, and excels in both efficiency and accuracy, providing a new paradigm for semantic segmentation.

3.2.2. Weakly Supervised Method

OME (One Model is Enough) [21] is an image-level weakly supervised semantic segmentation framework designed for multiclass land cover classification of remote sensing imagery (RSIs). The framework constructs high-quality, image-level training sample sets by integrating sample filtering and dataset co-occurrence evaluation mechanisms. It then employs Multi-class Activation Maps (Multi-class CAMs) to generate pixel-level weighted masks, significantly reducing noise in pseudo-labels. A key contribution of OME is its uncertainty-driven weighted loss function, which mitigates the impact of label noise in pseudo-masks during model training. This method incorporates a pixel-level weighting mask W, defined as W = 1 − U (where U represents an uncertainty mask), into the standard cross-entropy loss. By dynamically adjusting the loss contribution of each pixel, the weight mask enhances segmentation performance by directing the model’s focus toward high-confidence labeled regions. The specific weight allocation strategy follows Equation (1):
Y i j = 1 ,                     0.5 < W i j < 1 w t ,               0.2 < W i j < 0.5 t = 0,1 , 2 0 ,                           0 < W i j < 0.2
According to this strategy, a pixel label is considered highly reliable and assigned full weight ( W i j = 1) if its weight value is high (0.5 < W i j < 1). If the weight is medium (0.2 < W i j < 0.5), the label is considered uncertain and assigned a reduced weight, w t . If the weight is low (0 < W i j < 0.2), the label is deemed highly unreliable, and the pixel’s contribution to the loss is ignored. The framework ultimately employs a weighted cross-entropy loss for model optimization, significantly enhancing robustness to label noise while maintaining training stability.

3.3. Image Fusion Method

In this study, the Gram–Schmidt (GS) method was used to fuse the Pleiades high-resolution panchromatic image (0.5 m) with the PlanetScope-3 multispectral image (3 m, 8 bands), representing a resolution ratio of 6:1. The selection of this method was motivated by its comprehensive advantages in terms of spectral fidelity, spatial detail injection capability, and algorithmic applicability. The GS algorithm effectively preserves the spectral independence and structural integrity of the eight bands via its orthogonal transformation, providing a reliable spectral foundation for subsequent land cover classification and vegetation index inversion. Furthermore, the component replacement strategy effectively integrates the fine spatial details from the Pleiades image, thereby significantly enhancing the spatial resolution of the fused product. Compared with PCA and NNDiffuse methods, the GS algorithm demonstrates a superior balance among spectral preservation, computational efficiency, and detail retention, making it suitable for large-scale data processing. The final output is an 8-band multispectral image with a 0.5 m resolution that fully retains the original spectral information while incorporating fine spatial features, providing a high-quality database for deep learning-based semantic segmentation. A visual comparison of the images before and after fusion is presented in Figure 7.

3.4. Experimental Design and Setup

3.4.1. Experimental Environment

The experiments in this study were conducted on high-performance workstations. The specific hardware and software configurations are detailed in Table 2. This configuration provided the necessary computational resources for efficient processing of large-scale datasets and ensured the reproducibility of all experiments.

3.4.2. Experimental Parameter Setting

The model hyperparameters were configured to ensure a fair comparison across all experiments, with full details provided in Table 3. In brief, we used the Stochastic Gradient Descent (SGD) optimizer for all models. A key distinction in the loss function was applied: standard Cross-Entropy Loss was used for fully supervised methods, while the weakly supervised OME employed a weighted variant to mitigate noise in its pseudo-labels. Models were trained for 50 epochs, and the checkpoint with the highest mIoU on the validation set was selected for final evaluation.

3.4.3. Evaluation Metrics

To scientifically evaluate the performance and applicability of the deep learning models for feature classification in the Bosten Lake area and to analyze the influence of fusion technology on wetland classification accuracy, we employed the following metrics: accuracy, mean Intersection over Union (mIoU), mean Accuracy (mAcc), recall, precision, and F1-score. The corresponding confusion matrix is shown in Table 4.
Accuracy (Acc)
Accuracy is the proportion of total predictions that a model correctly makes, reported as a percentage. It is computed as the number of correct predictions divided by the total number of predictions, as defined in Equation (2).
A c c = T P + T N T P + T N + F P + F N × 100
Mean Accuracy (mAcc)
The mean accuracy is the arithmetic mean of the per-class accuracy values and is used to assess the model’s average predictive performance across all classes, expressed as a percentage. It is calculated by first determining the accuracy for each class and then averaging these values. This approach provides a balanced measure of the model’s overall classification ability and helps mitigate assessment bias caused by class imbalance. The mathematical expression is provided in Equation (3).
m A c c = 1 N i = 1 N T P i + T N i T P i + T N i + F P i + F N i × 100
Mean Intersection over Union (mIoU)
Intersection over Union (IoU), for a given class, is the ratio of the area of overlap between the predicted and ground truth pixel sets to the area of their union. The mean Intersection over Union (mIoU) is calculated by averaging the IoU values across all classes in a semantic segmentation task, yielding a percentage value. This metric provides a robust measure of the model’s overall segmentation accuracy. The mathematical expression is provided in Equation (4).
m I o U = 1 N T P i T P i + F P i + F N i × 100
In the following equation, N represents the number of categories.
Recall
Recall measures a model’s ability to identify all relevant positive instances, calculated as the ratio of correctly predicted positive observations to all actual positives, presented as a percentage. A higher recall value indicates that the model misses fewer positive class samples. The mathematical expression is provided in Equation (5).
R e c a l l = T P T P + F N × 100
Precision
Precision, also referred to as user’s accuracy, measures the proportion of samples that truly belong to the positive class among all samples predicted as positive by the model, given in percentage form. A high precision value indicates that the model’s positive identifications are reliable. The mathematical expression is provided in Equation (6).
P r e c i s i o n = T P T P + F P × 100
F1-Score
F1 Score combines the precision and recall metrics of the model and is defined as the harmonic mean of the model’s precision and recall, multiplied by 100 for percentage reporting. It represents a balance between the model’s precision and recall. The F1 score is high only when both precision and recall are high, making it a robust metric for imbalanced datasets. The mathematical expression is provided in Equation (7).
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l × 100 = 2 T P 2 T P + F P + F N × 100

4. Results

4.1. Comparative Analysis of Overall Accuracy

Evaluation based on various accuracy metrics (Table 5 and Figure 8) reveals significant performance differences among the five models in the complex classification task of the Bosten Lake wetland, highlighting the varying capabilities of different network architectures for semantic segmentation. The SegFormer model, based on the Transformer architecture, achieved the best performance across all core metrics: an accuracy of 98.75%, a mIoU of 95.33%, a mAcc of 97.36%, a precision of 97.50%, a recall of 97.31%, and an F1-score of 97.47%. This indicates that the model exhibits strong extraction capability and consistency across all feature classes, demonstrating the excellent adaptability of its powerful global contextual modeling to complex wetland environments. DeepLabV3+ and U-Net, as representative convolutional neural network-based models, achieved the second-best results. While they were clearly outperformed by SegFormer, their metrics were still significantly higher than those of the traditional baseline model FCN. Among them, DeepLabV3+ achieved a slightly higher mIoU of 86.58% compared to U-Net’s 85.65%, and also showed marginal improvements in Precision and Recall. This demonstrates the consistent, albeit modest, advantage of its multi-scale atrous convolution module in contextual information fusion. In contrast, FCN, as the benchmark fully supervised method, had the lowest scores, reflecting its limitations in handling complex spectral-spatial structures and multi-class fine classification tasks.
The weakly supervised model OME demonstrates significant advantages under limited sample conditions. Results show that OME performs comparably to the optimal fully supervised model, SegFormer, in terms of Accuracy (98.76%) and Recall (96.20%), indicating excellent target discovery capability and regional coverage completeness with a very low omission rate. Although its mIoU (87.94%) and precision (91.72%) remain slightly below the top fully supervised models, its overall classification performance is significantly better than the traditional fully supervised baseline FCN and is comparable to the widely used U-Net and DeepLabV3+. This finding underscores the important research value and considerable application potential of weakly supervised learning methods for high-precision remote sensing feature classification.
A comparison of the evaluation metrics (Figure 8a) reveals that the mean Intersection over Union (mIoU) is consistently the lowest metric for each model, while accuracy remains the highest. This discrepancy stems from their distinct computational properties. Accuracy measures the overall rate of correct pixel classification and is highly sensitive to large, dominant categories (e.g., open water, bare ground). In contrast, it is relatively insensitive to boundary errors and misclassifications of smaller categories. mIoU, however, considers both accuracy and completeness of segmentation results, and has higher sensitivity to category boundary misclassification and few category confusions. The mIoU values of both fully and weakly supervised models are significantly lower than the accuracy, precision, and recall of the same model. In particular, for the weakly supervised model OME, there is a gap of more than 10 percentage points between the mIoU (87.94%) and the accuracy (98.76%), which is much larger than the corresponding gap in the fully supervised model. Even in the SegFormer model with optimal performance, although its accuracy reaches 98.75%, an in-depth analysis of the per-class IoU values reveals that the IoU values of certain difficult categories (e.g., crops with similar spectral features) are still relatively low, thus reducing the overall mIoU level. This characteristic makes mIoU a rigorous metric for evaluating segmentation quality, as it more accurately reflects a model’s practical capability in complex environments. The high values of accuracy are mainly due to the model’s accurate classification of large homogeneous areas, while the relatively low mIoU reveals the model’s deficiencies in detailing and precise delineation of boundaries.
The loss curves of the training process further provide evidence for the above performance differences. As shown in Figure 8b, after 50 rounds of training, the loss function of each model tends to converge, but there are significant differences in their optimization processes and convergence trajectories. The loss curves of the models can be visually ranked from top to bottom as: FCN, U-Net, DeepLabV3+, with SegFormer and OME converging at the lowest position. This ranking is highly consistent with the models’ performance on the test set.
Specifically, among the fully supervised models, SegFormer and DeepLabV3+ show relatively better optimization characteristics. The loss function of SegFormer and DeepLabV3+ decreases the fastest and eventually stabilizes at a lower value level, indicating a more efficient and stable optimization process and stronger feature learning capability. In contrast, the loss curve of FCN always stays in the higher value range, fluctuates significantly during training, converges slowly, and has the highest final convergence value, consistent with its poorer generalization performance. The overall loss level of U-Net, although better than FCN, is still significantly higher than DeepLabV3+ and SegFormer, showing some optimization deficiencies. The weakly supervised model OME, however, exhibits unique convergence behavior. Its loss values remain in a very low range throughout training, with a starting value significantly lower than all fully supervised models. It converges rapidly in the early stages and then remains nearly flat, indicating a distinct learning dynamic. This phenomenon arises from both the intrinsic differences in the objective functions between weakly and fully supervised learning and OME’s ability to efficiently utilize image-level signals to rapidly approach its optimization target. Despite OME’s extremely low loss value, its performance gap on fine-grained metrics such as mIoU suggests that there is still room for optimization of weakly supervised learning for pixel-level localization tasks.

4.2. Comparative Analysis of Overall Classification Effectiveness

Based on the overall classification results (Figure 9), the macroscopic performance of each model over the large-scale space of Bosten Lake region is systematically compared. Figure 9a–f present the overall classification results for the original remote sensing image and the five models (FCN, U-Net, DeepLabV3+, SegFormer, OME), respectively. Overall, all models can achieve the basic distinction among the 19 major feature classes, but there are significant differences in semantic consistency, boundary accuracy, and noise control.
Specifically, the classification results of the FCN model exhibit obvious “salt-and-pepper noise” and feature fragmentation. Large systematic misclassifications also occur, such as misclassifying continuous sparse vegetation as “stevia” or “grassland,” and misclassifying large crop areas as “licorice” or “grassland.” The boundaries of the extracted features are fuzzy. Only features with distinct spectral characteristics, such as open water, are well-classified, making the results suitable only for preliminary reference. The U-Net model showed improved overall performance, successfully distinguishing most feature boundaries. However, significant noise from multiple feature classes persisted within regions. Local misclassifications, such as confusing sparse vegetation with “grape” and “settlement”, coupled with insufficient boundary preservation in fine-grained landscapes like settlements, indicate limitations in modeling semantic consistency within complex scenes. In contrast, the SegFormer and DeepLabV3+ models show optimal macro-classification performance, with complete and continuous patches, clear boundaries, and a significant reduction in misclassifications and noise compared to the other models, reflecting the strong spatial context modeling ability and the precise semantic parsing ability of complex wetland landscapes.
The weakly supervised OME model produces a visually coherent map with well-defined boundaries for features like settlements and roads. However, a detailed comparison reveals a distinct behavioral pattern: OME tends to over-classify extensive agricultural areas as a single dominant crop class, whereas the fully supervised models (SegFormer, DeepLabV3+) correctly delineate a more heterogeneous crop mosaic. This discrepancy highlights a key characteristic—and current limitation—of the image-level supervision paradigm. Without pixel-level guidance, OME may learn an oversimplified spatial prior for dominant classes, favoring large, homogeneous patches over fine-grained inter-class distinctions. Despite this bias in labeling large uniform regions, OME’s ability to maintain high overall semantic consistency and achieve competitive quantitative metrics remains noteworthy, demonstrating its potential as a data-efficient solution where exhaustive pixel annotation is unavailable.

4.3. Detailed Performance Analysis of Typical Features

This study provides a detailed performance analysis of typical features by examining the classification details in localized areas (Figure 10). The analysis focuses on three major categories: natural and impervious surfaces, wetland vegetation and hydrological systems, and agricultural crops. The first column shows the original remote sensing image, and the second to sixth columns present the segmentation results of the five models FCN, U-Net, DeepLabV3+, SegFormer, and OME in localized areas in turn. Specifically, Regions I–III & IX-XI focus on the classification of natural and impervious surfaces (bare ground, sparse vegetation, grasslands, shrubland, settlements, roads); Regions IV–VIII focus on the differentiation of wetland vegetation and hydrologic systems (reeds, degraded reeds, water, mudflats); Regions XII–XVI analyze the classification of various crops (wheat, maize, grapes, fennel, stevia, chili peppers, tomatoes, licorice, and sugar beet) in depth.
By systematically comparing the performance of each model in terms of boundary integrity, category consistency, and misclassification, the discriminative ability of different models for spectrally characteristically similar features, as well as their performance in semantic understanding in complex wetland-agricultural hybrid landscapes, is revealed. This multi-level comparison from the global level to the local level provides a solid basis for a comprehensive assessment of the applicability of the model in the real geographic context.

4.3.1. Comparative Analysis of Per-Class Performance

To move beyond aggregate metrics and identify the specific strengths and weaknesses of each model, we conducted a per-class performance analysis using F1-score (Table 6). The F1-score heatmap (Figure 11) visually summarizes the performance hierarchy across all 19 land cover classes.
The analysis reveals that among the 19 remote sensing land cover classes, only approximately 10.5% present a significant risk of confusion. The grassland category emerges as the most problematic. While a tendency for over-prediction is already evident in fully supervised models, this issue escalates into a systematic failure in the weakly supervised OME model, which achieves an F1-score of merely 37.38%—a substantial deficit of 51.49% compared to the optimal fully supervised SegFormer. This indicates that the model exhibits a strong propensity to misclassify a vast number of spectrally ambiguous regions as grassland. This phenomenon underscores a fundamental limitation inherent to weakly supervised approaches when dealing with feature-ambiguous categories. When the target class shares high spectral and textural similarity with other features such as roads or wheat, reliance on image-level labels alone is insufficient to guide the model in learning pixel-level discriminative characteristics.
Settlements exhibit a different pattern. Fully supervised models including U-Net and DeepLabV3+ tend to capture only prototypical settlement features while overlooking morphological variations, frequently confusing them with spectrally similar classes such as roads. Notably, OME performs well on settlements, achieving an F1-score of 92.05% and ranking third among all models. This suggests that for structurally complex categories, weakly supervised methods may have an advantage due to their capacity for holistic semantic understanding.
The performance of OME exhibits a clear dichotomy; it excels on classes with distinct spectral–spatial features but underperforms on confusable ones. This highlights the sensitivity of weakly supervised learning to inter-class similarity and its lower stability versus fully supervised methods. Furthermore, SegFormer leads in multiple challenging categories, suggesting its superior capability in modeling fine-grained distinctions. In contrast, the chili pepper class shows high consistency across all models, demonstrating that distinctive spectral features can significantly mitigate classification difficulty.
In summary, this study delineates a clear applicability boundary for weakly supervised methods in remote sensing land cover classification. While these methods can substantially reduce annotation costs and deliver strong performance for the majority of classes, they may fail completely on feature-ambiguous, easily confusable categories. Therefore, future research should develop hybrid supervision strategies. These strategies would employ weak supervision for the majority of spectrally distinct classes, while for a minority of critical yet challenging categories, they would integrate pixel-level annotations with targeted model refinements. This approach aims to achieve an optimal balance between classification accuracy and annotation efficiency.

4.3.2. Analysis of Classification Details for Natural and Impervious Surfaces

In the task of categorizing natural surfaces versus impervious surfaces (including bare ground, sparse vegetation, grasslands, shrubland, settlements, and roads), the models demonstrated significant differences in capability (regions I–III & IX–XI).
For road recognition (region I), U-Net and OME models perform best in the extraction of linear features, and road continuity and boundary integrity are well maintained; while DeepLabV3+ and SegFormer have high overall classification accuracy, but road edges are fragmented and noisy, indicating a deficiency in local feature modeling. In the bare ground identification (region II), FCN showed serious confusion by misidentifying extensive bare ground as stevia. This exposes its limitations in differentiating spectrally similar features. OME, despite strong overall performance, still exhibited confusion with sparse vegetation. In contrast, U-Net, DeepLabV3+ and SegFormer showed more stable performance in bare ground recognition, and were able to accurately distinguish bare ground from the surrounding vegetation types. In terms of residential detailing (region IX), the OME model was able to clearly outline buildings and recognize small gaps, and the other models were relatively coarse. In the area where settlements and roads are intertwined (region X), OME is able to accurately distinguish between the two types of fine features and maintain clear boundaries, while the rest of the models show more misclassification and boundary blurring. In the area of mixed bare land, settlements and shrubland (region XI), DeepLabV3+ is optimal in category discrimination, with almost no obvious misclassification errors, while OME is superior in detail accuracy and has finer boundaries, reflecting the model’s trade-off between accuracy and spatial precision.
Overall, U-Net is stable in linear feature recognition; DeepLabV3+ stands out in category differentiation accuracy; SegFormer has high overall accuracy but slightly less detail; and OME demonstrates clear advantages in detail preservation and complex scene processing. These findings suggest that the model should be selected according to the specific application requirements. OME is recommended for fine boundary extraction, and DeepLabV3+ or SegFormer is suitable for focusing on the accuracy of large-scale categorization.

4.3.3. Analysis of Classification Details for Wetland Vegetation and Hydrologic Systems

The task of classifying wetland vegetation and hydrologic systems (including reeds, degraded reeds, water, and mudflats) was a key component in testing the ability of the models to discriminate subtle spectral differences (corresponding to regions IV to VIII). All models were able to extract open water with high accuracy, but showed significant differences in distinguishing spectrally similar features such as reeds, degraded reeds, and mudflats.
Through detailed analysis of typical areas (regions IV–VIII marked by red circles), FCN showed serious confusion in the classification of degraded reed, often systematically misclassifying it as a spectrally similar category such as grassland, stevia, or maize, demonstrating a poor ability to discriminate the fine spectral features of wetland vegetation. SegFormer and DeepLabV3+ performed better at maintaining the overall structural boundaries of reed beds. SegFormer was more accurate in vegetation categorization, although its results exhibited noticeable pixelation. DeepLabV3+, meanwhile, misclassified some reeds as shrubland. U-Net has the best performance in the fine distinction of wetland vegetation, especially in the recognition of reeds and degraded reeds, which can clearly distinguish the subtle spectral differences and maintain high semantic consistency. OME performs poorly on such tasks, with a large number of degraded reeds misclassified as sparse vegetation and significant underdetection, highlighting a key challenge for weakly supervised methods, which is recognizing fine spectral features without sufficient labeled data. However, OME achieved the highest accuracy in the distinction between mudflats and water and was able to accurately identify transition boundaries; DeepLabV3+ and OME performed the best in water body boundary extraction, and were able to accurately delineate complex shorelines.
In summary, U-Net demonstrates obvious advantages in the fine classification of wetland vegetation; OME shows the highest accuracy in the distinction between mudflats and water; and DeepLabV3+ performs well in water body boundary extraction. This result reveals the differentiated response of different models to wetland environmental features and provides a basis for model selection for targeted feature types. U-Net is preferred for vegetation fine classification, while DeepLabV3+ or OME is suitable for water body-beach boundary extraction.

4.3.4. Analysis of Classification Details for Various Crops

Crop classification posed the greatest challenge in this study (corresponding to regions XII to XVI), primarily due to the high spectral similarity among crops such as wheat, maize, grapes, fennel, stevia, chili peppers, tomatoes, licorice, and sugar beet. This spectral ambiguity places extremely high demands on the models’ fine-grained discrimination capabilities. A systematic evaluation of the models’ performance in this scenario revealed clear differences in their ability to capture subtle spectral variations and interpret spatial contextual features.
In the mixed cropping region (region XII), both U-Net and DeepLabV3+ misclassified chili peppers as tomatoes, while all models except OME systematically misclassified wheat as licorice. In terms of overall recognition accuracy, SegFormer and OME performed best. In terms of field structure preservation, U-Net and OME models were most effective in boundary extraction, with OME having outstanding performance in wheat recognition. In the tomato recognition task (region XIII), the models were not accurate. FCN and U-Net misidentified tomato as licorice and reed. OME performed best in tomato identification although there were instances where some chili peppers were misidentified as maize. In grape classification (region XIV), all models were able to recognize grape subjects better, but there were still some local misclassifications. Grapes were misclassified by FCN as grassland in small numbers, by U-Net as roads, and the results from SegFormer were confused with chili peppers. In contrast, DeepLabV3+ and OME demonstrated the best overall performance for grape classification, with OME being particularly effective in maintaining field boundaries. In the distinction between maize and chili peppers (region XV), all models exhibited different degrees of confusion, with FCN achieving the highest error rate and DeepLabV3+ having the best results. FCN and U-Net also misclassified some stevia as fennel. For maize, pepper, and stevia, DeepLabV3+ achieved the best recognition results, with almost no noise and clear boundaries. For beet recognition (region XVI), FCN recognized chili peppers as beets, with reverse misclassification by OME, while DeepLabV3+ had the highest accuracy.
Overall, DeepLabV3+ has the best overall performance in crop classification, especially in the classification of easily confused crops such as maize, pepper, stevia and sugar beet, and its classification results have clear boundaries and very low noise. This performance reflects the significant advantages of multi-scale feature fusion architectures in agricultural remote sensing applications. Surprisingly, the weakly supervised model OME performs well, not only in wheat and tomato recognition with the highest accuracy, but also in field boundary preservation, proving the potential of weakly supervised methods in crop structure extraction tasks. The SegFormer model, although with higher overall accuracy, still suffers from local spectral confusion. The traditional models FCN and U-Net face greater challenges and generally suffer from systematic misclassification problems, especially when dealing with spectrally similar crop pairs such as chili-tomato, wheat-licorice, and stevia-fennel, with high error rates, highlighting the limitations of their context-awareness capabilities.

4.4. Analysis of Ablation Experiments

We conducted rigorous ablation experiments (Table 7) to systematically validate two key aspects: the effectiveness and necessity of fusing Pleiades and PlanetScope-3 multimodal remote sensing data, and the performance differences between model architectures under fully supervised versus weakly supervised paradigms. All models were trained under the same conditions and tested multiple times to mitigate the effects of random variations, and the results were averaged.
The experimental results demonstrate that multimodal data fusion, combined with advanced model architectures, contributed significantly to performance gains. Among them, the SegFormer architecture and the weakly supervised OME method delivered notably outstanding performance. Regardless of model architecture, models trained on fused Pleiades and PlanetScope-3 data significantly outperform their counterparts trained on a single data source. In the case of DeepLabV3+, for example, its F1-score on the single PlanetScope-3 data was 78.42%, while it improved to 92.04% on the fused data, representing an increase of 13.62 percentage points. This finding demonstrates that multimodal data fusion effectively compensates for the limitations of individual images in terms of spatial and temporal resolution and spectral features. By enhancing the model’s feature representation capability through complementary information, it significantly improves segmentation accuracy.
In addition, the choice of model architecture was experimentally found to have a significant impact on segmentation performance. Models with traditional encoder–decoder architectures (e.g., FCN and U-Net) perform poorly on the PlanetScope-3 dataset, which contains complex feature representations. For instance, U-Net achieves an F1-score of only 31.02%, indicating its limitations in handling complex remote sensing scenes. In contrast, the SegFormer model based on the Transformer architecture achieved an F1-score of 97.42% on the fused data, significantly outperforming the baseline models. This demonstrates that SegFormer, with its powerful global attention mechanism and multi-scale feature fusion capability, can efficiently capture long-range dependencies and contextual information in remote sensing images, making it highly suitable for interpreting complex and diverse remote sensing scenes.
Notably, on the fused data, the OME model with weakly supervised learning achieves an F1-score of 92.84% using only a small number of annotations, which not only significantly outperforms traditional fully supervised methods such as FCN and U-Net, but is even comparable to the performance of powerful fully supervised models like SegFormer (97.42%). This finding demonstrates that the weakly supervised learning method can greatly reduce the labeling cost while still maintaining a very competitive performance. This capability provides a practical solution to the challenge of scarce labeled data in remote sensing.
By further analyzing the interaction between data and models, this study finds a significant coupling between model capability and data richness. Lower-performing models benefit more from multimodal data fusion, e.g., U-Net’s mIoU performance improvement on fused data is as high as 63.39%. In contrast, a more capable model like SegFormer achieves solid performance even on a single data source (F1 = 80.21% on PlanetScope-3) and reaches near-perfect performance on fused data. This pattern suggests an equilibrium relationship between model capability and data complexity. Strong model architectures can better utilize the information contained in rich data, and data fusion strategies can, to some extent, compensate for the limitations of model expressiveness.
In summary, these ablation experiments show that in order to achieve optimal remote sensing image segmentation performance, advanced model architectures such as SegFormer should be adopted and combined with multimodal data fusion strategies. Meanwhile, the OME weakly supervised approach shows great application potential and provides a new technical path for deploying high-performance models in real-world scenarios where the annotation cost is limited. These findings provide important guidance for the practical application of intelligent interpretation of remote sensing images.

5. Discussion

5.1. Evolution of the Model Architecture

The results of this study clearly depict a path of model performance improvement with architecture evolution: FCN → U-Net/DeepLabV3+ → SegFormer. This improvement is not accidental but stems from a qualitative leap in the model’s feature representation capability, which aligns with the overall development trend in remote sensing semantic segmentation.
The traditional FCN model [12] is limited by the coarseness of its local convolution operation and upsampling, which inevitably generates “salt-and-pepper noise” and intra-class inconsistency when dealing with a scene with fragmented features and complex boundaries. U-Net [13] mitigates the detail loss problem to a certain extent by introducing skip connections and performs better in preserving the boundaries of linear and structural features such as roads and settlements. This capability has led to its widespread use in medical imaging and remote sensing segmentation. For example, Yan et al. [39] proposed an improvement to the U-Net architecture and effectively improved the accuracy of forest land remote sensing classification by fusing shallow details and deep semantic features through its skip connection structure. DeepLabV3+ [36] enhances the model’s ability to capture multi-scale features through the Atrous Spatial Pyramid Pooling (ASPP) module, enabling it to achieve more robust performance in terms of average category accuracy while maintaining effective boundary preservation, demonstrating the effectiveness of multi-scale feature fusion strategies in complex geographic environments [40].
However, the most significant performance jump in this experiment came from SegFormer [38], which is based on the Transformer architecture. Its exceptionally high scores across nearly all metrics demonstrate the significant advantage of the Self-Attention (SA) mechanism in modeling the global contextual information of remote sensing images. Our findings are consistent with recent studies by Roy et al. [41] and Jamali et al. [42], both of which reported the superior performance of the Transformer architecture in remote sensing scene understanding tasks. Feature distributions in wetland ecosystems are strongly spatially correlated (e.g., water vs. mudflats, reeds vs. degraded reeds). SegFormer is able to efficiently capture the long-range dependencies between all pixels in the whole image, thus achieving deep semantic understanding of complex scenes and reducing local misclassifications caused by CNN-based models due to limited receptive fields. This insight highlights a future direction for remote sensing visual model design: the need to more efficiently fuse local detailed features with global contextual information [43].

5.2. Potential for Weakly Supervised Learning

An important finding of this study is that the weakly supervised OME model demonstrates highly competitive performance, matching or approaching top fully supervised methods in overall metrics, and excels in identifying certain specific land cover types. This result is of great practical significance, as it aligns with the great potential shown by weakly supervised learning in the field of natural images [44], and is fully validated for the first time in a complex wetland scenario such as Lake Bosten. It demonstrates that using only image-level labels (rather than pixel-level annotations) can effectively drive the model to learn highly discriminative feature representations, thus offering the potential to free researchers from the burden of costly pixel-level annotations and greatly improve the feasibility of deep learning applications in remote sensing. This success aligns with the principle of the class activation map (CAM) technique proposed by Zhou et al. [17] and its subsequent improvements [45], a principle which demonstrates that image-level labels are sufficient to guide a model to locate the most discriminative regions.
However, OME also exposes the inherent limitations of weakly supervised learning, a finding that is consistent with the research consensus in the field. The primary reason for its suboptimal performance in fine-grained wetland vegetation classification lies in the absence of pixel-level localization guidance in weak supervisory signals. This absence renders the model insensitive to subtle spectral differences. This phenomenon is widely documented in studies by Li et al. [46] and Wei et al. [47]. This phenomenon indicates that while current weakly supervised methods are maturing for “image classification” tasks, they continue to face significant challenges in achieving high accuracy for “pixel-level localization”.
The performance of OME, particularly its competitive results on many classes, must also be interpreted within the context of our multimodal data fusion strategy. The fusion of high-spatial-resolution Pleiades and rich-spectral PlanetScope data created a feature space with enhanced separability. This likely amplified the saliency of classes with distinct spectral-spatial signatures, making them more easily localized from image-level cues alone, thereby allowing OME to perform on par with fully supervised models on these categories. Conversely, OME’s failure on spectrally ambiguous classes like ‘Grassland’—despite the fused data—underscores a more fundamental limitation: multimodal fusion can improve feature discriminability, but it cannot fully compensate for the absence of pixel-level guidance when classes inherently overlap in the feature space. This reveals a conditional efficacy of data fusion for WSL: it primarily aids in the discrimination of already-separable classes rather than resolving core ambiguities intrinsic to weak supervision.
Future research should therefore advance toward more refined weakly supervised paradigms that directly address the core challenge of generating high-fidelity pseudo-labels. Key directions include: refining initial class activation maps through advanced attention mechanisms [48]; enhancing the localization capability of transformer-based models via contrastive token learning [49]; developing dedicated architectures for sparse point-supervised segmentation [50]; and designing task-specific modules that encode domain-aware priors for applications such as post-disaster analysis [51] and fine-grained vegetation mapping. This evolution toward native, context-sensitive frameworks represents a crucial pathway to bridging the persistent gap between weak image-level supervision and pixel-accurate segmentation in remote sensing.

5.3. Gain Effect of Fusion of Data from Multiple Models

The ablation experiments strongly confirmed that fusing multimodal remote sensing data is core to improving wetland classification accuracy, a finding consistent with the general consensus in remote sensing information processing [52]. The high spatial resolution (0.5 m) provided by the Pleiades data and the rich spectral information (8 bands) provided by the PlanetScope-3 data complement each other perfectly, aligning with the concept of “spatial-spectral information synergy” emphasized by Xia et al. [53]. Pleiades data enable the precise delineation of feature boundaries and mitigate the mixed-pixel problem common in hyperspectral data. Conversely, PlanetScope-3 data enhance the differentiation of spectrally similar features and compensate for the limited spectral channels of high-resolution data.
This performance improvement is generalizable across all models but is most pronounced for those with weaker baseline performance, such as U-Net. This suggests that for models whose architectural limitations create a bottleneck in feature extraction, richer and more diverse data can effectively compensate. This finding corroborates the work of Xie et al. [54] in complex scene categorization, which showed that high-quality data inputs can mitigate the limitations of a model’s architecture.
This insight offers a practical strategy for resource-constrained teams: when state-of-the-art models are inaccessible or computationally prohibitive, investing in multimodal data acquisition and fusion provides a viable path to enhancing existing models. This approach is particularly significant for enabling high-accuracy remote sensing interpretation under computational constraints.

5.4. Research Limitations and Future Perspectives

Although this study provides systematic insights into the classification of complex wetlands, some limitations remain that point the way to future research.
First and foremost, it is crucial to acknowledge the geographic specificity of this work. The evaluation metrics, model performance conclusions, and the validated framework are derived from and primarily applicable to the specific study area—the arid zone wetland-cropland mosaic of Bosten Lake, China. The exceptional results achieved under the local spectral, spatial, and ecological conditions cannot be directly extrapolated to wetland ecosystems with fundamentally different characteristics (e.g., coastal mangroves, tropical peatlands, or urban wetlands) without further validation. This constitutes the primary boundary for interpreting the findings of this case study.
Second, although representative, the model comparison in this study does not cover emerging architectures such as Swin Transformer [55] and Mask2Former [56]. Future benchmarking incorporating these models will provide a more comprehensive and up-to-date picture of performance. Third, this study focused on single-time-phase analysis and did not explore the great potential of multi-temporal data in distinguishing phenological features. The introduction of time series analysis, drawing on methods such as the ViT-BiLSTM LUCC Prediction Framework [57], will be the key to improving the accuracy of monitoring the dynamics of crop and wetland vegetation. In addition, although OME performs well in this scenario, its generalization still needs to be verified on different geographic landscapes and sensor data, a necessary step for weakly supervised methods to move from the lab to real-world applications [58]. Finally, the high computational cost of Transformer models hinders their deployment. Future work must therefore pursue efficient architectures, translating general advances in lightweight Transformers [59] into dedicated networks for remote sensing segmentation [60], to enable real-time, on-edge applications.

6. Conclusions

Focused on the Bosten Lake wetland, this case study establishes that integrating multimodal data fusion with weakly supervised learning provides an effective approach for fine-grained wetland-cropland classification in this arid landscape. Key findings substantiate this: the Transformer-based SegFormer achieved the highest accuracy (98.75%) and mIoU (95.33%); the weakly supervised OME model matched this leading accuracy (98.76%) using only image-level labels; and multimodal fusion universally improved performance, most notably increasing U-Net’s mIoU by 63.39%.
These results and the validated framework are primarily applicable to arid zone wetland mosaics with characteristics similar to the study area. They offer a practical, cost-effective strategy for high-precision monitoring in such contexts. Future work is essential to rigorously evaluate the framework’s transferability to other ecosystem types and to integrate temporal analysis for dynamic assessment.

Author Contributions

Conceptualization, J.Z. and A.S.; methodology, J.Z. and A.S.; software, J.Z.; validation, J.Z. and A.S.; formal analysis, J.Z.; investigation, J.Z., E.Z. and W.L.; resources, A.S.; data curation, J.Z. and A.S.; writing—original draft preparation, J.Z.; writing—review and editing, J.Z., A.S. and E.L.; visualization, J.Z.; supervision, A.S.; project administration, A.S.; funding acquisition, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 42371389); the Tianshan Talent Development Program (Grant No. 2022TSYCCX0006) and the Western Young Scholars Project of the Chinese Academy of Science (Grant No. 2022-XBQNXZ-001).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Available from the corresponding author upon reasonable request.

Acknowledgments

The authors are particularly grateful to all researchers and institutions for providing data support for this study. The authors would also like to thank the editors and reviewers for their valuable comments, which significantly helped improve the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mitsch, W.; Gosselink, J. Wetlands; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
  2. Davidson, N.C.; Fluet-Chouinard, E.; Finlayson, C.M. Global extent and distribution of wetlands: Trends and issues. Mar. Freshw. Res. 2018, 69, 620–627. [Google Scholar] [CrossRef]
  3. Mitsch, W.J.; Gosselink, J.G. The value of wetlands: Importance of scale and landscape setting. Ecol. Econ. 2000, 35, 25–33. [Google Scholar] [CrossRef]
  4. Moreno-Mateos, D.; Power, M.E.; Comín, F.A.; Yockteng, R. Structural and functional loss in restored wetland ecosystems. PLoS Biol. 2012, 10, e1001247. [Google Scholar] [CrossRef]
  5. Tang, X.; Xie, G.; Deng, J.; Shao, K.; Hu, Y.; He, J.; Zhang, J.; Gao, G. Effects of climate change and anthropogenic activities on lake environmental dynamics: A case study in Lake Bosten Catchment, NW China. J. Environ. Manag. 2022, 319, 115764. [Google Scholar] [CrossRef]
  6. Liu, W.; Ma, L.; Abuduwaili, J. Anthropogenic influences on environmental changes of lake Bosten, the largest inland freshwater lake in China. Sustainability 2020, 12, 711. [Google Scholar] [CrossRef]
  7. Wulder, M.A.; Coops, N.C.; Roy, D.P.; White, J.C.; Hermosilla, T. Land cover 2.0. Int. J. Remote Sens. 2018, 39, 4254–4284. [Google Scholar] [CrossRef]
  8. Li, J.; Pei, Y.; Zhao, S.; Xiao, R.; Sang, X.; Zhang, C. A review of remote sensing for environmental monitoring in China. Remote Sens. 2020, 12, 1130. [Google Scholar] [CrossRef]
  9. Xie, Y.; Sha, Z.; Yu, M. Remote sensing imagery in vegetation mapping: A review. J. Plant Ecol. 2008, 1, 9–23. [Google Scholar] [CrossRef]
  10. Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
  11. Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
  12. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  13. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  14. Zhou, Z.H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2018, 5, 44–53. [Google Scholar] [CrossRef]
  15. Papandreou, G.; Chen, L.-C.; Murphy, K.P.; Yuille, A.L. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1742–1750. [Google Scholar]
  16. Wei, Y.; Liang, X.; Chen, Y.; Shen, X.; Cheng, M.-M.; Feng, J.; Zhao, Y.; Yan, S. Stc: A simple to complex framework for weakly-supervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2314–2320. [Google Scholar] [CrossRef] [PubMed]
  17. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
  18. Lee, J.; Kim, E.; Yoon, S. Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4071–4080. [Google Scholar]
  19. Chang, Y.-T.; Wang, Q.; Hung, W.-C.; Piramuthu, R.; Tsai, Y.-H.; Yang, M.-H. Weakly-supervised semantic segmentation via sub-category exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8991–9000. [Google Scholar]
  20. Wei, Y.; Feng, J.; Liang, X.; Cheng, M.-M.; Zhao, Y.; Yan, S. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1568–1576. [Google Scholar]
  21. Li, Z.; Zhang, X.; Xiao, P. One model is enough: Toward multiclass weakly supervised remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4503513. [Google Scholar] [CrossRef]
  22. Gxokwe, S.; Dube, T.; Mazvimavi, D. Multispectral remote sensing of wetlands in semi-arid and arid areas: A review on applications, challenges and possible future research directions. Remote Sens. 2020, 12, 4190. [Google Scholar] [CrossRef]
  23. Cao, R. Multi-Source Data Fusion for Land Use Classification Using Deep Learning. Doctoral Dissertation, University of Nottingham, Nottingham, UK, 2021. [Google Scholar]
  24. Fauvel, M.; Benediktsson, J.A.; Chanussot, J.; Sveinsson, J.R. Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles. IEEE Trans. Geosci. Remote Sens. 2008, 46, 3804–3814. [Google Scholar] [CrossRef]
  25. Schmitt, M.; Zhu, X.X. Data fusion and remote sensing: An ever-growing relationship. IEEE Geosci. Remote Sens. Mag. 2016, 4, 6–23. [Google Scholar] [CrossRef]
  26. Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
  27. Yokoya, N.; Grohnfeldt, C.; Chanussot, J. Hyperspectral and multispectral data fusion: A comparative review of the recent literature. IEEE Geosci. Remote Sens. Mag. 2017, 5, 29–56. [Google Scholar] [CrossRef]
  28. Shao, Z.; Wu, W.; Guo, S. IHS-GTF: A fusion method for optical and synthetic aperture radar data. Remote Sens. 2020, 12, 2796. [Google Scholar] [CrossRef]
  29. Feng, X.; He, L.; Cheng, Q.; Long, X.; Yuan, Y. Hyperspectral and multispectral remote sensing image fusion based on endmember spatial information. Remote Sens. 2020, 12, 1009. [Google Scholar] [CrossRef]
  30. Laben, C.A.; Brower, B.V. Process for Enhancing the Spatial Resolution of Multispectral Imagery Using Pan-Sharpening. U.S. Patent 6,011,875, 4 January 2000. [Google Scholar]
  31. Tang, Q.; Liu, X.; Zhou, Y.; Wang, P.; Li, Z.; Hao, Z.; Liu, S.; Zhao, G.; Zhu, B.; He, X.; et al. Climate change and water security in the northern slope of the Tianshan Mountains. Geogr. Sustain. 2022, 3, 246–257. [Google Scholar] [CrossRef]
  32. Hao, T.; Lin, L.; Zhengyong, Z.; Guining, Z.; Shan, N.; Ziwei, K.; Tongxia, W. Evaluation on the critical ecological space of the economic belt of Tianshan northslope. Acta Ecol. Sin. 2021, 41, 401–414. [Google Scholar] [CrossRef]
  33. Wei, X.; Yongjie, P. Analysis of new generation high-performance small satellite technology based on the Pleiades. Chin. Opt. 2013, 6, 9–19. [Google Scholar] [CrossRef]
  34. Roy, D.P.; Huang, H.; Houborg, R.; Martins, V.S. A global analysis of the temporal availability of PlanetScope high spatial resolution multi-spectral imagery. Remote Sens. Environ. 2021, 264, 112586. [Google Scholar] [CrossRef]
  35. Hongfei, G.; Yongzhong, Z.; Haowen, Y.; Shuwen, Y.; Qiang, B. Vegetation extraction from remote sensing images based on an improved U-Net model. J. Lanzhou Jiaotong Univ. 2025, 44, 139–146. [Google Scholar] [CrossRef]
  36. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  37. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  38. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar] [CrossRef]
  39. Yan, C.; Fan, X.; Fan, J.; Wang, N. Improved U-Net Remote Sensing Classification Algorithm Based on Multi-Feature Fusion Perception. Remote Sens. 2022, 14, 1118. [Google Scholar] [CrossRef]
  40. Chang, Z.; Li, H.; Chen, D.; Liu, Y.; Zou, C.; Chen, J.; Han, W.; Liu, S.; Zhang, N. Crop Type Identification Using High-Resolution Remote Sensing Images Based on an Improved DeepLabV3+ Network. Remote Sens. 2023, 15, 5088. [Google Scholar] [CrossRef]
  41. Roy, S.K.; Deria, A.; Hong, D.; Rasti, B.; Plaza, A.; Chanussot, J. Multimodal fusion transformer for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5515620. [Google Scholar] [CrossRef]
  42. Jamali, A.; Roy, S.K.; Ghamisi, P. WetMapFormer: A unified deep CNN and vision transformer for complex wetland mapping. Int. J. Appl. Earth Obs. Geoinf 2023, 120, 103333. [Google Scholar] [CrossRef]
  43. Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep-learning-based semantic segmentation of remote sensing images: A survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 8370–8396. [Google Scholar] [CrossRef]
  44. Zhang, M.; Zhou, Y.; Zhao, J.; Man, Y.; Liu, B.; Yao, R. A survey of semi-and weakly supervised semantic segmentation of images. Artif. Intell. Rev. 2020, 53, 4259–4288. [Google Scholar] [CrossRef]
  45. Jo, S.; Yu, I.-J. Puzzle-cam: Improved localization via matching partial and full features. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 639–643. [Google Scholar]
  46. Li, Z.; Zhang, X.; Xiao, P.; Zheng, Z. On the effectiveness of weakly supervised semantic segmentation for building extraction from high-resolution remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3266–3281. [Google Scholar] [CrossRef]
  47. Zhang, W.; Tang, P.; Corpetti, T.; Zhao, L. WTS: A Weakly towards strongly supervised learning framework for remote sensing land cover classification using segmentation models. Remote Sens. 2021, 13, 394. [Google Scholar] [CrossRef]
  48. Zhang, J.; Zhang, Q.; Gong, Y.; Zhang, J.; Chen, L.; Zeng, D. Weakly supervised semantic segmentation with consistency-constrained multiclass attention for remote sensing scenes. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5621118. [Google Scholar] [CrossRef]
  49. Hu, Z.; Gao, J.; Yuan, Y.; Li, X. Contrastive tokens and label activation for remote sensing weakly supervised semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5620211. [Google Scholar] [CrossRef]
  50. Zhao, Y.; Sun, G.; Ling, Z.; Zhang, A.; Jia, X. Point based weakly supervised deep learning for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5638416. [Google Scholar] [CrossRef]
  51. Qiao, W.; Shen, L.; Wang, J.; Yang, X.; Li, Z. A weakly supervised semantic segmentation approach for damaged building extraction from postearthquake high-resolution remote-sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6002705. [Google Scholar] [CrossRef]
  52. Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf 2022, 112, 102926. [Google Scholar] [CrossRef]
  53. Xia, J.; Yokoya, N.; Iwasaki, A. Fusion of hyperspectral and LiDAR data with a novel ensemble classifier. IEEE Geosci. Remote Sens. Lett. 2018, 15, 957–961. [Google Scholar] [CrossRef]
  54. Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.-T.; Le, Q.V. Unsupervised data augmentation for consistency training. Adv. Neural Inf. Process. Syst. 2020, 33, 6256–6268. [Google Scholar] [CrossRef]
  55. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  56. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
  57. Mohanrajan, S.N.; Loganathan, A. Novel vision transformer–based bi-LSTM model for LU/LC prediction—Javadi Hills, India. Appl. Sci. 2022, 12, 6387. [Google Scholar] [CrossRef]
  58. Cordeiro, F.R.; Carneiro, G. A survey on deep learning with noisy labels: How to train your model when you cannot trust on the annotations? In Proceedings of the 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Porto de Galinhas, Brazil, 7–10 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 9–16. [Google Scholar]
  59. Fournier, Q.; Caron, G.M.; Aloise, D. A practical survey on faster and lighter transformers. ACM Comput. Surv. 2023, 55, 304. [Google Scholar] [CrossRef]
  60. Yue, J.; Wang, Y.; Pan, J.; Liang, H.; Wang, S. Less Is More: A Lightweight Deep Learning Network for Remote Sensing Imagery Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4418913. [Google Scholar] [CrossRef]
Figure 1. Overview map of the study area.
Figure 1. Overview map of the study area.
Land 15 00092 g001
Figure 2. Field survey map of various types of objects.
Figure 2. Field survey map of various types of objects.
Land 15 00092 g002
Figure 3. Detailed division of dataset.
Figure 3. Detailed division of dataset.
Land 15 00092 g003
Figure 4. Overall workflow chart.
Figure 4. Overall workflow chart.
Land 15 00092 g004
Figure 5. Structure diagram of fully supervised model: (a) FCN. (b) U-Net. (c) DeepLabV3+. (d) SegFormer.
Figure 5. Structure diagram of fully supervised model: (a) FCN. (b) U-Net. (c) DeepLabV3+. (d) SegFormer.
Land 15 00092 g005
Figure 6. OME framework (OME structure diagram comes from the paper “One Model Is Enough: Toward Multiclass Weakly Supervised Remote Sensing Image Semantic Segmentation”).
Figure 6. OME framework (OME structure diagram comes from the paper “One Model Is Enough: Toward Multiclass Weakly Supervised Remote Sensing Image Semantic Segmentation”).
Land 15 00092 g006
Figure 7. Comparison before and after image fusion: (a) Pleiades original image: (I) the whole area, (II) typical wetland area, (III) typical cropland area, (IV) typical residential area. (b) PlanetScope-3 original image: (I–IV) Same areas as (a). (c) The fused image: (I–IV) Same areas as (a,b).
Figure 7. Comparison before and after image fusion: (a) Pleiades original image: (I) the whole area, (II) typical wetland area, (III) typical cropland area, (IV) typical residential area. (b) PlanetScope-3 original image: (I–IV) Same areas as (a). (c) The fused image: (I–IV) Same areas as (a,b).
Land 15 00092 g007
Figure 8. Comprehensive evaluation of the different models: (a) Quantitative comparison of evaluation metrics. (b) Training loss convergence.
Figure 8. Comprehensive evaluation of the different models: (a) Quantitative comparison of evaluation metrics. (b) Training loss convergence.
Land 15 00092 g008
Figure 9. Classification results of Bosten Lake area: (a) The fused image. (b) OME. (c) FCN. (d) U-Net. (e) DeepLabV3+. (f) SegFormer.
Figure 9. Classification results of Bosten Lake area: (a) The fused image. (b) OME. (c) FCN. (d) U-Net. (e) DeepLabV3+. (f) SegFormer.
Land 15 00092 g009
Figure 10. Comparison of semantic segmentation effect of multimodal in typical remote sensing ground object scenes. Regions I–III, IX–XI: Natural and impervious surfaces. Regions IV–VIII: Wetland vegetation and hydrologic systems. Regions XII–XVI: Various crops.
Figure 10. Comparison of semantic segmentation effect of multimodal in typical remote sensing ground object scenes. Regions I–III, IX–XI: Natural and impervious surfaces. Regions IV–VIII: Wetland vegetation and hydrologic systems. Regions XII–XVI: Various crops.
Land 15 00092 g010
Figure 11. Heatmap of F1-scores across 19 land cover classes for all models.
Figure 11. Heatmap of F1-scores across 19 land cover classes for all models.
Land 15 00092 g011
Table 1. Main parameters and image information of each satellite.
Table 1. Main parameters and image information of each satellite.
ParametersPleiadesPlanetScope-3
Orbit TypeSun-synchronous orbitSun-synchronous orbit
Orbit Altitude (km)694475–600
Inclination (°)9797
Revisit Period (days)1–31
Sensor TypeTDI CCD & MultispectralPush-broom
Number of Bands48
Spatial Resolution0.5 m (Panchromatic)
2 m (Multispectral)
3 m
Swath Width (km)20 × 2032.5 × 19.5
Image ID202406100519053_1206_02316
202406100519168_1004_04564
202406100518356_0707_03390
20240611_050805_88_24e6
20240611_050808_13_24e6
20240611_051017_27_24b7
20240611_051019_53_24b7
20240611_051021_78_2467
Acquisition Date10 June 202411 June 2024
Table 2. Computer software and hardware configuration.
Table 2. Computer software and hardware configuration.
NameSpecification
WorkstationDell Precision 3660
(Dell Inc., Round Rock, TX, USA)
CPU13th Gen Intel(R) Core(TM) i9-13900K
GPUNVIDIA RTX A6000
Random Access Memory (RAM)128 GB
Hard Disk12 TB
Operating SystemUbuntu 22.04
Programming LanguagePython 3.8.5
Deep Learning Framework & CUDAPyTorch 1.10.1, CUDA 11.1
Table 3. Model hyperparameter configuration and description.
Table 3. Model hyperparameter configuration and description.
ParameterValueDescription
Number of Classes19Total number of semantic categories to segment
Image Size30 × 30Spatial dimensions of the input image in pixels
Batch Size16Number of samples processed per parameter update
Epochs50Number of complete passes through the training dataset
Maximum Iterations22,750Total number of training iterations
(Epochs × (Training Samples/Batch Size))
OptimizerSGDStochastic Gradient Descent
Initial Learning Rate0.01Learning rate at the start of training
Loss FunctionCross-Entropy/Weighted Cross-EntropyStandard Cross-Entropy (supervised methods)/
Weighted Cross-Entropy (OME)
Evaluation Interval5 EpochsPerformance evaluated on the validation set every 5 epochs
Model Selection CriterionBest mIoUModel with the highest mIoU on the validation set is selected
Table 4. Confusion matrix table.
Table 4. Confusion matrix table.
Actual\PredictedPositiveNegative
TrueTP (True Positive)TN (True Negative)
FalseFP (False Positive)FN (False Negative)
Note: For our multi-class segmentation task (with N = 19 classes), the terms TP, FP, FN, and TN used in the following formulas are defined on a per-class basis. When calculating metrics for a specific class i, that class is treated as the “positive” class, and all other classes are aggregated as the “negative” class. Thus, TPᵢ denotes pixels correctly predicted as class i; FPᵢ denotes pixels incorrectly predicted as class i; TNᵢ denotes pixels correctly predicted as not being class i; and FNᵢ denotes pixels of class i incorrectly predicted as other classes. All metric values are reported as percentages.
Table 5. Comparison of experimental results of different models.
Table 5. Comparison of experimental results of different models.
ModelAccuracy
(%)
mIoU
(%)
mAcc
(%)
Precision
(%)
Recall
(%)
F1-Score
(%)
FCN91.6583.5790.3587.7590.4588.48
U-net93.9285.6591.5491.2491.4291.59
DeepLabV3+94.1486.5892.0592.6192.1292.08
SegFormer98.7595.3397.3697.5097.3197.47
OME98.7687.9495.9391.7296.2092.82
Table 6. Per-class F1-scores of the all models on the fused imagery.
Table 6. Per-class F1-scores of the all models on the fused imagery.
IdNameFCN
(%)
U-Net
(%)
DeepLabV3+
(%)
SegFormer
(%)
OME
(%)
1road86.8481.4681.9895.7890.03
2settlement96.1577.5779.5298.5592.05
3water99.0299.9199.94100.0099.95
4reed94.2497.7497.9599.0499.82
5degraded reed99.0399.5499.7299.3695.01
6mudflat100.00100.00100.00100.0097.13
7bare ground96.9489.0590.2096.9889.02
8grassland67.6683.9685.3488.8737.38
9Sparse
vegetation
89.1889.3590.1497.1899.97
10shrubland96.6999.3899.5598.7693.55
11grape99.02100.00100.00100.082.66
12fennel89.7492.0293.1199.8797.02
13sugar beet100.0099.7299.9499.5896.51
14wheat85.1795.1095.7398.5498.33
15maize88.5896.3296.8599.9594.52
16stevia94.2693.8594.5299.9296.01
17chili pepper93.8591.4492.31100.0099.34
18tomato94.2496.9897.53100.0098.95
19licorice99.0498.5399.04100.0096.96
Table 7. The result of ablation experiment.
Table 7. The result of ablation experiment.
TypeModelDatasetAccuracy
(%)
mIoU
(%)
mAcc
(%)
Precision
(%)
Recall
(%)
F1-Score
(%)
Full
supervision
FCNPleiades93.3183.2289.7287.6489.8688.44
PlanetScope-369.4437.0449.9548.4549.9448.05
Pleiades&
PlanetScope-3
91.6583.5290.3487.7790.4588.48
U-NetPleiades88.3577.2585.8587.8385.8285.95
PlanetScope-356.1422.2831.9643.2431.9531.02
Pleiades&
PlanetScope-3
93.9585.6791.5591.2591.4491.50
DeepLabV3+Pleiades92.0283.6590.5491.4490.5690.40
PlanetScope-385.4466.8279.4180.2679.5778.42
Pleiades&
PlanetScope-3
94.1186.5392.0292.6892.1892.04
SegFormerPleiades95.2288.2592.6394.2792.6693.35
PlanetScope-384.6368.0483.8877.6883.8580.21
Pleiades&
PlanetScope-3
98.7595.3397.3797.5997.3497.42
Weak
supervision
OMEPleiades94.5484.8591.2890.1589.5590.81
PlanetScope-386.3564.1281.9576.8282.2277.93
Pleiades&
PlanetScope-3
98.7687.9495.9691.7196.2092.84
Note: The best result for each evaluation metric is highlighted in bold.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Samat, A.; Li, E.; Zhu, E.; Li, W. Fine-Grained Classification of Lakeshore Wetland–Cropland Mosaics via Multimodal RS Data Fusion and Weakly Supervised Learning: A Case Study of Bosten Lake, China. Land 2026, 15, 92. https://doi.org/10.3390/land15010092

AMA Style

Zhang J, Samat A, Li E, Zhu E, Li W. Fine-Grained Classification of Lakeshore Wetland–Cropland Mosaics via Multimodal RS Data Fusion and Weakly Supervised Learning: A Case Study of Bosten Lake, China. Land. 2026; 15(1):92. https://doi.org/10.3390/land15010092

Chicago/Turabian Style

Zhang, Jinyi, Alim Samat, Erzhu Li, Enzhao Zhu, and Wenbo Li. 2026. "Fine-Grained Classification of Lakeshore Wetland–Cropland Mosaics via Multimodal RS Data Fusion and Weakly Supervised Learning: A Case Study of Bosten Lake, China" Land 15, no. 1: 92. https://doi.org/10.3390/land15010092

APA Style

Zhang, J., Samat, A., Li, E., Zhu, E., & Li, W. (2026). Fine-Grained Classification of Lakeshore Wetland–Cropland Mosaics via Multimodal RS Data Fusion and Weakly Supervised Learning: A Case Study of Bosten Lake, China. Land, 15(1), 92. https://doi.org/10.3390/land15010092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop