Cropland Extraction Based on PlanetScope Images and a Newly Developed CAFM-Net Model

Ren, Jianhua; Jing, Yating; Zheng, Xingming; Li, Sijia; Li, Kai; Mu, Guangyi

doi:10.3390/rs18040646

Open AccessArticle

Cropland Extraction Based on PlanetScope Images and a Newly Developed CAFM-Net Model

by

Jianhua Ren

^1,†,

Yating Jing

^1,†,

Xingming Zheng

²,

Sijia Li

²

,

Kai Li

¹ and

Guangyi Mu

^3,*

¹

College of Geographical Science, Harbin Normal University, Harbin 150025, China

²

Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130102, China

³

Laboratory of Applied Disaster Prevention in Water Conservation Engineering of Jilin Province, Changchun Institute of Technology, Changchun 130103, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(4), 646; https://doi.org/10.3390/rs18040646

Submission received: 9 January 2026 / Revised: 8 February 2026 / Accepted: 18 February 2026 / Published: 19 February 2026

(This article belongs to the Special Issue Land Cover Change Detection: Emerging Algorithms and Applications in Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel dual-branch fusion network is developed for cropland extraction, integrating local spatial details and global contextual information.
The CAFM model with edge-assisted supervision enhances boundary delineation and small cropland detection.

What are the implications of the main findings?

The developed CAFM model accurately achieves fine-scale cropland mapping in high-resolution remote sensing images.
Improved cropland boundary and small parcel detection accuracy supports agricultural monitoring, cropland protection, and precision land management.

Abstract

Cropland constitutes a foundational resource for global food security and agricultural sustainability, and its accurate extraction from high-resolution remote sensing imagery is essential for agricultural monitoring and land management. However, existing deep learning-based segmentation methods often struggle to balance global contextual modeling and fine-grained boundary representation, leading to boundary blurring and omission of small cropland parcels. To address these challenges, this study proposes a novel CNN–Transformer dual-branch fusion network, named CAFM-Net, which integrates a convolution and attention fusion module (CAFM) and an edge-assisted supervision head (EH) to jointly enhance global–local feature interaction and boundary delineation capability. Experiments were conducted on a self-built PlanetScope cropland dataset from Suihua City, China, and the GID public dataset to evaluate the effectiveness and generalization ability of the proposed model. On the self-built dataset, CAFM-Net achieved an overall accuracy (OA) of 96.75%, an F1-score of 96.80%, and an Intersection over Union (IoU) of 93.79%, outperforming mainstream models such as UNet, DeepLabV3+, TransUNet, and Swin Transformer by a clear margin. On the GID public dataset, CAFM-Net obtained an OA of 94.58%, an F1-score of 94.19%, and an IoU of 89.02%, demonstrating strong robustness across different data sources. Ablation experiments further confirm that the CAFM contributes most significantly to performance improvement, while the EH module effectively enhances boundary accuracy. Overall, the proposed CAFM-Net provides a quantitatively validated and robust solution for fine-grained cropland segmentation from high-resolution remote sensing imagery, with clear advantages in boundary precision and small-parcel detection.

Keywords:

cropland identification; dual-encoder network; convolutional attention fusion; edge detection; CAFM-Net

1. Introduction

Cropland is a strategic national resource underpinning food security, agricultural productivity, and ecological sustainability [1]. Its extent and condition directly govern the resilience and adaptive capacity of national food systems, serving as critical metrics for assessing a country’s agricultural competitiveness, land resource endowment, and environmental carrying capacity [2]. In recent decades, rapid socioeconomic transformation—driven by industrial expansion, infrastructure development, and urban residential sprawl—has intensified cropland conversion and fragmentation [3,4], leading to a significant reduction in the area of cropland and an increasingly fragmented spatial distribution. This trend not only weakens the comprehensive agricultural production capacity but also intensifies land-use conflicts, posing a severe challenge to the stability of regional ecosystems and environmental security [5]. This degradation erodes agroecosystem functionality, exacerbates land-use conflicts, and undermines regional ecological integrity and environmental security. Consequently, high-precision, timely, and spatially explicit mapping of cropland area and its configuration is no longer merely a technical objective in land resource inventory; it is an operational prerequisite for evidence-based land governance, effective implementation of the national food security strategy, targeted ecological conservation planning, and balanced regional development [6,7,8,9].

Despite the growing strategic importance of cropland resources, substantial challenges persist in obtaining accurate, consistent, and spatially explicit information on their extent and distribution. Conventional manual surveys and visual interpretation methods, while capable of achieving high classification accuracy, are inefficient, time-consuming, and costly, rendering them inadequate for large-scale and dynamic monitoring requirements [10,11]. Although prior studies have shown that integrating multispectral remote sensing data with rule-based expert knowledge can improve classification consistency in specific contexts [12], such approaches remain limited by low scalability, poor transferability across diverse agroecological regions, and insufficient adaptability to seasonal crop dynamics or land-use transitions [13,14,15]. In recent years, automated texture feature extraction from high-resolution imagery using nonlinear functions has emerged as a scalable alternative to manual delineation [16]. Compared to visual interpretation, this approach is less subjective, more objective, and significantly more efficient in data processing. Nevertheless, such methods remain highly reliant on spectral and textural features and are susceptible to interference from terrain variations, farming practices, crop coverage, and building shadows, thereby resulting in omission errors and fragmented boundaries. These issues compromise both the accuracy and spatial continuity of the extracted results [17].

To date, a large number of studies have applied machine learning algorithms to object-oriented cropland identification [18]. Among these, the Support Vector Machine (SVM) has been widely adopted for classifying multi-source remote sensing imagery due to its strong generalization capability, significantly improving the accuracy of cropland mapping and demonstrating robust performance [19,20,21]. The Random Forest (RF) algorithm has also garnered extensive attention owing to its resilience in high-dimensional feature spaces and superior resistance to overfitting [22,23,24]. Additionally, the K-Nearest Neighbors (KNN) method has shown promising application potential in regional studies, proving suitable for object-level feature classification and delivering stable recognition outcomes in small-scale areas [25,26,27]. However, despite their advantages in computational efficiency, interpretability, and compatibility with moderate-resolution satellite data, traditional machine learning methods face persistent operational constraints in fine-grained cropland mapping. Specifically, boundaries between cropland and non-cropland are often indistinct in remote sensing images—particularly in regions with complex topography, heterogeneous vegetation cover, or natural ecotones—leading to classification ambiguity and misclassification [28]. Moreover, conventional methods rely heavily on manual feature extraction, which limits feature representation capacity, weakens model generalization, and results in a notable decline in recognition performance when applied to high-resolution imagery and complex land-cover backgrounds [29].

Deep learning has emerged as a pivotal technical paradigm for cropland identification from remote sensing imagery [30,31]. Methodologically, deep learning-based approaches have evolved across two distinct phases: an initial stage dominated by convolutional neural networks (CNNs) and a current phase defined by hybrid CNN–Transformer architectures that jointly leverage local feature discrimination and global contextual reasoning [32]. Early studies predominantly adopted established CNN-based segmentation models such as UNet, DeepLabV3+, and PSPNet, achieving high per-pixel accuracy in controlled, small-scale experimental settings [33,34,35]. However, these approaches still suffer from issues such as boundary blurring and misclassification since CNNs rely on local receptive fields and struggle to model long-range spatial dependencies, while lacking selective attention to critical regions. Moreover, in real-world cropland scenes, many cultivated parcels are small and fragmented, often surrounded by complex backgrounds including buildings, vegetation, and water bodies, which further complicates model recognition and hinders continuous performance improvement [36]. To address these limitations, attention mechanisms have been integrated into segmentation tasks, enabling models to emphasize key features through dynamic weighting and effectively mitigating CNN weaknesses in global perception, long-range dependency modeling, and susceptibility to background interference. Benefiting from their successful deployment in computer vision, transformer models based on self-attention have matured rapidly and are now widely adopted in natural language processing and other domains [37]. In addition, transformers can model dependencies between any spatial positions, significantly enhancing global context integration because of their strong capacity for capturing contextual associations [38]. In order to better meet the demands of pixel-level semantic segmentation, researchers have adapted by combining CNNs for multi-scale feature extraction or designing synergistic CNN–Transformer fusion frameworks for superior performance [39]. Nevertheless, existing networks still face two major challenges: inadequate coordination between global context and local details, resulting in imprecise boundary localization, and the phenomenon of land parcel adhesion or boundary artifacts, which in turn affects the spatial accuracy of the results [40].

Suihua City is an important grain production base in Northeast China, with abundant and diverse types of cultivated land resources that are widely distributed. Therefore, high-precision cultivated land mapping is crucial for agricultural monitoring, protection, and precise management [41]. Given that the identification of cultivated land in this region depends on both local texture features and global spatial structure, and the boundaries of plots are complex, this study proposes a new CAFM-Net model, which is a fusion network based on an improved CNN–Transformer architecture. This model employs a parallel dual-branch encoder to capture local details and global context respectively. Additionally, the convolution and attention fusion module (CAFM) integrates multi-scale features and enhances important information and also embeds an edge-assisted detection head (EH) to improve boundary sensitivity. By jointly optimizing feature representation, semantic understanding, and boundary refinement, the CAFM-Net model achieves precise cultivated land mapping in complex agricultural environments, effectively addressing common issues such as blurred boundaries, plot adhesion, and false boundaries and improving spatial integrity, positioning accuracy, and overall performance.

2. Materials and Methods

2.1. Study Area

Suihua City (125°18′–128°04′E, 45°03′–48°02′N) lies in the central Heilongjiang Province of China and occupies the agricultural core of the Songnen Plain (Figure 1). Its terrain is predominantly flat and open, under a temperate continental monsoon climate characterized by pronounced seasonality and an average annual precipitation of 500–600 mm, providing favorable thermal and moisture conditions for crop growth [42]. The dominant soil types in Suihua City are black soil and meadow soil, characterized by deep profiles and high organic matter content, offering excellent natural conditions for agricultural production. Cropland is spatially highly concentrated, typically forming large contiguous blocks distributed around towns and along river corridors, exhibiting both strip-shaped and patchy patterns with regular and irregular plot geometries [43]. Nevertheless, due to the combined effects of topographic variation and human activities, cropland boundaries in certain areas are indistinct.

2.2. Data Sources and Preprocessing

Planet satellite data (https://www.planet.com, accessed on 15 April 2025) came from the earth observation satellite constellation operated by the US Commercial Remote Sensing Company planet labs PBC, San Francisco, CA, USA. At present, the planet’s main satellites in orbit include PlanetScope (dove) and SkySat optical remote sensing satellites. In this study, the PlanetScope satellite remote sensing imagery provided by the American planet company was selected as the main remote sensing data source of the Suihua research area (shown in Table 1). PlanetScope imagery offers a spatial resolution of 3 m and comprises four spectral bands, red, green, blue, and near-infrared (NIR), enabling high-fidelity representation of ground object spectral signatures at fine spatial scales. Moreover, its high temporal revisit frequency provides exceptional capability for large-scale agricultural monitoring and land-use dynamics analysis, allowing for robust capture of phenological crop development stages and progressive land-cover transitions.

This study utilized PlanetScope imagery acquired over Suihua City during the bare soil period (April to May 2024) as the primary data source. Images were rigorously selected based on three criteria: zero cloud cover, high radiometric quality, and full spatial coverage of the study area. Preprocessing included radiometric calibration, atmospheric correction, seamless mosaicking, and precise region-of-interest clipping. The resulting RGB composite images constituted the input image component of the Suihua Cultivated Land Identification Dataset. For label generation, binary farmland masks were produced via single-class pixel-level annotation—exclusively targeting cultivated land—using ArcGIS Pro version 3.0.2. A total of 14,198 parcel-level ground-truth samples were manually delineated. To ensure inter-annotator consistency and labeling reliability, all annotations adhered to a standardized protocol and underwent independent cross-verification by three domain-expert annotators. To augment dataset size and enhance model generalizability and robustness, a systematic data augmentation pipeline was applied. First, each original image–mask pair was randomly cropped into non-overlapping 512 × 512 pixel patches, preserving local spatial structure and semantic coherence. Second, geometric transformations including horizontal and vertical flipping and 90°, 180°, and 270° rotations were applied to all patches to explicitly increase invariance to plot orientation and shape variability. Following augmentation, the final self-annotated dataset comprised 12,672 image–mask pairs. These were stratified into training and validation subsets at an 8:2 ratio (10,138 and 2534 samples, respectively), ensuring balanced class distribution and statistically representative evaluation. Representative label examples are shown in Figure 2.

To assess the model’s generalization across different data sources, this study used the GID public dataset from GF-2 imagery. The dataset includes two subsets: GID-5 (five land-cover classes: building, farmland, forest, grassland, and water) and GID-15. GID-5 contains 150 images (6800 × 7200 pixels for each), all expert-annotated at the pixel level for high accuracy. For cropland extraction, GID-5 was converted to a binary task including cropland and non-cropland. Using ArcGIS Pro, the data were cropped into 512 × 512-pixel image and label patches. The same geometric augmentation strategy as applied to the self-built dataset was adopted, including horizontal and vertical flipping and 90°, 180°, and 270° rotations (as shown in Figure 3). The final dataset was split into training (25,200 images) and validation (2534 images) sets at an 8:2 ratio.

2.3. Cropland Extraction Model

2.3.1. Dual-Branch Encoder of CNN–Transformer

The CNN–Transformer model [44] combines the advantages of CNNs in local feature extraction and the modeling capabilities of transformers for global dependencies. It enhances the overall spatial understanding while preserving fine details, thereby improving the accuracy and generalization ability for high-resolution remote sensing image classification, cropland identification, and change detection. This model adopts a U-shaped encoder–decoder structure, with a parallel dual-branch design in its encoder. Specifically, the CNN branch uses ResNet50 [45] as the backbone network, consisting of five stages (initial feature extraction and downsampling), followed by four stages of stacked residual blocks (Figure 4a). Through progressive downsampling, it captures multi-level local features, and the final feature map has a resolution of 1/32 of the input.

For the transformer branch, this study adopted the Swin Transformer [46] as the backbone network. The input image was first divided into patches with a side length of one quarter of the original resolution through the patch embedding module and then processed by the Swin Transformer to extract contextual semantic features. The Swin Transformer core consists of two attention mechanisms: window multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA) (Figure 4b). Specifically, W-MSA captures local features by computing self-attention within non-overlapping windows, while SW-MSA achieves cross-window interaction through shifted windows, thereby enhancing global context modeling [47]. A multi-layer perceptron (MLP) was induced to perform nonlinear feature transformation after each attention block [48]. Additionally, LayerNorm (LN) was applied before W-MSA, SW-MSA, and MLP to stabilize features, and a residual connection followed each module to support stable training and gradient flow. The entire Swin Transformer block can be defined by the following equation:

{\hat{Z}}^{L} = W - M S A (L N (Z^{L - 1})) + Z^{L - 1}

(1)

Z^{L} = M L P (L N ({\hat{Z}}^{L})) + {\hat{Z}}^{L})

(2)

{\hat{Z}}^{L + 1} = S W - M S A (L N ({\hat{Z}}^{L})) + {\hat{Z}}^{L})

(3)

{\hat{Z}}^{L + 1} = M L P (L N ({\hat{Z}}^{L + 1})) + {\hat{Z}}^{L + 1})

(4)

where

{\hat{Z}}^{L}

and

Z^{L}

represent the output features of the two MSA modules and MLP module, respectively.

2.3.2. The Improved Dual-Branch Network Model: CAFM-Net

Although the CNN–Transformer model has made progress in feature fusion and edge information extraction, it is still constrained by limitations. The fixed receptive field of CNNs limits their ability to capture long-range semantic dependencies. Meanwhile, while transformers excel in global modeling, they perform poorly in fine texture representation and boundary delineation, often losing local details [49]. During the fusion process, the lack of effective cross-modal interaction leads to semantic inconsistency between the convolutional local features and the attention-based global features. These issues are further exacerbated in high-resolution remote sensing images with complex boundaries and ambiguous transition regions, reducing the model’s performance in fine-grained classification and semantic segmentation. To achieve high-precision spatial modeling for identifying farmlands and overcome the limitations of traditional fusion methods in global–local feature integration and edge perception, this study optimized the network architecture in the feature extraction and decoding stages. The overall structure is shown in Figure 5.

After feature extraction through the dual-branch encoder, a convolution and attention fusion module (CAFM) [50] was introduced to jointly capture local and global image features (Figure 6). The CAFM combines convolutional operations with self-attention mechanisms, leveraging their complementary advantages to model local details and global context, thereby enhancing feature representation and denoising performance. In the local branch, CNN feature X_C is first adjusted in channel dimension using a 1 × 1 convolution, then processed with channel shuffling and a 3 × 3 × 3 convolution to capture spatial details. In the global branch, queries (Q) are generated from transformer feature X_T, while keys (K) and values (V) come from CNN feature X_C, using 1 × 1 and 3 × 3 convolutions respectively. This design allows global context to guide attention weighting on local features for more accurate aggregation. Finally, the outputs of both branches are added element-wise via the detail fusion module [51] to produce the final CAFM output.

To enhance the spatial accuracy of farmland boundary localization, this study integrated an auxiliary edge detection module (edge head, EH) [52] into the network. As an auxiliary supervision branch, the EH explicitly constrains the backbone network to focus on the farmland boundary regions during training. By introducing edge-level supervision, the EH guides the main segmentation branch to learn more discriminative boundary representations, thereby reducing boundary blurring and plot adhesion. The EH fuses shallow features from the CNN and Swin Transformer branches through concatenation and processes the combined features with three lightweight convolutional layers to compress channels, compact the representation, and generate a single-channel edge probability map. Additionally, it achieves edge refinement while maintaining a lightweight design. All convolutional layers except the last one were followed by batch normalization (BN) and ReLU activation to stabilize features and enhance nonlinearity. Finally, the output was upsampled by a factor of 4 through bilinear interpolation to match the original input resolution. In our framework, both the main segmentation branch and the edge head (EH) branch were supervised with standard cross-entropy loss, enabling the EH to serve as an effective auxiliary boundary constraint. Through shared feature learning, it enhanced boundary perception while maintaining a simple and stable training strategy.

All experiments were conducted in PyTorch version 2024.1.1 using the Adam optimizer with an initial learning rate of 0.01 and a weight decay of 1 × 10⁻⁵. Models were trained for 50 epochs with a batch size of 4, and early stopping was applied as training converged after approximately 50 epochs. Cross-entropy loss was used for optimization. For fair comparison, all baseline models employed the same experimental setup, including the dataset, data augmentation strategy, optimizer, learning rate, batch size, training epochs, and evaluation metrics. Network-specific parameters followed the default configurations recommended in the original papers or official implementations. The experiments were carried out on a workstation equipped with an Intel Core i9-14900K CPU, an NVIDIA RTX 4080 Ti GPU (16 GB), and 48 GB RAM.

2.4. Ablation Experiment

To evaluate the contribution of the proposed CAFM and the auxiliary EH modules, systematic ablation experiments were conducted. By incrementally adding or removing key components, the impact of each module on classification accuracy and generalization ability was assessed. Experiments were performed on both the self-built farmland dataset (Suihua City) and GID public dataset to validate model performance and cross-dataset generalization. Four configurations were designed: Test 1—removing both the CAFM and EH, retaining only the CNN branch decoding structure as the baseline; Test 2—introducing the CAFM while removing the EH to evaluate the effect of the feature fusion mechanism; Test 3—retaining the EH but removing the CAFM to assess the edge detection module’s independent contribution and indirectly reflect the effectiveness of the transformer branch and fusion strategy; Test 4—retaining all enhanced modules to form the final model, evaluating the overall performance improvement from architectural optimization.

2.5. Comparison Experiment

To evaluate CAFM-Net against mainstream models, comparative experiments were conducted on both the self-built Suihua City dataset and GID public dataset. Segmentation accuracy was introduced as the primary metric for assessing model performance. The commonly used models include UNet [53], PSPNet [54], DeepLabV3+ [55], TransUNet [56], and Swin Transformer, representing both convolutional encoder–decoder and transformer-based approaches. Through the comparison experiments, structural complexity and computational efficiency were analyzed to validate CAFM-Net’s balance of lightweight design and performance, as well as its accuracy, robustness, and generalization ability. In addition to accuracy evaluation, model efficiency was evaluated using the number of parameters, floating point operations (FLOPs), and average inference time. Specifically, FLOPs were computed as the total number of floating point operations required for a single forward pass. The average inference time was obtained by averaging multiple forward passes conducted under identical hardware configurations and input settings [57].

2.6. Accuracy Evaluation

This study followed the ISPRS benchmark protocol [58] and adopted three standard classification metrics, namely overall accuracy (OA), precision, and recall, to evaluate the performance of the model. To further assess segmentation quality in detail, F1 score, Dice coefficient, and Intersection over Union (IoU) were also introduced. The mathematical definitions of all metrics are provided below:

O A = \frac{T P}{T P + T N + F P + F N}

(5)

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

F 1 = \frac{2 T P}{2 T P + F N + F P} = \frac{2 P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

I O U = Σ_{i - 0}^{k} \frac{T P}{FN + FP + TP}

(9)

D i c e = \frac{2 T P}{F P + 2 T P + F N}

(10)

where TP, TN, FP and FN denote the numbers of correctly classified positive samples, correctly classified negative samples, negative samples erroneously predicted as positive, and positive samples erroneously predicted as negative.

3. Results

3.1. Results of Ablation Experiments

3.1.1. Self-Built Dataset

Table 2 shows the accuracy results of the self-built Suihua City cropland dataset under four ablation experiments with different module combinations. Specifically, Test 1 achieves an overall accuracy (OA) of 93.60%, a precision of 92.11%, and a recall of 95.33%, with both an F1-score and Dice coefficient of 93.75% and an IoU of 88.24%. With the introduction of the CAFM, Test 2 obtains an OA of 96.71%, precision of 96.10%, and recall of 97.43%, while the F1-score and Dice coefficient both reach 96.76% and the IoU reaches 93.73%. After introducing the EH module, Test 3 achieves an OA of 93.73%, precision of 92.72%, and recall of 95.00%, with the F1-score and Dice coefficient both reaching 93.85% and an IoU of 88.41%. When both the CAFM and EH modules are combined, Test 4 achieves an OA of 96.75%, precision of 96.27%, and recall of 97.32%, with the F1-score and Dice coefficient both reaching 96.80% and the IoU reaching 93.79%.

Figure 7 shows representative areas of remote sensing images in Suihua City, their land-type labels, and ablation experiment results from four ablation experiments, respectively. Note that white regions represent cropland, and black regions refer to other land types. It can be seen from the figure that Test 1 produced a coherent segmentation but lost details, missed small areas, and had blurred boundaries; Test 2 improved the continuity of the contour and the clarity of the shape but still had edge classification errors; Test 3 achieved a smoother spatial distribution and better boundary retention compared to Test 1; Test 4 used both the CAFM and EH simultaneously, resulting in the most complete outcome, with enhanced small-target detection capability and clearer shapes, although there were still a few local noises and gaps.

3.1.2. GID Public Dataset

The accuracy of four ablation experiments using the GID public dataset under different module combinations is shown in Table 3. The overall accuracy of Test 1 is 92.20%, with precision at 89.56% and recall at 94.53%, while the F1-score and Dice coefficient both reach 94.13% and the IoU reaches 88.92%. In Test 2 introducing the CAFM, the overall accuracy increases to 94.54%, with precision and recall reaching 93.45% and 95.05% respectively, and the corresponding F1-score and Dice coefficient both reaching 94.25%, with an IoU of 89.12%. After adding the EH module, Test 3 achieves an overall accuracy of 92.38%, precision of 89.95%, and recall of 94.33%, with the F1-score and Dice coefficient both reaching 92.09% and an IoU of 85.34%. In the experiment integrating both the CAFM and EH modules in Test 4, the overall accuracy reaches 94.58%, precision is 94.97%, and recall is 93.42%, while the F1-score and Dice coefficient both reach 94.19% and the IoU reaches 89.02%.

Figure 8 shows remote sensing images, ground-truth labels, and segmentation results from ablation experiments on typical areas of the GID public dataset, where white and black regions represent croplands and non-croplands respectively. Test 1 showed clear edge errors and missed small targets, with overly smoothed boundaries and severe detail loss in complex texture areas near rivers and towns. Test 2 improved small-target detection and overall integrity but still produced local misclassifications (such as noise points in complex backgrounds labeled in the red box), indicating limited interference suppression. The performance of Test 3 is similar to that of Test 1, with almost no improvement in boundary sharpness or fine structures, and also exhibited over-smoothing, suggesting that merely adding an edge module does not significantly enhance the segmentation quality. In contrast, Test 4 performs best, with better spatial consistency, clearer boundaries, and more accurate recovery of small targets.

3.2. Results of Comparison Experiments

3.2.1. Model Efficiency and Computational Complexity Analysis

As shown in Table 4, CAFM-Net contains 69.8 M parameters and requires 153.6 GFLOPs for a 512 × 512 input image. In comparison, UNet has 31.4 M parameters and 49.8 GFLOPs, while TransUNet has 102.7 M parameters and 196.8 GFLOPs. The average inference time of CAFM-Net is 44.5 ms per image. The corresponding inference times of UNet, DeepLabV3+, Swin Transformer, and TransUNet are 21.6 ms, 38.7 ms, 33.1 ms, and 63.9 ms, respectively.

3.2.2. Self-Built Dataset

Figure 9 presents cropland extraction results from representative areas of Suihua City, comparing six deep learning models. Specifically, the UNet model produced structurally coherent segmentation but suffered from small-plot omission, edge degradation near road intersections and greenhouse zones. PSPNet captured the main object contours effectively yet exhibited discontinuous boundaries and missed edges for small-scale targets. DeepLabV3+ showed notable inaccuracies in complex boundary regions; as indicated by the red dashed box, built-up areas were misclassified as cropland, accompanied by scattered noise. TransUNet maintained moderate continuity in large-area extraction but experienced fragmentation and local misclassification in linear feature zones. Swin Transformer preserved the overall shape of large cultivated fields but yielded blurred boundaries and poor detection of small plots, leading to suboptimal segmentation accuracy. In contrast, the proposed CAFM-Net achieved superior performance across diverse scenarios (including regular farmland, greenhouses, and complex terrains) by enhancing boundary delineation, preserving spatial continuity of fine structures, and delivering more accurate and detailed feature representation.

Table 5 shows the segmentation accuracy of different models on the Suihua City dataset. It can be seen that both traditional CNN-based models (UNet, PSPNet, and DeepLabV3+) and transformer-based models (Swin Transformer and TransUNet) perform differently. Specifically, UNet and DeepLabV3+ achieved high F1-scores and IoU values; PSPNet and Swin Transformer showed lower accuracy. TransUNet achieved balanced results with F1-score and Dice coefficient at 90.23%, IoU at 82.41%, and OA at 91.66%. CAFM-Net performed best across all metrics since the F1-score and Dice coefficient reached 96.80%, IoU was 93.79%, OA reached 96.75%, precision reached 96.27%, and recall achieved 97.32%.

3.2.3. GID Public Dataset

Figure 10 shows cropland extraction results of different models on the GID dataset. Specifically, UNet produced coherent segmentation but with blurred boundaries near rivers. PSPNet misclassified river edges and vegetation as farmland and failed to segment narrow water bodies. DeepLabV3+ showed fragmentation and weak spatial continuity. Swin Transformer preserved large shapes but lost details and also suffered from blurred boundaries. TransUNet had discontinuous segmentation in linear feature zones like elongated fields or ditches, with poor detection of narrow structures. CAFM-Net achieved sharper boundaries, better small-plot detection, and strong adaptation to complex terrain. However, as shown in Figure 10, spectral confusion between rivers, vegetation, and surrounding areas caused a slight drop in accuracy along river-adjacent regions.

Table 6 shows the segmentation performance of different models on the GID dataset. The table indicates that both traditional CNN models (UNet, PSPNet, and DeepLabV3+) and transformer-based models (Swin Transformer and TransUNet) showed clear differences in evaluation metrics, which was consistent with results from the self-built dataset. Specifically, UNet and DeepLabV3+ achieved high F1-scores and OA while PSPNet performed lower. Swin Transformer and TransUNet described limited boundary and detail representation, with F1-scores of 81.96% and 83.11%, IoU of 69.99% and 71.22%, and OA of 83.25% and 83.70%. CAFM-Net achieved the best results across all metrics: F1-score and Dice coefficient reached 94.19%, IoU reached 89.02%, OA was 94.58%, precision achieved 94.97%, and recall reached 93.42%.

4. Discussion

Although cropland and non-cropland generally exhibit distinct spectral and spatial characteristics, local similarity and spatial adjacency may still occur under certain conditions, leading to boundary adhesion and ambiguous transitions, which pose challenges for multi-scale feature extraction and precise boundary identification [59]. While traditional CNNs excel at capturing local details, they struggle to model long-range semantic dependencies; in contrast, transformers offer strong global context modeling but perform poorly in preserving fine-grained spatial structures. To address this issue, a U-shaped network with a parallel dual-branch encoder and a CAFM for efficient local–global feature fusion was developed as CAFM-Net, improving segmentation in complex agricultural scenes. This model also integrates an EH module in the backbone network to utilize shallow high-resolution features for edge response and employs a boundary supervision mechanism to explicitly learn contours, thereby enhancing sensitivity to transition areas, spatial continuity, and boundary accuracy.

Ablation experiments in Table 2 and Table 3 demonstrate that the CAFM significantly enhances model performance. By fusing multi-scale local features with global context, it improves the model’s ability to capture spatial distribution patterns and boundary structures of cropland, increasing overall accuracy by 3.11% and 2.34% on the two datasets, respectively. Concurrent improvements in precision and recall also indicate a better balance between suppressing misclassification and reducing omission errors. Compared with Test 1, Test 3 exhibits a noticeable decrease in recall, which can be attributed to the characteristics of EH when used independently. Although the EH enhances boundary sensitivity by emphasizing edge features, the absence of the CAFM limits the effective integration of global contextual information. As a result, the model tends to focus more on boundary regions while suppressing ambiguous or small cropland areas, leading to an increase in omission errors. This behavior particularly affects fragmented or spectrally heterogeneous cropland parcels, where insufficient global semantic support causes some true cropland pixels to be misclassified as background, thereby reducing recall. The results also show that it is difficult to guarantee high recall performance only relying on boundary monitoring, and effective global–local feature fusion plays an important role in balancing the boundary accuracy and target integrity. Additionally, Figure 7 further shows that while Test 1 identifies major croplands, it suffers from poor detail preservation and blurred boundaries. Introducing the CAFM (Test 2) yields clearer contours but still exhibits misclassifications in certain edge regions. After considering the EH module (Test 3), boundary sensitivity was enhanced to some extent but still performed poorly on small-scale targets, leading to under-segmentation. In contrast, Test 4 shows that when the CAFM and EH are combined, better performance can be achieved in both boundary accuracy and fine-structure recognition, and this trend is consistent in both the GID public dataset and the self-built dataset. Overall, the CAFM enhances the deep integration of multi-scale features and global semantics, while the EH optimizes the fine-grained boundary expression. Their complementary roles bring about the best performance, confirming their robustness and generalization ability across datasets.

CAFM-Net outperforms mainstream models on two datasets in both accuracy and visual quality, demonstrating high segmentation performance and strong generalization. Figure 9 and Figure 10 show clear boundaries and good preservation of small plots in farmland, greenhouses, roads, buildings, rivers, vegetation, and urban areas. Table 5 indicates large performance gaps among models on the Suihua dataset. CNNs (UNet, PSPNet, and DeepLabV3+) perform well on large areas: UNet and DeepLabV3+ achieve high F1-scores and IoU but struggle with small targets and complex boundaries; PSPNet has low OA and poor detail recovery due to limited resolution restoration. Transformers (Swin Transformer and TransUNet) model global context well, but Swin lacks local detail and boundary continuity; TransUNet achieves balanced results but still underperforms on narrow features. CAFM-Net achieves the best scores in all metrics, excelling in overall accuracy, fine-target detection, and boundary precision. Despite the complex terrain on the GID dataset, it maintained robustness and accurately segmented small plots and complex backgrounds. It ranks first in F1-score, IoU, precision, recall, and OA. Compared to UNet, the second-best model, it improves OA by 4.10%, precision by 5.17%, recall by 3.30%, F1-score by 4.24%, and IoU by 7.20%. Both CNNs and transformers have limitations: UNet’s local receptive field limits long-range modeling, resulting in blurred edges and missed detections [60]; PSPNet’s pyramid pooling leads to over-smoothing and loss of small targets in complex areas [61]; DeepLabV3+ expands the receptive field but misclassifies buildings as farmland due to weak fine-structure expression [62]; and Swin Transformer captures global context but has boundary discontinuities due to limited cross-window interaction [63]. Although TransUNet combines the advantages of CNNs and transformers, it leads to detail loss and edge fractures due to insufficient fusion [64]. Table 4 indicates that CAFM-Net introduces moderate computational overhead due to its dual-branch CNN–Transformer architecture, with 69.8 M parameters and 153.6 GFLOPs for a 512 × 512 input. While this is higher than pure CNN-based models such as UNet (31.4 M and 49.8 GFLOPs), it remains lower than TransUNet (102.7 M and 196.8 GFLOPs) and other hybrid models. Despite the increased complexity, CAFM-Net achieves higher segmentation accuracy, particularly in boundary delineation and small-parcel detection, indicating a favorable performance–efficiency trade-off. In terms of efficiency, CAFM-Net achieves an average inference time of 44.5 ms per image, which is acceptable for offline or near-real-time agricultural mapping. Although CAFM-Net is not the most lightweight model, its computational cost is justified by the accuracy gains, and future work will focus on model compression and lightweight design.

Despite its strong accuracy and cross-dataset generalization, CAFM-Net faces three well-defined limitations: misclassification or omission of small targets in complex scenes such as rivers, urban areas, and irregularly shaped parcels; increased model size and computational overhead from the dual-branch encoder, limiting deployment in real-time or edge-constrained settings; the current evaluation does not sufficiently validate the model’s generalization ability across datasets and geographical regions. To comprehensively assess the generalization capability of CAFM-Net, further work can be focused on incorporating a wider range of self-annotated and publicly available datasets at larger spatial scales with training samples covering different soil types, land-use patterns, terrain characteristics, and crop varieties. Further improvements can also be made in aspects such as fine-grained boundary modeling, adaptive multi-scale feature fusion, hardware-aware model lightweighting, low-latency inference optimization, domain adaptive multi-source learning, and explainable analysis integrating uncertainty quantification.

5. Conclusions

This study proposed CAFM-Net, a CNN–Transformer dual-branch fusion network for fine-grained cropland segmentation from high-resolution remote sensing imagery. By integrating the convolution and attention fusion module (CAFM) with an auxiliary edge-assisted supervision head (EH), the model effectively balances global contextual modeling and local boundary representation. Experimental results on both the self-built PlanetScope dataset and the GID public dataset demonstrate that CAFM-Net consistently outperforms mainstream segmentation models. On the self-built dataset, CAFM-Net achieved an overall accuracy of 96.75%, an F1-score of 96.80%, and an IoU of 93.79%. On the GID dataset, it reached an overall accuracy of 94.58%, an F1-score of 94.19%, and an IoU of 89.02%, confirming its robustness and cross-dataset generalization ability. Ablation experiments further verify that the CAFM provides the primary performance gains, while the EH module contributes to improved boundary delineation. Although CAFM-Net shows clear advantages in accuracy and boundary precision, challenges remain in highly complex scenes and in reducing model complexity. Future work will focus on lightweight design and further boundary refinement to enhance efficiency and applicability in large-scale agricultural monitoring.

Author Contributions

Conceptualization, G.M.; methodology, Y.J.; software, Y.J.; validation, X.Z.; formal analysis, K.L.; investigation, J.R.; resources, S.L.; data curation, J.R.; writing—original draft preparation, Y.J. and J.R.; writing—review and editing, J.R.; visualization, X.Z.; supervision, G.M.; project administration, G.M.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (No. 42371381 and No. 42171333), the Natural Science Foundation of Jilin Province of China (No. YDZJ202501ZYTS466), the National Key Research and Development Program Project (No. 2021YFD1500101), the Program for Young Talents of Basic Research in Universities of Heilongjiang Province (No. YQJH2024113), and the Natural Science Foundation of Heilongjiang Province of China (No. PL2025D015).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, A.; He, H.; Wang, J.; Li, M.; Guan, Q.; Hao, J. A study on the arable land demand for food security in China. Sustainability 2019, 11, 4769. [Google Scholar] [CrossRef]
He, X.; Liu, W. Coupling coordination between agricultural eco-Efficiency and urbanization in China considering food security. Agriculture 2024, 14, 781. [Google Scholar] [CrossRef]
Ma, E.; Cai, J.; Lin, J.; Guo, H.; Han, Y.; Liao, L. Spatio temporal evolution and influencing factors of global food security pattern from 2000 to 2014. Acta Geogr. Sin. 2020, 75, 332–347. (In Chinese) [Google Scholar]
Sun, X.; Xiang, P.; Cong, K. Research on early warning and control measures for arable land resource security. Land Use Policy 2023, 128, 106601. [Google Scholar] [CrossRef]
Liao, Y.; Lu, X.; Liu, J.; Huang, J.; Qu, Y.; Qiao, Z.; Xie, Y.; Liao, X.; Liu, L. Integrated Assessment of the Impact of Cropland Use Transition on Food Production Towards the Sustainable Development of Social–Ecological Systems. Agron. J. 2024, 14, 2851. [Google Scholar] [CrossRef]
Bren d’Amour, C.; Reitsma, F.; Baiocchi, G.; Barthel, S.; Güneralp, B.; Erb, K.; Haberl, H.; Creutzig, F.; Seto, K.C. Future urban land expansion and implications for global croplands. Proc. Natl. Acad. Sci. USA 2017, 114, 8939–8944. [Google Scholar] [CrossRef]
Song, D.; Ding, W.; Zhou, W. Temporal and spatial variation characteristics and sustainable utilization strategy of main cropland reserve resources in China. J. Plant Nutr. Fert. 2024, 30, 1437–1446. (In Chinese) [Google Scholar]
Zhao, S.; Yin, M. Change of urban and rural construction land and driving factors of arable land occupation. PLoS ONE 2023, 18, e0286248. [Google Scholar] [CrossRef]
Li, H.; Song, W. Spatial transformation of changes in global cropland. Sci. Total Environ. 2023, 859, 160–194. [Google Scholar] [CrossRef]
Cai, Z.; Hu, Q.; Zhang, X.; Yang, J.; Wei, H.; He, Z.; Song, Q.; Wang, C.; Yin, G.; Xu, B. An adaptive image segmentation method with automatic selection of optimal scale for extracting cropland parcels in smallholder farming systems. Remote Sens. 2022, 14, 3067. [Google Scholar] [CrossRef]
Hossain, M.; Chen, D. Segmentation for Object-Based Image Analysis (OBIA): A review of algorithms and challenges from remote sensing perspective. ISPRS J. Photogramm. Remote Sens. 2019, 150, 115–134. [Google Scholar] [CrossRef]
Yang, Y.; Meng, Z.; Zu, J.; Cai, W.; Wang, J.; Su, H.; Yang, J. Fine-scale mangrove species classification based on uav multispectral and hyperspectral remote sensing using machine learning. Remote Sens. 2024, 16, 3093. [Google Scholar] [CrossRef]
Agnoletti, M.; Cargnello, G.; Gardin, L.; Santoro, A.; Bazzoffi, P.; Sansone, L.; Pezza, L.; Belfiore, N. Traditional landscape and rural development: Comparative study in three terraced areas in northern, central and southern Italy to evaluate the efficacy of GAEC standard 4.4 of cross compliance. Ital. J. Agron. 2011, 6, 121–139. [Google Scholar] [CrossRef]
Martínez-Casasnovas, J.; Ramos, M.; Cots-Folch, R. Influence of the EU CAP on terrain morphology and vineyard cultivation in the Priorat region of NE Spain. Land Use Policy 2010, 27, 11–21. [Google Scholar] [CrossRef]
Zhao, B.; Ma, N.; Yang, J.; Li, Z.; Wang, Q. Extracting features of soil and water conservation measures from remote sensing images of different resolution levels: Accuracy analysis. Bull. Soil Water Conserv. 2012, 32, 154–157. [Google Scholar]
Li, X.; Li, Y.; Ai, J.; Shu, Z.; Xia, J.; Xia, Y. Semantic segmentation of UAV remote sensing images based on edge feature fusing and multi-level upsampling integrated with Deeplabv3+. PLoS ONE 2023, 18, e0279097. [Google Scholar] [CrossRef]
Han, H.; Feng, Z.; Du, W.; Guo, S.; Wang, P.; Xu, T. Remote sensing image classification based on multi-spectral cross-sensor super-resolution combined with texture features: A case study in the Liaohe planting area. IEEE Access 2024, 12, 16830–16843. [Google Scholar] [CrossRef]
Hofmann, P.; Blaschke, T.; Strobl, J. Quantifying the robustness of fuzzy rule sets in object-based image analysis. Int. J. Remote Sens. 2011, 32, 7359–7381. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L. An SVM ensemble approach combining spectral, structural, and semantic features for the classification of high-resolution remotely sensed imagery. IEEE Trans. Geosci. Remote Sens. 2012, 51, 257–272. [Google Scholar] [CrossRef]
Yan, S.; Yao, X.; Zhu, D.; Liu, d; Zhang, L.; Yu, G.; Gao, B.; Yang, J.; Yun, W. Large-scale crop mapping from multi-source optical satellite imageries using machine learning with discrete grids. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102485. [Google Scholar] [CrossRef]
Go, S.H.; Park, J.H. Improving field crop classification accuracy using GLCM and SVM with UAV-acquired images. Korean J. Remote Sens. 2024, 40, 93–101. [Google Scholar]
Wang, M.; Huang, L.; Tang, B.H.; Yu, Y.; Zhang, Z.; Wu, Q.; Cheng, J. Mapping cropland in Yunnan Province during 1990–2020 using multi-source remote sensing data with the Google Earth Engine Platform. Geocarto Int. 2024, 39, 2392848. [Google Scholar] [CrossRef]
Saini, R. Integrating vegetation indices and spectral features for vegetation mapping from multispectral satellite imagery using AdaBoost and random forest machine learning classifiers. Geomat. Environ. Eng. 2023, 17, 57–74. [Google Scholar] [CrossRef]
Wan, L.; Kendall, A.D.; Rapp, J.; Hyndman, D.W. Mapping agricultural tile drainage in the US Midwest using explainable random forest machine learning and satellite imagery. Sci. Total Environ. 2024, 950, 175283. [Google Scholar] [CrossRef]
Thanh Noi, P.; Kappas, M. Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery. Sensors 2017, 18, 18. [Google Scholar] [CrossRef]
Moharram, M.A.; Sundaram, D.M. Spatial–spectral hyperspectral images classification based on Krill Herd band selection and edge-preserving transform domain recursive filter. Appl. Remote Sens. 2022, 16, 044508. [Google Scholar] [CrossRef]
Aziz, N.; Minallah, N.; Hasanat, M.; Ajmal, M. Geographic Object-based Image Analysis for Small Farmlands using Machine Learning Techniques on Multispectral Sentinel-2 Data. Proc. Pak. Acad. Sci. A Phys. Comput. Sci. 2024, 61, 41–49. [Google Scholar] [CrossRef]
Rangel, R.; Lourenço, V.; Oldoni, L.; Bonamigo, A.; Santos, W.; Oliveira, B.; Barreto, M. A unified framework for cropland field boundary detection and segmentation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 4–8 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 636–644. [Google Scholar]
Shen, Q.; Deng, H.; Wen, X.; Chen, Z.; Xu, H. Statistical texture learning method for monitoring abandoned suburban cropland based on high-resolution remote sensing and deep learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3060–3069. [Google Scholar] [CrossRef]
Papadopoulou, E.; Mallinis, G.; Siachalou, S.; Koutsias, N.; Thanopoulos, A.; Tsaklidis, G. Agricultural land cover mapping through two deep learning models in the framework of EU’s CAP activities using sentinel-2 multitemporal imagery. Remote Sens. 2023, 15, 4657. [Google Scholar] [CrossRef]
Li, H.; Du, Y.; Xiao, X.; Chen, Y. Remote Sensing Identification Method of cropland at Hill County of Sichuan Basin Based on Deep Learning. Smart Agric. 2024, 6, 34. [Google Scholar]
Voelsen, M.; Lauble, S.; Rottensteiner, F.; Heipke, C. Transformer Models for Multi-Temporal Land Cover Classification Using Remote Sensing Images. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 10, 981–990. [Google Scholar] [CrossRef]
Yang, S. Performance and Analysis of FCN, U-Net, and SegNet in Remote Sensing Image Segmentation Based on the LoveDA Dataset. ITM Web Conf. 2025, 70, 03023. [Google Scholar] [CrossRef]
Liu, Y.; Bai, X.; Wang, J.; Li, G.; Li, J.; Lv, Z. Image semantic segmentation approach based on DeepLabV3 plus network with an attention mechanism. Eng. Appl. Artif. Intell. 2024, 127, 107260. [Google Scholar] [CrossRef]
Gao, X.; Liu, L.; Gong, H. MMUU-Net: A robust and effective network for farmland segmentation of satellite imagery. J. Phys. Conf. Ser. 2020, 1651, 012189. [Google Scholar] [CrossRef]
Hu, L.; Qin, M.; Zhang, F.; Du, Z.; Liu, R. RSCNN: A CNN-based method to enhance low-light remote-sensing images. Remote Sens. 2020, 13, 62. [Google Scholar] [CrossRef]
Popel, M.; Bojar, O. Training tips for the transformer model. arXiv 2018, arXiv:1804.00247. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
Qi, L.; Zuo, D.; Wang, Y.; Tao, Y.; Tang, R.; Shi, J.; Gong, J.; Li, B. Convolutional neural network-based method for agriculture plot segmentation in remote sensing images. Remote Sens. 2024, 16, 346. [Google Scholar] [CrossRef]
Lingwal, S.; Bhatia, K.; Singh, M. Semantic segmentation of landcover for cropland mapping and area estimation using Machine Learning techniques. Data Intell. 2023, 5, 370–387. [Google Scholar] [CrossRef]
Zhang, H. Automatic Extraction of Non-grain and Non-agriculturalization Use Patterns of Cultivated Land Based on Satellite Remote Sensing Images. Geomatics. Spat. Inf. Technol. 2025, 6, 87–90. (In Chinese) [Google Scholar]
Xie, Y.; Zeng, H.; Tian, F.; Zhang, M.; Hu, Y. Study on sample dependence and model space extrapolation of crop remote sensing classification. Nat. Remote Sens. Bull. 2024, 28, 2878–2895. (In Chinese) [Google Scholar]
Zhang, X.; Li, S.; Wang, X.; Song, K.; Chen, Z.; Zheng, K. Quantitative remote sensing retrieval of soil total nitrogen in Suihua City, Heilongjiang Province Based on sentinel-2 satellite image. Trans. Chin. Soc. Agric. Eng. 2023, 39, 144–151. (In Chinese) [Google Scholar]
Peng, Z.; Huang, W.; Gu, S.; Xie, L.; Wang, Y.; Jiao, J.; Ye, Q. Conformer: Local features coupling global representations for visual recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: Piscataway, NJ, USA, 2021; pp. 367–376. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
Huang, J.; Fang, Y.; Wu, Y.; Wu, H.; Gao, Z.; Li, Y.; Del Ser, J.; Xia, J.; Yang, G. Swin transformer for fast MRI. Neurocomputing 2022, 493, 281–304. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Xia, L.; Mi, S.; Zhang, J.; Luo, J.; Shen, Z.; Cheng, Y. Dual-stream feature extraction network based on CNN and transformer for building extraction. Remote Sens. 2023, 15, 2689. [Google Scholar] [CrossRef]
Hu, S.; Gao, F.; Zhou, X.; Dong, J.; Du, Q. Hybrid convolutional and attention network for hyperspectral image denoising. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5504005. [Google Scholar] [CrossRef]
Yang, Y.; Zhou, Y.; Chen, Y.; Zhang, Z.; Ma, Z.; Yuan, C.; Li, B.; Song, L.; Gao, J.; Li, P.; et al. DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval. arXiv 2025, arXiv:2505.17796. [Google Scholar]
Pu, M.; Huang, Y.; Liu, Y.; Guan, Q.; Ling, H. Edter: Edge detection with transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LO, USA, 21–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1402–1412. [Google Scholar]
Singh, N.J.; Nongmeikapam, K. Semantic segmentation of satellite images using deep-unet. Arab. J. Sci. Eng. 2023, 48, 1193–1205. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, H.; Jia, J. Pyramid scene parsing network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2881–2890. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Computer Vision—ECCV 2018; Springer: Cham, Switzerland, 2018; pp. 801–818. [Google Scholar]
Chen, J.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar] [CrossRef]
Rottensteiner, F.; Sohn, G.; Jung, J.; Gerke, M.; Baillard, C.; Benitez, S.; Breitkopf, U. The ISPRS Benchmark on Urban Object Classification and 3D Building Reconstruction. ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci. 2012, I-3, 293–298. [Google Scholar] [CrossRef]
Zhang, J.; He, Y.; Yuan, L.; Liu, P.; Zhou, X.; Huang, Y. Machine learning-based spectral library for crop classification and status monitoring. Agron. J. 2019, 9, 496. [Google Scholar] [CrossRef]
Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J.; Atkinson, P.M. Joint Deep Learning for land cover and land use classification. Remote Sens. Environ. 2019, 221, 173–187. [Google Scholar] [CrossRef]
Li, W.; Dong, R.; Fu, H.; Yu, L. Large-scale oil palm tree detection from high-resolution satellite images using two-stage convolutional neural networks. Remote Sens. 2018, 11, 11–31. [Google Scholar] [CrossRef]
Xu, Y.; Xue, X.; Sun, Z.; Gu, W.; Cui, L.; Jin, Y.; Lan, Y. Deriving agricultural field boundaries for crop management from satellite images using semantic feature pyramid network. Remote Sens. 2023, 15, 2937. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In Computer Vision—ECCV 2022 Workshops; Springer: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Liu, B.; Wang, W.; Wu, Y.; Gao, X. Attention Swin Transformer UNet for Landslide Segmentation in Remotely Sensed Images. Remote Sens. 2024, 16, 44–64. [Google Scholar] [CrossRef]

Figure 1. Study area of Suihua City.

Figure 2. Partially labeled samples of the self-labeled dataset (the base image is the PlanetScope remote sensing image; the color frame is the label sample).

Figure 3. Original imagery and arable land label map for the GID dataset. (a) Image; (b) Ground.

Figure 4. Main structure of dual encoder model. (a) ResNet50 structure; (b) W-MSA and SW-MSA structures.

Figure 5. CAFM-Net model network architecture diagram by the dual-branch network structure considering the CNN and transformer, CAFM and EH.

Figure 6. CAFM model architecture (X_T ∈ R^{B × C_T × h × W} is the transformer branch, X_C ∈ R^{B × C_C × h × W} is the CNN branch, B is the batch size, and h and W are the spatial resolution of the feature map).

Figure 7. Experimental results of ablation based on self-labeled dataset; the red frame is false positive (FP), and the blue frame is false negative (FN).

Figure 8. Ablation experiment results for GID public dataset; the red frame is false positive (FP), and the blue frame is false negative (FN).

Figure 9. Comparative experiment results based on self-labeled dataset; the red frame is false positive (FP), and the blue frame is false negative (FN).

Figure 10. Comparative experiment results based on GID public dataset; the red frame is false positive (FP), the blue frame is false negative (FN).

Table 1. Technical specifications of the PlanetScope satellite.

Country of Origin	Orbit	Spectral Band	Spatial Resolution	Revisit Period	Width
America	Sun-synchronous orbit (465–700 km)	Red: 610–700 nm	3–5 m	1–2 day	24 km
	Sun-synchronous orbit (465–700 km)	Green: 500–590 nm
	International Space Station orbit (about 420 km)	Blue: 420–530 nm
	International Space Station orbit (about 420 km)	Near-infrared: 760–860 nm

Table 2. Ablation experiments for self-labeled dataset (Modules marked with ‘√’ were added).

Ablation Experiment	CNN	CAFM	EH	OA (%)	Precision (%)	Recall (%)	F1_Score (%)	Dice (%)	IOU (%)
Test 1	√			93.60	92.11	95.33	93.75	93.75	88.24
Test 2		√		96.71	96.10	97.43	96.76	96.76	93.73
Test 3	√		√	93.73	92.72	95.00	93.85	93.85	88.41
Test 4		√	√	96.75	96.27	97.32	96.80	96.80	93.79

Table 3. Ablation experiments for GID public dataset (Modules marked with ‘√’ were added).

Ablation Experiment	CNN	CAFM	EH	OA (%)	Precision (%)	Recall (%)	F1_Score (%)	Dice (%)	IOU (%)
Test 1	√			92.20	89.56	94.53	94.13	94.13	88.92
Test 2		√		94.54	93.45	95.05	94.25	94.25	89.12
Test 3	√		√	92.38	89.95	94.33	92.09	92.09	85.34
Test 4		√	√	94.58	94.97	93.42	94.19	94.19	89.02

Table 4. Summary of the efficiency comparison results.

Comparative Experiment	Parameters (M)	FLOPs (G)	Inference Time (ms)
UNet	31.4	49.8	21.6
PSPNet	47.2	172.6	46.3
Deeplabv3+	41.1	142.9	38.7
Swin Transformer	28.3	92.4	33.1
TransUNet	102.7	196.8	63.9
CAFM-Net	69.8	153.6	44.5

Input size = 512 × 512 and batch size = 1; GPU = NVIDIA RTX 4080 Ti.

Table 5. Accuracy evaluation of comparative experiments in self-labeled dataset.

Comparative Experiment	F1_Score (%)	Dice (%)	IOU (%)	OA (%)	Precision (%)	Recall (%)
UNet	87.82	87.82	78.61	89.74	90.01	86.31
PSPNet	77.79	77.79	64.81	82.51	83.82	75.61
Deeplabv3+	86.79	86.79	76.83	81.50	83.87	90.74
Swin Transformer	76.77	76.77	62.61	78.03	76.33	79.79
TransUNet	90.23	90.23	82.41	91.66	91.79	89.10
CAFM-Net	96.80	96.80	93.79	96.75	96.27	97.32

Table 6. Accuracy evaluation of comparative experiments in GID public dataset.

Comparative Experiment	F1_Score (%)	Dice (%)	IOU (%)	OA (%)	Precision (%)	Recall (%)
UNet	89.95	89.95	81.82	90.48	89.80	90.12
PSPNet	77.79	77.79	64.81	82.51	83.82	75.43
Deeplabv3+	89.04	89.04	80.59	86.48	85.99	92.96
Swin Transformer	81.96	81.96	69.99	83.25	82.30	82.07
TransUNet	83.11	83.11	71.22	83.70	82.68	83.99
CAFM-Net	94.19	94.19	89.02	94.58	94.97	93.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ren, J.; Jing, Y.; Zheng, X.; Li, S.; Li, K.; Mu, G. Cropland Extraction Based on PlanetScope Images and a Newly Developed CAFM-Net Model. Remote Sens. 2026, 18, 646. https://doi.org/10.3390/rs18040646

AMA Style

Ren J, Jing Y, Zheng X, Li S, Li K, Mu G. Cropland Extraction Based on PlanetScope Images and a Newly Developed CAFM-Net Model. Remote Sensing. 2026; 18(4):646. https://doi.org/10.3390/rs18040646

Chicago/Turabian Style

Ren, Jianhua, Yating Jing, Xingming Zheng, Sijia Li, Kai Li, and Guangyi Mu. 2026. "Cropland Extraction Based on PlanetScope Images and a Newly Developed CAFM-Net Model" Remote Sensing 18, no. 4: 646. https://doi.org/10.3390/rs18040646

APA Style

Ren, J., Jing, Y., Zheng, X., Li, S., Li, K., & Mu, G. (2026). Cropland Extraction Based on PlanetScope Images and a Newly Developed CAFM-Net Model. Remote Sensing, 18(4), 646. https://doi.org/10.3390/rs18040646

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cropland Extraction Based on PlanetScope Images and a Newly Developed CAFM-Net Model

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Sources and Preprocessing

2.3. Cropland Extraction Model

2.3.1. Dual-Branch Encoder of CNN–Transformer

2.3.2. The Improved Dual-Branch Network Model: CAFM-Net

2.4. Ablation Experiment

2.5. Comparison Experiment

2.6. Accuracy Evaluation

3. Results

3.1. Results of Ablation Experiments

3.1.1. Self-Built Dataset

3.1.2. GID Public Dataset

3.2. Results of Comparison Experiments

3.2.1. Model Efficiency and Computational Complexity Analysis

3.2.2. Self-Built Dataset

3.2.3. GID Public Dataset

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI