Effect Analysis of Spectral and Spatial Variations on Attention-Based Cropland Extraction Networks

Cheng, Lin; Deng, Cailong; Zhou, Chaohu; Zhang, Yong; Lu, Haojian; Li, Zhen; Chen, Shiyu

doi:10.3390/rs18101501

Open AccessArticle

Effect Analysis of Spectral and Spatial Variations on Attention-Based Cropland Extraction Networks

by

Lin Cheng

¹

,

Cailong Deng

^2,3

,

Chaohu Zhou

^2,3,

Yong Zhang

⁴,

Haojian Lu

⁴,

Zhen Li

⁵ and

Shiyu Chen

^1,2,6,*

¹

School of Geographic Sciences, Xinyang Normal University, Xinyang 464000, China

²

Key Laboratory of Investigation, Monitoring, Protection and Utilization for Cultivated Land Resources, MNR, Chengdu 610045, China

³

Sichuan Institute of Land Science and Technology (Sichuan Center of Satellite Application Technology), Chengdu 610045, China

⁴

School of Electronic Information and Artificial Intelligence, Wuzhou University, Wuzhou 543002, China

⁵

College of Surveying and Geo-Informatics, North China University of Water Resources and Electric Power, Zhengzhou 450046, China

⁶

Xinyang Key Laboratory of Land Surface Ecological Remote Sensing Monitoring in the Huai River Basin, Xinyang 464000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1501; https://doi.org/10.3390/rs18101501

Submission received: 7 April 2026 / Revised: 5 May 2026 / Accepted: 6 May 2026 / Published: 10 May 2026

(This article belongs to the Topic Advances in Smart Agriculture with Remote Sensing as the Core and Its Applications in Crops Field, 2nd Edition)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Spectral and spatial resolutions exhibit clear linear relationships with cropland segmentation accuracy in attention-based models.
A spectral–spatial coupling model based on Iso-IoU effectively quantifies the trade-off between band number and spatial resolution.

What are the implications of the main findings?

Spectral information can partially compensate for spatial resolution loss, especially for models with stronger spectral utilization capability.
The proposed framework provides practical guidance for optimizing input configurations and model selection in agricultural remote sensing applications.

Abstract

Accurate extraction of cropland is essential for optimizing regional land-use structure and ensuring food security. Although attention-based deep learning has advanced cropland extraction, the lack of a quantitative framework to evaluate the trade-off between spectral band count and spatial resolution hinders optimal sensor configuration. To address this gap, we employ two representative attention-based segmentation networks, BsiNet and REAUnet, to conduct controlled spectral–spatial variation experiments, and proposes an equivalent IoU (Iso-IoU) equivalent model to quantify their complementary relationship. By conducting experiments with multiple band combinations and multi-scale spatial resolutions, we quantitatively evaluate the respective contributions of spectral and spatial information to model performance and further analyze their coupling relationship. The results show that: (1) model performance is positively correlated with spectral richness (i.e., band count), where four-band configurations achieve an IoU improvement of approximately 1.5–4% compared with single-band inputs. While the inclusion of the near-infrared (NIR) band consistently yields the highest accuracy within each band count group, the total number of available spectral bands remains the primary driver of segmentation performance; (2) model performance is more sensitive to spatial resolution, and the IoU decreases by about 5–7% on average when the spatial resolution is degraded to one-quarter of the original resolution; (3) a quantifiable complementary relationship exists between spectral band combinations and spatial resolution, which can be described by the proposed Iso-IoU model; (4) the two attention-based networks examined in this study exhibit stable error tendencies in cropland extraction, with consistent false-positive and false-negative patterns. These findings provide practical guidance for cropland extraction with remote sensing images. Prioritizing NIR information and maintaining sufficient spatial resolution are critical for preserving segmentation accuracy, while the Iso-IoU model enables quantitative optimization of spectral–spatial configurations under sensor constraints.

Keywords:

deep learning; cropland segmentation; spectral band combinations; spatial resolution; attention mechanisms

1. Introduction

Cropland is a fundamental resource for agricultural production, and its sustainable use is both related to national food security [1] and ecosystem balance [2]. Therefore, accurate mapping of cropland distribution is of great importance for food-security assessment, agricultural management optimization, and macro-level decision-making [3]. With the rapid development of high-resolution remote sensing imagery, deep learning-based semantic segmentation has become one of the mainstream approaches for cropland extraction [4]. Representative deep neural network (DNN) models, such as U-Net, DeepLabv3+, and SegNet [5,6,7] have shown strong performance in extracting land-cover objects [8,9,10,11]. These methods can learn end-to-end mappings from image pixels to semantic labels, thereby reducing dependence on manually designed features [12,13]. Traditionally handcrafted methods rely heavily on manually designed features and segmentation parameters to obtain spatial information of agricultural parcels [14,15]. Whereas, DNN-based methods can automatically learn hierarchical spatial textures and spectral characteristics, and thus exhibit stronger generalization ability in complex backgrounds and fragmented parcel scenarios [16,17,18]. More recently, attention mechanisms embedded in segmentation networks have further improved the representation of cropland boundaries and internal structures by adaptively emphasizing informative features [19], and they have been widely applied in various remote sensing tasks [20,21]. Technically, these mechanisms enhance the extraction of discriminative features by modeling two key dimensions of remote sensing data: spectral and spatial data [22]. These mechanisms are generally categorized into two types: channel attention (CA), which recalibrates feature responses along the channel dimension, and spatial attention (SA) [23,24], which assigns different weights to spatial positions. CA typically estimates the importance of feature channels through global statistics, thereby emphasizing more informative spectral and semantic responses [25]. In contrast, SA highlights target-related regions by modeling spatial correlations or local response differences across positions [26,27]. These two types of attentions correspond, respectively, to modeling spectral (or feature) responses and spatial structures (or boundary cues). These properties make them particularly suitable for cropland extraction tasks, which rely on both spectral separability and parcel morphology [28]. However, despite the potential of attention mechanisms, their practical application is still constrained by the inherent complexities of remote sensing imagery. The commonly observed phenomena of “same object with different spectra” and “different objects with the same spectrum” in high-resolution images [29,30], as well as issues such as irregular cropland morphology, complex terrain, and high parcel fragmentation, pose significant challenges to accurate cropland segmentation [31,32].

Although attention-based networks have achieved significant progress in cropland extraction, the effect of the spatial resolution and spectral bands of remote sensing imagery on model performance remains insufficiently understood [33,34]. Firstly, spectral bands determine the dimensionality of spectral information available to the model [35], and different band combinations play different roles in distinguishing cropland from similar land-cover types such as bare soil and roads [36,37]. Moreover, spatial resolution directly affects texture representation and boundary clarity, thereby providing an essential basis for the accurate delineation of cropland parcels [38,39]. In practical remote sensing systems, the number of spectral bands and spatial resolution often involve a trade-off. Imagery with higher spatial resolution can better preserve parcel boundaries, but usually contains limited spectral information [40,41]; images with more spectral bands provide richer spectral information and can help distinguish land-cover types, yet they are often associated with lower spatial resolution, which tends to cause blurred boundaries [42,43]. Existing studies mostly focus on network architecture design or performance improvements under a single-data-source condition. Nevertheless, few studies systematically examine, from a model-oriented perspective, how variations in spectral band combinations and spatial resolution affect the performance of cropland extraction. In particular, when attention-based networks are applied to cropland extraction, quantitative investigations into whether these two factors exhibit a complementary or trade-off relationship remain limited [44,45,46]. Moreover, it is still unclear whether attention mechanisms exhibit stable error tendencies under different resolution combinations and, if so, what factors drive such tendencies. Current research in cropland extraction generally follows two distinct paths. One is model-centric, focusing on improving performance through architecture refinements such as attention mechanisms and multi-scale feature fusion [47,48]. Though these approaches achieve high accuracy, the influence of intrinsic data properties—particularly the interaction between spectral and spatial resolution—on model behavior is often not explicitly analyzed. The other is data-centric, which focuses on integrating very-high-resolution (VHR) imagery, multi-temporal sequences, or ancillary data like digital elevation models (DEM) and synthetic aperture radar (SAR) [49,50]. However, these studies mainly emphasize performance improvements brought by additional data, while the underlying coupling relationships among different data characteristics are less explored. A key research gap lies in the lack of a unified understanding of how spectral and spatial variations jointly affect attention-based models in cropland extraction. By proposing an equivalent Intersection over Union (Iso-IoU) model, we bridge this gap and provide a quantitative framework to evaluate the joint effects of spectral and spatial variations.

By evaluating two representative attention-based networks, BsiNet [51] and REAUnet [52], on datasets from the Jilin-1 and GF-2 satellites, we systematically investigate the effects of spectral band combinations and spatial resolution on cropland segmentation performance. The two models are selected to represent different attention design paradigms: BsiNet adopts a multi-task learning strategy with boundary and distance supervision, while REAUnet incorporates coupled spatial and channel attention mechanisms. These representative attention-based networks enable a comparative analysis of their responses to spectral and spatial variations. Specifically, we aim to: (1) quantify the independent and coupled effects of spectral and spatial variations on cropland segmentation accuracy using a controlled experimental framework; (2) propose a spectral–spatial coupling analysis method based on the Iso-IoU model, consequently provide a quantitative perspective for characterizing the linear complementary relationship between spectral and spatial information and offer practical guidance for identifying configurations if comparable performance is required; (3) introduce an error tendency analysis method to reveal the stable patterns and structural consistency of false-positive (FP) and false-negative (FN) errors in cropland extraction.

The remainder of this paper is organized as follows: Section 2 introduces the study areas and datasets; Section 3 presents the deep learning models and experimental methods; Section 4 presents the experimental design; Section 5 reports the experimental results; Section 6 provides comprehensive discussions of the findings and limitations; and Section 7 concludes the paper with future work.

2. Study Areas and Datasets

To systematically analyze the impacts of variations in spectral band combinations and spatial resolution on cropland extraction performance, we employ two types of high-resolution remote sensing imagery with different imaging characteristics and scene complexities: the iFLYTEK public cropland dataset [53] and the Shanglin dataset, acquired by the Jilin-1 and GF-2 satellites, respectively. The two datasets differ substantially in spatial resolution, scene complexity, and cropland morphology, thereby providing a suitable basis for comparative analysis of spectral and spatial variation effects.

The iFLYTEK cropland dataset originates from the 2021 iFLYTEK Challenge. The dataset contains four bands: blue (B), green (G), red (R), and near-infrared (NIR), with a spatial resolution (i.e., ground sampling distance, GSD) ranging from 0.75 to 1.1 m. The iFLYTEK dataset covers multiple typical agricultural regions in China, particularly in mid-to-high-latitude regions, with good diversity across regions and climate zones. As shown in Figure 1, the dataset includes 31 large remote sensing scenes, each with a size of at least 3000 × 3000 pixels, together with corresponding cropland parcel labels in Shapefile format. Most cropland parcels in this dataset are relatively regular and large, and fragmented and small parcels are less common; however, the boundaries between adjacent parcels are often relatively blurred.

The Shanglin dataset is derived from GF-2 imagery, and the study area is located in Shanglin County, Guangxi, China. The region is characterized by evident terrain undulations, dominated by low and medium mountains (as shown in Figure 2). Croplands are interwoven with other land-cover types, such as forest, bare land, and roads, resulting in strong textural heterogeneity and highly fragmented structures. These characteristics make the dataset particularly suitable for analyzing the effect of spatial resolution variation on model performance. The GSDs of the GF-2 panchromatic and multispectral bands are 0.8 m and 3.2 m, respectively. In this study, we use the provided 0.8 m standardized multispectral imagery, which undergo standard preprocessing steps including radiometric calibration, atmospheric correction, and spatial fusion. All images are subsequently resampled to a unified spatial resolution for consistency across experiments. To ensure label reliability, cropland boundaries are meticulously annotated through a multi-tier manual visual interpretation workflow in ArcMap 10.8. Initially, independent interpretation is conducted by trained researchers with the aid of national geographic condition survey data. Subsequently, any ambiguous regions are resolved through consensus discussions, and the final annotations are rigorously reviewed by a senior expert to ensure spatial plausibility. This comprehensive quality control process resulted in a highly reliable pixel-level dataset containing 159,273 agricultural parcels.

To ensure fair evaluation and sufficient generalization, a spatially independent split strategy is adopted for both datasets. The Shanglin dataset is divided into four spatial quadrants, with the northwest quadrant used as the test set and the remaining three quadrants used as the training set. For the iFLYTEK dataset, 25 image scenes are used as the training set and the remaining 6 scenes are used as test set. All images and labels are cropped into 256 × 256-pixel patches using a sliding-window strategy with an overlap of 10%. After cropping, the training set patches are further divided into training and validation subsets at a ratio of 8:2 through random sampling. While the 10% sliding-window overlap inevitably introduces minor shared boundaries between adjacent patches within the training and validation pools, the final performance evaluation is strictly conducted on the spatially independent testing set. This strategy helps reduce performance overestimation caused by spatial autocorrelation and ensures that the experimental results more reliably reflect model performance in unseen areas. The key parameters and statistics of the datasets are summarized in Table 1.

3. Models

To systematically investigate the effects of spectral and spatial variations on cropland segmentation, we employ two representative attention-based segmentation networks, BsiNet and REAUnet. Both models are built on the U-Net encoder-decoder architecture, but they differ in their attention designs [54]. Specifically, BsiNet mainly improves feature representations through group-wise feature reweighting, whereas REAUnet explicitly combines CA and SA for joint feature refinement. Therefore, these two models can be regarded as representatives of two typical attention strategies: the former emphasizes lightweight recalibration of feature channels based on grouped features, while the latter emphasizes explicit spectral–spatial joint modeling. Comparing their performance under different spectral and spatial conditions can help reveal how different attention designs affect model responses and error patterns.

3.1. BsiNet

BsiNet is a multi-task segmentation network that jointly predicts cropland masks, parcel boundaries, and distance maps through a shared encoder and multiple decoder branches. This design strengthens the representation of parcel contours and spatial structures, and is particularly beneficial for cropland scenes with fragmented parcels and blurred boundaries. BsiNet serves as a representative model for group-wise feature reweighting. Its core enhancement component is the Spatial Group-wise Enhance (SGE) module [55]. This module performs feature modulation by dividing channels into multiple groups. It then generates a shared spatial response for each individual group to refine the features. Generally, the feature refinement process of a CA module can be written as:

X^{'} = A (X) ⊙ X

(1)

where

X \in R^{C \times H \times W}

denotes the input feature map,

A (X)

denotes the attention weights inferred from

X

, and

⊙

denotes element-wise multiplication.

Unlike conventional CA, which usually assigns one scalar weight to each channel, SGE adopts a group-wise spatial reweighting strategy. Specifically, the input feature map is first divided into g groups along the channel dimension:

X = [X_{1}, X_{2}, \dots, X_{g}], X_{i} \in R^{C / g \times H \times W}

(2)

For the i-th feature group

X_{i}

, SGE computes a shared spatial attention map:

M_{i} = σ (ϕ (X_{i})), M_{i} \in R^{1 \times H \times W}

(3)

where

ϕ (\cdot)

denotes the group-wise mapping function and

σ (\cdot)

is the sigmoid activation function. The attention map is then used to recalibrate the grouped feature:

X_{i}^{'} = X_{i} ⊙ M_{i}

(4)

where

M_{i}

is broadcast along the channel dimension and shared by all channels within group

i

. Finally, all enhanced groups are concatenated to form the output feature map:

X^{'} = Concat (X_{1}^{'}, X_{2}^{'}, \dots, X_{g}^{'})

(5)

Figure 3 illustrates the SGE module structure. The module recalibrates features along the channel dimension using a grouping strategy. This approach enables BsiNet to enhance locally consistent responses while simultaneously suppressing noisy and irrelevant activations within each channel group. Although BsiNet does not explicitly couple channel attention and spatial attention, it still modulates CA and SA through the group mechanism: each grouped feature is calibrated along the channel dimension through CA with a shared weight, while the weight is derived from intra-group SA.

3.2. REAUnet

Unlike BsiNet which enhances features through group-wise reweighting, REAUnet is a representative model that integrates channel-spatial attention by coupling CA and SA within the same attention block. This design allows the model to simultaneously exploit channel-wise spectral responses and spatial structural cues [56]. This design is particularly suitable for cropland extraction, where accurate segmentation depends not only on spectral separability among land-cover types but also on spatial textures such as parcel boundaries and shapes. Distinguishing parcels from other land-cover types primarily relies on spectral bands, whereas the parcel layouts are determined by image textures, which are closely related to spatial resolution.

REAUnet is also built on a U-Net encoder-decoder architecture. During encoding, residual convolutional blocks, edge enhancement modules, and channel-spatial attention blocks (AttBLK) are introduced sequentially. The core attention module of REAUnet is illustrated in Figure 4. Given an input feature map

X \in R^{C \times H \times W}

, the channel attention branch firstly computes two channel descriptors through global average pooling and global max pooling along the spatial dimensions as

f_{avg}^{c} = GAP (X)

,

f_{\max}^{c} = G M P (X) \in R^{C \times 1 \times 1}

. These two descriptors are then fed into a shared multilayer perceptron and fused to generate the channel attention map:

M_{c} = σ (M L P (f_{a v g}^{c})+ M L P (f_{\max}^{c}))

(6)

where

M_{c} \in R^{C \times 1 \times 1}

denotes the channel attention map and

σ (\cdot)

is the sigmoid activation function. The CA branch highlights channels with stronger semantic responses and suppresses less informative feature (e.g., noise) dimensions.

The SA branch focuses on the importance of different spatial locations. Average pooling and max pooling are firstly performed along the channel dimension to obtain two spatial descriptors

f_{avg}^{s} = {A v g}_{c} (X)

,

f_{\max}^{s} = {M a x}_{c} (X)

, where

f_{avg}^{s}

,

f_{\max}^{s} \in R^{1 \times H \times W}

. By using channel-wise concatenation, the combined spatial descriptor is formulated as

F_{s} = [f_{avg}^{s}; f_{\max}^{s}]

,

F_{s} \in R^{2 \times H \times W}

. The SA map is then obtained by

M_{s} = σ (C o n v (F_{s}))

(7)

where

M_{s} \in R^{1 \times H \times W}

denotes the SA map, and

C o n v (\cdot)

denotes a two-dimensional convolution operator. The SA branch emphasizes regions that are more relevant to parcel structures and boundary delineation. Finally, REAUnet adopts a parallel residual fusion strategy to combine the original features, the channel-refined features, and the spatial-refined features:

X^{'} = X + X ⊙ M_{c} + X ⊙ M_{s}

(8)

where

⊙

denotes element-wise multiplication, and

X^{'} \in R^{C \times H \times W}

is the enhanced output feature map. AttBLK in REAUnet models channel-wise spectral responses and spatial structural cues simultaneously. These joint attentions allow the network to capture both spectral differences and boundary-related information, thereby maintaining feature separability and eventually improving segmentation accuracy. In our proposed framework, REAUnet serves as a representative model for analyzing how CA and SA respond to spectral and spatial variations in cropland extraction.

3.3. Implementation Details

To ensure fair comparisons across different spectral and spatial configurations, consistent data preprocessing and dataset splitting strategies are adopted for both datasets. All experiments are implemented in PyTorch 2.4 and conducted on an NVIDIA GeForce RTX 4070 GPU.

For data preparation, all input spectral bands are normalized to the [0, 1]. During training, REAUnet employed basic geometric augmentations, including random horizontal/vertical flips and 90° rotations, while BsiNet did not apply additional spectral perturbations to preserve the physical integrity of the input bands. To ensure fair comparison across varying spectral configurations (1–4 bands), both models are trained from scratch using Kaiming initialization [57] without pretrained weights. A fixed random seed (seed = 42) is used for dataset splitting and training initialization to ensure reproducibility.

For BsiNet, the Adam optimizer [58] is used with an initial learning rate of 1 × 10⁻⁴ and no weight decay. A cosine annealing learning-rate scheduler is applied which is consistent with the original implementation. The loss function is a multi-task objective which computes as an equally weighted sum of the segmentation, boundary detection, and distance transform branches.

For REAUnet, the AdamW optimizer [59] is adopted with an initial learning rate of 3 × 10⁻⁴ and a weight decay of 1 × 10⁻⁴. A ReduceLROnPlateau scheduler is employed, reducing the learning rate by a factor of 0.1 if the validation loss did not improve for 30 consecutive epochs. The loss combined Binary Cross-Entropy and Dice loss, with deep supervision applied to intermediate outputs via a weighted loss strategy.

The batch sizes are set to 12 for BsiNet and 6 for REAUnet, with a maximum of 150 and 200 epochs, respectively. Early stopping is applied with a patience of 50 epochs to mitigate overfitting. For final prediction, a threshold of 0.5 is used to convert probability maps into binary masks.

4. Experiments

The performance of the two DNN models is primarily evaluated using IoU, which is defined as follows:

I o U = T P / (T P + F P + F N)

(9)

where

T P

(true positive),

F P

, and

F N

denote the number of correctly segmented cropland pixels, falsely segmented pixels, and missed cropland pixels, respectively. Besides, precision, recall, F1-score, and accuracy are also used as secondary evaluation metrics. To ensure a rigorous sensitivity analysis, “spectral and spatial variations” are defined as controlled simulations. Specifically, spectral variations are implemented by systematically selecting subsets of the native four multispectral bands (e.g., blue, green, red, and NIR) to form 15 different band combinations, without any artificial spectral synthesis, cross-season mixing, or noise addition. Spatial variations, on the other hand, are generated by downsampling the original high-resolution imagery to three predefined levels (baseline L = 0, half-resolution L = 1, and quarter-resolution L = 2) to form a controlled grid, without applying any additional blurring processes or noise injections.

All spectral band combination experiments are conducted under a fixed spatial resolution. For radiometric preprocessing, all spectral bands are normalized using a consistent linear scaling method. The raw gray levels of the 8-bit satellite imagery are rescaled to a range of [0, 1] by applying a constant divisor of 255.0. This strategy ensures that the inherent relative radiometric differences between spectral bands are preserved, allowing the models (BsiNet and REAUnet) to leverage the original spectral contrast and physical information content. This protocol eliminates the risk of artificial distribution shifts that might arise from independent per-band standardization, thereby ensuring the reported spectral sensitivity is a reflection of the actual information gain from additional bands. Different band combinations, ranging from single-band to four-band inputs, are constructed to examine how variations in spectral band combinations affect cropland segmentation performances. All band combinations apply the same dataset splitting, preprocessing pipelines, and training hyperparameters to ensure that any performance differences could be attributed solely to changes in band combinations.

Besides, all spatial resolution experiments are performed using the original high-resolution imagery as the baseline. In our experimental framework, the input patch size is fixed to 256 × 256 pixels across all spatial resolution tiers. Though this results in a variable geographic field of view (FOV), where coarser resolutions cover a larger ground area, this design is chosen to maintain a constant model architecture and parameter count. To eliminate spatial bias, all patches are spatially aligned to identical geographic center coordinates, ensuring that the performance variations are primarily attributable to the loss of high-frequency spatial details rather than differences in the underlying land-cover objects. Datasets are generated at different spatial resolutions through multi-scale resampling. Multispectral images are downsampled using bicubic interpolation, whereas label maps are resampled using nearest-neighbor interpolation to preserve semantic consistency. For each spatial resolution setting, the best-performing band combination identified in the spectral experiments is adopted to guarantee the results primarily reflect the influence of spatial resolution variations.

Based on the above single-factor experiments, we further conduct spectral–spatial coupling experiments. With the number of bands (

N_{b a n d}

) and the spatial resolution degradation level (

L

) as independent variables and

I o U

as the response variable, we fit a linear model as follows:

I o U \approx c_{0} + c_{1} \times N_{b a n d} + c_{2} \times L

(10)

where

L

denotes the degradation level of the current resolution relative to the original spatial resolution (denoted as

{G S D}_{b a s e}

). To maintain consistency across the iFLYTEK dataset (native GSD: 0.75–1.1 m),

{G S D}_{b a s e}

is defined as the baseline resolution of each individual image. Thus,

L

is formulated as a dimensionless relative scaling factor,

{L = \log}_{2} (s)

, where

s \in \{1,2, 4\}

denotes the uniform downsampling factor. This ensured that

L

quantifies the proportional loss of spatial information, rendering the analysis independent of subtle variations in the absolute GSD across different scenes. For example,

L

= 1 corresponds to a 2× spatial resolution degradation, and

L

= 2 corresponds to a 4× degradation. In this model,

c_{0}

is a constant term,

c_{1}

represents the accuracy gain obtained by adding one spectral band (spectral gain coefficient), and

c_{2}

represents the accuracy loss caused by a twofold degradation in spatial resolution, which is typically negative. Based on the fitted model, Iso-IoU contour lines are drawn to characterize the compensation relationship between the number of spectral bands and spatial resolution under a given accuracy level.

To facilitate understanding of the model construction process, here we provide a step-by-step derivation using the performance of BsiNet on the iFLYTEK dataset as an example. At the original spatial resolution (L = 0), the average

I o U

under one to four-band inputs are 77.33%, 78.52%, 79.73%, and 80.36%, respectively. A least-squares linear fitting is then performed between

I o U

and

N_{b a n d}

, yielding the following relationship:

I o U \approx 1.03 \cdot N_{b a n d} + 76.41

(11)

Thus, the spectral gain coefficient is obtained as

c_{1} = 1.03

, indicating that each additional spectral band improved the

I o U

by approximately 1.03 percentage points. The coefficient of determination for this fitting is

R^{2} = 0.98

, suggesting a strong linear relationship between

I o U

and

N_{b a n d}

(i.e., spectral band count). Similarly, when the four-band input is fixed, the

I o U

under L = 0, L = 1, and L = 2 are 80.36%, 79.50%, and 74.97%, respectively. A least-square linear fitting between IoU and L yields:

I o U \approx - 2.70 \cdot L + 80.97

(12)

Accordingly, the spatial sensitivity coefficient is obtained as

c_{2} = - 2.70

, meaning that each twofold degradation in spatial resolution reduced the

I o U

by approximately 2.70 percentage points. The coefficient of determination for this fitting is

R^{2} = 0.87

, indicating that the linear approximation still captures the overall trend of performance degradation with increasing spatial resolution. To ensure that the equivalent model is anchored to the observed baseline configuration,

c_{0}

is determined using the reference point (

N_{b a n d}

= 4, L = 0, IoU = 80.36). Substituting

c_{1} = 1.03

into Equation (10), we obtain:

c_{0} = 80.36 - 1.03 \times 4 = 76.24

(13)

Therefore, the final equivalent equation for BsiNet on the iFLYTEK dataset is:

I o U \approx 76.24 + 1.03 \cdot N_{b a n d} - 2.70 \cdot L

(14)

Based on the fitted parameters, the number of spectral bands required to maintain a target accuracy level

{I o U}_{t a r g e t}

can be derived from Equation (10):

N_{b a n d} \approx \frac{{I o U}_{t a r g e t} - c_{0} - c_{2} \cdot L}{c_{1}}

(15)

The Iso-IoU contours, plotted in the (

L

,

N_{b a n d}

) coordinate system, have slopes that represent the substitution rate between spectral information and spatial detail. This provides a quantitative basis for selecting an appropriate spectral–spatial configuration under sensor resource constraints.

In addition, error tendency experiments are designed to further investigate the error tendency characteristics of the models under different spectral combination and spatial resolution settings. Specifically, we construct binary error masks for the FPs and FNs, and analyze their spatial overlap across different experimental conditions. The Reference Overlap Rate (ROR) is defined as:

ROR (M^{(c)}, M^{(b)}) = \frac{∣ M^{(c)} \cap M^{(b)} ∣}{∣ M^{(b)} ∣} \times 100 %

(16)

where

M^{(c)}

denotes the FP or FN mask under the resolution condition

c

,

M^{(b)}

denotes the corresponding mask under the baseline resolution condition

b

, and

|\cdot|

represents the number of pixels. To ensure statistical stability, the baseline error pixel counts (

|M^{(b)}|

) are calculated globally across the entire test set. The massive denominator effectively avoided variance instability caused by small sample sizes. Essentially, ROR measures the extent to which error regions under the baseline condition are consistently re-observed under different experimental settings. A higher ROR indicates that error-prone regions are spatially stable and tend to recur across different spectral or spatial configurations, reflecting a consistent error tendency pattern.

5. Results

5.1. Spectral Variations Experiment Results

The results of the spectral band combination experiments are shown in Figure 5. Overall, the number of input bands has a significant impact on the cropland segmentation performance of both models, and REAUnet performs slightly better than BsiNet. The main findings can be summarized from three aspects.

First, the overall performance improves as the number of input bands increases, while the influence of specific band combinations is relatively limited. As statistically summarized in Table 2 and Table 3, the IoU values of BsiNet and REAUnet generally exhibit an upward trend on both datasets as the input changes from single-band to four-band settings. For example, on the iFLYTEK dataset, the mean IoU of BsiNet increases from 77.33% (

N_{b a n d}

= 1) to 80.36% (

N_{b a n d}

= 4). By comparison, the mean IoU of REAUnet increases from 84.70% to 86.17% under the same setting. A similar trend is also observed on the Shanglin dataset. Notably, the performance variation among different band combinations within the same band count is generally small, with the range remaining within 0.28% to 2.27%. These results indicate that model performance is primarily influenced by the amount of available spectral information (i.e., band count), rather than by specific band identities. In some cases, the minimum IoU of a higher band-count group exceeds the maximum IoU of a lower band-count group (e.g., for REAUnet on iFLYTEK, the

N_{b a n d}

= 3 minimum of 85.60% exceeds the

N_{b a n d}

= 2 maximum of 85.22%), further supporting this observation.

Second, band combinations containing the near-infrared (NIR) band generally achieve slightly better segmentation performance. Under the same band count, inputs containing NIR produce overall higher IoU, and this trend is particularly evident for BsiNet. For example, for three-band inputs on the iFLYTEK dataset, the IoU of the non-NIR combination (i.e., B3 + B2 + B1) is 78.36%, whereas NIR-including band combinations B4 + B3 + B2, B4 + B3 + B1, and B4 + B2 + B1 achieve 80.05%, 80.21%, and 80.31%, respectively. Similar patterns are also observed on the Shanglin dataset, although the magnitude of improvement varies across different band combinations. This result is consistent with the spectral characteristics of cropland, because vegetation typically exhibits a strong response in the NIR band [36,41]. Therefore, when NIR information is available, it should be included in the model input whenever possible for cropland segmentation tasks.

In summary, the spectral band number experiments suggest that: (1) increasing the number of bands is generally beneficial, (2) the NIR band is helpful, and (3) explicitly integrating SA with CA improves the robustness.

5.2. Spatial Variations Experiment Results

The results of the spatial resolution experiments are summarized in Table 4 and Table 5. Overall, variations in spatial resolution lead to more pronounced performance changes than changes in spectral inputs under the tested configurations.

First, decreasing spatial resolution leads to significant accuracy degradation. As the spatial resolution progressively decreases, the performance metrics of the two models on the two datasets exhibit a stable downward trend. When the resolution decreases from the original high resolution (0.75–1.1 m for the iFLYTEK dataset or 0.8 m for the Shanglin dataset) to a medium resolution (1.5–2.2 m or 1.6 m, i.e., half of the original resolution), the IoU decreases by approximately 2–3% on average. When the resolution further decreases to a low resolution (3.0–4.4 m or 3.2 m, i.e., one-quarter of the original resolution), the cumulative decline expands to about 5–7%. Notably, the degradation in spatial resolution affects the model through a different physical mechanism than the reduction of spectral bands. While the spectral experiments primarily reduce the spectral discriminability by decreasing the band count, spatial degradation directly compromises the geometric structure and textural integrity of the imagery [38,43]. Our results show that the performance decline caused by spatial resolution reduction (5–7%) is generally larger than the performance change observed when reducing the spectral input from four bands to a single band (1.5–4%). These results indicate that, for attention-based cropland extraction in fragmented landscapes, preserving spatial structural information is particularly important, while spectral information provides complementary but less dominant contributions under the current experimental setting.

In summary, two instructive conclusions can be drawn from the spatial experiments: (1) higher spatial resolution consistently yields better performance, and (2) SA coupling with CA is generally beneficial, but its effectiveness depends on the availability of sufficient spatial details and may be weakened under severe spatial degradation.

5.3. Spectral–Spatial Coupling Experiments Results

In addition to the single-factor experiments, we quantitatively model the coupling relationship between these two distinct information channels: the spatial structural channel and the spectral discriminative channel. Using the joint spectral–spatial regression model developed in Equation (10), we fit eight models (four for

I o U

and

L

, four for

I o U

and

N_{b a n d}

) based on experimental results obtained from the two networks and the two datasets, respectively. The fitted models are presented in Figure 6. The eight models reveal linear relationships between

I o U

and both the band count (

N_{b a n d}

) and spatial resolution degradation level (

L

), with all coefficients of determination (

R^{2}

) exceeding 0.87. This indicates that the complementary relationship between the geometric cues provided by spatial resolution and the thematic cues provided by spectral bands can be quantified in a linear model. Since

L

is in the continuous real number space, we generate Iso-IoU contour maps for different models and datasets.

As shown in Figure 7, each contour line represents a constant performance level (i.e., Iso-IoU). Moving along a given contour line indicates that different spectral–spatial configurations can achieve the same performance level (i.e., IoU). For example, in Figure 7c, point A (

L \approx

1,

N_{b a n d}

= 1) and point B (

L \approx

2,

N_{b a n d}

= 3) lie on the same 78% Iso-IoU contour line, suggesting that the configuration of point A (i.e., a single band at 2× degraded spatial resolution) and point B (i.e., three bands at 4× degraded spatial resolution) achieve a comparable IoU of 78%. To verify the predictive power of the Iso-IoU equivalent model, an out-of-sample validation is conducted using a cross-dataset protocol. The spatial sensitivity parameters are first calibrated using the iFLYTEK (Jilin-1) dataset and then employed to predict the performance of the Shanglin (GF-2) dataset under unseen configurations. The results demonstrate high predictive accuracy; for instance, the model’s predicted IoU for the Shanglin 1.6-m configuration is 81.43%, which deviates from the actual value (81.42%) by only 0.01%.

Besides, the contour slope reflects the relative sensitivity between spectral and spatial information. To further characterize this tendency, the fitted coefficients for spectral sensitivity (

α

) and spatial sensitivity (

β

) are summarized in Figure 8. As shown in Figure 8a, BsiNet exhibits higher spectral sensitivity (

α \approx

1.0–1.2) than REAUnet (

α \approx

0.5), which suggests BsiNet is stronger reliance on spectral information. Conversely, all models show negative spatial sensitivity in Figure 8b, with REAUnet exhibiting stronger sensitivity to spatial degradation on the Shanglin dataset (

β \approx

−4.2), consistent with its steeper contour patterns in Figure 6. We emphasize that the Iso-IoU model is intended as a descriptive approximation over a limited discrete experimental grid, and should not be interpreted as a globally valid functional form.

In summary, two conclusions can be drawn: (1) spectral and spatial information exhibit approximately linear complementarity within the studied range, and (2) when spatial resolution is constrained, increasing spectral information can partially compensate for performance loss.

5.4. Error Tendency Experiment Results

To further investigate the error sources in cropland extraction, we analyze the spatial and spectral layouts (i.e., image textures and land types) of the FNs and FPs using the ROR. The ROR results for different band combinations (note that band combination B4 + B3 + B2 + B1 is used as the baseline) are presented in Figure 9. Three main conclusions can be drawn from Figure 9. Firstly, BsiNet and REAUnet both exhibit error tendencies in both the iFLYTEK and Shanglin datasets (almost all RORs exceed 50%). Secondly, the error tendency in the Shanglin dataset is generally more severe. Thirdly, the error tendency of REAUnet is more pronounced than that of BsiNet.

Figure 9a,c respectively demonstrate the FN and FP RORs of BsiNet. Experiments show that most FN RORs in the Shanglin dataset are over 70%, indicating that BsiNet is prone to fail to detect some specific ground truth croplands.

Figure 9b,d show the ROR results of REAUnet, it can be clearly seen that the FP RORs are markedly higher in the Shanglin dataset (shown in Figure 9d of the first row), while the FN RORs are comparable in both the Shanglin and iFLYTEK datasets. This tendency becomes more evident when the NIR band is retained. For example, as shown in Figure 10b of the second row, the FN RORs of B4 + B1, B4 + B3 + B1, and B4 + B3 + B2 in the iFLYTEK dataset reach up to 83.1%, 82.6%, and 80.3%, respectively. This means that regardless of the band combinations employed for REAUnet, as long as the NIR band is included in the input, the network cannot correctly recognize some land types as croplands.

In summary, two instructive conclusions can be drawn from the error tendency experiments from the perspective of band combinations: (1) spectral information is generally not fully utilized in the attention-based networks; and (2) the error tendency generally exists even when more bands are provided.

We also examine the error tendency of the two networks from the perspective of spatial resolution degradation, and the results are presented in Table 6 (note that in these experiments, all four spectral bands are used). It can be seen clearly that the error tendency is also present despite an increase in spatial resolution. However, the level of the tendency is much less than that observed with the increase in the spectral band numbers.

As shown in the results in Table 6, for BsiNet, at the moderate degradation level (L = 1, i.e., half-resolution), both datasets show a consistent trend where the ROR decreases as spatial resolution degrades. This indicates that lower spatial resolution disrupts the spatial consistency of errors, leading to lower RORs compared to the high-resolution baseline. The error tendency patterns of BsiNet are also observed in the experimental results of REAUnet (results are shown in Table 6): the ROR decreases as spatial resolution degrades. In contrast, the error tendency of REAUnet is more moderate than that of BsiNet. For example, in the Shanglin dataset, when the resolution is reduced to half of the baseline resolution (i.e., resolution degradation level L = 1), the FN and FP RORs of BsiNet are 72.07% and 67.67%, respectively, while for REAUnet under the same spatial resolution setups, the two corresponding RORs are decreased to 55.40% and 66.00%, respectively. Furthermore, in other spatial resolution setups, comparisons show that most RORs of REAUnet are lower, indicating that the error tendency of REAUnet is more moderate.

Examples in Figure 11 further show that BsiNet is prone to produce FN segmentations while REAUnet is apt to generate FP segmentations (this phenomenon is also quantitatively verified in Figure 9a). For REAUnet, as shown in Figure 11c,d, it is prone to generate FP segmentations (this phenomenon is also quantitatively verified in Figure 9d).

To further investigate the boundary delineation performance under mixed-pixel conditions, we provide zoomed-in visual comparisons in Figure 12. As shown, both BsiNet and REAUnet are able to preserve the overall structure of cropland parcels under challenging boundary scenarios, including linear roads, irregular parcel shapes, and tree/shadow occlusions. However, their behaviors exhibit noticeable differences. BsiNet tends to produce more regularized and conservative boundaries, which often leads to under-segmentation or missed detections (FN) in regions affected by spectral ambiguity. In contrast, REAUnet shows improved sensitivity to local details, maintaining better structural continuity for complex parcels, although it is more prone to minor false positives (FP) in texture-similar areas. These observations are consistent with the quantitative ROR analysis and further suggest that different attention mechanisms contribute differently to boundary modeling by enhancing spectral–spatial feature interactions. Nevertheless, mixed-pixel effects at parcel boundaries remain a major source of error, especially in fragmented landscapes and under low spatial resolution conditions. Furthermore, a spatially decoupled analysis using a morphological boundary buffer reveals that false positives are predominantly concentrated in boundary regions, while false negatives are primarily distributed in parcel interiors. This confirms that boundary ambiguity and intra-parcel spectral variability drive distinct error mechanisms.

In summary, two instructive conclusions can be drawn from the error tendency experiments from the spatial resolution perspective: (1) spatial information is generally the dominant factor in alleviating the error tendency problem; and (2) the error tendency persists even when higher spatial resolution images are supplied.

6. Discussion

6.1. Spectral–Spatial Trade-Offs and the Role of Global Context

To fully understand the performance variations observed in the results, it is essential to analyze the underlying mechanisms driving these changes, particularly the trade-off between global context and spatial detail. As detailed in our experimental framework, we fixed the input patch size at 256 × 256 pixels across all spatial resolution tiers. Consequently, when the spatial resolution is downsampled (e.g., from 0.8 m to 3.2 m), the actual geographical FOV covered by the 256 × 256 patch significantly increases. As a result, the global context available to the attention mechanism is expanded, encompassing broader macro-structures of the landscape. Generally, an expanded global context provides richer semantic information, which typically enhances the performance of attention mechanisms by helping them better understand the scene’s global structure. However, in our experiments, we observe a consistent drop in segmentation performance as spatial resolution degraded. Since the expanded global context should theoretically favor the model, the observed performance degradation strongly indicates that the negative impact of losing high-frequency spatial details (such as boundary clarity and fine-grained textures) is the dominant factor. The loss of these crucial geometric cues overrides any potential benefits gained from processing a broader global context.

This effect is more evident on the Shanglin dataset, where stronger terrain undulations and higher cropland fragmentation are present. As the results demonstrated, the IoU of REAUnet decreases from 86.17% to 80.89% on the iFLYTEK dataset, and from 85.78% to 77.47% on the Shanglin dataset. The underlying reason is that high-spatial-resolution images provide detailed textures with relatively low ambiguity, and CA primarily suppresses noise, thereby limiting its ability to exploit fine spatial details. In contrast, SA focuses on spatial structures and can leverage detailed textures to enhance segmentation performance.

Furthermore, our analysis reveals why the two models exhibit different sensitivities to spatial resolution variations. Although both models experience performance degradation as spatial resolution decreases, BsiNet shows better stability across different resolutions, whereas REAUnet exhibits a more pronounced decline, especially on the Shanglin dataset. BsiNet mainly focuses on channel-wise feature representations derived from spectral information and is therefore less sensitive to spatial resolution changes. In contrast, REAUnet integrates both CA and SA within a dual-branch architecture, enabling it to jointly exploit spatial and spectral information. The SA branch explicitly models spatial structures and benefits from fine-grained texture details, particularly in fragmented cropland regions such as those in the Shanglin dataset [11,30]. However, when spatial resolution decreases, textural details gradually disappear and cropland boundaries become blurred. As a result, the SA mechanism cannot effectively generate meaningful spatial weights from degraded inputs, leading to a more pronounced decline in segmentation performance compared with BsiNet. This suggests that although the integration of SA and CA offers high peak performance, it also introduces a greater dependency on spatial structural clarity.

6.2. Error Tendency and Attention Mechanisms

These observations help explain the underlying reasons behind the performance differences observed in Section 5. Beyond the macroscopic spectral–spatial trade-offs, the specific architectural designs of the attention mechanisms in BsiNet and REAUnet intrinsically dictate their distinct error tendencies. Regarding spectral utilization, we observe a saturation effect under certain conditions. For instance, on the Shanglin dataset, although the mean IoU of BsiNet increases from 82.00% at

N_{b a n d}

= 3 to 83.05% at

N_{b a n d}

= 4, the improvement becomes relatively smaller compared to earlier increments, suggesting a potential diminishing return effect. One possible explanation is that the group-wise channel enhancement mechanism in BsiNet may be more effective at suppressing less informative channels than at fully exploiting additional spectral details. In highly fragmented cropland scenes, the inclusion of extra bands may introduce additional variability that is not fully exploited by the model, which may contribute to the observed saturation effect. Conversely, the joint use of CA and SA in REAUnet can partially buffer the impact of spectral variation. When spectral information becomes limited, the SA branch can still exploit texture and structural cues to compensate for the reduced discriminative power of channel-wise spectral responses [24,56]. As a result, REAUnet shows stronger robustness to band-combination changes than BsiNet.

In terms of specific error types, the two models diverge significantly. The main reason for BsiNet’s higher FN tendency lies in that BsiNet uses grouped CA to aggregate spectral information, which emphasizes structured channel-wise feature selection and noise suppression. In contrast, small parcels, which are commonly present in the Shanglin dataset, suffer from spectral mixed-pixel problems [60,61]. These mixed pixels are often treated as noise, which may contribute to the higher error tendency in the Shanglin dataset of BsiNet (the examples are shown in Figure 10a,b). Since BsiNet suppresses the features from channel dimensions which are relevant to spectral information, it cannot distinguish croplands from other land types if the croplands have abnormal color characteristics. If higher spatial resolution images are provided, the areas of the FNs are mainly shrunken, since higher spatial resolution compensates for the spectral information in the segmentation.

On the other hand, the higher FP tendency of REAUnet may be related to the coupling strategy of SA and CA. Although more bands are introduced, the newly introduced bands cannot boost REAUnet in segmenting fragmented parcels (as shown in Figure 10c,d). Since the boundaries of these parcels are highly related to image textures, and although the spectral information has been fully exploited, the error tendency problem caused by the fragmented parcels may not be fully alleviated without higher-resolution images. Interestingly, the FP areas show a more complex response to spatial resolution variation. At the baseline resolution (i.e., resolution degradation level L = 0), the FP segmentations are similar to the croplands both in color (corresponding to spectral information) and textures (corresponding to spatial information). As the spatial resolution decreases, texture details are lost and spatial information declines. Consequently, spectral information takes over the segmentation and the FP areas decrease.

6.3. Limitations

This study provides quantitative insights into the spectral–spatial trade-offs in attention-based cropland extraction. However, several limitations should be considered when interpreting the results. Regarding geographical coverage, the datasets are limited to two representative regions in China (i.e., Shanglin and iFLYTEK), which may restrict the generalizability of the findings across diverse ecological and agricultural environments. Furthermore, while the ground truth labels are carefully generated, manual visual interpretation inevitably introduces subjective bias and boundary ambiguity, particularly in fragmented landscapes.

From a methodological standpoint, the spatial and spectral degradation processes are simulated using resampling-based strategies. While this isolates variables for sensitivity analysis, it does not fully capture physical sensor effects, such as point spread function variations and radiometric distortions. Additionally, the controlled experimental design is not fully orthogonal, potentially limiting our ability to capture complex, higher-order interactions between spectral and spatial features. Furthermore, regarding evaluation metrics, while this study relied on widely used region-based metrics (e.g., IoU and F1 score) to systematically evaluate spectral and spatial trade-offs, we acknowledge that these metrics do not fully capture topological consistency. Future work should incorporate boundary-aware and object-level metrics (such as the boundary F-score or Global Over-Classification, Global Under-Classification, and Global Total-Classification indicators) to provide deeper insights into the geometric quality of parcel delineation, particularly for models that explicitly employ boundary supervision.

7. Conclusions and Future Work

This study systematically investigates the joint effects of spectral band combinations and spatial resolution on cropland extraction using two attention-based segmentation networks, BsiNet and REAUnet. Importantly, this research yields several novel findings regarding these interactions. Under controlled experimental conditions, we quantitatively demonstrate that spatial resolution exerts a more dominant impact on model performance than spectral richness. Beyond evaluating their independent effects, we reveal a quantifiable complementary relationship between spectral and spatial information. To characterize this interaction, we propose the interpretable Iso-IoU framework, which offers an explicit mathematical approach to describe how improvements in spectral information can partially compensate for spatial degradation.

Furthermore, through spatial decoupling and ROR analysis, we identify structurally stable error tendencies inherent to different attention mechanisms. Specifically, our findings show that channel-wise attention systematically tends to induce the FNs driven by intra-parcel spectral heterogeneity, whereas spatial-channel coupling inherently induces the FPs at boundaries due to mixed-pixel effects when geometric cues are lost.

Future work will extend this study by incorporating additional spectral bands (e.g., the red-edge band) and multi-source satellite data fusion. Moreover, ensemble-based, multi-run evaluations will be conducted to improve the robustness and generalizability of the proposed framework.

Author Contributions

Conceptualization, S.C.; methodology, L.C. and C.D.; validation, L.C. and H.L.; formal analysis, C.D. and H.L.; data curation, L.C. and Y.Z.; writing—original draft preparation, S.C., Z.L. and C.Z.; writing—review and editing, Z.L. and C.D.; visualization, C.Z.; supervision, S.C.; project administration, S.C.; funding acquisition, C.D., S.C. and C.Z.; All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China under Grant 41901402, in part by Open Fund of Key Laboratory of Investigation, Monitoring, Protection and Utilization for Cultivated Land Resources, Ministry of Natural Resources under Grant CLRKL2024KP04, in part by Ministry-Province Cooperative Project under the Ministry of Natural Resources under Grant 2024ZRBSHZ045, in part by Research Project of the Department of Natural Resources of Sichuan Province under Grant KJ-2025-008, in part by Open Fund of Key Laboratory of Land Use, Ministry of Natural Resources under Grant 2025No.215-6, in part by Fundamental Scientific Research Business Expense under Grant 2026JDKY0019-3, in part by Nanhu Scholars Program for Young Scholars of XYNU under Grant 2017037, in part by Key Scientific and Technological Research Project of Henan Province under Grant 252102110354.

Data Availability Statement

The data will be shared on the Baidu cloud disk, and training codes will be available on https://github.com/csyhy1986 (accessed on 4 May 2026) once the paper is accepted.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xie, Z.; Zhang, Q.; Jiang, C.; Yao, R. Cropland Compensation in Mountainous Areas in China Aggravates Non-Grain Production: Evidence from Fujian Province. Land Use Policy 2024, 138, 107026. [Google Scholar] [CrossRef]
Yang, R.; Yang, Z.; Zhong, C.; Yang, S.; Cao, L. Study on the Relationship between Farmland and Grain Abundance and Households’ Income in China. J. Nat. Resour. 2024, 39, 2619–2638. [Google Scholar] [CrossRef]
Zhao, S.; Yin, M. Change of Urban and Rural Construction Land and Driving Factors of Arable Land Occupation. PLoS ONE 2023, 18, e0286248. [Google Scholar] [CrossRef] [PubMed]
Aung, H.L.; Uzkent, B.; Burke, M.; Lobell, D.; Ermon, S. Farm Parcel Delineation Using Spatio-Temporal Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 340–349. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 October 2018; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep Learning in Remote Sensing Applications: A Meta-Analysis and Review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Wang, M.; Wang, J.; Cui, Y.; Liu, J.; Chen, L. Agricultural Field Boundary Delineation with Satellite Image Segmentation for High-Resolution Crop Mapping: A Case Study of Rice Paddy. Agronomy 2022, 12, 2342. [Google Scholar] [CrossRef]
Cai, Z.; Hu, Q.; Zhang, X.; Yang, J.; Wei, H.; He, Z.; Song, Q.; Wang, C.; Yin, G.; Xu, B. An Adaptive Image Segmentation Method with Automatic Selection of Optimal Scale for Extracting Cropland Parcels in Smallholder Farming Systems. Remote Sens. 2022, 14, 3067. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Blaschke, T. Object Based Image Analysis for Remote Sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
Chang, B.; Wang, J.; Luo, Y.; Wang, Y.; Wang, Y. Cultivated Land Extraction Based on GF-1/WFV Remote Sensing in Shenwu Irrigation Area of Hetao Irrigation District. Trans. Chin. Soc. Agric. Eng. 2017, 33, 188–195. (In Chinese) [Google Scholar] [CrossRef]
Wang, Y.; Yang, M.; Zhang, T.; Hu, S.; Zhuang, Q. DAENet: A Deep Attention-Enhanced Network for Cropland Extraction in Complex Terrain from High-Resolution Satellite Imagery. Agriculture 2025, 15, 1318. [Google Scholar] [CrossRef]
Zhang, X.; Huang, J.; Ning, T. Progress and Prospect of Cultivated Land Extraction from High-Resolution Remote Sensing Images. Geomat. Inf. Sci. Wuhan Univ. 2023, 48, 1582–1590. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Wu, Y.H.; Zhang, S.C.; Liu, Y.; Zhang, L.; Zhan, X.; Zhou, D.; Feng, J.; Cheng, M.M.; Zhen, L. Low-Resolution Self-Attention for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 8180–8192. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3141–3149. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, L.; Tang, B.-H.; Le, W.; Wang, M.; Cheng, J.; Wu, Q. MATNet: Multiattention Transformer Network for Cropland Semantic Segmentation in Remote Sensing Images. Int. J. Digit. Earth 2024, 17, 2392845. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar] [CrossRef]
Zhao, Q.; Liu, J.; Li, Y.; Zhang, H. Semantic Segmentation with Attention Mechanism for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5403913. [Google Scholar] [CrossRef]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Advances in Neural Information Processing Systems 28 (NeurIPS 2015); Curran Associates: Red Hook, NY, USA, 2015; pp. 2017–2025. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. PSANet: Point-Wise Spatial Attention Network for Scene Parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 267–283. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Price, J.C. Spectral Band Selection for Visible-near Infrared Remote Sensing: Spectral–Spatial Resolution Tradeoffs. IEEE Trans. Geosci. Remote Sens. 1997, 35, 1277–1285. [Google Scholar] [CrossRef]
Xu, F.; Yao, X.; Zhang, K.; Yang, H.; Feng, Q.; Li, Y.; Yan, S.; Gao, B.; Li, S.; Yang, J.; et al. Deep Learning in Cropland Field Identification: A Review. Comput. Electron. Agric. 2024, 222, 109042. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Du, B. Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art. IEEE Geosci. Remote Sens. Mag. 2016, 4, 22–40. [Google Scholar] [CrossRef]
Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep-Learning-Based Semantic Segmentation of Remote Sensing Images: A Survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8370–8396. [Google Scholar] [CrossRef]
Skosana, T.E.; Esler, K.J.; Rebelo, A.J. Exploring the Trade-Offs between Spatial and Spectral Resolution in Mapping Invasive Alien Trees. Ecol. Inform. 2025, 92, 103448. [Google Scholar] [CrossRef]
Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep Learning-Based Classification of Hyperspectral Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
Plaza, A.; Benediktsson, J.A.; Boardman, J.W.; Brazile, J.; Bruzzone, L.; Camps-Valls, G.; Chanussot, J.; Fauvel, M.; Gamba, P.; Gualtieri, A.; et al. Recent Advances in Techniques for Hyperspectral Image Processing. Remote Sens. Environ. 2009, 113, S110–S122. [Google Scholar] [CrossRef]
Thenkabail, P.S.; Smith, R.B.; De Pauw, E. Hyperspectral Vegetation Indices and Their Relationships with Agricultural Crop Characteristics. Remote Sens. Environ. 2000, 71, 158–182. [Google Scholar] [CrossRef]
Xie, Y.; Sha, Z.; Yu, M. Remote Sensing Imagery in Vegetation Mapping: A Review. J. Plant Ecol. 2008, 1, 9–23. [Google Scholar] [CrossRef]
Radoux, J.; Defourny, P. A Quantitative Assessment of Boundaries in Automated Forest Stand Delineation Using Very High Resolution Imagery. Remote Sens. Environ. 2007, 110, 468–475. [Google Scholar] [CrossRef]
Chang, M.; Li, S.; Peng, S.; He, Z.; Anders, K. Cropland Segmentation Leveraging a Synergistic Edge Enhancement and Temporal Difference-Aware Network with Sentinel-2 Time-Series Imagery. Int. J. Digit. Earth 2025, 18, 2554350. [Google Scholar] [CrossRef]
Jia, J.; Chen, J.; Zheng, X.; Wang, Y.; Guo, S.; Sun, H.; Jiang, C.; Karjalainen, M.; Karila, K.; Duan, Z.; et al. Tradeoffs in the Spatial and Spectral Resolution of Airborne Hyperspectral Imaging Systems: A Crop Identification Case Study. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5510918. [Google Scholar] [CrossRef]
Im, J.; Jensen, J.R. Hyperspectral Remote Sensing of Vegetation. Geogr. Compass 2008, 2, 1943–1961. [Google Scholar] [CrossRef]
Forzieri, G.; Moser, G.; Catani, F. Assessment of Hyperspectral MIVIS Sensor Capability for Heterogeneous Landscape Classification. ISPRS J. Photogramm. Remote Sens. 2012, 74, 175–184. [Google Scholar] [CrossRef]
Atkinson, P.M.; Aplin, P. Spatial Variation in Land Cover and Choice of Spatial Resolution for Remote Sensing. Int. J. Remote Sens. 2004, 25, 3687–3702. [Google Scholar] [CrossRef]
Sun, H.; Zheng, X.; Lu, X.; Wu, S. Spectral–Spatial Attention Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3232–3245. [Google Scholar] [CrossRef]
Cui, Y.; Yu, Z.; Han, J.; Gao, S.; Wang, L. Dual-Triple Attention Network for Hyperspectral Image Classification Using Limited Training Samples. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5504705. [Google Scholar] [CrossRef]
Zhu, Y.; Pan, Y.; Hu, T.; Zhang, D.; Zhao, C.; Gao, Y. A Generalized Framework for Agricultural Field Delineation from High-Resolution Satellite Imageries. Int. J. Digit. Earth 2024, 17, 2297947. [Google Scholar] [CrossRef]
Cui, W.; Wang, F.; He, X.; Zhang, D.; Xu, X.; Yao, M.; Wang, Z.; Huang, J. Multi-Scale Semantic Segmentation and Spatial Relationship Recognition of Remote Sensing Images Based on an Attention Model. Remote Sens. 2019, 11, 1044. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607713. [Google Scholar] [CrossRef]
Blaes, X.; Vanhalle, L.; Defourny, P. Efficiency of Crop Identification Based on Optical and SAR Image Time Series. Remote Sens. Environ. 2005, 96, 352–365. [Google Scholar] [CrossRef]
Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep Learning Classification of Land Cover and Crop Types Using Remote Sensing Data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [Google Scholar] [CrossRef]
Long, J.; Li, M.; Wang, X.; Stein, A. Delineation of Agricultural Fields Using Multi-Task BsiNet from High-Resolution Satellite Images. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102871. [Google Scholar] [CrossRef]
Lu, R.; Zhang, Y.; Huang, Q.; Zeng, P.; Shi, Z.; Ye, S. A Refined Edge-Aware Convolutional Neural Networks for Agricultural Parcel Delineation. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104084. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, Y.; Zhang, G.; Tang, L.; Hu, X. The Winning Solution to the IFLYTEK Challenge 2021 Cultivated Land Extraction from High-Resolution Remote Sensing Images. In Proceedings of the International Conference on Advanced Computational Intelligence (ICACI 2022), Nanjing, China, 15–17 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 376–380. [Google Scholar] [CrossRef]
Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A Fully Convolutional Neural Network for Automatic Building Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2020, 12, 1050. [Google Scholar] [CrossRef]
Li, X.; Hu, X.; Yang, J. Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks. arXiv 2019, arXiv:1905.09646. [Google Scholar] [CrossRef]
Roy, A.G.; Navab, N.; Wachinger, C. Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2018; Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11070, pp. 421–429. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Las Condes, Chile, 11–18 December 2015; IEEE: Piscataway, NJ, USA; pp. 1026–1034. [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar] [CrossRef]
Chen, X.; Wang, D.; Chen, J.; Wang, C.; Shen, M. The Mixed Pixel Effect in Land Surface Phenology: A Simulation Study. Remote Sens. Environ. 2018, 211, 338–344. [Google Scholar] [CrossRef]
Plaza, A.; Martinez, P.; Perez, R.; Plaza, J. A New Approach to Mixed Pixel Classification of Hyperspectral Imagery Based on Extended Morphological Profiles. Pattern Recognit. 2004, 37, 1097–1116. [Google Scholar] [CrossRef]

Figure 1. Overview of the iFLYTEK dataset. (A–F) are samples from six provinces of China: Gansu, Heilongjiang, Anhui, Sichuan, Xinjiang, and Guizhou. The red boxes indicate the locations of the zoomed-in examples. Sub-figures (1–12) show the parcel details.

Figure 2. Overview of the Shanglin dataset. (A,B) show the location of the research area. (C) is an overview of the remote sensing image, where the red boxes indicate the locations of the zoomed-in samples. Sub-figures (1–4) refer to the details of the parcels.

Figure 3. Schematic illustration of the SGE module in BsiNet. The module performs group-wise spatial reweighting by splitting the input features into groups, recalibrating each group with a shared SA map, and concatenating the enhanced groups to generate the output feature map.

Figure 4. Schematic illustration of the AttBLK in REAUnet. The module performs explicit channel-spatial joint reweighting by generating in parallel a CA map and an SA map. The two maps are then used to recalibrate the input features along the channel and spatial dimensions. The refined features are fused by summing with the original feature to generate the final feature map.

Figure 5. Quantitative evaluation of cropland extraction performance across models and datasets under different band combinations. (a,b) show the performance of BsiNet on the iFLYTEK and Shanglin datasets; (c,d) show the performance of REAUnet on the same two datasets.

Figure 6. Relationships between image resolution and IoU among different networks and datasets. (a) Regressed linear model of spatial resolution degradation level L and IoU, and (b) regressed linear model of spectral resolution

N_{b a n d}

and

I o U

.

Figure 6. Relationships between image resolution and IoU among different networks and datasets. (a) Regressed linear model of spatial resolution degradation level L and IoU, and (b) regressed linear model of spectral resolution

N_{b a n d}

and

I o U

.

Figure 7. Iso-IoU contour maps derived from the spectral–spatial regression models. Maps show the coupled effects of spatial resolution degradation L and spectral band count

N_{b a n d}

on model performance: (a) BsiNet + iFLYTEK; (b) REAUnet + iFLYTEK; (c) BsiNet + Shanglin; (d) REAUnet + Shanglin. The color gradient indicates IoU (%), and the black contours represent configurations with equal performance (i.e., Iso-IoU). Points A and B represent two configurations located on the same 78% Iso-IoU contour line.

Figure 7. Iso-IoU contour maps derived from the spectral–spatial regression models. Maps show the coupled effects of spatial resolution degradation L and spectral band count

N_{b a n d}

on model performance: (a) BsiNet + iFLYTEK; (b) REAUnet + iFLYTEK; (c) BsiNet + Shanglin; (d) REAUnet + Shanglin. The color gradient indicates IoU (%), and the black contours represent configurations with equal performance (i.e., Iso-IoU). Points A and B represent two configurations located on the same 78% Iso-IoU contour line.

Figure 8. Quantitative sensitivity of different models to spectral and spatial variations. (a) Spectral sensitivity; (b) Spatial sensitivity.

Figure 9. Error tendency heatmaps based on the ROR metric. (a,c) show the FN and FP ROR results for BsiNet, respectively. (b,d) display the FN and FP ROR results for REAUnet, respectively. Each heatmap illustrates the consistency of error distributions across various band combinations relative to the baseline (B4 + B3 + B2 + B1). Higher RORs indicate persistent error tendency, and false segmentations remain consistent despite different spectral combinations.

Figure 10. Qualitative comparison of cropland extraction results with different band combinations. Samples (a,b) are from BsiNet on the Shanglin dataset, while (c,d) are from REAUnet on the iFLYTEK and the Shanglin datasets, respectively. It can be seen clearly that FP and FN exhibit stable error tendencies regardless of the different band combinations. Note: “TP” and “TN” stand for true positive and true negative, respectively.

Figure 11. Examples of variations of FP and FN with the decrease in spatial resolution. (a,b) are the results of BsiNet on the iFLYTEK and Shanglin datasets, respectively; and (c,d) are the results of REAUnet on the iFLYTEK and Shanglin datasets, respectively. The parameters L = 0, L = 1 and L = 2 correspond to the baseline, half-resolution, and quarter-resolution, respectively. Predictions in the red boxes and blue boxes highlight the FP and FN variations, respectively. Note: for a better comparison, the predictions are resampled to the same resolution as the baseline.

Figure 12. Zoomed-in visual comparisons of cropland boundary delineation by REAUnet and BsiNet, illustrating the models’ capabilities in handling mixed pixels. (a) Overviews of the study areas in the iFLYTEK and Shanglin datasets, with yellow boxes indicating the selected challenging boundary regions. (b) Detailed comparisons across three typical boundary interference scenarios: linear roads (A), irregular shapes (B), and tree/shadow occlusions (C). Ground truth boundaries are delineated by black lines. Predicted boundaries by REAUnet and BsiNet are denoted by red and blue lines, respectively. Dashed boxes highlight regions with large (yellow) and moderate (cyan) boundary errors.

Table 1. The detailed parameters of the Shanglin and iFLYTEK datasets.

Datasets	Satellite	Resolution	Bands (ID)	Train/Val/Test Number
Shanglin	GF-2	0.8 m	Blue (1), Green (2), Red (3), NIR (4)	4214/1054/1060
iFLYTEK	Jilin-1	0.75–1.1 m	Blue (1), Green (2), Red (3), NIR (4)	14613/3654/3736

Table 2. Statistical summary of BsiNet IoU performance across different spectral band counts.

Datasets	Band Count ( $N_{b a n d}$ )	Mean IoU (%)	Min IoU (%)	Max IoU (%)	Range (%)
iFLYTEK	1	77.33	76.24	78.44	2.20
	2	78.86	78.08	79.60	1.52
	3	79.48	78.36	80.31	1.95
	4	80.36	80.36	80.36	0.00
Shanglin	1	79.22	78.64	79.97	1.33
	2	81.04	80.27	82.54	2.27
	3	82.00	81.16	83.13	1.97
	4	83.05	83.05	83.05	0.00

Table 3. Statistical summary of REAUnet IoU performance across different spectral band counts.

Datasets	Band Count ( $N_{b a n d}$ )	Mean IoU (%)	Min IoU (%)	Max IoU (%)	Range (%)
iFLYTEK	1	84.70	84.28	85.23	0.95
	2	84.76	84.20	85.22	1.02
	3	85.73	85.60	85.88	0.28
	4	86.17	86.17	86.17	0.00
Shanglin	1	84.24	83.30	84.83	1.53
	2	85.00	84.65	85.25	0.60
	3	85.24	85.01	85.70	0.69
	4	85.78	85.78	85.78	0.00

Table 4. Metrics of BsiNet and REAUnet on the iFLYTEK dataset across different spatial resolutions.

Model	Band Combination	Resolution	ACC (%)	F1 (%)	IoU (%)	Pre (%)	Recall (%)
BsiNet	B4 + B3 + B2 + B1	0.75–1.1 m	89.50	89.07	80.36	88.95	89.20
BsiNet	B4 + B3 + B2 + B1	1.5–2.2 m	88.66	88.57	79.50	88.51	88.65
BsiNet	B4 + B3 + B2 + B1	3.0–4.4 m	85.70	85.69	74.97	85.8	85.72
REAUnet	B4 + B3 + B2 + B1	0.75–1.1 m	92.16	91.74	86.17	92.14	92.71
REAUnet	B4 + B3 + B2 + B1	1.5–2.2 m	91.68	90.31	83.81	92.02	89.74
REAUnet	B4 + B3 + B2 + B1	3.0–4.4 m	90.99	88.65	80.89	90.56	87.81

Note: ACC = Accuracy; Pre = Precision; F1 = F1-Score; IoU = Intersection over Union.

Table 5. Metrics of BsiNet and REAUnet on the Shanglin dataset across different spatial resolutions.

Model	Band Combination	Resolution	ACC (%)	F1 (%)	IoU (%)	Pre (%)	Recall (%)
BsiNet	B4 + B3 + B2	0.8 m	90.88	90.78	83.13	90.89	90.69
BsiNet	B4 + B3 + B2	1.6 m	89.90	89.75	81.42	89.68	89.83
BsiNet	B4 + B3 + B2	3.2 m	88.35	87.37	77.74	87.24	87.50
REAUnet	B4 + B3 + B2 + B1	0.8 m	92.42	92.06	85.78	93.78	91.07
REAUnet	B4 + B3 + B2 + B1	1.6 m	91.29	89.16	81.20	89.65	89.90
REAUnet	B4 + B3 + B2 + B1	3.2 m	91.12	87.00	77.47	87.56	87.36

Note: ACC = Accuracy; Pre = Precision; F1 = F1-Score; IoU = Intersection over Union.

Table 6. RORs of FP and FN at different spatial resolutions for BsiNet and REAUnet.

Network	Dataset	Resolution	Type	ROR (%)	Type	ROR (%)
BsiNet	iFLYTEK	1.5–2.2 m	FN	69.31	FP	58.44
	iFLYTEK	3.0–4.4 m	FN	53.77	FP	51.80
	Shanglin	1.6 m	FN	72.07	FP	67.67
	Shanglin	3.2 m	FN	65.37	FP	58.73
REAUnet	iFLYTEK	1.5–2.2 m	FN	68.46	FP	47.78
	iFLYTEK	3.0–4.4 m	FN	64.97	FP	39.44
	Shanglin	1.6 m	FN	55.40	FP	66.00
	Shanglin	3.2 m	FN	51.26	FP	59.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cheng, L.; Deng, C.; Zhou, C.; Zhang, Y.; Lu, H.; Li, Z.; Chen, S. Effect Analysis of Spectral and Spatial Variations on Attention-Based Cropland Extraction Networks. Remote Sens. 2026, 18, 1501. https://doi.org/10.3390/rs18101501

AMA Style

Cheng L, Deng C, Zhou C, Zhang Y, Lu H, Li Z, Chen S. Effect Analysis of Spectral and Spatial Variations on Attention-Based Cropland Extraction Networks. Remote Sensing. 2026; 18(10):1501. https://doi.org/10.3390/rs18101501

Chicago/Turabian Style

Cheng, Lin, Cailong Deng, Chaohu Zhou, Yong Zhang, Haojian Lu, Zhen Li, and Shiyu Chen. 2026. "Effect Analysis of Spectral and Spatial Variations on Attention-Based Cropland Extraction Networks" Remote Sensing 18, no. 10: 1501. https://doi.org/10.3390/rs18101501

APA Style

Cheng, L., Deng, C., Zhou, C., Zhang, Y., Lu, H., Li, Z., & Chen, S. (2026). Effect Analysis of Spectral and Spatial Variations on Attention-Based Cropland Extraction Networks. Remote Sensing, 18(10), 1501. https://doi.org/10.3390/rs18101501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Effect Analysis of Spectral and Spatial Variations on Attention-Based Cropland Extraction Networks

Highlights

Abstract

1. Introduction

2. Study Areas and Datasets

3. Models

3.1. BsiNet

3.2. REAUnet

3.3. Implementation Details

4. Experiments

5. Results

5.1. Spectral Variations Experiment Results

5.2. Spatial Variations Experiment Results

5.3. Spectral–Spatial Coupling Experiments Results

5.4. Error Tendency Experiment Results

6. Discussion

6.1. Spectral–Spatial Trade-Offs and the Role of Global Context

6.2. Error Tendency and Attention Mechanisms

6.3. Limitations

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI