Multi-Scale Feature Learning for Farmland Segmentation Under Complex Spatial Structures

Han, Yongqi; Wang, Yuqing; Zhang, Yun; Ai, Hongfu; Qin, Chuan; Zhang, Xinle

doi:10.3390/e28020242

Open AccessArticle

Multi-Scale Feature Learning for Farmland Segmentation Under Complex Spatial Structures

by

Yongqi Han

¹

,

Yuqing Wang

¹,

Yun Zhang

²

,

Hongfu Ai

¹,

Chuan Qin

^1,3 and

Xinle Zhang

^1,*

¹

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

²

College of Computer Science, Changchun Humanities and Sciences College, Changchun 130117, China

³

State Key Laboratory of Black Soils Conservation and Utilization, Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130102, China

^*

Author to whom correspondence should be addressed.

Entropy 2026, 28(2), 242; https://doi.org/10.3390/e28020242

Submission received: 18 January 2026 / Revised: 13 February 2026 / Accepted: 18 February 2026 / Published: 19 February 2026

(This article belongs to the Section Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

Fragmented, irregular, and scale-heterogeneous farmland parcels introduce high spatial complexity into high-resolution remote sensing imagery, leading to boundary ambiguity and inter-class spectral confusion that hinder effective feature discrimination in semantic segmentation. To address these challenges, we propose CSMNet, which adopts a ConvNeXt V2 encoder for hierarchical representation learning and a multi-scale fusion architecture with redesigned skip connections and lateral outputs to reduce semantic gaps and preserve cross-scale information. An adaptive multi-head attention module dynamically integrates channel-wise, spatial, and global contextual cues through a lightweight gating mechanism, enhancing boundary awareness in structurally complex regions. To further improve robustness, a hybrid loss combining Binary Cross-Entropy and Dice loss is employed to alleviate class imbalance and ensure reliable extraction of small and fragmented parcels. Experimental results from Nong’an County demonstrate that the proposed model achieves superior performance compared with several state-of-the-art segmentation methods, attaining a Precision of 95.91%, a Recall of 93.95%, an F1-score of 94.92%, and an IoU of 90.85%. The IoU exceeds that of Unet++ by 8.92% and surpasses PSPNet, SegNet, DeepLabv3+, TransUNet, SeaFormer and SegMAN by more than 15%, 10%, 7%, 6%, 5% and 2%, respectively. These results indicate that CSMNet effectively improves information utilization and boundary delineation in complex agricultural landscapes.

Keywords:

farmland segmentation; remote sensing imagery; multi-head attention mechanism; deep learning; semantic segmentation

1. Introduction

The effective monitoring and management of agricultural resources has emerged as a crucial issue for sustainable development in recent years due to the acceleration of demographic expansion and the requirement for food security [1]. Global food security, the best use of agricultural resources, and sustainable development all depend on the accurate extraction and dynamic monitoring of cropland, the fundamental geographical unit of agricultural production and management [2,3]. While arable land resources continue to decline as a result of urbanization and ecological degradation, the Food and Agriculture Organization of the United Nations (FAO) projects that world food demand will increase by almost 60% by 2050 [4]. In this regard, highly efficient and precise farmland monitoring technology has emerged as a pivotal instrument for addressing the challenge of food security. With the increasing volume of remote sensing imagery and the growing refinement of ground object visual features, remote sensing technology has become one of the predominant methods for extracting farmland information, mapping, and conducting dynamic analysis—thanks to its large-scale, multi-temporal, and multi-spectral data acquisition capabilities. It provides crucial data support for understanding farmland distribution [5,6,7,8]. In smallholder-dominated agricultural systems, fragmented farmland parcels exhibit two defining characteristics: blurred boundaries with topological ambiguity and pronounced spatial heterogeneity manifesting as irregular geometries and multi-scale parcel size distribution. These features severely limit the accuracy of cropland information extraction using conventional remote sensing techniques [9,10]. Due to their reliance on manual feature design, traditional techniques like maximum likelihood classification (MLC), support vector machines (SVM), random forests (RF), and object-oriented image analysis (OBIA) have a 20–30% lower classification accuracy for fragmented farmland areas and struggle to handle spectral confusion and spatial heterogeneity in complex farmland scenes [11,12,13].

Recent advances in deep learning technology have revolutionized the extraction of remote sensing picture fields. Starting with the convolutional neural network (CNN) [14], the complete convolutional network (FCN) uses a weight-sharing mechanism in local receptive fields to accomplish end-to-end mapping from raw images to semantic segmentation [15]. By aggregating multi-scale contextual information, Pyramid Scene Parsing Network (PSPNet) [16] adds a pyramid pooling module to enhance semantic segmentation performance. This allows the model to capture various resolutions in both global and local characteristics. In order to increase segmentation accuracy while preserving detail information, DeepLab v3+ [17] combines multi-scale feature fusion with null convolution, which broadens the sensory field. Both Segmentation Network (SegNet) [18] and U-Net [19] employ a coding-decoding structure. U-Net improves feature reuse by efficiently combining low-level and high-level features in the decoding stage through skip concatenation, while SegNet accomplishes semantic segmentation by downsampling the image and using a maximal pooled index for up sampling. In semantic segmentation challenges, Transformer [20] uses a self-attention mechanism to model global context information. This mechanism efficiently captures long-distance dependencies, boosting feature representation and segmentation accuracy, particularly in complex images. Nested U-Net (U-Net++) [21] is an extended version of U-Net that further improves the segmentation performance by adding more skip connections and dense paths to enhance feature reuse and information flow. TransUNet [22] combines U-Net and Transformer structures to leverage Transformer’s benefits in context modeling to improve semantic segmentation. Transformer’s ability to enhance semantic segmentation in context modeling is advantageous. Squeeze-enhanced Axial Transformer (SeaFormer) [23] is a lightweight model for semantic segmentation on mobile that builds on Transformer’s attention mechanism by adding axis compression and local information improvement to boost performance at low resolutions. Multi-scale and multi-level feature representations can be automatically learned by deep learning, greatly increasing the precision and resilience of field extraction in intricate agricultural environments. However, they are originally designed for general semantic segmentation tasks (e.g., natural scenes, medical images) and lack targeted optimization for farmland’s unique complex spatial characteristics (e.g., irregular shapes, significant scale differences, fragmented distribution).The characteristics of the borders of arable land in remote sensing images are less clear than those of buildings and roads, even though these models can increase segmentation accuracy. As a result, neural network models’ performance in agricultural extraction tasks has not yet attained the high accuracy level that is expected [24].

In order to overcome this difficulty, researchers both domestically and internationally have recently put forth a number of enhanced neural network-based models that significantly increase segmentation performance in challenging farming situations. Li et al. proposed the DBBANet, which employs ResNet-50 as the encoder. The dual-branch architecture leverages both unique semantic information relevant to farmland and detailed boundary information [25]. Chen et al. used the U-2-Net++ model based on the RSU module, deep separable convolution, and the channel-spatial attention mechanism module to extract different types of fields [26]. Lu et al. improved AttMobile-DeeplabV3+: a boundary tracing function is used to track the boundaries of the binary image and the least squares method is used to obtain the fitted boundary line [27]. Zhang et al. integrated the Bilateral Feature Encoder of CNNs and Transformers with a global-local information mining module to enhance global context extraction and improve cropland separability [28]. Wang et al. proposed a Multitask Deformable UNet combined Enhanced network for farmland boundary segmentation [29]. Zhong et al. proposed an improved deep learning method named the Multi-Swin Mask Transformer [30]. Lu et al. proposed a dual attention mechanism and a multi-scale feature fusion DASFNet to extract the cropland from a GaoFen-2 image in Xinjiang, China [31]. Xu et al. proposed DSCUnet, an improved U-net with depth-wise separable convolution to achieve aiming at fine extraction of cultivated land parcels within large areas [32]. Cao et al. proposed an example segmentation method of Mask R-CNN based on dual attention mechanism feature pyramid network to describe the automatic delimitation of fields in small farms [33]. Zhang et al. developed an modified PSPNet (MPSPNet) for large areas of cropland mapping at very high resolution [34]. Miao et al. propose a twin network (SNUNet3+) based on full-scale connected UNet. The decoder subnetwork incorporates a scSE attention mechanism. Additionally, a deep supervision module is introduced [35]. The MATNet architecture proposed by Zhang et al. is constructed based on a fusion of a CNN encoder and a Transformer decoder. The encoder incorporates spatial and channel reconstruction units alongside multiple attention mechanisms, enhancing the model’s ability to extract features from finely grained agricultural parcels [36]. Xu et al. proposes a multiscale edge-guided network for accurate cultivated land parcel boundary extraction, which consists of the following: an edge enhancement module, dual-pyramid structure and de-overlap operation [37]. To efficiently collect farmland ridge information, Hong et al. proposes a segmentation method based on an encoder–decoder network with a strip pooling module and an ASPP module [38]. Wang et al. proposed the AAMS-YOLO model for farmland parcel detection. The model incorporates AMA and EMA attention modules, integrating the Attention Scale Sequence Fusion P2 Network with the Triple Feature Encoder module and Scale Sequence Feature Fusion module [39]. These models focus on farmland extraction and have improved boundary detection or scale adaptation to a certain extent. However, most of them only address single aspects of complex spatial structures (e.g., dual-branch structures for boundary and semantic information, or single-scale attention mechanisms), failing to comprehensively solve the integrated challenges of boundary ambiguity, scale heterogeneity, and class imbalance in complex farmland scenes.

Three fundamental obstacles still exist in farmland extraction, despite tremendous advancements and encouraging outcomes:

(1): The fragmented distribution, irregular geometries, and significant scale differences of farmland parcels under China’s small-farmer-dominated cultivation mode led to blurred parcel boundaries and inter-class spectral confusion in remote sensing imagery, which severely limits the completeness and boundary accuracy of farmland extraction.
(2): Farmland-containing remote sensing imagery typically has complex backgrounds, making it difficult to accurately detect and extract farmland features from these datasets.
(3): Large-scale, high-resolution farmland mapping is not effectively automated using the current methods. The precision of parcel area statistics and ensuing agricultural management are directly impacted by these three issues.

2. Materials and Methods

2.1. Overview of the Study Area

The study area is located in Nong’an County, Changchun City, Jilin Province, China, situated in the hinterland of the Songliao Plain, with geographical coordinates ranging from 124°31′ E to 125°45′ E longitude and 43°55′ N to 44°55′ N latitude (as shown in Figure 1).

The terrain generally exhibits a topographical trend of higher elevation in the southwest and lower elevation in the northeast. Characterized by flat topography and distinct geomorphological features, the region is particularly suitable for agricultural production. According to official data from the Nong’an County People’s Government, the county covers a total area of approximately 5400 km², of which cultivated land accounts for 4153.33 km², representing a substantial proportion that highlights its significance as a major agricultural county. Climatologically, Nong’an County experiences a temperate continental climate with four distinct seasons. The mean annual temperature is approximately 4.7 °C, with a frost-free period of around 145 days and average annual precipitation of 507.7 mm, providing favorable climatic conditions for crop growth. The abundant arable land resources combined with suitable climatic conditions establish a robust natural foundation for agricultural production, while simultaneously providing distinct surface features for cropland identification using remote sensing imagery.

2.1.1. Sources and Production of Sentinel-2 Datasets

The whole of Nong’an County was covered by the satellite image data used in this study, which were obtained from the Google Earth Engine (GEE) platform on 15 September 2025, using Sentinel-2 with a spatial resolution of 10 m (as shown in Figure 1b). The Sentinel-2 imagery comprises 13 spectral bands, encompassing visible, near-infrared (NIR), and short-wave infrared (SWIR) spectral regions, which provide rich spectral characteristics for clearly delineating cropland distribution and effectively distinguishing agricultural fields from other land cover types. The rationale for selecting the September imagery for Nong’an County is based on the following considerations:

(1): Phenological characteristics: The crop maturation period in September exhibits distinctive spectral signatures that facilitate accurate cropland segmentation.
(2): Data quality: Compared to summer months, September typically has reduced cloud cover, ensuring clearer imagery, while the red-edge bands enhance feature extraction.
(3): Algorithm compatibility: The high spatial resolution and multi-spectral bands optimally support deep learning model training.
(4): Regional adaptability: This temporal window aligns with the optimal growing season for major agricultural regions in the Northern Hemisphere.

To guarantee the physical consistency and spatial geometric precision of the data, the obtained images are put through a number of pre-processing procedures, such as radiometric calibration, atmospheric correction, and orthorectification. In ArcGIS 10.8, each agricultural field is generated as a .shp file by employing the ‘pointing’ method on remote sensing imagery. Following file consolidation, raster processing is applied to convert the .shp annotation files into .tif label files (as illustrated in Figure 1d). The entire satellite image is cropped using the sliding window method, which guarantees a certain amount of overlap between the cropped image blocks and increases the diversity of the training data. The window size is set at 256 × 256 pixels, and the window size repetition rate is set at 0.1. In total, 1441 useful remote sensing photos of farmland were kept after cropping was finished and pictures with a lot of irrelevant backdrops were identified and removed. The dataset was expanded using horizontal flipping, vertical flipping, and diagonal mirroring data enhancement techniques, which further improved the model’s capacity for generalization (as shown in Figure 2). After data enhancement, the dataset had grown to 5764 photos. The processed dataset was randomly split into training and testing sets at an 8:2 ratio, with strict spatial separation maintained between the two subsets to prevent data leakage issues (the label files synchronize the above operations). This partitioning strategy ensures the independence of the test set for unbiased model evaluation.

2.1.2. Public Datasets

To validate the generalization capability of the improved model in ultra-high-resolution scenarios, this study incorporates the Jilin-1 satellite remote sensing public dataset constructed by iFLYTEK. iFLYTEK (Jiangxi iFLYTEK Co., Ltd., Hefei, China), a leading enterprise in China’s intelligent speech and computer vision technology sector (hereafter, references to iFLYTEK denote the iFLYTEK Public Dataset). Covering diverse agricultural landscapes across North and East China, this dataset complements the localized An’an County dataset to form a complementary validation framework for both general and specialized applications. Jilin-1 is one of China’s most significant optical remote sensing satellites, developed by Changguang Satellite Technology Co., Ltd. (Changchun, China). The 2021 iFLYTEK dataset comprises Jilin-1 high-resolution remote sensing imagery with resolutions ranging from 0.75 to 1.1 m, featuring four spectral bands (B, G, R, and NIR). The labels of this dataset are manually annotated by professional remote sensing interpreters [40]. From the Feitian dataset’s 16 TIF images of varying dimensions (rows and columns ≥ 3000), 7 remote sensing images (rows and columns ≥ 6000) were selected. These images cover typical farmland types including contiguous cornfields in the North China Plain, terraced fields in Shandong’s hilly regions, and paddy fields in Jiangsu. Each image was cropped using a sliding window method with a window size of 256 × 256 pixels and zero overlap. After removing irrelevant photos, the dataset comprised 5417 images, which were subsequently divided into training and testing sets at an 8:2 ratio.

2.2. Model Architecture

2.2.1. Improvements to the U-Net++ Model

In order to separate cropland from Sentinel-2 satellite data, our research suggests a deep learning model referred to as the CSMNet model (as shown in Figure 3). UNet++, an enhanced variant of the traditional U-Net architecture, serves as the foundation for the construction of CSMNet. The encoder and decoder that make up the U-Net model itself use skip connections to merge low-resolution and high-resolution data. In order to provide multiple connection channels between each layer and its successor, U-Net++ extends U-Net with multiple skip connections. This improves the flow of contextual information, enables the network to have receptive fields of varying sizes, and helps to better integrate feature from several layer. However, the dense connectivity between encoder and decoder nodes results in increased computational complexity, as each decoder node must process all intermediate connections with corresponding encoder nodes. To address this, we redesigned the skip connections to prioritize the most salient connections and feature fusions, achieving more efficient computation while maintaining performance. The new skip connections keep the model’s performance up rather than down by reducing computing complexity, lowering the number of parameters, and making effective use of the major characteristics. Different feature information levels are passed from the encoder to the decoder, which enhances image segmentation performance and makes it possible for the model to detect farming areas of varying sizes more precisely.

The output is mapped into the horizontal output layer, which aggregates the high-level and low-level feature maps of various scales after upsampling. This enhances the accuracy of the segmentation results the model produces, improves the details and semantic information, reduces information transmission loss, and enhances the capture of small structures and enhances boundaries. It guarantees that the model’s segmentation output closely resembles the distribution of cropland in reality.

x^{i, j} = \{\begin{matrix} C (D (x^{i - 1, j})) & , i + j \neq 4 \\ C ([{[x^{i, k}]}_{k = 0}^{j - 1}, U (x^{i + 1, j - 1})]) & , i + j = 4 \end{matrix},

(1)

where C(·) denotes the convolution operation, D(·) denotes the downsampling layer, U(·) denotes the up sampling layer, and [·] denotes the feature map splicing. Here, it denotes the output of the feature map in addition to the encoder’s feature map, where superscript i denotes the skip connection depth and superscript j denotes the skip connection width. The feature map of the decoder is passed to the lateral output layer to be stitched together for the final farm segmentation.

ConvNeXt V2 [41] is a novel convolution neural network (CNN) architecture that integrates the strengths of Swin Transformer [42] and ResNet [43], combining convolution operations with self-attention mechanisms. In order to further improve the performance of image processing tasks through large-scale computational requirements and enhanced feature representation capabilities, a staged structure is adopted, with each stage consisting of multiple residual blocks. The model’s feature extraction module is similar to Swin Transformer’s. Furthermore, ConvNeXt V2 employs adaptive convolution and Layer Normalization operations, as well as a Global Response Normalization (GRN) layer and a complete convolution mask self-encoder architecture. Compared to traditional encoder architectures, ConvNeXt V2 enhances generalization capability via self-supervised learning while optimizing feature representation, making it particularly suitable for processing high-resolution satellite imagery—effectively capturing fine-grained details such as agricultural field boundaries, textures, and semantic features. In this study, we adopt the ConvNeXt-Tiny configuration with: Block stacking ratios per stage: [3:3:9:3]; Output feature map channels: [96, 192, 384, 768] (as shown in Figure 4).

2.2.2. Multi-Headed Attention

This work presents a Multi-Headed Attention Module (MHA) to address the different presentation of elements in high-resolution remote sensing data and the complex and changing character of rural settings (as shown in Figure 5). This module seeks to improve the model’s accuracy in detecting and segmenting farming targets by effectively combining local features and global contextual information through a structured attention coordination method. A multidimensional orthogonal attention division system is built by MHA. Feature optimization is separated into three parallel, specialized paths using this system:

(1): The channel attention pathway adopts an efficient channel attention strategy without reducing dimensionality. Leveraging adaptive 1D convolutions to dynamically adjust channel weights, it focuses on enhancing feature channels closely linked to crop spectral responses, thereby enabling effective differentiation between diverse crop varieties and distinct growth stages.
(2): For the spatial attention pathway, a traditional spatial attention mechanism is employed. It merges features extracted through channel-wise average pooling and max-pooling operations, then constructs spatial weight maps using fixed-size 7 × 7 convolution kernels. This process strengthens the regular geometric boundaries, internal texture structures, and spatial layout characteristics of farmland parcels.
(3): The global context path uses a linear-complexity self-attention mechanism. It models semantic relationships between farmland and adjacent areas by creating long-range dependencies between pixels, offering crucial scene-level prior information for local discrimination.
(4): A mechanism for dynamic adaptive fusion is presented. A lightweight Gate Network is created to work around the drawbacks of fixed-weight fusion. Based on the initial input properties, this network conditionally creates a set of spatially adaptive weight maps in real-time that match the outputs of the three attention paths. The module achieves input-adaptive optimization of the feature fusion technique by dynamically allocating contributions from each path based on the local content of the input imagery using element-wise weighted summation.

The Channel Attention Module (CAM) suppresses interference from irrelevant channels while selectively enhancing channels responsive to crop spectral properties in remote sensing-based crop field extraction activities. In order to create channel weight vectors, one-dimensional convolutions are used to capture cross-channel interaction information after collecting a global description for each channel using global average pooling. The channel descriptor z is obtained using a global average pooling operation for the input feature map F. Z is transformed using a one-dimensional convolution to capture cross-channel interactions. To guarantee enough coverage of the channel neighborhood, the kernel size k is adaptively chosen based on the number of channels. The channel weight vector ω is obtained by passing the output of the one-dimensional convolution through a Sigmoid activation function. This procedure can be depicted as:

ω = σ (Conv1D_k(z)),

(2)

Perform a channel-wise multiplication between the channel weight vector ω and the original input feature map F to obtain the channel-enhanced feature map Fc:

Fc = ω \otimes F,

(3)

The purpose of the Spatial Attention Module (PAM) is to suppress spatial responses in non-farmland areas while simultaneously enhancing spatial regions in feature maps that are sensitive to the geometric structure of farmland. To capture the regular geometric features of farming parcels and contextual information on a broader scale, a fixed-size 7 × 7 convolution kernel is used. We concatenate the channel dimensions after performing average and max pooling operations along the channel dimension for the input feature map F. The composite feature map is then subjected to a 7 × 7 convolution. This maintains computational efficiency while offering a receptive field big enough to encompass the average spatial scale of farming units. The spatial attention map is produced by a Sigmoid activation function after convolution, calculated as follows:

Ms (F) = σ (f ^7×7 ([AvgPool(F); MaxPool(F)])),

(4)

Perform element-wise multiplication between the generated spatial attention map and the original input feature map to obtain the spatially augmented feature map:

Fs = Ms \otimes F,

(5)

While suppressing irregular spatial elements of non-agricultural regions such as buildings and highways, this module successfully captures the regular boundaries of farmland parcels, their underlying textural structures, and the spatial relationships between adjacent parcels. It works effectively for farmland settings with clear geometric regularity, including trapezoidal terraces and rectangular plots, in remote sensing-based farmland extraction activities. This lessens spatial discontinuity problems brought on by shadows, cloud cover, or uneven crop growth while greatly increasing the model’s accuracy in identifying field boundaries and enhancing internal area consistency. In the end, it produces farmland area segmentation that is more accurate.

The context matrix is then obtained by computing the matrix product between key and value. The attention output is then obtained by multiplying this by the query matrix. This is how the procedure is put together:

Q,K,V = Conv1 × 1(F),

(6)

C = K^T · V,

(7)

O = Q · C,

(8)

Attention output O is projected through a 1 × 1 convolution layer and connected via residual connection to the original input features, yielding the globally context-enhanced feature map E:

E = Proj(O) + F,

(9)

By creating long-range relationships between pixels, this module successfully integrates contextual information between farms and nearby landmarks (such buildings, roads, and bodies of water). It fills in gaps in farmland areas produced by occlusion and helps the model discern semantically separate but spectral similar places (e.g., between vegetation-covered farmland and woods). This improves the model’s ability to make boundary decisions in complicated scenarios and maintain semantic consistency.

The Dynamic Gate Module (DGM) is responsible for adaptively fusing the output features from the three branches, dynamically adjusting the contribution weights of each branch based on the content of the input image to achieve optimal feature fusion. This module first extracts multi-scale contextual information from the raw input features, then generates three-branch fusion weights through a lightweight gate network. For the input feature map F, multi-scale features are extracted using multiple average pooling layers at different scales. These features are concatenated with the original features to obtain a rich contextual feature representation. A gating network composed of two 1 × 1 convolution layers (with a ReLU activation function in between) generates a three-channel base weight map W_base. This process can be represented as:

F_ms = Concat(AvgPool₁(F),AvgPool₂(F),…,F),

(10)

W_base = Conv1 × 1(ReLU(Conv1 × 1(F_ms))),

(11)

To ensure the sum of the three weights for each spatial location equals 1, Softmax normalization is applied along the channel dimension, yielding the final fusion weight maps W0, W1, W2. The feature maps from the three branches are multiplied in an element-wise way with their corresponding weight maps and then summed to obtain the fused feature map:

F_{fused} = W 0 \otimes F_c + W 1 \otimes F_s + W 2 \otimes E,

(12)

Depending on the local features of the input image, this module can adaptively modify the weights of each branch. For example, it increases the weight of global context in congested urban–rural periphery areas, intensifies the importance of spatial attention in fragmented farms with complex boundaries, and increases the contribution of channel attention in uniformly cultivated fields with noticeable spectral patterns. This greatly enhances the model’s capacity for generalization and segmentation accuracy in a variety of settings by enabling adaptive feature fusion across a range of farming conditions.

Furthermore, MHA is integrated into the backbone network via residual connections. This approach substantially enhances feature representation capabilities while introducing minimal additional parameters and computational overhead, ensuring practicality and deployment flexibility when processing large-scale remote sensing imagery.

2.2.3. Combine Loss Function

In the task of fine-grained extraction of farmland from remote sensing imagery, farmland areas constitute only a minuscule proportion of the total image area, while non-farmland areas dominate overwhelmingly. This directly leads to a category imbalance between farmland (positive samples) and non-farmland (negative samples). This imbalance not only reduces the model’s sensitivity in identifying farmland areas but also frequently leads to critical defects such as blurred, discontinuous, or even missing farmland boundaries. These issues severely limit the accuracy and practicality of remote sensing-based farmland extraction. To mitigate the adverse effects of class imbalance on farmland extraction—particularly in boundary region identification—and achieve refined optimization of farmland boundaries, we propose a hybrid loss function for our model. Among these, the Binary Cross-Entropy Loss function is useful for handling classification jobs, particularly when dealing with binary classification issues that allow each pixel to be assigned a possibility:

L_{BCE} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot l o g (\hat{y_{i}}) + (1 - y_{i}) \cdot l o g (1 - \hat{y_{i}})],

(13)

On the other hand, the Dice loss function is made expressly to assess overlapping regions, improving the model’s sensitivity to farming areas and facilitating the handling of imbalanced datasets.

L_{Dice} = 1 - \frac{2 \cdot \sum_{i = 1}^{N} (y_{i} \cdot \hat{y_{i}})}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} \hat{y_{i}}},

(14)

L_{Com} = α * L_{BCE} + β * L_{Dice}

(15)

where α and β are weight coefficients and α, β

\in [0,1]

.

y_{i}

is the true label of the ith pixel and

\hat{y_{i}}

is the probability that the model predicts the ith pixel to be a farmland. N is the total number of all model-predicted pixels.

2.3. Accuracy Assessment

To validate the accuracy of the extraction results obtained using the proposed method, this study employs precision, recall, F1-Score, and Intersection over Union (IoU) as evaluation metrics:

Precision = \frac{T P}{T P + F P},

(16)

The Precision measures the percentage of anticipated farmland areas that are accurately categorized as actual farmland. True Positive (TP) signifies the number of correctly retrieved farmland pixels, whereas False Positive (FP) denotes the number of farmland pixels that were labeled incorrectly.

Recall = \frac{T P}{T P + F N},

(17)

The Recall in farmland segmentation refers to the percentage of all real agricultural portions that the algorithm correctly finds. False Negative (FN) refers to the number of farmland pixels that were wrongly categorized as non-farmland.

F 1 -Score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l},

(18)

The F1-Score shows the model’s overall performance for the farmland segmentation task, which is the reconciled average of precision and recall.

IoU = \frac{T P}{T P + F P + F N}

(19)

The IoU estimates the degree of overlap between actual and predicted farmed areas.

2.4. Experimental Environment

With Windows 10 as the operating system, an Intel Xeon Gold 6246R (Intel Corporation, Santa Clara, CA, USA) as the central processor, a Quadro RTX 8000 (NVIDIA Corporation, Santa Clara, CA, USA) as the graphics card model, Python 3.8.20 as the programming language, Torch 1.13.1 + cu117 as the model training architecture, and CUDA 11.7 as the acceleration platform, the study’s model training and validation are carried out on the same device. Table 1 displays specific parameter settings.

3. Results

This section uses CSMNet to conduct ablation experiments. Additionally, CSMNet is compared with six other instance segmentation techniques, including PSPNet, SegNet, DeepLabv3+, UNet++, TransUNet and SeaFormer using two datasets (Nong’an County and iFLYTEK) in order to assess and evaluate its instance segmentation performance. To ensure result reliability, each model was trained and tested five times with different random seeds, and the mean and standard deviation of the five runs were reported. The ‘±’ in the table below denotes the standard deviation of experimental results, reflecting the model’s performance stability under different random initialization conditions.

3.1. Weighting of the Combine Loss Function

Experiments were performed using various combinations of weight coefficients α and β for the modified model in order to assess the effect of weight coefficients in the loss function on experimental results. The significance of tailoring loss function weights for application scenarios—farmland extraction in particular—in semantic segmentation tasks is confirmed by Table 2. For farmland extraction, loss function combinations skewed toward regional consistency are more advantageous than the equal-weight approach used for general tasks. α = 0.8, β = 1.0 is the ultimate combination that was chosen.

3.2. Evaluation of Ablation

To evaluate the impact of ConvNeXt V2, the redesigned skip connection architecture, Multi-Headed Attention (MHA) mechanism, and hybrid loss function on model performance, we selected typical segmentation models (UNet++) as baselines and constructed the target model CSMNet by progressively integrating the above improvement components: here, “C” in CSMNet represents the integration of ConvNeXt V2, “S” corresponds to the multi-scale fusion structure with redesigned skip connections and lateral output layers, and “M” stands for the boundary optimization structure combining MHA and the hybrid loss function. Figure 6 visually presents the performance of these models on the Sentinel-2 dataset across four core evaluation metrics: Precision, Recall, F1-Score, and IoU (all in percentage). Following the progressive addition of these improvement techniques to the model, the experimental results (combined with Figure 6) show that the Precision, Recall, F1-Score and IoU of the model are continuously improved after each component is added.

Individual module ablation (Table 3) clearly demonstrates the independent contributions of each module.

Recall increased dramatically from 87.16% to 89.91% when Combine Loss was added alone, as compared to the baseline UNet++. This shows that, by forcing the model to concentrate more on difficult pixels (such as small objects or boundary regions), Combine Loss successfully reduces class imbalance and significantly improves recall capabilities. The ConvNeXt V2 encoder was introduced, and it produced the most extensive performance improvements of any single-module experiment. Its strong integrated representation capabilities were demonstrated by all metrics reaching single-module optimality, particularly with F1-score and IoU rising to 92.41% and 86.18%, respectively. When compared to the baseline, the addition of Multi-Scale Fusion alone produced consistent improvements across parameters. F1 and IoU performed marginally worse than ConvNeXt V2, although recall (89.98%) was still strong. During decoding, features from several encoder stages were effectively merged through redesigned lateral outputs and skip connections. As a result, the semantic gap between the encoder’s high-level semantic characteristics and the decoder’s low-level detail information was reduced. When the MHA module is applied separately, its great precision shows that false positives are effectively suppressed, bringing prediction boundaries closer to real targets. In complex scenarios, it directly enhances boundary accuracy and maintains high precision. The model can adaptively focus on structurally complicated regions (e.g., borders, fine junctions) by dynamically combining channel, spatial, and global context through a lightweight gating mechanism. This makes it possible for the model to learn to improve important parts and disregard unimportant backgrounds. This greatly improves edge perception capabilities and the capacity to identify intricate structures.

3.3. Results of the Sentinel-2 Dataset

To evaluate the applicability of CSMNet proposed model, we conducted comparative experiments with other mainstream deep learning-based image segmentation models, including PSPNet, Deeplabv3, SegNet, TransUNet, SeaFormer, UNet++ and SegMAN [44]. All models were trained under identical conditions—without utilizing pre-trained weights—and were trained and evaluated on CSMNet dataset to ensure a fair comparison. Figure 7 displays the test visualization extraction results.

While CSMNet and TransUNet performed better in segmentation accuracy and were better able to preserve the edges and details of the original image, PSPNet segmentation performed badly overall, with blurred edges and incorrect omissions in the majority of regions. Despite background aliasing and mis-segmentation, PSPNet segmentation results seem to be somewhat rough, with a poor capacity to catch fine features. However, it generally does well when handling long-distance features. In certain areas, SegNet’s segmentation map exhibits erratic breaks and fine connectivity issues, which causes some errors and omissions to be visible when compared to the labelled maps. Excessive smoothing issues in DeepLab V3+ cause the loss of previously existent information, particularly at complicated edges where over-adhesion is obviously visible. Although the final product appears decent overall, UNet++ exhibits redundant segmentation, with some fields having jagged edges and several small sections being mislabeled. Although TransUNet has a certain level of consistency in segmentation, it still falls short in some complex areas, particularly when it comes to maintaining the sharpness and details of the objects’ edges. SeaFormer may escape the “overall collapse” issue of conventional models because of its great semantic consistency and excellent global structure reduction. It still falls short in capturing the intricate intricacies of the grid layout, which results in the incorrect identification of cropland with irregular shapes. The semantic information of nearby regions is not adequately correlated by the model, and it does not learn enough about “field continuity.” SegMAN achieves comprehensive segmentation while avoiding excessive edge smoothing. However, it exhibits minor shortcomings in handling highly irregular region shapes and suppressing background noise, occasionally resulting in insufficient boundary refinement. CSMNet outperforms the other models in capturing particular morphology or texture, as well as in handling smaller segmented areas, with a smoother handling and no sticking of edges. TransUNet’s results show blurred edges, missing details, and background noise. Compared to the previous models, CSMNet is superior at handling smaller segmented regions, capturing specific forms or textures, and producing smoother, non-sticking edges.

Table 4 displays the four accuracy evaluation criteria for the seven models used to map agricultural lands in the Nong’an County research region. All metrics are reported as mean ± standard deviation across five independent runs, capturing stability over varied spatial sample sand random initialization. The precision, recall, F1-score, and IoU evaluation indices show variations in the performance of a number of models. With the highest Precision value of 95.91%, Recall value of 93.95%, F1 value of 94.92%, and IoU value of 90.85%, we can observe that CSMNet outperforms other models in the four quantitative metrics of accuracy evaluation. This suggests that the model not only rarely generates false alarms when predicting fields, but also captures extremely accurate field boundaries, and that there is a high overlap between the prediction results and the actual area. While the SegNet recall (85.79%) suggests that miss detection occur when working with small or fragmented areas, the high recall (93.75%) indicates that the model can identify the majority of real field areas with a very low miss detection rate, and the IoU (80.02%) indicates that the stability is still inadequate when compared to more sophisticated models. To forecast field boundaries more precisely, TransUNet combines the benefits of the Transformer and the conventional U-Net architecture. By combining the benefits of the Transformer and the conventional U-Net design, TransUNet achieves a comparatively high recall rate (91.78%), demonstrating a distinct advantage in identifying small objects and intricate forms, as well as a reasonably strong performance in terms of IoU (83.90%). The precision (90.72%), however, suggests a minor inaccuracy. DeepLab V3+ and UNet++ outperform PSPNet in comparable and notable ways. DeepLab V3+’s enhanced encoder-decoder architecture and empty space pyramid pooling (ASPP) module enable it to efficiently capture multi-scale contextual information, improving accuracy (92.77%) and IoU (82.28%). performs better. Nevertheless, its recall (86.72%) is comparatively low, most likely because of cavity convolution’s inability to adequately capture local features, which causes some fine fields or fractured regions to be overlooked. Contrarily, UNet++ improves feature reuse through dense hopping connections, allowing it to marginally surpass DeepLabV3+ in terms of recall (87.16%). However, because of its intricate structure, which may have introduced some noise, it achieves slightly lower precision (92.01%) and IoU (81.93%). All four metrics show poor performance from PSPNet, with recall (79.90%) and IoU (74.91%) falling well short of the other models. This could be because its Pyramid Pooling Module (PPM) over-smooths local features when collecting global context information, which leads to a high leakage rate and erroneous field boundary segmentation. The model achieves a fair compromise between accuracy and completeness, as evidenced by the nearly equal SeaFormer precision (92.08) and recall (91.95). False detections could result from global attention’s tendency to “over-associate” semantics and misclassify similar features (such as road texture versus farmland greenery). SegMAN’s overall performance (F1: 93.81%, IoU: 88.78%) demonstrates its effectiveness in integrating local attention mechanisms with state-space models for feature fusion and spatial detail preservation. Nevertheless, its metrics consistently fall slightly short of CSMNet’s across all evaluation criteria. Furthermore, CSMNet showed a significant improvement of 3.9% in p-value, 6.79% in R-value, 5.41% in F1-value, and 8.92% in IoU-value when compared to the unimproved Unet++ method.

In terms of precision measures (precision, recall, F1, and IoU), CSMNet emerged as the clear leader, guaranteeing remarkably high dependability in farmland boundary extraction. Its inference speed (30.18 FPS) and computational cost (35.72 M parameters, 44.85 G FLOPs), as indicated in Table 5, stay within a very reasonable range, making it completely viable and effective for real-world agricultural remote sensing applications (real-time needs that are not too high). CSMNet attains significant accuracy improvements at a manageable speed expense when compared to the lightweight SeaFormer model. In terms of efficiency and performance, it outperforms the cumbersome TransUNet paradigm. For data extraction tasks, CSMNet is a strong and useful option since it achieves the ideal mix between accuracy and efficiency.

The CSMNet model’s mapping results for the overall distribution of farmland in Nong’an County are displayed in Figure 8. The farmland is represented by the white area in the figure, and its outline is more complete, essentially reflecting its spatial layout and feeling somewhat regular. The model matches most of the borders more correctly, as evidenced by the smoother edge lines. More precise identification is made of the widely dispersed regions of expansive cropland. For macro land-use analysis of remote sensing pictures, the CSMNet model works well for core identification of farming areas, which can better depict the contour and large-scale structure of major features.

3.4. Results of the iFLYTEK Dataset

The segmentation findings for a 9049 × 6806 test picture slice from the iFLYTEK dataset are displayed in Figure 9. Figure 9 shows places where the respective models missed detection (shown by red regions) and regions where they identified incorrectly (represented by green regions; masks with confidence values greater than 0.5). The graphic clearly shows that CSMNet has the most accurate extraction of farmland regions, the cleanest edge and structure detection, and the fewest green and red marks. The next best findings are from TransUNet, SeaFormer and SegMAN, which had better morphology retention and less green and red marks. Small target details (small farming plots) may be overlooked by DeepLabV3+ due to its excessive dependence on a vast sensory field. PSPNet is not sensitive enough to predict border pixels and has a few more errors and omissions. SegNet and UNet++ have mediocre performance. The model may be more likely to fail to predict as farmland in regions with spectral complexity or blurred boundaries due to UNet++’s great attention to detail. The contextual modeling capability is somewhat weak, and the SegNet feature transfer technique restricts the maximum performance limit.

Table 6 demonstrates that CSMNet is highly detailed in identifying and determining that there is less pollution in the non-farmland. It has the greatest Precision (97.23%), indicating the least misdetection and the best control of misdetection. The precision of SegNet is the lowest at 94.32%. With the highest recall (96.39%), CSMNet captures the great majority of farms and virtually never misses anything. SegNet missed the most and had the lowest recall (92.25%). Every other model was less than 95%. With the highest F1 score (96.94%), CSMNet demonstrated the best overall balance. The traditional model was below 95%, suggesting that it was unable to adequately capture the intricate farming structure. Other models, such as SegMAN and SeaFormer, came in second and third. With an IoU of 93.69%, CSMNet demonstrated that its predicted area was accurate, and the borders were well-fitted, showing that it very closely coincided with the real area, which was the best. There is a wide disparity, with all other models falling below 92.5%.

3.5. Extraction Results in Challenging Scenarios

In the “small fragmented plots” scenario (shown by the first two rows in Figure 10), all models showed varied degrees of performance degradation, demonstrating the problem’s intrinsic complexity. With blurring precision, models such as PSPNet, SegNet, DeepLab V3+, and UNet++ capture the edge contours of micro-plots. Because the lines separating small plots from their backgrounds or neighboring plots are blurry, many micro-plots appear to be visually “fused together.” Accurately segmenting the separate shapes of fractured plots is challenging due to this extreme lack of fineness. TransUNet showed better detail preservation with to its Transformer architecture and dense skip connections, but it still had a lot of little misclassifications and contour coalescence. Although SeaFormer struggled to maintain the integrity of small parcels, it performed at an intermediate level. SegMAN occasionally merges very small parcels into adjacent parcels. In the comparative examination, CSMNet proved to be the most reliable performer. It produces more comprehensive and smoother anticipated borders by correctly detecting the existence of the great majority of minute patches. Nevertheless, CSMNet also combined a few extremely tiny patches into nearby ones in the fine patches inside the red-boxed region of the first sample row. This suggests that there is still potential for the model to reach sub-pixel-level exact localization at extreme scales.

Due to strong occlusion of important spectral information, all models showed significant classification uncertainty and boundary blurring in the last two rows of examples that targeted “shadowed areas.” In darkened regions, models such as PSPNet, SegNet, DeepLab V3+, UNet++, and TransUNet demonstrated much lower segmentation accuracy and border coherence. They were unable to reconstruct the overall morphology of things under shadows, as evidenced by their outputs, which significantly differed from the actual object shapes. SeaFormer’s strong global context modeling feature, which infers coherent semantics for veiled objects, helps to reduce this problem to some degree. It still lacks adequate continuity, however, at “shadow-object” intersections (such as the line separating the road and shadow in the fourth image). SegMAN’s predictions still exhibit score leakage in shadowed areas, similar to other models. When it comes to shadowy sceneries, CSMNet performs the best, successfully reconstructing most of the shadowed countryside. This illustrates how CSMNet can compensate for missing spectral information by utilizing spatial context.

Rough edges and partial misclassifications are examples of how CSMNet’s predictions show uncertainty at shadow boundaries. This suggests that a current bottleneck in the model is its perception of low-contrast “shadow-object” boundaries.

4. Discussion

Nong’an County, Jilin Province, presents a challenging yet representative case for farmland parcel extraction due to its traditional smallholder farming structure. This model generates a landscape of high spatial entropy, characterized by significant scale variance, irregular geometric patterns, and extreme parcel fragmentation. In remote sensing imagery, this complexity manifests as informational ambiguity, primarily through blurred parcel boundaries and high spectral confusion (e.g., between crops, rural settlements, and bare soil), which severely limits the completeness and precision of automated extraction methods. To address this challenge rooted in information uncertainty, we proposed and validated CSMNet.

The architecture of CSMNet is designed to manage the high information entropy inherent in such scenes. It integrates a ConvNeXt V2 encoder for robust hierarchical representation, redesigned skip connections with lateral outputs to facilitate multi-scale information flow and reduce semantic gaps, an adaptive multi-head attention module for dynamic context integration, and a hybrid Binary Cross-Entropy and Dice loss to counter class imbalance. Evaluations on multi-resolution data, including high-resolution JILIN-1 and 10-m Sentinel-2 imagery, confirm the model’s robustness. Notably, on Sentinel-2 data, CSMNet maintains distinct boundaries and edge details, effectively mitigating the common issue of boundary dissolution into noise in medium-resolution images, and outperforms benchmark models across all key metrics (Precision, Recall, F1-score, IoU).

Ablation studies provide critical insights into the informational contributions of each component. The hybrid loss function was essential for resolving the severe foreground-background class imbalance, ensuring small or irregular parcels were not lost—a direct mitigation of information loss for minority classes. The structural innovations of lateral outputs and modified skip connections worked synergistically to enhance cross-scale feature fusion. This mechanism preserves the fine-grained spatial information that is often attenuated in standard U-Net variants, effectively bridging the encoder–decoder semantic gap and retaining critical details about small-scale structures. Fundamentally, CSMNet’s performance stems from the synergistic information processing between the ConvNeXt V2 encoder and the multi-head attention modules. While the encoder establishes a strong hierarchical feature foundation with reduced spatial redundancy, the attention mechanism dynamically reinforces salient spatial and contextual dependencies. This synergy allows the model to resolve ambiguities in spectrally confused regions by focusing on long-range dependencies and precise boundary semantics, directly addressing the core causes of feature loss and boundary blurring.

5. Conclusions

This study introduced CSMNet, a novel deep learning framework designed to address the high-entropy problem of accurate farmland parcel extraction in fragmented, smallholder agricultural landscapes. The model’s integrated innovations—a powerful modern encoder, context-aware attention mechanisms, optimized multi-scale feature fusion pathways, and an imbalance-aware loss function—collectively tackle persistent challenges such as boundary ambiguity, spectral confusion, and the omission of small objects.

The principal theoretical implication is that the information bottleneck between high-level semantics and low-level spatial accuracy in remote sensing segmentation can be effectively narrowed through the deliberate co-design of advanced visual backbones, adaptive attention mechanisms, and refined feature-flow topologies. From an applied perspective, CSMNet generates accurate, field-scale geospatial information that is vital for intelligent agricultural decision-support systems. This includes precise crop health monitoring, variable-rate input application, and optimized irrigation scheduling. By minimizing both false positives (reducing unnecessary inputs on non-farm areas) and false negatives (ensuring complete farmland coverage), the model contributes directly to enhancing agricultural productivity, improving resource-use efficiency, and reducing environmental impact—key objectives of sustainable precision agriculture.

There is still potential for improvement in sub-pixel boundary localization, as the model occasionally blends adjacent plots for very small, fragmented plots. Compared to lightweight models like SeaFormer (16.8 G FLOPs), this model’s computational complexity (44.85 G FLOPs) is higher, which would limit its use on low-resource devices such as edge computing platforms for field monitoring.

Future studies should focus on three directions to advance this research further. First, rigorous validation across diverse agro-ecological zones, cropping systems, and seasonal conditions is necessary to fully establish the model’s robustness and generalization capability under varying informational complexities. Second, exploring model compression, neural architecture search, or adaptive computation strategies could enhance operational efficiency for large-scale or near-real-time monitoring deployments. The model can be optimized via neural architecture search (NAS) or lightweight design (e.g., depthwise separable convolution) to reduce computational cost while maintaining high segmentation accuracy, enabling deployment on edge devices. Finally, to realize its full potential, CSMNet should be integrated into scalable agricultural intelligence platforms. Such integration would enable not only parcel delineation but also facilitate synergistic analysis for yield prediction, soil assessment, and dynamic management, forming a critical component of a holistic, intelligent, and sustainable farm management system.

Author Contributions

Conceptualization, Y.H.; methodology, Y.H., H.A. and Y.W.; software, Y.W. and C.Q.; validation, Y.H. and Y.W.; formal analysis, Y.Z.; investigation, Y.Z. and C.Q.; resources, Y.H. and H.A.; data curation, Y.H. and Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W.; visualization, Y.H., Y.W.; supervision, X.Z.; project administration, Y.H. and X.Z.; funding acquisition, Y.H. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key R&D Program of China (2021YFD1500100), the Jilin Provincial Department of Education Scientific Research Science and Technology Project (JJKH20261576KJ), the Jilin Province Science and Technology Development Plan Project (No. 20240101043JC) and the Jilin Agricultural University Introduction of Talents Project (No. 202020010).

Data Availability Statement

Data are subject to privacy restrictions; please contact the corresponding author.

Acknowledgments

The authors thank the students who assisted with fieldwork and data collection, as well as the instructors for their constructive comments on the improvement of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
Waldner, F.; Diakogiannis, F.I. Deep learning on edge: Extracting field boundaries from satellite images with a convolutional neural network. Remote Sens. Environ. 2020, 245, 111741. [Google Scholar] [CrossRef]
Watkins, B.; Van Niekerk, A. A comparison of object-based image analysis approaches for field boundary delineation using multi-temporal Sentinel-2 imagery. Comput. Electron. Agric. 2019, 158, 294–302. [Google Scholar] [CrossRef]
Food and Agriculture Organization of the United Nations (FAO). The Future of Food and Agriculture—Trends and Challenges. Available online: https://www.fao.org/3/a-i7962e.pdf (accessed on 13 February 2026).
Sykas, D.; Sdralkakis, M.; Zografakis, D.; Papoutsis, I. A Sentinel-2 multiyear, multicountry benchmark dataset for crop classification and segmentation with deep learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3323–3339. [Google Scholar] [CrossRef]
Raei, E.; Asanjani, A.A.; Nikoo, M.R.; Sadegh, M.; Pourshahabi, S.; Adamowski, J.F. A deep learning image segmentation model for agricultural irrigation system classification. Comput. Electron. Agric. 2022, 198, 106977. [Google Scholar] [CrossRef]
Matton, N.; Canto, G.S.; Waldner, F.; Valero, S.; Morin, D.; Inglada, J.; Arias, M.; Bontemps, S.; Koetz, B.; Defourny, P. An automated method for annual cropland mapping along the season for various globally-distributed agrosystems using high spatial and temporal resolution time series. Remote Sens. 2015, 7, 13208–13232. [Google Scholar] [CrossRef]
Zhou, X.; Zheng, H.B.; Xu, X.Q.; He, J.Y.; Ge, X.K.; Yao, X.; Cheng, T.; Zhu, Y.; Cao, W.X.; Tian, Y.C. Predicting grain yield in rice using multi-temporal vegetation indices from UAV-based multispectral and digital imagery. ISPRS J. Photogramm. Remote Sens. 2017, 130, 246–255. [Google Scholar] [CrossRef]
Long, J.; Li, M.; Wang, X.; Stein, A. Delineation of agricultural fields using multi-task BsiNet from high-resolution satellite images. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102871. [Google Scholar] [CrossRef]
Li, M.; Long, J.; Stein, A.; Wang, X. Using a semantic edge-aware multi-task neural network to delineate agricultural parcels from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 200, 24–40. [Google Scholar] [CrossRef]
Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
Mountrakis, G.; Im, J.; Ogole, C. Support vector machines in remote sensing: A review. ISPRS J. Photogramm. Remote Sens. 2011, 66, 247–259. [Google Scholar] [CrossRef]
Blaschke, T.; Hay, G.J.; Kelly, M.; Lang, S.; Hofmann, P.; Addink, E.; Feitosa, R.Q.; Van der Meer, F.; Van der Werff, H.; Van Coillie, F.; et al. Geographic object-based image analysis—Towards a new paradigm. ISPRS J. Photogramm. Remote Sens. 2014, 87, 180–191. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th Advances in Neural Information Processing Systems (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 30th Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A nested U-Net architecture for medical image segmentation. In Proceedings of the 4th Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (DLMIA 2018), Granada, Spain, 16–20 September 2018; pp. 3–11. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Strasbourg, France, 27 September–1 October 2021; pp. 66–78. [Google Scholar]
Wan, Q.; Huang, Z.; Lu, J.; Yu, G.; Zhang, L. SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation. arXiv 2023, arXiv:2301.13156v4. [Google Scholar]
Shunying, W.; Ya’nan, Z.; Xianzeng, Y.; Li, F.; Tianjun, W.; Jiancheng, L. BSNet: Boundary-semantic-fusion network for farmland parcel mapping in high-resolution satellite images. Comput. Electron. Agric. 2023, 206, 107683. [Google Scholar] [CrossRef]
Li, J.; Wei, Y.; Wei, T.; He, W. A Comprehensive Deep-Learning Framework for Fine-Grained Farmland Mapping From High-Resolution Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5601215. [Google Scholar] [CrossRef]
Long, C.; Wenlong, S.; Tao, S.; Yizhu, L.; Wei, J.; Jun, L.; Hongjie, L.; Tianshi, F.; Rongjie, G.; Abbas, H.; et al. Field Patch Extraction Based on High-Resolution Imaging and U2-Net++ Convolutional Neural Networks. Remote Sens. 2023, 15, 4900. [Google Scholar] [CrossRef]
Lu, H.; Wang, H.; Ma, Z.; Ren, Y.; Fu, W.; Shan, Y.; Hu, S.; Zhang, G.; Meng, Z. Farmland boundary extraction based on the AttMobile-DeeplabV3+ network and least squares fitting of straight lines. Front. Plant Sci. 2023, 14, 1228590. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Li, Y.; Tong, Z.; He, L.; Zhang, M.; Niu, Z.; He, H. GLCANet: Global–Local Context Aggregation Network for Cropland Segmentation from Multi-Source Remote Sensing Images. Remote Sens. 2024, 16, 4627. [Google Scholar] [CrossRef]
Wang, Y.; Gu, L.; Jiang, T.; Gao, F. MDE-UNet: A Multitask Deformable UNet Combined Enhancement Network for Farmland Boundary Segmentation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 3001305. [Google Scholar] [CrossRef]
Zhong, B.; Wei, T.; Luo, X.; Du, B.; Hu, L.; Ao, K.; Yang, A.; Wu, J. Multi-Swin Mask Transformer for Instance Segmentation of Agricultural Field Extraction. Remote Sens. 2023, 15, 549. [Google Scholar] [CrossRef]
Lu, R.; Wang, N.; Zhang, Y.; Lin, Y.; Wu, W.; Shi, Z. Extraction of Agricultural Fields via DASFNet with Dual Attention Mechanism and Multi-scale Feature Fusion in South Xinjiang, China. Remote Sens. 2022, 14, 2253. [Google Scholar] [CrossRef]
Lu, X.; Ming, D.; Du, T.; Chen, Y.; Dong, D.; Zhou, C. Delineation of cultivated land parcels based on deep convolutional networks and geographical thematic scene division of remotely sensed images. Comput. Electron. Agric. 2022, 192, 106611. [Google Scholar] [CrossRef]
Cao, Y.; Zhao, Z.; Huang, Y.; Lin, X.; Luo, S.; Xiang, B.; Yang, H. Case instance segmentation of small farmland based on Mask R-CNN of feature pyramid network with double attention mechanism in high resolution satellite images. Comput. Electron. Agric. 2023, 212, 108073. [Google Scholar] [CrossRef]
Zhang, D.; Pan, Y.; Zhang, J.; Hu, T.; Li, N.; Chen, Q. A generalized approach based on convolutional neural networks for large area cropland mapping at very high resolution. Remote Sens. Environ. 2020, 247, 111912. [Google Scholar] [CrossRef]
Miao, L.; Li, X.; Zhou, X.; Yao, L.; Deng, Y.; Hang, T. SNUNet3+: A Full-Scale Connected Siamese Network and a Dataset for Cultivated Land Change Detection in High-Resolution Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4400818. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, L.; Tang, B.H.; Le, W.; Wang, M.; Cheng, J.; Wu, Q. MATNet: Multiattention Transformer network for cropland semantic segmentation in remote sensing images. Int. J. Digit. Earth 2024, 17, 2392845. [Google Scholar] [CrossRef]
Xu, Y.; Zhu, Z.; Guo, M.; Huang, Y. Multiscale Edge-Guided Network for Accurate Cultivated Land Parcel Boundary Extraction From Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4501020. [Google Scholar] [CrossRef]
Hong, Q.; Zhu, Y.; Liu, W.; Ren, T.; Shi, C.; Lu, Z.; Yang, Y.; Deng, R.; Qian, J.; Tan, C. A Segmentation Network for Farmland Ridge Based on Encoder-Decoder Architecture in Combined with Strip Pooling Module and ASPP. Front. Plant Sci. 2024, 15, 1328075. [Google Scholar] [CrossRef] [PubMed]
Wang, B.; Zhou, Y.; Zhu, W.; Feng, L.; He, J.; Wu, T.; Luo, J.; Zhang, X. AAMS-YOLO: Enhanced Farmland Parcel Detection for High-Resolution Remote Sensing Images. Int. J. Digit. Earth 2024, 17, 2432532. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, Y.; Zhang, G.; Tang, L.; Hu, X. The winning solution to the iFLYTEK challenge 2021 cultivated land extraction from high-resolution remote sensing images. In Proceedings of the 14th International Conference on Advanced Computational Intelligence (ICACI), Wuhan, China, 18–20 March 2022; pp. 376–380. [Google Scholar] [CrossRef]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. arXiv 2023, arXiv:2301.00808. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Fu, Y.; Lou, M.; Yu, Y. SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation. arXiv 2025, arXiv:2412.11890v2. [Google Scholar] [CrossRef]

Figure 1. Location of the study area. (a) geographic location of the study area; (b) ground truth for the study area; (c) original images; (d) farmland distribution.

Figure 2. Data augmentation.

Figure 3. CSMNet network architecture.

Figure 4. ConvNeXt V2 as an encoder.

Figure 5. Multi-headed attention module.

Figure 6. Results of the step-by-step ablation experiment.

Figure 7. Segmentation effect of different methods on typical areas of Nong’an County Sentinel 2 dataset. The white area in the image indicates the farmland plot area and the black area indicates the background (non-farmland) area.

Figure 8. Results of mapping of farmland in the study area (white represents farmland).

Figure 9. Segmentation results of different methods on the image (9049 × 6806) in the iFLYTEK dataset. The red region in the image indicates the region missed by the corresponding model, and the green region indicates the region incorrectly detected by the corresponding model.

Figure 10. Cases in challenging scenarios. (The first two rows are fragmented farmland, while the last two rows are shaded farmland).

Table 1. Specific parameter settings.

Configuration	Contents
Optimizer	SGD
Scheduler	CosineAnnealingLR
Batch size	8
Total epochs	150
Initial learning rate	0.001
Min learning rate	0.00001
Weight decay	0.0001

Table 2. Effect of different weighting coefficient combinations on segmentation results.

α	β	IoU (%)
1	0	88.95
1	0.5	89.43
1	0.8	89.87
1	1	89.95
0.8	1	90.85
0.5	1	90.12
0	1	89.18

Table 3. Results of separate ablation experiment.

Method	Combine Loss	ConvNeXt V2	Multi—Scale Fusion	MHA	Precision (%)	Recall (%)	F1 (%)	IoU (%)
UNet++					92.01	87.16	89.51	81.93
	√				92.65	89.91	83.78	83.78
		√			93.47	91.86	92.41	86.18
			√		92.88	89.98	91.42	84.48
				√	93.12	90.72	91.90	85.16

Table 4. Effectiveness of different methods on the Nong’an County Sentinel-2 dataset.

Method	Precision (%)	Recall (%)	F1 (%)	IoU (%)
PSPNet	89.28 ± 0.68	79.90 ± 0.77	84.30 ± 0.71	74.91 ± 0.78
SegNet	90.90 ± 0.65	85.79 ± 0.74	88.26 ± 0.68	80.02 ± 0.75
DeepLabV3+	92.77 ± 0.60	86.72 ± 0.68	89.63 ± 0.63	82.28 ± 0.70
UNet++	92.01 ± 0.62	87.16 ± 0.71	89.51 ± 0.65	81.93 ± 0.72
TransUNet	91.72 ± 0.59	91.78 ± 0.69	91.24 ± 0.62	83.90 ± 0.69
SeaFormer	92.08 ± 0.55	91.95 ± 0.64	92.11 ± 0.57	85.24 ± 0.66
SegMAN	94.86 ± 0.48	92.87 ± 0.61	93.81 ± 0.52	88.78 ± 0.62
CSMNet	95.91 ± 0.45	93.95 ± 0.57	94.92 ± 0.49	90.85 ± 0.59

Table 5. Comparison of method parameters, FLOPs and frames per second (FPS).

Method	Parameters (M)	FLOPs (G)	FPS
PSPNet	29.81	27.27	87.34
SegNet	18.62	24.31	94.66
DeepLabV3+	39.63	41.03	57.3
UNet++	32.16	34.90	40.51
TransUNet	95.71	75.56	17.5
SeaFormer	10.1	16.8	127.8
SegMAN	28.2	38.13	35.5
CSMNet	35.72	44.85	30.18

Table 6. Effectiveness of different methods on the iFLYTEK dataset.

Method	Precision (%)	Recall (%)	F1 (%)	IoU (%)
PSPNet	94.32 ± 0.55	92.25 ± 0.69	93.72 ± 0.64	88.79 ± 0.70
SegNet	94.93 ± 0.50	93.28 ± 0.67	94.06 ± 0.61	89.12 ± 0.68
DeepLabV3+	95.59 ± 0.45	93.10 ± 0.63	94.54 ± 0.57	90.09 ± 0.62
UNet++	95.43 ± 0.47	93.11 ± 0.65	94.23 ± 0.59	89.56 ± 0.65
TransUNet	95.94 ± 0.43	93.69 ± 0.60	94.77 ± 0.54	90.49 ± 0.63
SeaFormer	95.50 ± 0.45	94.68 ± 0.56	95.01 ± 0.51	90.73 ± 0.60
SegMAN	96.16 ± 0.39	95.26 ± 0.50	95.93 ± 0.46	92.48 ± 0.54
CSMNet	97.23 ± 0.35	96.39 ± 0.48	96.94 ± 0.42	93.69 ± 0.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, Y.; Wang, Y.; Zhang, Y.; Ai, H.; Qin, C.; Zhang, X. Multi-Scale Feature Learning for Farmland Segmentation Under Complex Spatial Structures. Entropy 2026, 28, 242. https://doi.org/10.3390/e28020242

AMA Style

Han Y, Wang Y, Zhang Y, Ai H, Qin C, Zhang X. Multi-Scale Feature Learning for Farmland Segmentation Under Complex Spatial Structures. Entropy. 2026; 28(2):242. https://doi.org/10.3390/e28020242

Chicago/Turabian Style

Han, Yongqi, Yuqing Wang, Yun Zhang, Hongfu Ai, Chuan Qin, and Xinle Zhang. 2026. "Multi-Scale Feature Learning for Farmland Segmentation Under Complex Spatial Structures" Entropy 28, no. 2: 242. https://doi.org/10.3390/e28020242

APA Style

Han, Y., Wang, Y., Zhang, Y., Ai, H., Qin, C., & Zhang, X. (2026). Multi-Scale Feature Learning for Farmland Segmentation Under Complex Spatial Structures. Entropy, 28(2), 242. https://doi.org/10.3390/e28020242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Feature Learning for Farmland Segmentation Under Complex Spatial Structures

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Study Area

2.1.1. Sources and Production of Sentinel-2 Datasets

2.1.2. Public Datasets

2.2. Model Architecture

2.2.1. Improvements to the U-Net++ Model

2.2.2. Multi-Headed Attention

2.2.3. Combine Loss Function

2.3. Accuracy Assessment

2.4. Experimental Environment

3. Results

3.1. Weighting of the Combine Loss Function

3.2. Evaluation of Ablation

3.3. Results of the Sentinel-2 Dataset

3.4. Results of the iFLYTEK Dataset

3.5. Extraction Results in Challenging Scenarios

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI