1. Introduction
The precise delineation of farmland parcels is fundamental to the digital transformation of agriculture and relies critically on high-quality labeled datasets. Under the twin pressures of population growth and climate change, precision agriculture requires highly accurate field-boundary information to support intelligent agronomic decisions—such as feasible fertilization, precision irrigation, and targeted pest and disease control—with high-quality boundary datasets serving as the data foundation for reliable parcel recognition [
1].
As large-scale farming advances globally, agricultural practices remain constrained by local conditions. In China, by the end of 2024, high-standard farmland had been established on over 50% of the total cultivated area [
2,
3,
4,
5,
6], while non-standard farmland still accounted for more than 40%. Especially in regions with dense river networks or rugged terrain, fine-scale fragmentation and complex field boundaries—legacies of topography and local practices—persist. In this paper, we define farmland type from both the spatial configuration of plots and their suitability for concentrated field operations. Non-standard farmland refers to farmland where the cultivated area within a local neighborhood is divided into several small, non-adjacent patches/parcels that are separated by paths, canals, buildings, woodland or other non-crop features, and these patches/parcels are irregular in shape and cannot be integrated into a single contiguous block for continuous, large-scale machine operations [
7,
8]. As a result, field work has to be organized in a fragmented, piece-by-piece manner. In contrast, standard farmland denotes relatively large, contiguous patches/parcels with more regular shapes that can be treated as consolidated operation units and are well suited to mechanized field management [
6]. In our UAV-image-based dataset, image tiles containing multiple small, spatially isolated plots were annotated as non-standard farmland, whereas tiles dominated by a single large, contiguous plot were annotated as standard farmland.
A farmland cultivation zone (FCZ) is a spatial unit delineated to achieve specific agricultural objectives (e.g., planting, irrigation, fertilization) [
9]. Compared with traditional parcel-based analysis, FCZ research offers three core advantages: (1) overcoming the limitations of static parcel boundaries in representing cultivating dynamics, thereby providing greater spatial flexibility [
10]; (2) stronger operational guidance, supporting decision-making for precision spraying, seeding plans, and harvest scheduling [
11,
12,
13]; and (3) economic benefits that help shift agricultural management from land-centric control to process optimization [
14,
15]. Most mainstream parcel-segmentation studies focus on large, intensive farms in Europe and North America, whose regular shapes contrast sharply with the fragmented fields prevalent elsewhere. In typical Chinese agricultural regions—such as terraced fields in the southwest and sloping farmlands in hilly areas—complex landscapes cause notable performance degradation of generic segmentation models, underscoring the need for FCZ-specific research [
9]. As China transitions toward standardized farmland [
16], building a high-quality dataset for FCZ scenarios that mix fragmented and standardized fields is not only practically meaningful for localized management platforms, but also provides a key technical reference for intelligent monitoring of fragmented farmlands worldwide.
Against this background, this study pursues two overarching objectives. At the methodological level, we aim to design a UAV-based sampling and tiling strategy that can simultaneously: (i) preserve fine-scale FCZ boundaries in fragmented, non-standard plots; and (ii) avoid redundant coverage of homogeneous interiors in standardized fields, so that segmentation models are trained on tiles that are truly informative for mixed farmland structures. At the data level, we aim to build an open, multi-temporal UAV benchmark of farmland cultivation zones in a representative mixed-farmland region, enabling systematic evaluation of models in terms of boundary sensitivity, temporal robustness across growth stages, and cross-regional generalization.
Early farmland datasets were primarily constructed from medium-to low-resolution satellite imagery, leveraging wide spatial coverage for global and regional cropland studies. Medium-resolution (10–30 m) national satellite programs—such as ESA Sentinel, NASA Landsat-8, and China’s Gaofen series—are institutionally managed and support macro-scale topics including global cropland inventories, crop-type mapping, and inter-annual change monitoring [
17]. While such data ensure spatiotemporal continuity, their resolution limits the depiction of fine farm boundaries. To overcome this, researchers have developed high-resolution labeled datasets. Representative examples include Agriculture-Vision [
18], which provides 0.1 m aerial imagery with pixel-level boundary labels across more than 94,000 tiles in the midwestern United Sates, enabling fine-grained studies such as small-parcel segmentation and bund identification and serving as a benchmarking dataset. DeepGlobe Land Cover Challenge [
19] offers semantic labels over diverse agricultural landscapes worldwide (e.g., tropical terraces, drylands) and is often used for large-scale land-cover segmentation and pattern analysis, its boundary precision is relatively limited. Addressing the lack of datasets representing global agricultural diversity, the Fields of the World (FTW) benchmark [
20] spans 24 countries across Europe, Africa, Asia, and South America, substantially expanding sample size, diversity of landscapes, and metadata for cross-regional comparisons and studies of farming-system adaptability. Most satellite-based studies emphasize spectral-feature fusion and object-based classification to improve parcel continuity and classification robustness [
21,
22,
23,
24,
25].
However, satellite imagery suffers from limitations such as relatively low resolution and single-source acquisition modes, which can lead to missed detections of fine plots, boundary misclassification, and cloud contamination [
25,
26]. Long revisit cycles also hinder capturing rapid boundary changes induced by cultivations. To address single-source shortcomings, multi-temporal data fusion has gained attraction; time-series satellite data have become mainstream for farmland identification and monitoring [
27,
28,
29]. Some agencies have released time-series products (e.g., Sentinel-2) [
29], enabling dynamic analyses such as phenology tracking, multiple-cropping index estimation, and disaster response. Within this context, PASTIS (Panoptic Agricultural Satellite Time Series) [
30] pioneered a benchmark for panoptic agricultural parcel segmentation, providing Sentinel-2 multispectral time series with panoptic labels (instance IDs + semantics) for 2433 French parcels, thereby bridging time-series imagery with instance-level annotation. Despite improvements in parcel continuity, cloud and mixed-pixel issues persist. To mitigate clouds, SAR–optical fusion exploits SAR’s cloud-penetrating capability, enhancing representation of complex parcel structures via feature-level fusion; yet the higher algorithmic complexity can introduce geometric distortions [
31,
32,
33].
The small-scale, highly heterogeneous patterns of non-standard farmland impose stringent requirements on parcel-extraction techniques: traditional medium- to low-resolution remote sensing struggles to depict bund boundaries accurately, especially in cases where farmland is fragmented into small, non-contiguous plots. Ground surveys, though accurate, are costly and inefficient for such fine-scale analysis. In recent years, UAV platforms, with centimeter-level resolution, have significantly advanced parcel recognition, enabling the accurate identification and segmentation of smaller and more fragmented plots. This makes UAV platforms particularly well-suited for high-precision agricultural applications. Unlike satellite imagery, which faces difficulties in defining the boundaries of small, fragmented farmland plots, UAV imagery offers a much clearer representation of farmland cultivation zone (FCZ) boundaries. UAV platforms are emerging as key data carriers for precision agriculture, particularly for identifying and delineating FCZ boundaries in complex, heterogeneous agricultural landscapes.
Recently, UAV platforms—with centimeter-level resolution—have pushed farmland parcel recognition toward greater detail and become an emerging data source for high-precision agricultural applications. Compared with satellite imagery, UAV acquisition is more flexible, less affected by cloud cover, and can capture more complete surface information across multiple spatial scales [
34]. With continual resolution improvements, UAV imagery captures fine planting structures and has drawn wide attention in precision agriculture [
35]. The UAV-HSI-Crop dataset [
36] collects hyperspectral imagery of 27 crop types with complex planting structures and labels 29 long-tailed classes, addressing fine-grained classification under scarce samples of rare crops and providing a challenging benchmark for mapping heterogeneous farmland. Nevertheless, publicly available UAV datasets specifically targeting FCZs remain scarce; due to authorization or confidentiality constraints, many UAV datasets for parcel extraction are not open. Moreover, single-date acquisitions cannot capture boundary dynamics driven by cultivating actions. There is therefore an urgent need for an open, high-quality benchmark dataset.
This study establishes a publicly available dataset specifically designed for complex mixed farmland segmentation and proposes a dedicated data construction strategy to tackle key technical bottlenecks such as boundary preservation and spatial heterogeneity. Concretely, our contributions are threefold. First, we introduce BP-MOPS, a tile-generation method that operates on pre-acquired orthomosaics and their corresponding annotation masks to produce boundary-enriched, low-overlap training samples, overcoming the limitations of conventional sliding-window tiling in retaining boundary information in fragmented cultivation zones and avoiding redundant sampling in homogeneous field interiors. Second, we construct and open-source the MPFCZ dataset, comprising 6467 image patches of 1024 × 1024 pixels over three consecutive growing seasons at 4.5 cm spatial resolution, capturing cultivation patterns from large mechanized farmlands to fragmented smallholdings and effectively representing the spatial heterogeneity of mixed agricultural landscapes. And third, we perform extensive benchmarking to evaluate the dataset’s utility in semantic segmentation, multi-temporal change detection, and cross-region generalization, showing that models trained on MPFCZ can recognize complex farmland structures, adapt to seasonal variations, and generalize to unseen regions, thereby providing a valuable resource for seasonal cultivation-pattern analysis, farmland dynamic monitoring, and agricultural sustainability assessment.
Figure 1 illustrates the research framework of this study. The workflow begins with the acquisition and preprocessing of multi-temporal UAV imagery across three distinct phenological stages. Subsequently, instance-level annotation and rigorous quality control are performed to generate precise semantic masks. The core innovation follows, where the proposed BP-MOPS strategy is applied to generate a balanced and boundary-enhanced tile dataset from the annotated orthomosaics. The constructed MPFCZ dataset then undergoes a two-phase experimental evaluation, each with a distinct focus: the first phase employs the dataset to train various deep learning models to verify its usefulness, and the second phase assesses the utility of the dataset on farmland change detection over time and its generalization for a new imagery.
3. Dataset Generation Strategy: Boundary Probes and Minimum-Overlap Poisson-Disk Sampling (BP-MOPS)
In UAV-based FCZ segmentation, the spatial pattern is complex: large standardized fields coexist with numerous irregular fragmented operation zones. This causes the conventional “row-wise sliding with overlap” [
46] tiling to oversample large parcels while under-sampling small ones, biasing training toward large fields and degrading segmentation of small targets, boundaries, and heterogeneous regions.
We therefore introduce boundary probes and minimum-overlap Poisson-disk sampling (BP-MOPS), a dataset-generation method that combines boundary probes with minimum-overlap Poisson-disk sampling (MOPS) [
47] and enhances boundary information through direct mask analysis. This approach operates directly on the original image and its corresponding mask, requiring no model training or iterative learning. The BP-MOPS method is guided by the principle of leveraging annotated masks to enable boundary-aware tile selection. The process consists of two stages: the first is boundary-aware sampling, and the second is spatially uniform sampling. In the first stage, those boundary key points are detected through probe-based analysis of the mask pattern, directly identifying geometrically significant locations from the mask data. This effectively increases the coverage density of boundary areas and fragmented cultivation zones. In the second stage, the MOPS method is applied to select a spatially uniform subset from the boundary points, thereby achieving low inter-tile overlap and maintaining spatial representativeness.
This two-stage strategy overcomes the limitations of traditional row-wise sliding methods while remaining universally applicable to any scenario where image–mask pairs are available, without dependency on specific model architectures or retraining.
In the first stage, we design an adaptive sampling scheme to select, from the many randomly distributed candidate points inside the farmland mask, those that can effectively “sense” nearby boundaries. The underlying idea is geometric: if a point lies close to a true field edge, then lines cast from this point in multiple directions will cross from foreground (farmland) to background (non-farmland), whereas points deep inside homogeneous plots rarely exhibit such foreground–background transitions.
Operationally, we first generate a dense set of candidate seed points across the farmland region (
Figure 6(a1,a2)). For each candidate, we use Bresenham’s line algorithm [
48] to cast probes in eight directions at 45° intervals on the mask (
Figure 7) and record in which directions the line crosses from foreground to background. A candidate is retained as a boundary-aware valid point only if the configuration of these transitions satisfies one of the following criteria.
Criterion 1 (three non-consecutive directions,
Figure 7a): the lines cross foreground and background in at least three mutually non-adjacent directions. This pattern typically appears near corners or junctions of highly irregular boundaries, which are common in fragmented smallholder plots—for example at sharp bends in bunds or at intersections between multiple FCZs. In such locations, the boundary changes orientation abruptly and extends into several separated angular sectors, so non-adjacent probes all encounter the edge. Using three non-consecutive directions therefore allows us to preferentially keep points near tortuous or intersecting boundaries.
Criterion 2 (at least four consecutive directions,
Figure 7b): the lines cross foreground and background in at least four consecutive directions. This pattern indicates proximity to a locally straight or smoothly curved boundary segment, as often found along the edges of standardized, large fields where the boundary extends smoothly in one general direction. In these cases, the boundary occupies a contiguous arc around the point, causing a block of adjacent probes to intersect the edge. Requiring four consecutive directions thus selects points lying near smooth, extended boundary segments.
These thresholds were empirically validated to achieve an effective balance: looser settings (e.g., only two directions) admitted many points from interior textured regions, while stricter ones (e.g., five or more directions) excluded valid points near short or complex boundary sections. The core concept of aggregating multi-directional evidence is consistent with classical edge-detection principles (e.g., Canny’s method [
49]), but here it is specifically tailored to boundary-proximal seed selection.
This selective process ensures that the final seed set is concentrated in boundary-vicinity regions, as illustrated by the refined distribution of probes-based seeds in
Figure 6(b1,b2). Large, homogeneous areas within standardized farmland—where boundary information is sparse—are naturally assigned far fewer seeds, whereas fragmented, irregular plots are sampled more densely. Consequently, tiles centered on these seeds are highly likely to encapsulate meaningful boundary segments, providing models with rich contextual information for learning accurate segmentation of complex field geometries.
The second stage enforces spatial uniformity and independence among selected seeds. Building on the classic Poisson-disk minimum-distance constraint [
47], MOPS introduces a greedy minimization of inter-tile overlap: at each iteration, for every candidate, compute the maximum normalized overlap between its tile and already selected tiles, and choose the candidate with the smallest maximum overlap, repeating until the preset sample size is reached.
Let
represent the set of all candidate center points,
represent the set of selected slice centers, and
represent a 1024 × 1024 window centered at a point. In each iteration, we define the maximum normalized overlap rate
of a candidate point as in
select the point that minimizes the overlap
and add that point to
. The global objective equation is written as
Equation (3) aims to minimize the maximum pairwise area-overlap among final tiles.
Figure 8(a1,a2) illustrate the point selection process using minimum-overlap Poisson-disk sampling (MOPS), ensuring that the spatial overlap between any two tiles is minimized. The overlap rate is used to estimate the number of pre-selected tiles, which results in a uniformly distributed set of sample points with maximum inter-point distance. This sampling strategy reduces overlap in square tiles while maximizing spatial coverage.
Figure 8(b1,b2) show the process of using BP-MOPS-filtered sample points as tile center points, cropping image tiles of a preset size, and constructing a standardized UAV imagery dataset. The BP-MOPS strategy ensures that each tile in this dataset maximally captures farmland boundary information.
To determine when the greedy selection should terminate, we employ a coverage model that integrates two key characteristics of our sampling framework: (i) the boundary-biased seed distribution produced in the first stage, and (ii) the prescribed inter-tile overlap. Because BP-MOPS retains only points exhibiting multi-directional label transitions, most seed points lie within a narrow band around the farmland boundary. In the idealized case, the centers of all tiles can therefore be regarded as located directly on, or infinitesimally close to, the boundary. Under this boundary-anchored assumption, each tile necessarily straddles the interface between farmland and background.
To estimate the spatial domain that such boundary-centered tiles are capable of covering, we expand the farmland mask outward by half the tile width in all directions. This outward expansion is equivalent to a one-step morphological dilation using a square structuring element whose half-side length matches half the tile dimensions. Geometrically, this dilation reproduces the footprint of tiles centered on the boundary—each of which can extend half a tile into the exterior—and also smooths narrow gaps, thin appendices, and concave indentations. Therefore, the area of this expanded mask is defined as the effective region that must be covered, and this expanded region is denoted as .
For a locally straight boundary, a tile positioned on the boundary is bisected into two equal halves, implying that only half of its area overlaps with the expanded region. While this suggests a theoretical effective coverage ratio of approximately 0.5, Empirical observations on real UAV farmland imagery, combined with the smoothing from the dilation step, show higher values, which are typically in the range of 0.6–0.7. This range is specifically derived through analyzing the proportion of foreground and background in the study area. Accordingly, we adopt η ≈ 2/3 as the expected fraction of each tile that contributes to covering the expanded region.
To regulate redundancy, we impose a target inter-tile overlap of
and
. Under an ideal arrangement with uniform overlap, the effective non-redundant coverage per tile is
after determining the area
of the expanded mask (which represents the effective target region to be covered by the tiles), we first clarify that the total number of tiles required to cover this expanded region is denoted as
, whose meaning refers to “the total number of theoretical tiles needed to cover the expanded region”; therefore, the calculation formula for
is as follows:
Using this formulation, our dataset requires 6467 tiles to cover the expanded region under the parameter setting and . The greedy minimum-overlap sampling process produces an average empirical tile overlap of 19.5%, which closely matches the intended 20% redundancy. This agreement demonstrates that the selected values of and accurately capture the spatial behavior of boundary-anchored tiles and provide a reliable estimate of the appropriate sampling density.
In practice, the greedy selection iterates until the number of selected tiles reaches this analytically derived target. In the first stage, boundary-biased seed points are generated using the boundary probes-based approach that detects multi-directional transitions at the farmland boundaries. These seed points are selected based on two criteria, criterion 1 is designed for selecting points near tortuous or intersecting boundaries, and criterion 2 is designed for selecting points near a locally straight or smoothly curved boundary. In the second stage, the algorithm applies a greedy approach to minimize inter-tile overlap, selecting points iteratively until the target number of tiles is reached. This process ensures spatial uniformity and coverage efficiency, with the final selection guided by the expanded mask area and the prescribed overlap ratio.
4. Experimental Evaluation of BP-MOPS Versus Row-Wise Dataset Generation Strategies
This section provides a systematic comparison between the proposed BP-MOPS and the conventional row-wise method under a controlled experimental design. Two training datasets were generated with overlap using each strategy, and the same set of models was trained on them. Model performance was first evaluated on their respective validation sets using standard segmentation metrics (mIoU, mAcc, mDice, and mRecall). To further examine generalization, we constructed a non-overlapping test set (zero overlap, covering the entire study area) and evaluated all models on this unified benchmark using mIoU as the primary metric. As summarized in
Table 3, results show that BP-MOPS produced 6467 tiles with an average overlap of 19.5%, compared to 6174 tiles at 20% overlap with the row-wise method, highlighting its superiority in both sample quantity and distribution quality.
4.1. Experimental Design
Experiments were conducted on a high-performance platform (NVIDIA RTX 4090D GPU, Intel Core i9-14900KF, 128 GB RAM). The training pipeline integrated logging, visualization, learning-rate scheduling, and checkpointing. Validation and testing adopted the same resizing and preprocessing pipeline as training to ensure rigor and reproducibility.
We used the open-source MMsegmentation framework [
50], organizing the datasets in strict compliance with its format. Images and their pixel-level annotations were housed in separate directories, and the dataset was partitioned into distinct training, validation, and test subsets at a 7:1:2 ratio to ensure an objective and reproducible evaluation. We fixed the random seed to ensure the reproducibility of the dataset splits.
Given the large scaled variation, strong illumination disturbances, and pronounced spatial heterogeneity in UAV farmland imagery, we applied data augmentations—random scale resizing, random cropping, horizontal flipping, and photometric perturbations—to improve the capability of model generalization. We used the momentum SGD optimizer [
51] (initial LR 0.01, momentum 0.9, weight decay 0.0005) with the Poly learning rate scheduler [
52] (power 0.9, minimum LR 1e-4) in the training process. We trained for 40,000 steps and ran validation every 4000 steps.
This study selected representative models from mainstream semantic segmentation architectures, including Transformer-based models (UperNet-SwinT [
53,
54], Segformer [
55]), deep Convolutional Neural Network (CNN)-based models (PSPNet [
56](r101/r101b [
57]), DeepLabv3 [
58] (r101/r101b), DeepLabv3+ [
59] (r101/r101b)), and the innovative Transformer model RMT-S [
60], totaling ten typical models. These models have their advantages in remote sensing and natural scene segmentation. For example, transformer architectures (such as UperNet-SwinT and Segformer) excel in global modeling and multi-scale fusion, while deep CNN models (such as PSPNet, DeepLabv3, and DeepLabv3+) remain competitive in boundary detail capture and classic segmentation tasks. RMT-S achieves explicit spatial priors and efficient global modeling through a Manhattan distance-based spatial decay matrix and bidirectional decomposition attention mechanism.
Table 4 summarizes the models used in this study along with their main parameter settings.
Model performances were evaluated using standard semantic segmentation metrics, including overall accuracy (aAcc), mean intersection over union (mIoU), mean accuracy (mAcc), and mean Dice coefficient (mDice). Additionally, detection performance was assessed using mean recall (mRecall) and mean F1 score (mF1-score) [
18,
61]. To specifically evaluate performance across the three key farmland growth stages (DP, VGP, IGP), we compute the IoU for each stage individually, denoted as DPIoU, VGPIoU, and IGPIoU.
Furthermore, to quantify the model’s temporal generalization capability (i.e., consistency of performance across different stages), we proposed a novel metric, IoU-cv, defined as the coefficient of variation (the standard deviation divided by the mean) of the IoU values across the three stages: dormant period IoU (DPIoU), vigorous growing period IoU (VGPIoU), and intermediate growing period IoU (IGPIoU) [
62]. IoU-cv measures the variation in the model’s performance across stages by calculating the ratio of the standard deviation to the mean IoU, providing a robust metric for temporal stability. A lower IoU-cv value indicates more stable and consistent performance across the stages, reflecting better temporal generalization; in contrast, a higher IoU-cv value indicates greater performance fluctuation and poorer temporal generalization.
4.2. Model Performance on BP-MOPS Generated Dataset–MPFCZ
The multi-model semantic segmentation results for the dataset generated by the BP-MOPS strategy are summarized in
Table 5. It can be observed that the transformer-based UperNet-SwinT outperforms all other models across all evaluation metrics, with an mIoU of 89.60% and per-stage IoUs (DPIou, VGPIou, IGPIou) of 94.24%, 91.55%, and 91.60%, respectively. This result demonstrates that transformer architectures, particularly Swin Transformer, have strong multi-scale feature representation abilities and spatial adaptability. Swin transformer is capable of effectively capturing detailed information in complex backgrounds and dynamic operation zones, providing stable segmentation performance across multiple growing seasons. However, the high computational complexity of transformer models means that training on large datasets may require more time and computational resources.
Segformer (mit-b5) also performs well, achieving an mIoU of 85.83%. Although slightly lower than UperNet-SwinT (especially during the vigorous growing period, with IGPIou of 89.81%), it still demonstrates strong segmentation capabilities. Compared to other CNN architectures, Segformer shows better robustness in handling multi-scale images and can adapt well to scale variations. However, its performance slightly decreases in areas with complex textures, indicating some limitations in handling complex backgrounds.
The CNN-based DeepLabv3+ (r101b) achieved an mIoU of 77.21%, outperforming other CNN models and RMT-S in most crop dormant and growing stages (DPIou: 86.65%, VGPIou: 79.80%). This indicates that the deep ResNet backbone combined with the ASPP spatial pyramid structure is effective in capturing spatial heterogeneity and cross-scale target features in remote sensing scenes, especially in areas with continuous textures during the dormant and part of the growing periods. However, during the vigorous growing period (IGPIou: 79.55%), segmentation performance still shows limitations, indicating that CNNs need further improvement in expressing complex textures and fine boundaries.
RMT-S, a recently proposed model, achieved an mIoU of 75.87% with IoUs for the three periods of 85.52%, 79.91%, and 77.64% under the PB-MOPS strategy. Although its overall performance is slightly lower than DeepLabv3+r101b, the model still performs well across different growth stages, without requiring large external datasets for pre-training. Its structural advantage lies in maintaining balanced segmentation performance across stages even without pre-training, though its performance still lags slightly behind other pre-trained models, particularly during the vigorous growing period with complex textures and significant crop changes.
PSPNet (r101b/r101) performs relatively lower in most metrics, especially with the r101 backbone, where mIoU drops to 43.91–70.70%. The model’s segmentation performance in the vigorous and intermediate growing periods is particularly weak, as reflected in significant differences in category metrics such as DPIou and IGPIou. These results suggest that, in the absence of pre-training, PSPNet still has considerable room for improvement when dealing with class imbalance and complex boundaries.
In summary, transformer-based models demonstrate stronger generalization and boundary representation across all three periods. Deep CNN models (especially r101b) remain competitive during the dormant and some growing stages but are limited in handling complex textures and spatially heterogeneous regions. The recently proposed RMT-S model shows relatively strong performance in small- to medium-scale farmland semantic segmentation tasks, even without pre-training.
4.3. Model Performance on Dataset Generated by Conventional Row-Wise Sliding
Table 6 presents the performance of the same semantic segmentation models in
Table 5 on the UAV remote sensing dataset generated by conventional row-wise sliding-window method with overlap. It can be observed that regardless of whether transformer architectures (e.g., UperNet-SwinT, Segformer) or deep CNN structures (e.g., DeepLabv3+, PSPNet) are used, datasets generated under the PB-MOPS strategy (i.e., the proposed MPFCZ dataset) consistently outperform those from the traditional row-wise method across all three periods. For example, UperNet-SwinT achieves an mIoU of 89.60% under the new strategy, with IoUs for all three periods exceeding 91%. This dataset generation strategy effectively enhances the recognition capability of the tested model architectures in mixed scenarios involving both large and small farmland plots.
The relative ranking of models remains highly consistent across different data-generation strategies: the top model, UperNet-SwinT, consistently leads, followed by Segformer, with DeepLabv3+/RMT-S and PSPNet trailing. This stability indicates that the inherent expressive capacity differences between model architectures are independent of the dataset generation strategy. The performance gains from the PB-MOPS strategy stem from its improvement in the representativeness and balance of the training data—controlled uniform segmentation reduces redundancy in large homogeneous fields while boundary-aware sampling enhances the retention of complex junction regions, without altering the intrinsic characteristics of the models themselves.
As shown in
Figure 9, the uniform segmentation guided by the PB-MOPS strategy significantly improves the spatial and class representativeness of the training set: the randomly but controlled spacing of sampled points effectively avoids redundant coverage of large fields and, through boundary-aware sampling, retains samples from fragmented operation zones and field junctions, providing richer discriminative patterns for the model during training. The results show that all ten models achieve an average improvement of 5–10 percentage points in the three geometry- or recall-dominant metrics—mIoU, mDice, and mRecall—demonstrating the positive effect of adding boundary and small-target samples to reduce false negatives and improve spatial matching. Meanwhile, the IoU-cv (Intersection over Union Coefficient of Variation) metric, representing cross-period adaptability, shows an overall increase, indicating a more balanced performance distribution across the three growth stages. However, under the ‘one-class-per-tile’ setting in this task, the new strategy, while enhancing spatial representativeness, inevitably introduced more complex classification scenarios. Its boundary-aware sampling strategy intentionally included a large number of pixels located along boundaries, field ridges, and small operational zones—areas with high heterogeneity and classification uncertainty. As a result, the model (especially CNN-based backbones) misclassified many background pixels with textures similar to the foreground, leading to a significant increase in false positive pixels (FP), which outpaced the increase in true positive pixels (TP). This phenomenon highlights an inherent characteristic of the PB-MOPS strategy: while it injects diverse spatial context to significantly improve the model’s ability to discern complex geometries (as reflected in the increase in mIoU and recall), it also makes pixel-level classification more challenging, leading to some limitations in mAcc. This preliminary finding suggests that model performance evaluation requires a comprehensive consideration of different metrics, which must be balanced according to the specific task objectives.
4.4. Performance Comparison and Statistical Significance of Models Trained on BP-MOPS and Row-Wise Sliding Datasets
To further validate the above observations, this section presents a detailed comparison of models trained on datasets generated by the BP-MOPS strategy and the traditional Row-wise sliding approach. The evaluation was conducted using a consistent test set, which was generated by applying a non-overlapping row-wise sliding window method. This test set, consisting of 803 slices with size 1024 × 1024, achieves coverage of the study area consistent with the scope specified in this paper—thereby ensuring a fair and representative assessment that captures the real-world complexities of the segmentation task. The primary objective of this analysis is to assess how well the models trained on these two distinct data generation strategies generalize on the same standardized test set and to determine the statistical significance of the differences observed in key metrics such as Intersection over Union (IoU), accuracy, Dice coefficient, and recall. To evaluate these differences, t-tests were conducted, with the null hypothesis stating that there is no significant difference in result between the models trained on the BP-MOPS and row-wise segmented datasets. Significance levels were determined using thresholds of * (p < 0.05), ** (p < 0.01), and *** (p < 0.001), indicating varying degrees of confidence in the observed performance changes.
The results of
t-tests reveal that models trained on the BP-MOPS dataset show significant improvements (
Figure 10), particularly in boundary-sensitive metrics such as IoU, Dice coefficient, and recall. For instance, DeepLabv3+ (r101) and PSPNet (r101b) exhibit notable improvements in mIoU, with differences of 0.145 and 0.253, respectively, both of which are highly significant (
p < 0.001). These findings suggest that the BP-MOPS strategy enhances the model’s ability to capture complex spatial features, especially in fragmented regions and along field boundaries, which are critical for accurate segmentation in agricultural environments. Additionally, DeepLabv3 (r101) and Segformer-mit-b5 show consistent improvements in Dice coefficient and recall, with changes ranging from 0.126 to 0.147 for Dice and 0.169 to 0.183 for recall, all of which are statistically significant (
p < 0.001). These results demonstrate that BP-MOPS effectively addresses the challenge of small-target recognition, ensuring that complex, heterogeneous regions are well-represented during training, leading to more accurate segmentation results.
However, an in-depth analysis of the accuracy metric presents a more complex picture. Despite clear improvements in boundary-aware metrics, several models show a decline in mean accuracy (mAcc), a global measure that reflects the overall pixel-wise classification performance. For example, DeepLabv3+ (r101) and DeepLabv3 (r101b) experience reductions in accuracy of −0.064 and −0.028, respectively. This reduction in accuracy, particularly when coupled with improvements in other metrics, points to a fundamental trade-off between boundary precision and overall pixel-wise accuracy.
The decrease in accuracy can be attributed to the sampling strategy inherent in the BP-MOPS approach. By focusing on boundary-aware sampling and giving more weight to smaller, boundary-rich regions, BP-MOPS can lead to an increase in false positives (FPs), especially in areas of the image that are more homogeneous and not as well-represented in the training set. As a result, models trained with BP-MOPS may become more sensitive to fine-grained features, leading to higher precision in boundary detection but potentially reducing their ability to accurately classify larger, simpler regions. This shift in focus from global accuracy to boundary precision explains the observed drop in accuracy for some models, despite the improvement in other segmentation metrics.
In conclusion, the BP-MOPS strategy significantly enhances model performance in key segmentation metrics such as IoU, Dice coefficient, and recall, with statistically significant improvements observed across most models. However, the observed decline in accuracy in several models underscores a trade-off between boundary detection precision and overall pixel-wise classification accuracy. While the BP-MOPS strategy excels in improving segmentation in complex regions with intricate boundaries and small targets, the increase in false positives and the associated decline in accuracy must be considered. This trade-off highlights the need for a more nuanced approach to segmentation, where the focus on boundary precision is balanced with the need for global accuracy. Future work could further refine the BP-MOPS strategy to address these limitations, potentially improving the balance between fine-grained segmentation and overall classification performance.
5. Farmland Change Detection Based on the MPFCZ Dataset and Its Generalization Experiment
To comprehensively evaluate the practical application value of the MPFCZ dataset in complex agricultural scenarios, this section details two experiments from different aspects. Change detection experiment for farmland cultivation zone was designed to examine the capability of models trained by MPFCZ in capturing farmland dynamics over time. And the generalization experiment, which focuses on assessing the transferability of MPFCZ trained models to new regions which are not in the MPFCZ dataset.
5.1. Change Detection in Farmland Based on the MPFCZ Dataset
This experiment aims to systematically evaluate the performance of a model trained on the MPFCZ dataset for temporal change detection in farmland cultivation areas. First, an independent contiguous geographical area—covering imagery from three key phenological phases (VGP, DP, IGP)—was picked out as the test region (pixel dimensions: 10,047 × 17,828). Then all image patches originating from this region were excluded from MPFCZ to prevent their involvement in any subsequent training or validation processes. The remaining image data were then randomly divided into training and validation sets at an 8:2 ratio. Using this partition, the UperNet-SwinT model was retrained to rigorously avoid data leakage.
To comprehensively analyze the model’s responsiveness to farmland dynamics, a cross-phenological-phase experiment was designed. During the testing (inference) stage, the entire test region image was processed as a single input for end-to-end inference evaluation, simulating a real application scenario. To address the scale mismatch between large-scale remote sensing imagery and the model’s training input size, a four-direction sliding window inference strategy [
46] was employed during inference, extracting multi-scale features using a step size of 256 pixels on 1024 × 1024 pixel patches. To ensure spatial continuity along stitching boundaries, a Gaussian weight fusion method [
56] was applied to perform weighted averaging of predictions in overlapping regions, effectively suppressing edge artifacts. In addition, mirror padding [
57] was used during tile preprocessing to maintain image content continuity and boundary consistency, thereby improving the accuracy and reliability of segmentation results.
After semantic segmentation results were generated for the entire test region, three representative farmland samples were selected for detailed analysis to intuitively demonstrate the model’s performance across different landforms. Their pixel dimensions are 7365 × 6199 (standard farmland), 5240 × 2961 (fragmented farmland), and 8201 × 4569 (mixed farmland).
Figure 11 displays the temporal detection results for these three samples: (a1) to (c1) standard farmland areas show the model accurately capturing the complete cultivation cycle while maintaining distinct boundaries and spatial continuity; (a2) to (c2) fragmented farmland areas demonstrate high recognition consistency despite uneven evolutionary characteristics; (a3) to (c3) mixed-type areas validate the model’s adaptability to diverse farmland structures. Quantitative results (
Table 7) confirm the method’s effectiveness: the mean Intersection-over-Union (mIoU) exceeded 0.82 across all three phenological phases, with the VGP phase reaching 0.8472 and the DP phase achieving a precision of 0.9787, reflecting high reliability in the model’s recognition outputs. Although recall was slightly lower in the DP phase (0.8578), this is attributed to feature ambiguity caused by vegetation cover changes under drought conditions and an increased presence of small fragmented farmland parcels. Overall, the model exhibited strong stability and adaptability in capturing temporal change patterns across various farmland types.
5.2. A Generalization Experiment on a New Farmland Imagery
To validate the value of the MPFCZ dataset in constructing models with strong generalization capabilities, this study utilized the UperNet-SwinT model trained on the MPFCZ dataset (
Section 4.2) and tested it directly on a new dataset which was collected from the Yanzha Village area in Jiangling County, Hubei Province. The agricultural characteristics of Yanzha Village differ significantly from those of the training data source (Zhouhu Village in Zhijiang City). Specifically, Zhouhu Village comprises a mixed system of high-standard farmland and fragmented small plots, dominated by large-scale cultivation of staple crops like rice and rapeseed. In contrast, Yanzha Village features a relatively homogeneous farmland structure, focusing on the intensive management of economic crops such as sugarcane and vegetables. The test data, covering an area of approximately 659 hectares in Yanzha Village, was resampled to a spatial resolution of 0.045 m, resulting in a total of 1995 image slices, each with dimensions of 1024 × 1024 pixels.
The quantitative evaluation on the Yanzha Village dataset (
Table 8) reveals a nuanced performance profile of the model’s cross-regional generalization capability. The model demonstrates robust feature identification, achieving a high precision of 0.8817, which confirms that the MPFCZ dataset facilitates the learning of fundamental, transferable visual features of farmland. This capability allows the model to maintain reliable performance despite significant domain shifts, including transitions in farmland layout complexity, changes from staple to economic crops, and variations in image resolution (
Figure 12).
However, a critical analysis indicates that the overall segmentation performance, as measured by mIoU (0.5236) and Recall (0.5383), is modest. The notable disparity between high precision and lower mIoU suggests the model adopts an overly cautious approach when confronting unfamiliar domain characteristics. This conservative prediction strategy, while ensuring reliability, results in the under-segmentation of challenging regions—particularly along ambiguous boundaries and in areas with feature distributions that differ substantially from the training data. The performance drop is attributable to the inherent domain gap between the source and target regions, where differences in visual texture, structural geometry, and statistical distribution present challenges that exceed the model’s current generalization bounds.
This analysis provides valuable insights for future methodological improvements. To enhance model transferability, the incorporation of domain adaptation techniques, such as adversarial training or style transfer, could explicitly promote the learning of domain-invariant features, thereby increasing robustness to distribution shifts. Furthermore, the strong foundational features learned from the MPFCZ dataset establish it as an effective pre-training basis. Strategic fine-tuning using limited annotated samples from target regions presents a practical approach to efficiently adapt models to new environments, potentially bridging the performance gap with minimal annotation cost.
In conclusion, while the generalization experiment validates the MPFCZ dataset’s capability in developing models with transferable core knowledge, it also offers a critical perspective on the challenges in cross-regional deployment. The results not only confirm the dataset’s value as a pre-training resource but, more significantly, establish a methodological foundation for developing more adaptable and accurate farmland recognition models through the systematic integration of advanced domain adaptation techniques.
6. Conclusions and Future Work
This study establishes a robust methodological and data foundation for segmenting complex mixed farmland landscapes from UAV imagery by introducing the BP-MOPS data generation strategy. This two-stage synergistic process effectively addresses the limitations of conventional row-wise sampling by prioritizing boundary-sensitive seed points and enforcing a near-uniform tile distribution. Unlike traditional methods, BP-MOPS handles mixed farmland scenarios by directly incorporating annotated mask data. This approach utilizes the masks to guide the selection of tile centers and employs a greedy algorithm to minimize inter-tile overlap, making it particularly suitable for precision agriculture applications where boundary information is accessible. Experimental results demonstrate that this strategy efficiently generates datasets with superior boundary preservation capabilities and minimized spatial redundancy.
The practical value of BP-MOPS is demonstrated through the creation of the multi-period farmland cultivation zones (MPFCZ) dataset, which includes high-resolution, multi-temporal imagery and precise instance-level annotations tailored specifically for mixed farmland scenes. MPFCZ serves as a critical resource for advancing precision agriculture applications, such as high-fidelity parcel mapping, autonomous machinery navigation, and fine-scale cropping pattern monitoring.
Comprehensive benchmarking confirms that models trained on BP-MOPS data exhibit enhanced generalization, particularly for boundary-sensitive tasks, as reflected in significant improvements in mIoU and recall. However, this advancement reveals a fundamental trade-off: the enhanced boundary sensitivity concomitantly increases false positives, depressing pixel-level accuracy (mAcc). This underscores the challenge of balancing precise boundary delineation with classification specificity in highly heterogeneous scenes.
Future work will, therefore, focus on two pivotal directions to overcome these limitations.
Addressing Data Heterogeneity: Mixed scenes inherently present challenges of class imbalance and data heterogeneity. Developing adaptive sampling strategies and class-balanced loss functions is crucial to mitigate these issues.
Optimizing the Precision–Recall Trade-off: It is helpful to design novel, boundary-aware learning objectives or architectural modules capable of explicitly suppressing false positives while preserving the high recall achieved by BP-MOPS-generated data.
Pursuing these directions will foster the development of more robust and balanced segmentation models, enhancing their applicability in complex agricultural environments.