1. Introduction
With advancements in remote sensing technologies, large-scale Earth observation data has become increasingly available across various domains. Among these, agriculture has particularly benefited from such data-rich environments. In agricultural monitoring, agricultural parcel (AP) delineation is a fundamental and long-standing task that aims to identify cultivated fields [
1]. Since agricultural parcels serve as the basic operational units in farming, accurate delineation is essential for a range of downstream applications, including yield estimation, resource allocation, precision agriculture, and sustainable agricultural planning [
2].
Traditionally, mapping APs relies on field surveys and manual digitization, which are both time-consuming and resource-intensive. Given the need to scale mapping efforts across vast and frequently updated regions, automated solutions have become increasingly desirable. The rapid progress of artificial intelligence (AI) technologies has brought new momentum to AP delineation research. Owing to their strong capabilities in processing, optimizing, and learning from large-scale and high-dimensional data, AI-based methods naturally align with the characteristics of modern remote sensing data [
3,
4,
5]. Consequently, a growing body of work has explored the use of machine learning algorithms, and more recently, deep neural networks, for automatic field delineation from satellite imagery and other geospatial sources [
6,
7].
In high-resolution satellite or aerial imagery, field boundaries are often subtle, irregular, or visually ambiguous. Adjacent parcels may have similar spectral or textural signatures, especially when crop types are homogeneous or when fields are in early growth stages. Furthermore, boundary cues are frequently degraded by natural occlusions (e.g., shadows, clouds) or imaging artifacts, making accurate delineation inherently difficult [
8]. These challenges demand models that can jointly capture the global semantic context of agricultural scenes and local fine-grained boundary details.
Semantic segmentation has advanced considerably with the emergence of CNN-based models such as FCN and DeepLab, which demonstrate strong pixel-level performance [
9,
10]. However, these models often yield imprecise boundaries as fine structural details are lost in deep, down-sampled features, and interpolation-based up-sampling further smooths the reconstructed edges [
11]. To mitigate this limitation, many boundary-aware CNNs have been proposed. For example, multi-task frameworks have been developed to couple semantic segmentation and edge detection via shared encoders and boundary-consistency losses [
12]. BiSeNet [
13] proposes a two-path architecture to enhance spatial detail. BASS [
14] applies the boundary attention mechanism and an additional boundary-aware loss term to guide the learning process. Despite these advances, CNN-based networks remain constrained by limited receptive fields, making it difficult to maintain both global contextual understanding and fine boundary localization. The emergence of Transformer architectures has further advanced semantic segmentation by enabling global feature reasoning and long-range dependency modeling. Recent works such as BoundaryFormer [
15] and EdgeFormer [
16] have successfully integrated attention mechanisms with edge supervision, leading to improved contour delineation in general object segmentation tasks. The Segment Anything Model (SAM), as a strong general-purpose baseline, has demonstrated outstanding zero-shot and prompt-based segmentation performance across diverse imagery [
17]. However, such general-purpose segmentation frameworks are often optimized for well-defined object categories and regular shapes. Their direct application to large-scale agricultural parcel delineation remains suboptimal.
Agricultural parcels exhibit several domain-specific challenges that make their delineation uniquely difficult. Unlike urban or man-made objects in general scenes (e.g., roads, buildings), agricultural parcels often feature subtle, irregular, or ambiguous divisions, marked by narrow ridges, natural growth patterns, or unstructured boundaries. These characteristics, combined with intra-field heterogeneity, crop phenology, and temporal inconsistency in remote sensing imagery, pose unique challenges for boundary localization and semantic discrimination. To address these issues, several networks have been specifically developed for agricultural parcel delineation. Single-branch frameworks focus on improving general segmentation quality while indirectly enhancing boundary precision, as shown in
Figure 1a. For instance, ref. [
18] develops a generalized CNN framework for large-area cropland mapping from very-high-resolution (VHR) imagery, demonstrating strong regional scalability but limited sensitivity to fine parcel boundaries. Similarly, ref. [
19] has employed a DeepLabV3-based model to map smallholder plots from WorldView-2 imagery, achieving high overall accuracy yet relying on standard semantic segmentation pipelines without explicit edge modeling. Such single-branch architectures excel at capturing large-scale land-cover semantics but often produce blurred or incomplete parcel edges. In contrast, multi-branch or multi-task frameworks explicitly model both parcel interiors and boundary structures, allowing edge features to guide semantic reasoning [
20,
21,
22], as shown in
Figure 1b. BsiNet [
23] improves geometric precision in hilly terrains through a multi-task design. HBGNet [
24] and BFINet [
25] leverage dual-branch or hierarchical mechanisms to jointly learn semantic-body and edge information. DSTFNet [
26] extends this idea by integrating spatiotemporal cues from VHR imagery to enhance robustness under seasonal variations. These studies collectively confirm that explicit boundary modeling and cross-scale feature fusion are crucial for improving delineation accuracy and spatial consistency in agricultural landscapes.
Nevertheless, despite these notable advances, most existing dual-branch frameworks still rely on unidirectional or loosely coupled interactions between semantic and boundary representations. While such designs have steadily improved boundary fidelity, they primarily emphasize feature-level concatenation or shallow fusion, without deeper iterative communication between high-level semantic representations and low-level spatial cues. There remains room for further progress in achieving deeper, bidirectional feature exchange and more consistent integration between semantic context and edge precision. Motivated by this observation, we propose the Interactive Dual-Branch Transformer (IDBT). IDBT is a framework specifically designed for precise agricultural parcel delineation from VHR imagery. It couples a Transformer-based semantic branch for global contextual understanding with a boundary branch for localized edge learning. We introduce two dedicated modules, the Boundary-Information Tokenization Module (BITM) and the Boundary-Information Feedback Module (BIFM). BITM transforms fine-grained edge features into compact boundary tokens and integrates them into the Transformer sequence, allowing global semantic reasoning to be guided by localized spatial cues. Complementarily, BIFM leverages high-level semantics to adaptively refine boundary predictions, enabling a bidirectional flow of information between the two branches. Feature fusion has been widely recognized as an effective strategy for improving spatial coherence and multi-scale consistency in RS vision models. Many prior works have explored attention-based feature interaction to strengthen global–local alignment [
27,
28]. Building upon this insight, we propose the Multi-level Feature Fusion Module (MFFM), which further unifies hierarchical representations across Transformer stages, ensuring structural completeness and spatial coherence. Additionally, we incorporate a Dual-PointRend supervision strategy. This design refines predictions at both coarse and boundary levels. A composite loss is devised to jointly supervise the segmentation and point-wise predictions, thus promoting both semantic coherence and edge sharpness. To validate the effectiveness of our approach, we conduct extensive experiments on two large-scale benchmark datasets for AP segmentation. Our proposed model consistently outperforms state-of-the-art (SOTA) methods across all evaluation metrics. IDBT also shows stronger transferability in cross-dataset experiments. These results highlight the robustness and practical applicability of our framework. The main contributions of this study can be summarized as follows.
We design a bidirectional dual-branch network aware of boundary information that explicitly models semantic and boundary cues for precise agricultural parcel delineation;
We introduce a set of modules to enhance multi-level feature exchange between branches, improving spatial detail retention and semantic understanding;
We propose a Dual-PointRend supervision strategy to enable more effective boundary learning by selectively refining uncertain or complex regions during training;
We conduct comprehensive experiments on two large-scale datasets, achieving superior performance and demonstrating strong generalization across diverse agricultural landscapes.
3. Experiments
3.1. Experimental Setup
3.1.1. Datasets
We evaluate our model on two large-scale datasets designed for agricultural parcel segmentation from very high-resolution (VHR) remote sensing imagery. FHAPD [
24] is a VHR (1 m) remote sensing dataset covering seven regions in China with diverse agricultural landscapes. Images are cropped into
pixel patches and annotated with binary annotations (parcel vs. background). We select three representative subsets, FHAPD-JS, FHAPD-HB, and FHAPD-XJ. We randomly divide each regional subset (JS, HB, and XJ) into training, validation, and test sets using approximately a 70:10:20 ratio, ensuring spatial independence between samples.
The AI4Boundaries dataset [
36] consists of
orthophoto patches at 1 m resolution, with parcel masks derived from public cadastral data across several European countries. We adopt the official train/validation/test partitions and further divide each patch into non-overlapping
pixel sub-patches. After filtering out patches without agricultural content, the final counts are (9751/1084/2388) for train/validation/test, respectively. This division corresponds to a ratio close to 70:10:20, consistent with FHAPD subsets.
For both datasets, three types of ground-truth annotations are available: parcel masks, boundary maps, and distance maps. The distance maps are computed using the Euclidean transform from each interior pixel to the nearest boundary, serving as auxiliary supervision for geometric regularity. AI4Boundaries is characterized by large, homogeneous fields in relatively flat terrain, while FHAPD contains parcels with varied sizes, shapes, and topography.
Figure 5 presents representative examples from AI4Boundaries and the three FHAPD subsets. These samples visually highlight differences in field morphology and imaging characteristics, allowing a more intuitive understanding of the segmentation challenges and generalization potential.
3.1.2. Comparative Models
To thoroughly evaluate the effectiveness of the proposed IDBT model, we compare it with a set of representative methods used in AP delineation and RS image segmentation. The comparative baseline models include ResUNet [
37], REAUNET [
38], BsiNet [
23], SEANet [
39], BFINet [
25], HBGNet [
24], GPWFormer [
40], RSMamba-ss [
41], and BPT [
42].
3.1.3. Evaluation Metrics
In the experiments, we adopt a set of widely used metrics that collectively assess both semantic segmentation quality and boundary delineation accuracy following common practices in similar tasks [
11,
24,
43]. The primary metric is the Intersection-over-Union (IoU), which measures the degree of overlap between the predicted segmentation and the ground truth. It can be computed as:
where
and
refer to the predicted and true pixels of a sample for the category i, respectively. In addition to IoU, we report the F1 score and pixel accuracy (Acc), which further reflect classification precision and overall correctness. The F1 score is computed as the harmonic mean of precision and recall:
where
Pixel accuracy calculates the proportion of correctly classified pixels:
where TP stands for true positives (correct classification of pixels as positive predictions), while FP stands for false positives (negative pixels erroneously classified as positive). TN is true negatives (accurate identification of the background area as background), and FN is false negatives (positive pixels mistakenly categorized as negatives).
To better assess boundary delineation quality, we further report the boundary F1 score (
) with a narrow buffer region (3 m).
is the harmonic mean of completeness (Com) and correctness (Corr) [
44,
45,
46]:
Completeness is the ratio of correctly detected boundaries and correctness means the error of incorrectly detected. Com and Corr can be defined as:
where
denotes the extracted boundary and
represents the reference boundary. In addition, we employ the Average Surface Distance (ASD) metric to complement
. ASD quantifies the mean Euclidean distance between corresponding points on the predicted and reference boundaries. It thereby captures the average geometric deviation between the two contours. Unlike overlap-based metrics, ASD directly measures spatial consistency and smoothness along boundaries, offering an intuitive and stable evaluation of delineation accuracy—especially for complex, fragmented agricultural parcels where small pixel shifts can accumulate into significant shape differences.
Additionally, we report FLOPs (Floating Point Operations) to quantify the computational cost of different models. This metric captures the total number of arithmetic operations required during inference, providing an estimate of model complexity and runtime overhead.
3.1.4. Implementation Details
All models are trained using the Adam optimizer with an initial learning rate of , a batch size of 8, and 100 training epochs. The balancing weights for the individual components () are set to [0.5, 0.3, 1, 1, 0.5]. For the Dual-PointRend strategy, 1024 points are sampled for each branch at each iteration. All experiments are conducted on a server equipped with an NVIDIA GTX 3090 GPU. To ensure robustness and reliability, all reported performance metrics are averaged over at least three independent runs using different random seeds.
3.2. Results
3.2.1. FHAPD
The quantitative results on the three FHAPD regional subsets—JS, HB, and XJ—are summarized in
Table 1. Our proposed IDBT model achieves the best overall performance across the three regional subsets. On the JS subset, IDBT achieves an IoU of 84.4%, F1 score of 91.6%, and pixel accuracy of 90.1%. The boundary F1 score
also reaches a leading 80.7%, surpassing the second-best model (HBGNet, 79.3%) by a notable margin of 1.4%. Similarly, IDBT attains the best IoU (86.9%), F1 (93.0%), and accuracy (91.6%) on the HB subset. For the XJ subset, IDBT again delivers a remarkable performance with IoU 95.1%, F1 97.4%, and accuracy 96.2%, all of which are the highest among all baselines. In terms of the distance-based boundary metric, IDBT also achieves the lowest ASD across all subsets, with values of 3.88, 2.77, and 4.12 pixels on JS, HB, and XJ, respectively. The ASD values indicate that the predicted parcel boundaries are only a few pixels away from the ground truth contours on average. They further confirm that IDBT achieves precise geometric alignment and smooth boundary transitions.
Regarding the model efficiency, we compare the computational complexity using FLOPs. IDBT achieves the best performance while maintaining a moderate model size of 40.3 GFLOPs, which lies in the middle of the competing methods and is lower than the second-best-performing model, HBGNet (57.8 GFLOPs). We further assess practical deployment efficiency by reporting single-image inference latency and peak inference memory under a batch size of 1 (evaluated on FHAPD-JS). The evaluation results for competing models are provided in
Appendix A,
Table A1. IDBT demonstrates favorable runtime efficiency, achieving relatively low inference overhead while delivering the best segmentation accuracy.
These results demonstrate that IDBT is highly effective in accurately delineating agricultural parcels and preserving boundary details across diverse agricultural landscapes. Moreover, IDBT is also computationally efficient.
3.2.2. AI4Boundaries
The experimental results on the entire AI4Boundaries dataset are presented in
Table 2. Since the network architecture and input dimensions remain consistent across datasets, the FLOP values for all models are identical to those reported in
Table 1. For completeness, FLOPs for all models are included in
Table 2 to indicate model efficiency. The proposed IDBT model achieves the highest scores across all evaluation metrics, including IoU (83.7%), F1 (91.1%), Acc (87.4%), and
(51.3%). Compared to other strong baselines such as SEANet and HBGNet, which both attain a high F1 score of 90.5–90.6% and accuracy of 86.7%, IDBT still leads with a +0.5–0.6% improvement in F1 and +0.7% in accuracy. Notably, in terms of boundary delineation, IDBT outperforms all methods, surpassing SEANet and HBGNet by 1.0% in
. IDBT also yields the lowest ASD of 7.43 pixels, indicating that its predicted parcel boundaries are spatially closer and more geometrically aligned with the ground truth compared to competing methods. The results further confirm the generalization capability of IDBT when applied to large, homogeneous fields typical of the AI4Boundaries dataset. To verify that the filtering of non-agricultural patches does not bias the evaluation, we have conducted additional experiments using the unfiltered dataset. The results show that the filtering step affects the IoU by less than 0.5% and does not change the relative performance ranking among models.
In addition, although AI4Boundaries is primarily designed for semantic parcel segmentation, many of its regions contain relatively large and topologically complete agricultural fields compared with the small and densely tessellated parcels in FHAPD. This characteristic enables a scoped evaluation of instance-level delineation performance on a subset where parcel individuality is visually unambiguous. To provide complementary validation from this perspective, we construct a mini test set by selecting several AI4Boundaries images (178 counts) that exhibit clear and non-intersecting field boundaries. On this mini set, we compute instance Precision and instance F1 scores using IoU ≥ 0.5 matching criteria. The results are reported in
Appendix A,
Table A2. IDBT achieves competitive instance Precision (ranking 3rd among tested models) while maintaining the highest instance F1 score. The results show IDBT’s consistency in parcel completeness and topology preservation under this evaluation setting. It should be noted that the official AI4Boundaries dataset only provides semantic parcel masks, and instance masks are derived from them through connected-component extraction. Therefore, while this assessment offers helpful complementary insights, its conclusions should be interpreted with appropriate caution given the inherent characteristics of agricultural parcel tessellation.
3.3. Qualitative Visualizations
To provide a more intuitive understanding of model behavior across diverse agricultural scenarios, we present qualitative visualizations in
Figure 5,
Figure 6 and
Figure 7.
Figure 5 showcases representative samples from the AI4Boundaries dataset and the three FHAPD subsets (JS, HB, and XJ). For each example, we display the input image, ground-truth mask label, the parcel delineation result predicted by our IDBT model, and the corresponding extracted boundary map. This visualization highlights the varied farming characteristics across datasets. Despite these variations, IDBT consistently generates complete and accurate parcel masks, maintaining boundary integrity even in the presence of noisy backgrounds, irregular field shapes, or densely clustered plots. The extracted boundaries are continuous and well-aligned with the ground truth. These segmentation results demonstrate strong geometric precision and topological coherence.
Figure 6 provides a broader comparison across all competing models. We selectively show two samples from two distinct geographic regions within AI4Boundaries. Here we visualize error maps relative to the ground truth for all models, where true negatives appear in black, true positives in white, false positives in red, and false negatives in blue. These error maps facilitate a detailed visual assessment of both interior classification accuracy and boundary delineation performance. The comparison reveals that IDBT produces fewer boundary-related misclassifications and maintains higher foreground completeness than other methods. Notably, the reductions in red/blue pixels along narrow ridges and field edges indicate better localization and topological consistency of IDBT.
To further highlight the difference,
Figure 7 presents side-by-side visualizations of IDBT and the second-best-performing model (HBGNet). The figure includes two examples from AI4Boundaries and one example from each FHAPD subset. We also present the error maps of two compared methods. As seen in
Figure 7, IDBT produces more complete and precise parcel delineations. Most regions in the IDBT outputs exhibit fewer errors and clearer boundaries. Moreover, we use green rectangles to highlight regions with visually significant improvements, where IDBT demonstrates sharper parcel boundaries and reduced structural distortions, particularly in irregular and densely interacting field regions. These visual advantages align with the quantitative improvements reported earlier in numerical metrics.
3.4. Transfer Testing
Following the work of [
24], we first conduct additional cross-subset transfer experiments within the FHAPD dataset. In these experiments, each model is trained on the train set of one regional subset (source domain) and directly tested on the test set of another (target domain) without any fine-tuning. This setup allows us to assess the transferability of our model across regions with diverse agricultural characteristics. The same training configuration as in the main FHAPD experiments is adopted. Three representative transfer configurations are designed: JS→XJ, HB→JS, and XJ→HB. To be clear, we use the JS→XJ experiment as an example, which means models are trained on the FHAPD-JS subset and evaluated on the FHAPD-XJ subset. These scenarios comprehensively assess the ability of each model to transfer beyond the spatial and phenological patterns of its training domain.
Table 3 summarizes the results. IDBT model achieves the best performance across all transfer scenarios in terms of both IoU and F1 score. Specifically, IDBT attains 66.5% IoU and 79.9% F1 in JS→XJ, outperforming the second-best model (BPT) by a large margin of +6.4% IoU and +4.2% F1.
To further examine cross-domain generalization ability beyond dataset-internal shifts, we additionally evaluate a more challenging setting: training on FHAPD-JS and testing on the AI4Boundaries dataset (JS→AI4B), which introduces substantial domain discrepancies in parcel scale, topography, and geographic context. The results are also provided in
Table 3. Although IDBT continues to outperform other methods in both IoU and F1 under this configuration, the performance of all deep models decreases significantly (IoU values for some baselines drop below 40%), indicating that a direct zero-shot transfer across such heterogeneous datasets remains difficult.
This performance gap reflects widely recognized challenges in agricultural remote sensing: strong geographic domain shifts, sensor differences, and landscape-pattern discrepancies can severely affect cross-dataset transferability [
47,
48,
49]. Taken together, the results demonstrate that IDBT transfers robustly across subsets within FHAPD where data sources and cultivation patterns are more consistent, while much larger shifts such as FHAPD→AI4B remain challenging for all deep models. These findings suggest that IDBT already provides strong transferability in realistic operational scenarios where imagery is collected under similar sensing conditions, and they also highlight an important direction for future research toward more domain-adaptive agricultural parcel delineation.
3.5. Ablation Study
We conduct the ablation study in two scenarios to evaluate the contribution of each major component of IDBT: BITM&BIFM, FAM, MFFM, and the Dual-PointRend strategy.
The first scenario is in-domain testing on the FHAPD-JS subset, whose results are reported in
Table 4. Starting from the baseline (Exp
), the introduction of BITM&BIFM alone brings a clear improvement in both IoU and F1 (
and
, Exp
). Adding FAM (Exp
) or MFFM (Exp
) further enhances the performance. When FAM and MFFM are used together, the model achieves large gains compared to the baseline (Exp
vs.
) and compared to BITM&BIFM alone (Exp
vs.
). Combining BITM&BIFM, FAM, and MFFM (Exp
) yields a significant overall improvement of
in IoU and
in F1 over the baseline. The application of Dual-PointRend strategy does not lead to a notable improvement in this setting (
in IoU and
in F1, Exp
).
The second scenario evaluates the transferability of each component in the JS-XJ cross-subset transfer test. The results are shown in
Table 5. Similar to the first scenario, introducing any single module or the combination of two modules leads to a clear improvement over the baseline. Notably, the performance gains are even more pronounced in this transfer test. For example, using FAM and MFFM together yields a
improvement in IoU over the baseline. In addition, comparing the Exp
with
, the Dual-PointRend strategy brings a more significant improvement, suggesting that its point-level refinement plays a larger role when the model is tested on unseen regions.
3.6. Discussion
Overall, the experimental results across both the FHAPD and AI4Boundaries datasets demonstrate that IDBT consistently outperforms state-of-the-art baselines in terms of segmentation accuracy and boundary delineation quality. The strong performance holds across diverse agricultural landscapes, indicating the model’s robustness under varying geographic and topographic conditions.
From the transfer testing experiments, it is evident that IDBT generalizes well to unseen regional distributions. Even without fine-tuning, it consistently maintains a clear lead in all cross-subset evaluations within FHAPD. This result highlights the strong transferability of IDBT within domains where sensing modality and field morphology remain comparable. It is an essential property for real-world deployment where labeled data may be limited or unavailable for certain regions.
The qualitative visualizations provide strong intuitive support for the quantitative results. The visualization results show how IDBT effectively handles diverse agricultural patterns. In addition, the error maps clearly illustrate IDBT’s superior ability in achieving both mask completeness and boundary precision compared to the competing models.
Moreover, the ablation study clarifies the contribution of each component. BITM&BIFM establish a strong foundation by explicitly encoding edge cues early in the network. The FAM and MFFM modules further enhance representation capacity by integrating multi-level features with spatial and semantic focus. Regarding the Dual-PointRend supervision, its improvement in the in-domain setting appears marginal because the base model already achieves high accuracy (F1 above 90%). Given that Dual-PointRend selectively refines a limited number of points, most of which are already correctly predicted in such cases, the refinement gain becomes numerically small. However, under domain-shift scenarios where the base segmentation is less accurate, the refinement-oriented learning of Dual-PointRend strengthens the model’s boundary sensitivity and generalization ability, leading to more evident performance gains in transfer tests.
Finally, it is worth noting that IDBT maintains strong accuracy while being relatively lightweight. As shown in the FLOPs comparison, our model achieves better or comparable results with lower computational cost than many complex baselines.
Although the current design emphasizes efficiency, future work may explore lightweight distillation or dynamic inference strategies to further reduce computational cost for edge deployment scenarios. In addition, the cross-dataset transfer observations (FHAPD-JS→AI4Boundaries) motivate further exploration of domain-adaptive and multi-sensor parcel delineation frameworks to enhance robustness across heterogeneous data sources. Moreover, the proposed framework demonstrates strong potential for broader geospatial applications. Its boundary-faithful and instance-consistent delineation capability provides a solid foundation for tasks that require legally traceable and high-confidence parcel geometries. These characteristics suggest that IDBT could be further extended toward compliance-oriented mapping and other practical geospatial analysis scenarios.