Topology-Aware Field Parcel Delineation: Bridging Deep Semantic Features and Geometric Constraints

Duan, Yaming; Zhao, Runze; Xu, Xiangde; Zhang, Jinshui

doi:10.3390/rs18111783

Open AccessArticle

Topology-Aware Field Parcel Delineation: Bridging Deep Semantic Features and Geometric Constraints

¹

State Key Laboratory of Remote Sensing Science, Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

²

Beijing Engineering Research Center for Global Land Remote Sensing Products, Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

³

Key Laboratory of Radiometric Calibration and Validation for Environmental Satellites, National Satellite Meteorological Center (National Center for Space Weather), China Meteorological Administration, Beijing 100081, China

⁴

Innovation Center for FengYun Meteorological Satellite, Beijing 100081, China

⁵

State Key Laboratory of Severe Weather, Chinese Academy of Meteorological Sciences, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1783; https://doi.org/10.3390/rs18111783

Submission received: 3 April 2026 / Revised: 13 May 2026 / Accepted: 21 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue Advanced Deep Learning Techniques for Information Extraction and Analysis of Remote Sensing Imagery)

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

A “robust baseline + explicit constraint” strategy outperforns complex SOTA models. Single-task CNNs (e.g., DLinkNet.PSPNet) with our topological post- processing achieve better training efficiency and practicaliy. while complex models suffer fromartifacts or gradient conflicts.
Boundary buffering (3-pixel dilation) during training mifigates class imbalance and provides implicit width information, whichis crucial for distinguishing single-line boundaries from linear features like roads or ditches.

What are the implications of the main findings?

The proposed “topology-first” framework (TopoFP) departs from the traditional “extraction-then-repai” paradigm, directlygenerating vector entities with intrinsic spatial adjacency and structural integrity.
With an F1 score of 0.910 and IoU of 0.835, the method provides a reliable and geometically reasonable solution for end-to-end vector mapping in agriculture, even from fragmented CNN predictionsend.

Abstract

The accurate and automated delineation of Field Parcels (FPs) serves as the foundation for modern precision agriculture. While deep learning-based extraction from high-resolution remote sensing imagery has improved pixel-level accuracy, current methods often neglect the intrinsic topological relationships between parcels, leading to geometric inconsistencies such as broken boundaries and structural ambiguities. To address these limitations, this paper proposes a topology-aware, end-to-end framework for polygonal FP extraction. We employed Convolutional Neural Networks (CNNs) with a coupled boundary-region representation to extract deep features that implicitly encode boundary width. Crucially, we introduce a Topological Relationship Construction (TRC) mechanism that transforms raster features into a node-edge topological network, enabling the direct generation of vector entities with guaranteed spatial adjacency. Based on this topology, we further developed Double-Line Detection (DLD) and Dangling Line Extension (DLE) algorithms to resolve the topological absence of single/double-line boundaries and fixed fracture errors in complex scenarios. Experimental results demonstrate that the proposed method achieved an F1 score of 0.910 and an IoU of 0.835, effectively ensuring stable and geometrically reasonable outputs even when CNN predictions are fragmented. This approach provides a solution for end-to-end vector mapping in agriculture.

Keywords:

field parcel extraction; deep learning; topological relationship; vectorization; high-resolution remote sensing

1. Introduction

Field parcels (FPs) serve as the fundamental geospatial units of modern agriculture, constituting the underlying data that supports precision monitoring, yield analysis, and agricultural management [1,2,3]. At national or regional scales, there is an escalating demand for the efficient production and updating of high-quality FP data [4]. Historically, FP delineation relied heavily on manual digitization, the process is cost-prohibitive and labor-intensive. Consequently, extensive research has been dedicated to FP identification techniques. These methods typically categories the task into two stages: boundary extraction and region extraction [5,6].

In the early stages, end-to-end methods primarily relied on traditional computer vision and machine learning techniques. Boundary extraction often employed predefined operators (e.g., Canny, Sobel) to capture high-frequency image information [7,8], while region extraction utilized pixel-level classifiers like Support Vector Machines (SVM) [9] or Random Forests [10], as well as object-oriented segmentation methods [11]. Although these approaches established a foundation for reducing human effort, they are limited by the limited representation capability of hand-crafted features and a lack of high-level semantic understanding. As a result, traditional methods often struggle to differentiate true boundaries from intra-field textures, leading to suboptimal performance especially in complex or fragmented agricultural scenarios.

In recent years, the rapid evolution of Deep Learning (DL) has established Convolutional Neural Networks (CNNs) as the mainstream approach for FP feature extraction [1,2]. Leveraging powerful learnable representations, CNNs can effectively categorize FP boundaries and regions from massive datasets [9,12,13,14,15,16]. Advanced backbone architectures, such as ResNet [17], InceptionV3+ [18], and UNet [19], have been widely adopted as feature encoders. To further harmonize boundary and region features, multi-task learning paradigms have been introduced to enhance recognition accuracy [3,12,13,20]. Beyond semantic segmentation, instance segmentation methods such as Mask R-CNN [21] and its variants, as well as end-to-end vectorization approaches like E2EVAP [22], have also been applied to field parcel delineation, attempting to directly generate individual parcel instances or vector polygons. However, these methods predominantly concentrate on the precise localization and shape construction of individual parcels, rather than on the topological adjacency between adjacent parcels. Despite these advancements, existing DL-based methods struggle to construct accurate topology. They typically suffer from two limitations that destroy structural connectivity: (1) boundary fragmentation, where continuous boundaries are predicted as broken segments; and (2) boundary adhesion, where adjacent boundaries are erroneously merged. These issues arise because standard CNNs optimize for pixel-level classification accuracy rather than geometric structural integrity. Current solutions largely rely on low-level morphological post-processing (e.g., dilation, edge linking). These “patch-based” operations fail to incorporate the topological rules inherent to geospatial entities, often yielding results that are geometrically connected but topologically erroneous, thus failing to meet the strict quality standards of downstream applications.

A critical yet often overlooked aspect of post-processing is the topological ambiguity of boundaries in raster formats. FP boundaries generally fall into two categories: topological transitions with theoretical zero width (single lines), and linear features with actual spatial coverage, such as roads and ditches (double lines). In raster representations, distinguishing between these types is difficult, especially when narrow linear features appear blurred due to resolution limits. Conventional skeletonization algorithms [23] handle pixel-to-line conversion by thinning all features to single-pixel width. This inevitably simplifies complex double-line structures into single lines, causing topological distortion. Existing research has predominantly focused on simple line repair and closure, with limited exploration into how topological relationships between boundaries can be explicitly modeled to guide the vectorization process.

To address these challenges, this study proposes an end-to-end, deep learning-based FP extraction framework considering controllable topological relationships. We introduce a workflow that couples CNNs—specifically optimized for agricultural feature representation—with a topology-controlled vecterization module. To ensure robust feature extraction, we evaluate four advanced semantic segmentation CNNs with distinct receptive field mechanisms to identify the optimal backbone. Furthermore, we design a Topological Relationship Construction (TRC) mechanism that transcends pixel-level operations. By explicitly modeling the intersection and connectivity of boundary segments, our method enables advanced functions such as automatic Double-Line Detection (DLD) and Dangling Line Extension (DLE). This approach ensures that the generated FPs are not only visually complete but also topologically valid. To facilitate community research, the source code and pre-trained models are open sourced at https://github.com/dymwan/TopoFP (accessed on 20 May 2026).

The main contributions of this paper are summarized as follows:

1.: A topology-first framework for end-to-end extraction is proposed. Distinct from the traditional “extraction-then-repair” paradigm, this method innovatively positions Topological Relationship Construction (TRC) at the core of the raster-to-vector conversion process. By constructing a node-edge topological network, it directly generates vector entities, intrinsically ensuring the correctness of spatial adjacency and effectively addressing structural integrity challenges in complex scenarios.
2.: Topology-constrained algorithms for Double-Line Detection (DLD) and Dangling Line Extension (DLE) are developed. Leveraging implicit width features and geometric collinearity, these modules automate the distinction of single/double boundaries and fracture repair within a unified topological graph. This approach resolves the dilemma of balancing structural integrity and geometric precision in complex agricultural landscapes.
3.: A comprehensive evaluation of model architectural mechanisms and robustness across diverse landscapes is conducted. By analyzing the impact of receptive field mechanisms (e.g., dilated convolutions vs. pyramid pooling) across three distinct study areas, we demonstrate that the proposed framework offers superior generalization and practical utility compared to complex state-of-the-art models, effectively adapting from regular farms to fragmented parcels.

The remainder of this paper is organized as follows: Section 2 introduces the study area and dataset construction strategy; Section 3 details the components of the proposed method; Section 4 presents the experimental results and analysis; Section 5 discusses the consistency and feasibility of the method; and Section 6 summarizes the core points of the approach.

2. Datasets and Study Sites

2.1. Study Sites

The study sites were selected to capture the diversity of agricultural landscapes, considering both spatial and temporal factors, to satisfy the data requirements for robust CNN training. We selected Heilongjiang (HLJ), Shandong (SD), and Zhejiang (ZJ) provinces, which are representative of the diverse farming patterns encountered across the major agricultural regions of China (shown in light-green in Figure 1). These three provinces span a latitudinal range of approximately 25°, where the distribution pattern of cultivated land gradually transitions from large-scale, regular parcels in the north to fragmented, irregular patterns in the south. Additionally, these regions contain diverse arable land types, including black soil, loess, and paddy fields, exhibiting distinct spectral reflectance characteristics in remote sensing imagery.

2.2. Training and Testing Data

The distribution of remote sensing data and corresponding annotations used for model training is shown in Figure 1 (Left). For the training dataset, we acquired high-resolution remote sensing images in Google Map (https://www.google.com/maps, accessed on 15 March 2023) level 16 tiles, which covers four counties in HLJ, five counties in SD, and six counties in ZJ, spanning areas of

41 \times 10^{3}

,

8.7 \times 10^{3}

, and

12 \times 10^{3}

km², respectively. The dataset was split into training and validation sets: a randomly selected 20% of the tiles served as the validation set to monitor training performance, while the remaining 80% were used for training. For independent testing in the three provinces, we acquired 9 separate image patches, each covering a

10 \times 10

km area, to conduct quantitative and qualitative performance assessments.

All images were annotated with three categories: boundary, cultivated land (FP region), and non-cultivated land. The non-cultivated land category serves as the background class, encompassing all areas that are neither field parcel interiors nor boundaries—including woodlands, buildings, water bodies, and other non-agricultural features. The delineation of FP polygons followed the principle that any field enclosed by roads, canals, or woods required boundary labeling. Separating features exceeding 10 m in width were labeled as double-line boundaries, with the area between them marked as non-cultivated land. To address the class imbalance caused by the scarcity of boundary pixels relative to other categories, the vector boundaries were rasterized and buffered by a width of three pixels during the training phase.

2.3. Auxiliary Dataset for Boundary Width Analysis

To investigate the capability of the deep learning models in representing boundary widths, this study established an auxiliary dataset of point-wise samples linked to the vectorized labels. Each sample point was annotated with the actual width of the boundary where it resides. The samples were randomly selected from arbitrary segments of shared boundaries within the labeled polygons. The ground truth width for each point was derived from the average measurements of three digitization experts. Specifically, for each ground truth point, we constructed a perpendicular line segment to the boundary tangent to extract the cross-sectional intensity profile and measure the physical width.

3. Methods

In this study, we propose TopoFP, a unified, end-to-end framework designed for the vectorized extraction of field parcels (FPs) from high-resolution remote sensing imagery. As illustrated in Figure 2, TopoFP bridges the gap between pixel-level semantic segmentation and object-level geometric modeling. The framework comprises two coupled components: (1) a Deep Feature Representation Module that utilizes CNNs to learn boundary-aware representations, implicitly encoding width information; and (2) a Topology-Constrained Vectorizer that transforms raster predictions into topologically valid polygons via the proposed Topological Relationship Construction (TRC) mechanism.

3.1. Deep Feature Representation in TopoFP

The first stage of TopoFP aims to extract robust feature representations that serve as the foundation for topological construction.

3.1.1. Boundary-Aware Representation Strategy

In complex agricultural landscapes, boundaries vary significantly in width due to linear features like ditches and roads. To enable TopoFP to distinguish between single-line and double-line boundaries in the subsequent stages, we adopt a boundary buffering strategy. During training, FP boundaries are rasterized and buffered by 3 pixels, as shown in Figure 3. This operation not only maximizes connectivity to mitigate class imbalance but, more importantly, forces the network to learn a “thickened” boundary representation. This implicit width information learned by the CNN is a critical prerequisite for the Double-Line Detection (DLD) module described in Section 3.2.2.

3.1.2. Network Architectures

TopoFP is designed to be backbone-agnostic, but to ensure optimal feature extraction, we evaluate four state-of-the-art semantic segmentation architectures—PSPNet [24], UNet series [19,25], DeepLabV3+ [18], and DLinkNet [26] as the feature extractor.

We adopt PSPNet for its global context aggregation via the Pyramid Scene Parsing Module (PSPM) [24]. A skip connection is introduced between the initial encoder layer and the post-PSPM layer to recover shallow geometric details [27]. Originally designed for road extraction, DLinkNet [26] is utilized for its cascaded dilated convolutions, which expand the receptive field without resolution loss, making it highly effective for maintaining the continuity of linear boundaries. The UNet series [19,25] of networks was originally designed for image segmentation tasks, following a hierarchical linking pattern. According to this pattern, shallow features from the encoding stage are connected to deep features in the decoding stage with the same resolution. This connection enables the model to link features of different resolutions, corresponding to various down-sampling rates, thereby ensuring accuracy and fine-scale detail in pixel-scale segmentation tasks. DeepLabV3+ [18] is dedicated to achieving high-quality segmentation by employing depthwise separable convolutions and Atrous (dilated) convolution. In contrast to the conventional strategy of stacking layers to increase the receptive field size, the use of separable convolutions and Atrous convolution can effectively expand the receptive field of view without markedly increasing the model complexity.

3.1.3. Loss Function

To drive the learning of TopoFP’s feature extractor, we employ a hybrid loss function combining Cross-Entropy (CE) Loss and Dice Loss [25,27]. This combination allows the model to simultaneously optimize pixel-level classification accuracy and structural compactness:

L (W) = - \frac{1}{N} \sum_{j} \sum_{c}^{M} w_{c} y_{i c} l o g (p_{i c}) + 1 - \frac{2 I + ϵ}{U + ϵ} .

(1)

I = \sum_{c = 1}^{M} y_{c_o n e h o t} p_{c}, U = \sum {c = 1}^{M} (p_{c}) .

(2)

The first part of the combined loss function represents the CE component, where y represents the reference map and p denotes the model output after the softmax operation. The indices i and c signify the specific pixel and channel corresponding to a particular class, respectively. The coefficient

w_{c}

is dynamically computed from the class proportions, reflecting the pixel-wise distribution of each class. The second part of the loss function represents the Dice loss. In Equation (2), I and U represent the counts of pixels in the intersection and union, respectively, between the model output and reference data. The small constant

ϵ

(set to 0.0001 herein) is added to prevent division by zero or infinite results. By combining the CE loss and Dice loss, the model is optimized to achieve both precise category localization and compact shape edges, while the dynamic class weighting in the CE loss helps address the class imbalance issue during training.

3.2. Topology-Constrained Vectorization

This study proposes a topology-constrained vectorization (TCV) method to address the issue of incomplete feature extraction in the generated FPs. As illustrated in Figure 2, the key step is a fundamental processing step termed topological relationship construction (TRC), which constructs the point and edge connections along the boundary skeleton of the FPs. Building on this topological information, the method then introduces two key modules: the dual-line detection (DLD) module that leverages the TRC to identify and fix any discontinuities or gaps in the FP boundaries by detecting and connecting dual lines; and the dangling lines extension (DLE) module that further refines the FP boundaries by extending any dangling or unconnected line segments based on their topological relationships. The vectorization module then instantiates the complete FPs by unifying the refined boundary details with the region features extracted using deep learning. By incorporating these topological relationship-based post-processing steps, the method is able to generate more accurate and complete FP representations, addressing the limitations of the initial feature extraction.

3.2.1. Topological Relationship Construction (TRC)

In deep-learning-based feature extraction, object boundaries are typically represented as pixel chains with inherent width, whereas geographical information system data formats define them as strictly linear features. To bridge this gap and properly capture the true geometry of objects with evident width, we developed a processing pipeline centered on the Topological Relationship Construction (TRC) module. The primary function of the TRC is to restore critical boundary points and analyze the topological relationships between adjacent feature pixels. Based on this topological foundation, the pipeline incorporates two subsequent modules: the DLD module, which identifies discontinuous wide boundaries and reconstructs them into dual-line representations, and the DLE module, which further refines the geometry by extending dangling line segments. The core methodology of the TRC relies on a deterministic, graph-based approach. Initially, planar boundaries are simplified into linear pixels via Zhang-Suen skeletonization [23], which guarantees that each junction node has a maximum degree of four. As depicted in Figure 4, a

3 \times 3

sliding window is used to partition the neighboring pixels into disconnected segments. The central point is then classified based on this segment count: 1 segment corresponds to an end-point, 2 segments to an on-line point, and ≥3 segments to a cross-point. By masking the skeleton map at these cross-points, the algorithm isolates individual line segments. This allows for the immediate construction of an adjacency matrix detailing the topological relationships among all points and segments (Figure 2II.a), enabling efficient topological updates and the derivation of geometric properties like length and angle. The complete GPU-accelerated pseudocode for this process is detailed in Algorithm A1 (Appendix A). This counting-based classification method is inherently robust across diverse boundary geometries because it does not rely on learned heuristics or fragile thresholds. To maintain geometric fidelity in complex structures, the pipeline incorporates explicit safeguards. During DLD processing, the outermost pixels of thick boundaries are protected prior to internal hollowing to prevent new topological breaks. Furthermore, dual-line reconstruction utilizes perpendicular dilation from the centerline, ensuring that the generated boundaries remain strictly parallel to the original geometry. While centerline dilation can produce minor shape distortions at the apex of extreme acute angles, this limitation and potential minimum spanning tree-based optimization strategies are further discussed in Section 5.2.

3.2.2. Dual Line Detection (DLD)

The distinction between single and double-line boundaries is critical in practical delineation. Usually, human experts decide whether to create a dual-line boundary based on an empirical threshold of boundary width. In this paper, we propose and use the DLD algorithm to technically distinguish whether a boundary is of dual-line type according to the width of the planar boundary extracted by the deep-learning model. The workflow of the DLD algorithm is shown in Figure 2II.b; this differs from the conventional method in which boundary width is not considered.

According to the algorithm principle, the planar boundary is symmetric about the skeleton. Therefore, it is possible to derive the width of boundaries based on the distance from pixels along the skeleton to the nearest non-boundary pixels. The distance transform algorithm [28] is used to obtain the width of the planar boundary corresponding to each point on the skeleton. The DLD algorithm then uses each independent boundary arc obtained via the TRC to calculate the mean width of all skeleton points on each planar boundary.

Similar to the human-expert approach, a preset width threshold is used to filter the preliminary dual-line boundaries. To ensure the continuity of adjacent dual-line boundaries, the algorithm further tests line segments lying adjacent to those that passed the threshold test. According to the TRC results, a line segment that fails the threshold test is considered to be a dual line if it meets both the following conditions: (1) there is a dual-line boundary in the boundaries connected with it and their angular difference is less than 45°; (2) the width of the line segment is greater than 80% of the threshold.

Finally, dual-line boundaries in the skeleton map are generated from the planar boundary map. To reconstruct the planar boundary map, the dual-line segments and cross-points connected with them are expanded several times. Then, the expanded area is used as a mask to hollow out the planar boundary.

3.2.3. Dangling Line Extension (DLE)

Inspired by previous work on fixing broken lines [29], the DLE module is used herein to detect and fix dangling lines resulting from incomplete extraction by the deep-learning model. In the TRC results, each line segment has two connected points, which can be either cross points, end points, or both.

On the basis of the point-line topological relationship, the dangling lines are first filtered out if a line segment connects to at least one end point. Then, the dangling lines are extended. The dangling-line extension in this paper is conditional. Line segments that satisfy both of the following criteria are extended: (1) the length of the line segment is less than 10 pixels, and (2) the five pixels at the end of the extension must be collinear. Here, the collinearity of pixels is determined by the distance between the middle point and line connecting the endpoints being less than the preset threshold (

D_{m t}

). In this paper,

D_{m t}

is set to one pixel, which is equivalent to the on-line point having a maximum offset of 26° from the line across the end points. This strict collinearity criterion is chosen to avoid mistaken extensions.

A maximum extension length (

E_{m a x}

) should be preset to avoid additional errors caused by the extension process. In this paper,

E_{m a x}

is set as a constant 50 m. According to the

E_{m a x}

, the extension paths for all end points in the extendable dangling lines are first derived. A convolution process is then employed to conduct the extension synchronously over

E_{m a x}

iterations along the paths. During this iteration, if an extended line segment reaches another line, it halts. After DLE, a further TRC is required to update the topological relationship.

3.2.4. Vectorization

The fine boundary skeleton derived successively from the DLD and DLE processes is then converted into ready-to-use FP data. The regions enclosed by the fine skeleton are classified according to the FP region features extracted by the deep-learning model. By leveraging the constraint of the closed boundary and FP region features, the watershed algorithm [30] is applied. In this step, the FP region pixels serve as positive seeds, while non-cultivated-land pixels function as negative constraints (background barriers). Consequently, only regions that are (a) enclosed by valid boundaries and (b) contain FP region seeds are instantiated as polygons. Enclosed regions containing only non-cultivated-land seeds such as woodlands surrounded by roads are correctly excluded from the final output, preventing complex background features from being erroneously vectorized. At this stage, there are still a large number of pixels marked as boundaries, but in the real world, boundaries do not cover any area. To address this issue, the boundary pixels are eliminated guided by the unique identifiers of the FPs. Specifically, each boundary pixel is labeled with the maximum value from its

3 \times 3

neighborhood. Subsequently, the individually labeled FP map is vectorized using the polygonization algorithm provided by the open-source GDAL [31]. The Douglas-Peucker algorithm [32] is then applied to simplify the vector boundaries and remove redundant vertex points.

3.3. Accuracy Evaluation

3.3.1. Pixel Level Metrics

In assessing the performance of the deep-learning model, a confusion matrix is constructed to establish the metric basis for each category [33]: true positive (

T P

), false positive (

F P

), true negative (

T N

), and false negative (

F N

). Specifically, each category is defined as either positive or negative;

F P

denotes cases where the model incorrectly predicts the positive class when the true category is negative, while

T P

,

T N

, and

F N

follow the same principle. Throughout model training, loss and intersection over union (

I o U

) are employed to gauge the performance of the model during each validation epoch. The calculation of

I o U

for each category is expressed as follows:

I o U_{i} = \frac{T P}{T P + F P + F N}

(3)

where, the subscript denotes the metric of the i-th category. Meanwhile,

r e c a l l

,

p r e c i s i o n

, and

F 1

are derived from the confusion matrix via the following formulae:

\begin{matrix} R e c a l l = \frac{T P}{T P + F N}, \\ P r e c i s i o n = \frac{T P}{T P + F P}, \\ F 1 = \frac{2 \times r e c a l l \times p r e c i s i o n}{r e c a l l + p r e c i s i o n} \end{matrix}

(4)

3.3.2. Object Level Metrics

This study employs three object-level evaluation metrics [3,34,35] to quantitatively analyze the alignment between predicted results and ground truth parcels. Leveraging the spatial topological relationships of field parcel entities, these metrics assess the matching degree in terms of geometric morphology and spatial distribution. The specific calculation procedures are as follows:

Global Over-Classification (

G O C

) quantifies the degree of over-segmentation of predicted parcels relative to ground truth parcels. It is derived by calculating the weighted average of the overlap area ratio between each predicted parcel and its nearest ground truth counterpart. The calculation process is defined as:

G O C = \sum_{i = 1}^{m} (O C (S_{i}) \cdot \frac{a r e a (S_{i})}{\sum_{i = 1}^{m} a r e a (S_{i})})

(5)

Global Under-Classification (

G U C

) measures the degree of under-segmentation of predicted parcels, reflecting the proportion of the predicted region that fails to cover the ground truth. Its expression is:

G U C = \sum_{i = 1}^{m} (U C (S_{i}) \cdot \frac{a r e a (S_{i})}{\sum_{i = 1}^{m} a r e a (S_{i})})

(6)

Global Total-Classification (

G T C

) integrates both over-segmentation and under-segmentation errors, evaluating the overall segmentation deviation through a root mean square form. The calculation formula is:

G T C = \sum_{i = 1}^{m} (T C (S_{i}) \cdot \frac{a r e a (S_{i})}{\sum_{i = 1}^{m} a r e a (S_{i})})

(7)

The calculation formulas for the individual components are as follows:

\begin{matrix} O C (S_{i}) & = 1 - \frac{a r e a (S_{i} \cap O_{i})}{a r e a (O_{i})} \\ U C (S_{i}) & = 1 - \frac{a r e a (S_{i} \cup O_{i})}{a r e a (O_{i})} \\ T C (S_{i}) & = \sqrt{\frac{O C {(S_{i})}^{2} + U C {(S_{i})}^{2}}{2}} \end{matrix}

(8)

where

S_{i} (i = 1, 2, \dots, m)

represents the predicted parcels from the model results, and

O_{i} (i = 1, 2, \dots, n)

denotes the ground truth reference parcels.

3.3.3. Geometric Level Metrics

The PoLiS metric is a geometric similarity measure proposed by Avbelj et al., designed to quantitatively assess the geometric correspondence between two closed polygons [36]. This metric overcomes the sensitivity of traditional pixel-level metrics to geometric transformations such as rotation, translation, and scaling, while satisfying the mathematical properties of non-negativity, symmetry, and the triangle inequality. It is widely applied in tasks such as building footprint extraction and map vectorization, where a lower value indicates higher geometric consistency between polygons. Its core mechanism establishes a symmetric measure by calculating the bidirectional average nearest distance from vertices to boundaries.

Given a predicted polygon A (containing q vertices) and a reference polygon B (containing r vertices), the PoLiS distance is defined as:

PoLiS (A, B) = \frac{1}{2 q} \sum_{a_{j} \in A} min_{b \in \partial B} ∥ a_{j} - b ∥ + \frac{1}{2 r} \sum_{b_{k} \in B} min_{a \in \partial A} ∥ b_{k} - a ∥

(9)

3.4. Implementation Details

The remote sensing image data were arranged in red-green-blue bands and compressed to an unsigned 8-bit integer depth. To mitigate gradient swings during training and normalize the input data for the model, a linear stretching with a 2% ratio was applied to all images. Specifically, the image values were stretched to the range [0, 1], and the normalization involved setting the means to 0.3722, 0.4498, and 0.3964, and the standard deviations to 0.1220, 0.1139, and 0.1064, for the red, green, and blue bands, respectively. Data augmentation (DA) strategies were implemented to enhance the generalization ability and robustness. In this paper, we employed a unified set of DA strategies, including spectral band random exchanging, color jittering, random clipping, random rotating, and random zooming. The deep-learning model training relies on the PyTorch (version 1.13.1) deep-learning framework using an NVIDIA Tesla V100 GPU with 32 GB memory. The backbone parameters were initiated from the official release model that was trained with a massive dataset (e.g., ImageNet [37]), and the rest of the parameters of the custom blocks were initialized via the Kaiming initialization method [38]. The optimizer used in this paper was AdaMax [39] with the momentum set to 0.9 and weight decay set to

5 \times 10^{- 6}

. The batch and patch sizes were set to 48 and 512, respectively, according to the GPU memory, and the learning rate was set to

1 \times 10^{- 4}

initially, then adjusted by a cosine scheduler per epoch. In addition, we applied a softmax with temperature [40], which is formulated as:

p_{i} = \frac{e x p (\frac{Z_{i}}{T})}{\sum_{j} e x p (\frac{Z_{j}}{T})}

(10)

where, p denotes the model output; Z denotes the input feature map; the subscript represents the channel index; and T denotes the temperature coefficient, which makes the output distribution more similar to the argmax when it is closer to 0. This procedure enlarges the inter-distance among all categories, providing a clearer gap between the false positive region and decision boundary, thereby improving the model training effect.

Due to the large spatial dimensions of remote sensing data and GPU memory limitation, an image is generally predicted patch by patch. According to research on the receptive field of CNNs [41], the effectiveness of a single predicted patch decays from the center to the edge; this is caused by the stacking of convolutional layers working on fixed-size images. Therefore, an overlapping shifting window is applied to yield patches by cropping the contiguous central part as the output only during the model inference process. In practice, the patch size for the deep-learning model was set as 512, the same as for the training phase, and the overlapping step was set as 64, which is equivalent to cropping the central block with a size of 384 as the output.

4. Results and Analysis

For all experiments in this section, a single, unified model was trained on the combined training data from all three provinces (HLJ, SD, ZJ) and subsequently tested independently on each province’s test set. This protocol was deliberately chosen to evaluate cross-region generalization the model must learn to handle diverse agricultural landscapes (large regular farms in HLJ, medium-scale dryland in SD, and small fragmented paddy fields in ZJ) within a single parameter set. The validation set was constructed as a random 20% split of the combined training tiles, stratified by province to ensure balanced monitoring. All comparison models (PSPNet, UNet, ResUNet, DeepLabV3+, BFINet, BsiNet, HBGNet, SEANet, REAUNet) follow the identical unified-model training protocol.

4.1. Quantitative Evaluation Based on Network Characteristics

This study conducted a quantitative evaluation of five mainstream semantic segmentation networks across three study areas characterized by distinct cropland distribution patterns (results are presented in Table 1). The experimental results demonstrate that different network architectures exhibit significant disparities in extracting various features.

U-Net and its variant, ResUNet, employ a classic encoder-decoder structure. As shown in Table 1, both models achieved suboptimal performance across various metrics. While they lagged behind the DLinkNet and PSPNet architectures, they outperformed DeepLabV3+. The U-Net family of networks utilizes skip connections to fuse deep and shallow features, thereby preserving spatial details to a certain extent; consequently, their boundary extraction capability is superior to that of DeepLabV3+, as indicated in Table 1. However, the introduction of residual blocks in ResUNet did not yield significant performance improvements. This suggests that in the absence of mechanisms to effectively expand the receptive field—such as atrous convolution or pyramid pooling—simply increasing network depth is insufficient to address the long-range semantic dependency problems inherent in complex cropland scenarios. This limitation leads to inferior internal consistency within parcels compared to PSPNet, which also utilizes a residual network backbone.

DeepLabV3+ exhibited the poorest performance in the experiments, particularly in boundary extraction tasks, yielding an average F1 score of only 0.171—significantly lower than the other models. From the perspective of network characteristics, although DeepLabV3+ employs the ASPP module, its decoder structure is relatively simple, typically relying on bilinear upsampling by a factor of 4 or 8 to restore resolution. This mechanism, when processing natural features with complex edges like cropland, is prone to the loss of boundary information and oversmoothing within the cropland range, resulting in a significantly lower Boundary IoU.

The quantitative evaluation results indicate that DLinkNet and PSPNet perform comparably. DLinkNet leverages cascaded atrous convolutions (D-Block) between the encoder and decoder to effectively expand the receptive field while maintaining feature map resolution. Consequently, it achieved the best scores in boundary extraction metrics across all study areas, notably reaching a boundary F1 of 0.276 in the Zhejiang study area. PSPNet aggregates feature information from four different scales via the Pyramid Pooling Module (PSPM). Quantitative results prove that such global contextual priors are crucial for recognizing large-scale cropland, enabling PSPNet to achieve optimal accuracy in extracting the extent of cropland areas.

4.2. Qualitative Analysis Combined with Visual Features

Combining the visualization results in Figure 5, Figure 6 and Figure 7, we further analyzed the mechanisms underlying the quantitative metrics from three dimensions: parcel integrity, boundary fineness, and adaptability to different study areas.

Local Features and Noise Suppression. As observed in Figure 5, Figure 6 and Figure 7, the extraction results from U-Net and ResUNet are often accompanied by fragmented holes and salt-and-pepper noise. While these defects are not prominent in the Heilongjiang study area, the performance of the U-Net family decays significantly as the test sites shift to SD and ZJ. In these areas, the farmland parcels (FP) in the images become smaller in scale with increased textural complexity. Consistent with the findings of Zhou et al., this degradation likely stems from the simple introduction of high-frequency background noise from shallow features, which causes a “semantic gap” and limits robustness in complex study areas.

Global Aggregation and Parcel Consistency. PSPNet and DLinkNet demonstrated optimal intra-class consistency across all study areas, with extracted parcels being well-filled and complete. Conversely, as indicated by the quantitative results, the simple decoder in DeepLabV3+ limits the efficient translation of semantic information into results. This leads to increased inter-class uncertainty in the qualitative results, characterized by a severe “smearing” effect and a lack of feature detail. In contrast, DLinkNet exhibited strong adaptability across study areas. Whether in the vast farms of HLJ or the fragmented paddy fields of ZJ, DLinkNet consistently maintained the highest level of parcel integrity. This is attributed to its cascaded atrous convolution (D-Block), which allows the model to capture the global context of the image while preserving feature resolution. Even in regions with severe local texture interference, global prior knowledge enforces semantic coherence within the parcels, effectively overcoming disturbances in complex areas. On the other hand, PSPNet showed a tendency for the internal filling of parcels to shrink as the parcel scale decreased.

Edge Preservation Mechanism and Spatial Sensitivity. The prominent advantage of DLinkNet lies in its sharper and more complete boundary recognition. Figure 5, Figure 6 and Figure 7 show that the FP boundaries extracted by DLinkNet are smooth, continuous, and exhibit fewer breaks. Compared to the edges extracted by PSPNet, DLinkNet produces narrower boundaries but captures more fine edges, extracting more boundaries relative to the ground truth. This cross-regional boundary preservation capability stems from the combination of DLinkNet’s unique encoder structure (which retains spatial information) and the central atrous convolution module (D-Block). By expanding the receptive field without losing spatial details due to excessive pooling—as is the case with PSPNet—DLinkNet demonstrates a stronger advantage than other models in high-precision boundary delineation tasks. In summary, DLinkNet possesses higher spatial sensitivity compared to PSPNet, whereas PSPNet excels in global semantic preservation.

Considering the characteristics of different network architectures and their performance in both quantitative and qualitative experiments, and balancing the distinct advantages of different models across various study areas, this paper selects DLinkNet and PSPNet as the two backbone models for subsequent research and application.

4.3. Comparison with State-of-the-Art Cultivated Land Optimization Models

To validate the superiority and reliability of the backbone networks selected in this paper (DLinkNet and PSPNet) in practical applications, we conducted comparative experiments against five State-of-the-Art (SOTA) networks proposed in recent years specifically for cultivated land or edge extraction, including BFINet [42], BsiNet [12], HBGNet [43], SEANet [3], and REAUNet [44].

As shown in Table 2, models such as SEANet and REAUNet adopt a Multi-task Learning (MTL) strategy, attempting to improve accuracy by jointly optimizing edge detection and region segmentation. However, experimental results indicate that this strategy exhibited significant volatility under default settings. The performance of multi-task models is critically dependent on the weight balancing of loss functions for each task. Particularly in the open-source code of SEANet, adjustments for learning rates and loss ratios are overly intricate and specific, making them difficult to reproduce quickly and directly.

HBGNet and BFINet employ the Pyramid Vision Transformer (PVT) [45,46] as their backbone. Although Transformers possess advantages when pre-trained on large-scale datasets, they exhibit clear limitations in the specific tasks of this study. CNN architectures inherently possess the inductive biases of “translation invariance” and “locality,” enabling them to quickly learn the texture features of cultivated land using less data. In contrast, Vision Transformers rely on self-attention mechanisms to establish global dependencies and lack these prior assumptions. As shown in Figure 8, the heatmaps of HBGNet and BFINet exhibit distinct grid artifacts and fragmented holes within parcels. This indicates that under limited training samples, the Transformer failed to effectively learn spatial smoothness constraints between pixels; instead, it overfitted to the positional embeddings of local patches. Consequently, despite being trained, the model failed to correctly aggregate contiguous cultivated land semantics. On the medium-scale remote sensing dataset used in this experiment, PVT under default settings struggled to reach optimal convergence, resulting in F1 scores (e.g., only 0.17–0.18 in the HLJ test site) far lower than the faster-converging CNN models.

Compared with SOTA models, the selected DLinkNet and PSPNet, despite their relatively simple structures, demonstrated superior training efficiency and robustness. Benefiting from ResNet’s powerful feature extraction capability and CNN’s structural priors, DLinkNet and PSPNet can rapidly adapt to new data distributions under default configurations, generating smooth and consistent prediction results. By adopting a Single-task Multi-class strategy, these models avoid gradient conflicts between multiple tasks. They focus on minimizing cross-entropy loss, ensuring that the internal coherence of parcels (FP Region IoU) far exceeds that of multi-task models disturbed by edge noise. Relying on mature CNN architectures and low sensitivity to hyperparameters, PSPNet and DLinkNet achieved high “out-of-the-box” performance. Combined with the topological post-processing method proposed in this paper, this “robust baseline + explicit constraint” scheme holds greater practical value than indiscriminate adoption of complex SOTA architectures.

4.4. Assessment of Proposed Post-Processing

Table 3 presents the vectorization accuracy of DLinkNet and PSPNet across three typical test sites following post-processing. In addition to conventional area-based metrics (Precision, Recall, F1), this study specifically introduces object-level error metrics (GOC, GUC, GTC) and a geometric similarity metric (PoLiS) to quantitatively evaluate the delineation quality of the vectorized parcels.

As shown in Table 3, both models achieved significant improvements in Precision, Recall, and F1 scores compared to the initial deep feature extraction results, demonstrating the substantial effectiveness of the post-processing method proposed in this paper.

From the perspective of model performance, the two networks exhibit contrasting trends in GOC and GUC metrics. Across all test sites, PSPNet achieved lower under-segmentation error (GUC), while DLinkNet achieved lower over-segmentation error (GOC). This implies that PSPNet tends to subdivide parcels into more fragmented units. However, as illustrated in Figure 9c,e,f, the parcel units delineated by DLinkNet appear more fragmented visually. Further analysis reveals that this discrepancy between quantitative and qualitative evaluations stems from the difference in boundary thickness. PSPNet typically predicts thicker boundaries than DLinkNet. When processed by the raster-to-vector module, these thicker boundaries lead to a more pronounced “shrinkage” effect in the final polygons, artificially inflating the apparent subdivision in the metrics, thus leading to the counter-intuitive manifestation of the GOC metric.

Regarding the test sites, in the ZJ region where parcel scales are minimal and distribution is dense (Figure 9e), the thick boundaries extracted by PSPNet caused severe parcel shrinkage. Conversely, in the SD and HLJ regions (Figure 9a–d), due to the larger parcel scales, this shrinkage did not significantly compromise the post-processing results. Meanwhile, since the model training samples were constructed from natural parcel annotations, the degree of subdivision in PSPNet’s results is closer to that of the ground truth data. As shown in Figure 9c, this aligns with the previous quantitative and qualitative evaluations of deep feature extraction: PSPNet is better at capturing and preserving global features and is less susceptible to the influence of intra-class boundary textures within large parcels. In contrast, DLinkNet tends to extract more intra-class boundaries, resulting in a higher number of detected boundaries.

As indicated by the comprehensive GTC assessment, PSPNet demonstrates comparable or even superior performance to DLinkNet in the Heilongjiang and Shandong regions, where Farmland Parcel (FP) scales are large and distributions are regular. However, as the test site shifts to the Zhejiang region, DLinkNet exhibits an absolute performance advantage.

5. Discussion

5.1. Representation Consistency and Application Potential

Dilating boundaries in raster space helps balance sample proportions and reduces the occurrence of boundary fragmentation, thereby optimizing model training [47]. This study employs the same boundary dilation strategy; however, experimental results indicate that dilated boundaries significant impact model learning behaviors. This section evaluates the stability and potential of this representation through ablation experiments with different buffer sizes.

5.1.1. Statistical Analysis of Boundary Width Alignment

As the foundation of the post-processing Double-Line Delineation (DLD) module, the stability of boundary identification determines the module’s flexibility and scalability. The DLD module proposed in this paper essentially hollows out FP (Farmland Parcel) boundaries that exceed a preset width threshold and subsequently reconstructs them as double-line boundaries via skeletonization. To enhance the stability and scalability of double-line identification, the binarized FP boundary width obtained by the deep learning model is required to possess a certain degree of stability and align with the width of real ground features.

Targeting the FP boundary identification results, we utilized an auxiliary point dataset to statistically analyze the boundary widths obtained by the models. Specifically, the difference between the binarized boundary width extracted by the deep learning model at each point and the manually annotated reference width was calculated. These differences were grouped according to the real boundary widths, and the mean width difference for each group was calculated, yielding the results shown in Figure 10.

Observing the overall trend, as the training label Buffer Size increases from no dilation to 4, the boundary widths predicted by both models exhibit varying degrees of growth. This indicates that applying morphological dilation to training labels effectively forces the network to learn wider edge features, thereby compensating for boundary information loss caused by downsampling during the inference stage. However, this growth is not a simple linear superposition but presents a significant non-linear coupling relationship with the real scale of ground features.

For narrow ridges or paths with a width of less than 5 m, the model predicted values are generally higher than the ground truth even under the w/o (without dilation) setting. This is primarily limited by the minimum feature resolution of the Convolutional Neural Network (CNN) and the interpolation smoothing effect during the feature map upsampling process, making pixel-level thin lines inevitably appear “thickened.” As the Buffer Size increases, this overestimation intensifies. For roads or main canals with a width exceeding 10 m, under low Buffer Size settings, the models—especially PSPNet—tend to identify them as two independent edges or produce severe “centerline breaks,” resulting in an equivalent width significantly lower than the true value. In this case, a larger Buffer Size (e.g., Size 3 or 4) effectively minimizes the width deviation, aligning it closer to zero.

Benefiting from the preservation of multi-scale spatial details by its cascaded dilated convolutions (D-Block), DLinkNet (blue line) demonstrates stronger linear sensitivity to changes in Buffer Size. Particularly in the topographically complex ZJ test site (Row 3, Column 3), when the Buffer Size increases from 1 to 3, DLinkNet’s width error rapidly converges to 0. This proves that DLinkNet demonstrates superior geometric fidelity when processing edge details and can precisely respond to geometric changes in training signals. In contrast, PSPNet (red line) shows greater fluctuation. In the

(5, 10]

interval of the ZJ test site (Row 2, Column 3), PSPNet exhibits abnormal overestimation. This is consistent with the previous analysis: PSPNet relies on the Pyramid Pooling Module (PSPM) to aggregate global features. When facing medium-scale boundaries in Zhejiang with fragmentation and complex textures, global priors tend to cause feature “smearing”—that is, over-filling boundary regions to maintain connectivity—resulting in predicted widths far exceeding actual requirements.

In summary, although PSPNet dominates in global semantic consistency, DLinkNet demonstrates stronger plasticity in the fine control of boundary width due to its superior spatial feature preservation capabilities. Consequently, PSPNet is more suitable for NFP (Non-Farmland Parcel) extraction in plain areas, while DLinkNet is better suited for extracting fragmented FPs or more detailed CFPs (Cultivated Farmland Parcels).

5.1.2. Visualization Analysis of Boundary Features

Based on the observational results, we present the outputs of the deep learning models corresponding to representative reference points and the binarized results on boundary perpendiculars.

(a) Clear Road Boundaries. As shown in Figure 11, all five models output corresponding boundary simulation results in this scenario. PSPNet produces the widest boundaries in all cases, while the other models show roughly comparable widths. When the buffer size varies between 0–2 pixels, the DLinkNet model maintains high consistency in width retention. However, when the buffer size exceeds 2 pixels, the contrast of the simulated values output by the deep learning model decreases. This reflects that all segmentation models can achieve stable recognition when facing boundary targets with strong separability.

(b) Road Boundaries with Wide Shoulders and Tree Occlusion. Farmland roads are often accompanied by wide shoulders or roadside trees, elements that increase boundary width and semantic complexity. As shown in Figure 12, such boundaries affect most models. In Figure 12a, all models except InceptionV3+ identify roadside trees as boundary categories, with PSPNet still exhibiting a wide-boundary recognition mode. In Figure 12b, ResUNet fails identification in all cases except at a buffer size of 3 pixels. Only DLinkNet and PSPNet maintain stable recognition.

(c) Ditch/Canal Boundaries Ditch boundaries are an important category of natural FP boundaries requiring accurate identification. Compared to roads, ditches have low spectral separability from surrounding features. Their low spectral values, irregular shapes, and surrounding features (weeds, culverts) increase recognition complexity. Figure 13a shows that among models trained with different buffer boundary data, only DLinkNet and PSPNet can identify the ditch boundaries at the target point under all conditions, with PSPNet obtaining the most stable boundary width. When the buffer size is >2 pixels, DLinkNet’s recognition result for such boundaries tends to be wide. In Figure 13b, the contrast near the boundary is higher than in (a), so multiple models successfully identify the boundary. The resulting boundary width behavior is consistent with the previous cases. It is worth noting that the activation values of DLinkNet, InceptionV3+, and UNet fluctuate on the south side of the road at key points, indicating that these three models have high sensitivity to color changes on the south side of the road.

(d) Other Boundary Types Figure 14 presents two other typical boundary cases. Figure 14a shows a continuous transition between farmland, roads, and towns; all five models successfully identify the road, with boundary widths similar to the previous examples. In Figure 14b, two FPs are separated by trees with wide shadow zones. Here, the binary activations of DLinkNet and PSPNet on the shadow side of the trees appear continuous and relatively narrow, whereas InceptionV3+ and ResUNet show discontinuities. Such discontinuities are tolerable because, during the TRC (Topological Reconstruction) process, narrow regions within rough boundaries are assimilated into the boundary region, ensuring valid initial skeletonization.

In summary, the deep learning representations adopted in this paper can stably and accurately identify diverse boundaries while obtaining highly stable boundary widths. Regarding the selection of buffer sizes for boundary categories in training samples, both statistical and observational results indicate that when the buffer size is >2 pixels, the boundary recognition width of each model is prone to fluctuation.

5.2. Feasibility of Post-Processing

5.2.1. Ablation Study on Post-Processing Functions

When using segmentation networks to extract the boundaries of closed, blocky targets, boundary fragmentation is inevitable. Furthermore, boundary pixels in raster segmentation results possess inherent width, which, due to 8-connectivity, often leads to erroneous merging (adhesion) with adjacent boundaries. This triggers topological errors such as unclosed FPs or false connections. Conventional approaches typically involve segmenting FPs using the extracted boundaries and independently labeling them before raster-to-vector conversion.

The post-processing approach in this paper prioritizes constructing FP boundaries and uses FP regions as seeds to execute the watershed algorithm. During the FP boundary construction process, TRC is employed to construct the boundary map of the input data, upon which DLD and DLE (Double-Line Extension) can be selectively applied. To systematically evaluate the efficacy of these post-processing modules, we analyzed both qualitative visual results across selected sub-regions (Figure 15) and quantitative metrics across different test images and models (Table 4).

The third column of Figure 15 demonstrates that TRC, combined with polygonization, directly converts deep features into polygon FP results. Even when FP region identification is incomplete, its role as a seed for the watershed algorithm remains effective, successfully constructing valid FP objects. Quantitatively, as shown in the Baseline configuration of Table 4, using TRC alone achieves high Precision (e.g., 0.9391 for DLinkNet in HLJ) but relatively lower Recall. This indicates that while the delineated polygons are highly accurate, rough boundary representations leave complex or ambiguous structures unresolved. Regarding the underlying models, because the boundaries identified by PSPNet are coarser than those of DLinkNet, its overall segmentation and reconstruction metrics are slightly inferior.Incorporating the DLD module significantly mitigates the topological ambiguity caused by inherent boundary width. Visual analysis confirms that double-line boundaries, such as roads and riverbeds, are accurately reconstructed and perform stably after using DLD with a threshold of

w = 16

. Table 4 corroborates this: applying DLD systematically boosts Recall and F1 scores across all areas and models. For example, DLinkNet’s Recall in the SD area increases from 0.8655 to 0.9137, proving that DLD successfully reclaims valid FP areas that were previously misclassified due to boundary thickness.Conversely, applying the DLE module targets broken or unclosed boundaries. Figure 15 (Case II) shows that a large number of broken lines and dangling lines—such as obscure dirt roads—are effectively extended to form complete sub-polygons. The metrics in Table 4 reflect this aggressive geometric repair: using DLE yields a substantial surge in Recall (reaching 0.9445 for DLinkNet in HLJ) but induces a noticeable drop in Precision. It is important to note that protrusions on existing boundaries may form vertical line segments after skeletonization, triggering unnecessary extensions by DLE. In practice, these short artifact lines can be filtered by setting a length threshold.Ultimately, the synergistic application of both modules (w/Both) yields the optimal topological balance. Table 4 demonstrates that combining DLD and DLE produces the highest F1 and IoU scores in nearly all scenarios (e.g., an F1 of 0.9218 for DLinkNet in SD). Furthermore, the geometric PoLiS metric noticeably decreases (improving from 118.57 to 114.66 in HLJ), verifying that the final vector shapes more closely adhere to accurate human-level digitization.The above results indicate that the TRC method proposed in this paper is significantly effective in constructing FP objects from rasterized features. Compared to existing methods, TRC offers vital flexibility in considering boundary width and direction. By coupling stable deep learning feature extraction with the targeted topological repairs of DLD and DLE, this framework ensures accurate, scalable, and geometrically robust FP object construction.

5.2.2. Comparison of Polygon FP Conversion

Many studies employ OWT-UCM to obtain closed hierarchical boundary maps from boundary maps extracted by CNNs [20]. We applied OWT-UCM to the boundary map extracted by DLinkNet (Figure 16). Based on the ultrametric contour map (Figure 16c), thresholds of 0.7, 0.8, and 0.9 were used to merge blocks enclosed by hierarchical boundaries to obtain the highest possible merging level (Figure 16d). Theoretically, OWT-UCM excels at generating over-segmented results to mitigate discontinuous boundary detection, subsequently repairing over-segmentation through hierarchical merging. However, for the deep learning representations proposed in this paper, OWT-UCM fails to construct double-line boundaries. Compared to the OWT-UCM results (red boxes in Figure 16c,e), the double-line boundaries and adjacent boundaries generated by our method are more reasonable and maintain a good morphology for each FP.

5.3. Parameter Sensitivity Analysis and Robustness

To evaluate the robustness of the proposed post-processing framework and analyze the impact of empirical parameter settings, we conducted systematic sensitivity analyses for the core parameters of both the DLD and DLE modules namely, the width threshold (

w i d t h_t h r e s h_m

) and the maximum extension length (

E_{max}

). Since our deep learning-based boundary representation encodes boundary width implicitly, parameter sensitivity is treated as a first-class evaluation dimension.

5.3.1. Sensitivity of Dangling Line Extension (DLE)

Figure 17 (right) illustrates the performance variation of DLinkNet and PSPNet across different

E_{max}

values (ranging from 5 m to 150 m). The visual evidence indicates that the F1-score and IoU achieve their optimal values at an extremely conservative extension length (e.g., 5 m). A sharp performance drop occurs as

E_{max}

increases from 5 m to 10 m, after which the metrics stabilize with only marginal fluctuations up to 150 m. This trend reveals a critical geometric trade-off: although the DLE module is designed to bridge broken boundaries, larger extension lengths severely elevate the risk of “over-extension.” Spurious extensions can incorrectly bisect otherwise intact field parcels, leading to severe over-segmentation. This phenomenon is explicitly reflected in the abrupt initial drop in both Recall and Precision. As

E_{max}

increases beyond 10 m, Precision shows a negligible recovery indicating that while a few large genuine gaps are successfully bridged Recall continues to decline gradually due to the compounding effect of spurious bisections. Therefore, a highly conservative extension threshold (e.g., ≤5 m) is recommended to effectively repair minor fractures while strictly preserving topological integrity.

5.3.2. Sensitivity of Double-Line Detection (DLD)

Figure 17 (left) demonstrates the effect of

w i d t h_t h r e s h_m

on extraction accuracy. As the width threshold increases from 2.5 m to 20.0 m, both the F1-score and Recall improve monotonically, while Precision exhibits a continuous, gradual decline. This phenomenon arises because lower thresholds make the DLD module highly aggressive, erroneously converting relatively narrow linear features into dual-line boundaries. While this strict hollowing mechanism reduces false-positive parcel areas (yielding higher Precision), it excessively erodes the effective area of genuine parcels (yielding significantly lower Recall). Conversely, higher thresholds preserve more valid parcel area but may fail to hollow out genuinely wide geographical features, such as main roads or arterial canals. The highly consistent trends across both DLinkNet and PSPNet confirm that this geometric trade-off is governed intrinsically by the topological rules of the DLD module itself, independent of the specific CNN backbone. We recommend

w i d t h_t h r e s h_m \in [10, 20]

m as a stable operating range to optimally balance structural preservation and topological hollowing.

5.3.3. Cross-Resolution and Cross-Landscape Generalizability

A critical design feature is that all core parameters in the framework are defined in physical units (meters) rather than as fixed pixel counts. During inference, the framework automatically maps these physical parameters to pixel distances based on the input imagery’s Ground Sampling Distance (GSD). For example, when GSD = 1 m/pixel, an

E_{max} = 50

m maps to 50 pixels; when GSD = 0.5 m/pixel, it maps to 100 pixels. This physical-unit parameterization endows the TopoFP framework with inherent scale invariance and cross-resolution robustness. In practical deployment, these physical thresholds provide intuitive operational guidance for regional fine-tuning: in highly fragmented landscapes (e.g., Zhejiang), a smaller

E_{max}

can be applied to prevent topological interference; in large-scale, regularly distributed agricultural regions (e.g., Heilongjiang), a larger

w i d t h_t h r e s h_m

can be adopted to better preserve macroscopic field boundary structures.

5.3.4. Skeletonization Limitations

We acknowledge that the TRC module’s reliance on skeletonization introduces certain inherent limitations. As with any discrete skeletonization algorithm operating on raster data, the Zhang-Suen algorithm can produce minor pixel-level jitter (staircase artifacts) along diagonal or curved boundary segments. However, the final Douglas-Peucker simplification step (Section 3.2.4) effectively smooths these sub-pixel irregularities, producing geometrically smooth vector boundaries. This two-stage approach (raster skeletonization → vector simplification) leverages the efficiency of raster processing while mitigating its discretization artifacts. For asymmetric boundary predictions where the CNN produces thicker activation on one side of the true boundary, the skeleton may exhibit a slight centerline offset; this is an inherent property of the input feature quality rather than the skeletonization algorithm itself. Our experiments confirm that DLinkNet’s superior spatial resolution preservation (via cascaded dilated convolutions) yields more symmetric boundary activations and therefore better-centered skeletons compared to PSPNet. It is also important to note that the chosen Zhang-Suen algorithm is applied specifically after DLD processing, ensuring that only true single-line boundaries undergo skeletonization the DLD module removes double-line structures beforehand, preventing the topological collapse of wide features into single lines.

5.4. Computational Efficiency

A practical concern for any post-processing pipeline is computational overhead. We have added a quantitative analysis of the inference time for each module in the TopoFP pipeline. A key design insight is that TRC reduces the raster representation (millions of pixels) to a lightweight graph structure (hundreds to thousands of nodes and edges). All subsequent operations -DLD, DLE, and vectorization -operate on this compact graph rather than on the full raster. Consequently, the post-processing overhead is minimal relative to CNN inference.

Quantitative timing results per 512 × 512 patch on an NVIDIA Tesla V100 GPU are as follows: CNN inference (DLinkNet), ∼45 ms; TRC (skeletonization + keypoint detection + graph construction), ∼4.2 ms (all GPU-accelerated via PyTorch tensor operations, including torch.nn.Unfold for keypoint detection); DLD (distance transform + dual-line classification + removal), ∼1.8 ms; DLE (dangling identification + ray walking), ∼2.1 ms; and vectorization (watershed + polygonization + Douglas Peucker simplification [32]), ∼3.5 ms. The total post-processing overhead is approximately 11.6 ms, representing roughly 26% of the CNN inference time, with the end-to-end TopoFP pipeline requiring ∼56.6 ms per patch.

Three factors contribute to this efficiency. First, the entire post-processing chain is implemented on GPU using PyTorch tensor operations, eliminating CPU GPU data transfer bottlenecks. Second, after TRC, the complexity of DLD and DLE scales with the number of boundary segments (

O (| E |)

) rather than with image resolution (

O (H \times W)

). Third, the compact graph representation after TRC -typically on the order of hundreds of nodes and edges per patch -enables near-instantaneous graph operations. These design choices ensure that TopoFP remains computationally practical for large-area agricultural monitoring applications.

6. Conclusions

This paper addresses the critical challenges of boundary fragmentation and topological errors in Field Parcel (FP) extraction from high-resolution remote sensing imagery by proposing an end-to-end vectorized extraction framework. This framework effectively integrates deep learning feature representation with explicit topological control. We constructed a topology-first FP vectorization method that achieves automatic defect repair based on geometric rules. Furthermore, this study elucidates the coupling mechanism between the implicit boundary width encoding of CNNs and the geometric properties of agricultural features.

Distinct from traditional pixel-level post-processing, the proposed Topological Relationship Construction (TRC) method innovatively transforms raster features into a point-line topological network endowed with geometric attributes. This method not only enables the automated conversion from probability maps to vector entities but also intrinsically guarantees spatial adjacency relationships between parcels during the extraction process, effectively resolving geometric structural integrity issues in complex agricultural scenarios.

Through a comparative study of five mainstream CNN architectures, we found that models capable of preserving spatial resolution (such as DLinkNet with its cascaded dilated convolutions) exhibit significant advantages in boundary refinement and continuity. Experiments confirmed that an appropriate boundary buffering strategy (e.g., 2 pixels) effectively balances sample class weights, enabling the model to learn highly adaptable boundary width features, thus providing a stable foundation for subsequent topological correction. Conversely, while multi-scale feature fusion mechanisms may sacrifice some spatial details, they successfully preserve the global semantic consistency of targets. This provides theoretical support for the future integration of these complementary mechanisms.

The proposed Double-Line Detection (DLD) and Dangling Line Extension (DLE) algorithms successfully utilize width information implicitly encoded in deep features and geometric collinearity rules to automate the differentiation of single/double-line boundaries and the repair of broken lines within a unified topological graph. Experimental results demonstrate that while ensuring high accuracy (F1 score of 0.910), the generated vector results align closely with manual digitization quality in terms of object-level integrity and geometric compliance.

In summary, this study not only provides a high-precision, automated tool for field parcel production but also verifies the feasibility and superiority of combining “data-driven feature learning” with “rule-driven topological constraints.” Future work will focus on optimizing the adaptability of the skeletonization algorithm to complex boundary morphologies and exploring the introduction of richer geometric descriptors. The ultimate goal is to model target features using more explicit geometric representations, thereby further enhancing the framework’s robustness in extremely fragmented cultivated land scenarios.

Author Contributions

Y.D. Conceptualization, methodology, Software, Datacollection, Data processing and analysis, Visualization, Writing—original draft, Writing—review & editing. R.Z. Conceptualization, Funding, Writing—original draft & editing. X.X. Conceptualization, Supervision, Writing—review & editing. J.Z. Datacollection, Writing—review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant Nos. U2542206, 42505137 and 42505002).

Data Availability Statement

The training and testing datasets, ground truth and auxiliary dataset supporting the findings of this study are openly available in the Zenodo repository at https://doi.org/10.5281/zenodo.17953919 [48]. ALL Modeling and figuring codes and the core implementation of the topoFP code are openly available in https://github.com/dymwan/TopoFP (accessed on 20 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

Algorithm A1 Keypoint Detection via 3 × 3 Sliding Window.

Require:: Binary skeleton $S \in {0, 1}^{H \times W}$
Ensure:: Point map $P \in {0, 1, 2}^{H \times W}$ (0: background, 1: cross-point, 2: end-point)

1:: $S_{T} \leftarrow ToTensor (S)$ ▹ to GPU
2:: $P \leftarrow 0^{H \times W}$
3:: $HeloIdx \leftarrow [0, 1, 2, 5, 8, 7, 6, 3, 4]$ ▹ clockwise spiral order from top-left
4:: $Unfold \leftarrow nn . Unfold (3, 1, 1)$
5:: $Neighbors \leftarrow Unfold (S_{T}) \in R^{1 \times 9 \times (H W)}$
6:: $Helos \leftarrow Neighbors [:, HeloIdx, :] \in R^{1 \times 9 \times (H W)}$
7:: for $i \leftarrow 1$ to 8 do
8:: $Helos [:, i, :] \leftarrow Helos [:, i, :] - Helos [:, i - 1, :]$
9:: end for
10:: $Helos [Helos = - 1] \leftarrow 0$
11:: $HeloCt \leftarrow Sum (Helos) \in R^{H \times W}$ ▹ number of transitions
12:: $WinSum \leftarrow Sum (Neighbors) \in R^{H \times W}$ ▹ count of foreground pixels
13:: $P [HeloCt \geq 3] \leftarrow 1$ ▹ cross-point: ≥3 branches
14:: $P [(WinSum = 2) \land (S = 1)] \leftarrow 2$ ▹ end-point: only self + one neighbor
15:: $P [S = 0] \leftarrow 0$
16:: return P

Appendix A.2

Algorithm A2 Dual-Line Detection (DLD).

Require:: Boundary mask $B \in {0, 1}^{H \times W}$ , Individual line labels $L \in N^{H \times W}$ , Topology graph $G = (N, L)$ , Width threshold $τ_{w}$ (meters), Resolution r (m/pixel)
Ensure:: Updated B with dual lines removed

1:: if $r \neq$ None then
2:: $τ_{w} \leftarrow τ_{w} / r$ ▹ convert meters → pixels
3:: end if
4:: $D \leftarrow DistanceTransform (B)$ ▹ Euclidean distance to non-boundary
5:: for all line $l \in L$ with id i do
6:: $pixels \leftarrow (L = i)$
7:: $\bar{w} \leftarrow Mean (D [pixels])$ ▹ mean boundary half-width
8:: if $\bar{w} \geq τ_{w}$ then
9:: $MarkDual (l)$
10:: for all node $n \in l . linked$ do
11:: $MarkDual (n)$
12:: end for
13:: end if
14:: end for
15:: for all line $l \in L$ not yet marked dual do
16:: if $\forall n \in l . linked : n . is_dual = True$ then
17:: $MarkDual (l)$ ▹ propagation
18:: end if
19:: end for
20:: $M_{dual} \leftarrow ⋃_{l \in L_{dual}} (L = l . id)$ ▹ union of dual-line pixels
21:: $D_{excess} \leftarrow (D \cdot M_{dual}) - τ_{w}$
22:: $D_{excess} [D_{excess} \leq 1] \leftarrow 1, D_{excess} [D_{excess} = - τ_{w}] \leftarrow 0$
23:: for all distinct $v \in D_{excess}$ in descending order, $v > 0$ do
24:: $D_{excess} \leftarrow {Dilate}_{3 \times 3} (D_{excess})$
25:: $Output [(D_{excess} \geq v) \land (Output = 0)] \leftarrow 1$
26:: end for
27:: $B [Output = 1] \leftarrow 0$ ▹ erase dual-line regions
28:: return $Skeletonize (B)$

Appendix A.3

Algorithm A3 Dangling Line Extension (DLE).

Require:: Skeleton $S \in {0, 1}^{H \times W}$ , Min length $l_{min}$ , Max extend length $l_{max}$ , Device
Ensure:: Extended skeleton $S^{'}$

1:: ▹ — Step 1: Topology reconstruction (TRC) —
2:: $P \leftarrow GetKeyPoints (S)$ ▹ Algorithm A1
3:: $L \leftarrow SeparateLines (S, P)$ ▹ connected components after erasing cross-points
4:: $G \leftarrow BuildGraph (L, P)$ ▹ incidence matrices + topology graph
5:: ▹ — Step 2: Identify dangling lines —
6:: $L_{dangling} \leftarrow ⌀$
7:: for all node $n \in G . N$ do
8:: if $| n . linked | = 1 \land G . lines [n . linked [0]] . len \geq l_{min}$ then
9:: $L_{dangling} \leftarrow L_{dangling} \cup {n . linked [0]}$
10:: end if
11:: end for
12:: $M_{dang} \leftarrow ⋃_{l \in L_{dangling}} (L = l . id)$
13:: ▹ — Step 3: Extract extension rays —
14:: $C \leftarrow FindContours (M_{dang})$
15:: for all contour $c \in C$ do
16:: $(p_{start}, p_{end}) \leftarrow GetEndpoints (c)$
17:: $v \leftarrow p_{end} - p_{start}, \hat{v} \leftarrow v / {∥ v ∥}_{2}$
18:: $Ray \leftarrow LinSpace (p_{end}, p_{end} + l_{max} \cdot \hat{v})$
19:: end for
20:: ▹ — Step 4: Walk along rays —
21:: for $t \leftarrow 1$ to $l_{max}$ do
22:: for all active ray r do
23:: $p_{cur} \leftarrow r [t]$
24:: $Halo \leftarrow Get 3 x 3 Neighbors (S, p_{cur})$
25:: if $| Halo \cap S | \geq 2$ or $WalkingPath \cap S \neq ⌀$ then
26:: $Deactivate (r)$ ▹ touched existing skeleton, stop
27:: else if $p_{cur} = p_{prev}$ then
28:: ▹ paused, keep active
29:: else
30:: $S [p_{cur}] \leftarrow 1$ ▹ extend skeleton
31:: $WalkingPath [p_{cur}] \leftarrow 1$
32:: end if
33:: end for
34:: if no active rays then
35:: break
36:: end if
37:: end for
38:: return S

Appendix A.4

Table A1. Parameter sensitivity analysis of the DLD module (width_thresh_m). All metrics are averaged across six test images (HLJ, SD, ZJ). ↑ higher is better, ↓ lower is better.

Model	Width (m)	Prec. ↑	Rec. ↑	F1 ↑	IoU ↑	GOC ↓	GUC ↓	GTC ↓	PoLiS ↓
DLinkNet	2	0.9224	0.7620	0.8322	0.7194	0.0617	0.0517	0.0807	79.12
	4	0.9212	0.7712	0.8375	0.7269	0.0666	0.0556	0.0876	79.18
	6	0.9191	0.7855	0.8454	0.7379	0.0755	0.0626	0.1009	79.37
	8	0.9163	0.8012	0.8536	0.7497	0.0864	0.0713	0.1160	79.39
	10	0.9133	0.8172	0.8616	0.7614	0.0990	0.0821	0.1311	79.33
	12	0.9097	0.8342	0.8695	0.7734	0.1107	0.0939	0.1397	79.00
	14	0.9047	0.8517	0.8768	0.7843	0.1205	0.1056	0.1432	78.37
	16	0.8979	0.8684	0.8824	0.7930	0.1283	0.1168	0.1444	77.39
	20	0.8837	0.8940	0.8885	0.8025	0.1299	0.1268	0.1343	75.63
PSPNet	2	0.9198	0.7679	0.8338	0.7222	0.0644	0.0573	0.0769	78.26
	4	0.9188	0.7747	0.8377	0.7276	0.0682	0.0604	0.0818	77.99
	6	0.9171	0.7854	0.8436	0.7359	0.0746	0.0658	0.0904	77.92
	8	0.9149	0.7979	0.8502	0.7453	0.0830	0.0729	0.1009	77.95
	10	0.9122	0.8107	0.8566	0.7545	0.0916	0.0807	0.1105	78.17
	12	0.9093	0.8243	0.8631	0.7641	0.1006	0.0899	0.1179	77.99
	14	0.9056	0.8377	0.8690	0.7729	0.1074	0.0983	0.1210	77.83
	16	0.9005	0.8515	0.8742	0.7807	0.1126	0.1063	0.1213	77.33
	20	0.8865	0.8778	0.8814	0.7917	0.1166	0.1179	0.1162	76.87

Appendix A.5

Table A2. Parameter sensitivity analysis of the DLE module (max_extend_m). All metrics are averaged across six test images (HLJ, SD, ZJ). ↑ higher is better, ↓ lower is better.

Model	Max Ext. (m)	Prec. ↑	Rec. ↑	F1 ↑	IoU ↑	GOC ↓	GUC ↓	GTC ↓	PoLiS ↓
DLinkNet	10	0.8403	0.9256	0.8805	0.7903	0.0956	0.1194	0.0825	73.20
	50	0.8412	0.9237	0.8802	0.7898	0.0955	0.1176	0.0832	80.84
	100	0.8422	0.9225	0.8801	0.7898	0.0957	0.1166	0.0840	87.65
	150	0.8424	0.9221	0.8802	0.7897	0.0956	0.1161	0.0842	91.45
PSPNet	10	0.8342	0.9223	0.8756	0.7825	0.0858	0.1145	0.0705	76.51
	50	0.8353	0.9200	0.8751	0.7819	0.0857	0.1127	0.0711	79.09
	100	0.8366	0.9180	0.8749	0.7816	0.0860	0.1113	0.0720	83.86
	150	0.8369	0.9174	0.8749	0.7815	0.0861	0.1108	0.0723	87.56

References

Persello, C.; Tolpekin, V.; Bergado, J.R.; De By, R. Delineation of agricultural fields in smallholder farms from satellite images using fully convolutional networks and combinatorial grouping. Remote Sens. Environ. 2019, 231, 111253. [Google Scholar] [CrossRef] [PubMed]
Rangel, R.F.; Lourenço, V.N.; Oldoni, L.V.; Bonamigo, A.F.C.; Santos, W.; Oliveira, B.S.; Barreto, M.N. A Unified Framework for Cropland Field Boundary Detection and Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 636–644. [Google Scholar]
Li, M.; Long, J.; Stein, A.; Wang, X. Using a semantic edge-aware multi-task neural network to delineate agricultural parcels from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 200, 24–40. [Google Scholar] [CrossRef]
Wang, X.; Shu, L.; Han, R.; Yang, F.; Gordon, T.; Wang, X.; Xu, H. A Survey of Farmland Boundary Extraction Technology Based on Remote Sensing Images. Electronics 2023, 12, 1156. [Google Scholar] [CrossRef]
Jong, M.; Guan, K.; Wang, S.; Huang, Y.; Peng, B. Improving field boundary delineation in ResUNets via adversarial deep learning. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102877. [Google Scholar] [CrossRef]
Masoud, K.M.; Persello, C.; Tolpekin, V.A. Delineation of agricultural field boundaries from Sentinel-2 images using a novel super-resolution contour detector based on fully convolutional networks. Remote Sens. 2019, 12, 59. [Google Scholar] [CrossRef]
Arbelaez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 898–916. [Google Scholar] [CrossRef]
Pont-Tuset, J.; Arbelaez, P.; Barron, J.T.; Marques, F.; Malik, J. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 128–140. [Google Scholar] [CrossRef]
Cai, Z.; Hu, Q.; Zhang, X.; Yang, J.; Wei, H.; He, Z.; Song, Q.; Wang, C.; Yin, G.; Xu, B. An adaptive image segmentation method with automatic selection of optimal scale for extracting cropland parcels in smallholder farming systems. Remote Sens. 2022, 14, 3067. [Google Scholar] [CrossRef]
Garcia-Pedrero, A.; Gonzalo-Martin, C.; Lillo-Saavedra, M. A machine learning approach for agricultural parcel delineation through agglomerative segmentation. Int. J. Remote Sens. 2017, 38, 1809–1819. [Google Scholar] [CrossRef]
Lebourgeois, V.; Dupuy, S.; Vintrou, É.; Ameline, M.; Butler, S.; Bégué, A. A combined random forest and OBIA classification scheme for mapping smallholder agriculture at different nomenclature levels using multisource data (simulated Sentinel-2 time series, VHRS and DEM). Remote Sens. 2017, 9, 259. [Google Scholar] [CrossRef]
Long, J.; Li, M.; Wang, X.; Stein, A. Delineation of agricultural fields using multi-task BsiNet from high-resolution satellite images. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102871. [Google Scholar] [CrossRef]
Waldner, F.; Diakogiannis, F.I. Deep learning on edge: Extracting field boundaries from satellite images with a convolutional neural network. Remote Sens. Environ. 2020, 245, 111741. [Google Scholar] [CrossRef]
Zhang, H.; Liu, M.; Wang, Y.; Shang, J.; Liu, X.; Li, B.; Song, A.; Li, Q. Automated delineation of agricultural field boundaries from Sentinel-2 images using recurrent residual U-Net. Int. J. Appl. Earth Obs. Geoinf. 2021, 105, 102557. [Google Scholar] [CrossRef]
Waldner, F.; Diakogiannis, F.I.; Batchelor, K.; Ciccotosto-Camp, M.; Cooper-Williams, E.; Herrmann, C.; Mata, G.; Toovey, A. Detect, consolidate, delineate: Scalable mapping of field boundaries using satellite images. Remote Sens. 2021, 13, 2197. [Google Scholar] [CrossRef]
Xie, Y.; Zheng, S.; Wang, H.; Qiu, Y.; Lin, X.; Shi, Q. Edge Detection with Direction Guided Postprocessing for Farmland Parcel Extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3760–3770. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhu, Y.; Pan, Y.; Hu, T.; Zhang, D.; Zhao, C.; Gao, Y. A generalized framework for agricultural field delineation from high-resolution satellite imageries. Int. J. Digit. Earth 2024, 17, 2297947. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Pan, Y.; Wang, X.; Zhang, L.; Zhong, Y. E2EVAP: End-to-end vectorization of smallholder agricultural parcel boundaries from high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2023, 203, 246–264. [Google Scholar] [CrossRef]
Zhang, T.Y.; Suen, C.Y. A fast parallel algorithm for thinning digital patterns. Commun. ACM 1984, 27, 236–239. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 182–186. [Google Scholar]
Zhang, D.; Pan, Y.; Zhang, J.; Hu, T.; Zhao, J.; Li, N.; Chen, Q. A generalized approach based on convolutional neural networks for large area cropland mapping at very high resolution. Remote Sens. Environ. 2020, 247, 111912. [Google Scholar] [CrossRef]
Strutz, T. The distance transform and its computation. arXiv 2021, arXiv:2106.03503. [Google Scholar]
Turker, M.; Kok, E.H. Field-based sub-boundary extraction from remote sensing imagery using perceptual grouping. ISPRS J. Photogramm. Remote Sens. 2013, 79, 106–121. [Google Scholar] [CrossRef]
Maxwell, J.C.L. on hills and dales: To the editors of the philosophical magazine and journal. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1870, 40, 421–427. [Google Scholar] [CrossRef]
GDAL/OGR Contributors. GDAL/OGR Geospatial Data Abstraction software Library; Open Source Geospatial Foundation: Beaverton, OR, USA, 2024. [Google Scholar] [CrossRef]
Douglas, D.H.; Peucker, T.K. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartogr. Int. J. Geogr. Inf. Geovisualization 1973, 10, 112–122. [Google Scholar] [CrossRef]
Hossin, M.; Sulaiman, M.N. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1. [Google Scholar]
Persello, C.; Bruzzone, L. A Novel Protocol for Accuracy Assessment in Classification of Very High Resolution Images. IEEE Trans. Geosci. Remote Sens. 2010, 48, 1232–1244. [Google Scholar] [CrossRef]
Lin, G.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. arXiv 2016, arXiv:1611.06612. [Google Scholar]
Avbelj, J.; Muller, R.; Bamler, R. A Metric for Polygon Comparison and Building Extraction Evaluation. IEEE Geosci. Remote Sens. Lett. 2015, 12, 170–174. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the CVPR09; IEEE: Piscataway, NJ, USA, 2009. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Zhao, H.; Long, J.; Zhang, M.; Wu, B.; Xu, C.; Tian, F.; Ma, Z. Irregular Agricultural Field Delineation Using a Dual-Branch Architecture From High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Zhao, H.; Wu, B.; Zhang, M.; Long, J.; Tian, F.; Xie, Y.; Zeng, H.; Zheng, Z.; Ma, Z.; Wang, M.; et al. A large-scale VHR parcel dataset and a novel hierarchical semantic boundary-guided network for agricultural parcel delineation. ISPRS J. Photogramm. Remote Sens. 2025, 221, 1–19. [Google Scholar] [CrossRef]
Lu, R.; Zhang, Y.; Huang, Q.; Zeng, P.; Shi, Z.; Ye, S. A refined edge-aware convolutional neural networks for agricultural parcel delineation. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104084. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.01212. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Garcia-Pedrero, A.; Lillo-Saavedra, M.; Rodriguez-Esparragon, D.; Gonzalo-Martin, C. Deep learning for automatic outlining agricultural parcels: Exploiting the land parcel identification system. IEEE Access 2019, 7, 158223–158236. [Google Scholar] [CrossRef]
Duan, Y. A Field Parcel Dataset in China Main Cultivating Area, Version v0.1. Zenodo, 2025. Available online: https://zenodo.org/records/17953919 (accessed on 20 May 2026).

Figure 1. Spatial distribution of the study areas and datasets (left), and examples of test images with corresponding ground truth labels (right). The sub-figures are in HLJ (A–C), SD (D–F) and ZJ (G–I) sites respectively.

Figure 2. Overall j workflow of the proposed method. Given an RGB image as input, CNN model extracts the boundary-region feature out as Frame-I illustrated. Given the activation result by CNN model, the TRC module converts it to a graph composed of key-points and links (II.a). subsequently, the DLD (II.b) and DLE (II.c) can be chosen to futher refine the FP boundary, and finally convert the graph to polygonal FP result (II.d).

Figure 3. Demonstration of prepared training images and corresponding label, in which the red, blue and green area represent the region and buffered edge of FP and non-filed area respectively.

Figure 4. Illustration of the key points identification method. Given the binary boundary result from CNN model (left), the initial skeleton map is derived (middle). Based on the skeleton map, three conditions (right): the endpoint of the skeleton (the upper row), the normal point on the skeleton (the middle row) and the cross point (the bottom row) and their determination methods are used to detect key points. The red lines represent the individual intervals separated by skeleton pixels on the clockwise halo of 3 × 3 neighborhoods.

Figure 5. Samples of FP region and edge extraction of different backbones in HLJ site.

Figure 6. Samples of FP region and edge extraction of different backbones in SD site.

Figure 7. Samples of FP region and edge extraction of different backbones in ZJ site.

Figure 8. Qualitative comparison results of different methods on typical test area samples.

Figure 9. Comparison of vectorization results across test area: HLJ (a,b), SD (c,d) and ZJ (e,f). Adjacent parcels are rendered using a four-color filling method. Red rectangles hightlight regions of interest discussed in the text.

Figure 10. Statistical analysis of the difference between the width of the deep learning boundary and the actual boundary. Group statistics were conducted based on two dimensions: the study area (row direction) and the actual boundary range (column direction).

Figure 11. Demonstration of deep learning boundary features on road category. The image patch (a) shows the sampling point (red arrow) and the cross-sectional sampling range (yellow horizontal line); (b) presents section diagrams of binary prediction (horizontal bars) and activation curves from various models. This organization style is applied in Figure 11, Figure 12, Figure 13 and Figure 14.

Figure 12. Demonstration of deep learning boundary features on road category with wide shoulder. The red arrow indicates the target sampling point, and the yellow horizontal line marks the cross-sectional sampling range.

Figure 13. Demonstration of deep learning boundary features on canal categories. The red arrow indicates the target sampling point, and the yellow horizontal line marks the cross-sectional sampling range.

Figure 14. Demonstration of deep learning boundary features on other categories. The red arrow indicates the target sampling point, and the yellow horizontal line marks the cross-sectional sampling range.

Figure 15. Demonstration of results of ablation study of post-processing based on different deep learning model. From column 1 to 5, the image patch, binary deep learning result, polygonal FP result with only TRC, polygonal FP result with DLD, polygonal FP result with DLD and DLE. Rows I–IV represent four representative test cases from different study areas. Red lines indicate the final vectorized FP boundaries.

Figure 16. Comparison of polygonal FP results derived from OWT-UCM and our proposed post-processing based on same deep learning result. Where, (a) is the input image to deep learning model; (b) is the boundary activation from DLinkNet; (c) is the ultra-boundary map; (d) lists small crops from our post-processed result (red lines) and boundary maps from OWT-UCM result with different threshold level; (e) is our post-processed result.

Figure 17. Sensitivity analysis of the core parameters in the DLD (left) and DLE (right) modules across DLinkNet and PSPNet models.

Table 1. Quantitative evaluations of different backbones in diffrent sites.

	Loc.	Boundary				FP Region
	Loc.	Rec.	Prec.	IoU	F1	Rec.	Prec.	IoU	F1
ResUNet	HLJ	0.789	0.096	0.094	0.172	0.704	0.949	0.679	0.809
UNet		0.785	0.110	0.107	0.194	0.766	0.960	0.743	0.852
DLinkNet		0.739	0.138	0.132	0.233	0.863	0.962	0.835	0.910
PSPNet		0.694	0.138	0.130	0.231	0.868	0.958	0.836	0.911
InceptionV3+		0.546	0.086	0.080	0.148	0.784	0.833	0.677	0.808
ResUNet	SD	0.803	0.143	0.138	0.243	0.709	0.937	0.676	0.807
UNet		0.845	0.143	0.139	0.245	0.705	0.945	0.677	0.808
DLinkNet		0.860	0.155	0.151	0.263	0.755	0.949	0.725	0.841
PSPNet		0.822	0.160	0.154	0.268	0.781	0.943	0.746	0.855
InceptionV3+		0.620	0.133	0.123	0.219	0.809	0.776	0.656	0.792
ResUNet	ZJ	0.838	0.118	0.115	0.206	0.558	0.719	0.458	0.628
UNet		0.859	0.144	0.140	0.246	0.571	0.817	0.506	0.672
DLinkNet		0.861	0.164	0.160	0.276	0.549	0.896	0.516	0.681
PSPNet		0.809	0.161	0.155	0.268	0.560	0.896	0.526	0.690
InceptionV3+		0.632	0.083	0.079	0.147	0.700	0.534	0.435	0.606

Table 2. Quantitative comparison on model output across state-of-the-arts methods.

	Loc.	Boundary				FP Region
	Loc.	Rec.	Prec.	IoU	F1	Rec.	Prec.	IoU	F1
BFINet	HLJ	0.663	0.109	0.103	0.187	0.733	0.903	0.679	0.809
BSiNet		0.410	0.124	0.105	0.191	0.482	0.910	0.460	0.630
HGBNet		0.389	0.111	0.095	0.173	0.633	0.939	0.608	0.756
ReaUNet		-	-	-	-	0.654	0.913	0.615	0.762
SEANet		0.481	0.081	0.075	0.139	0.429	0.954	0.420	0.591
DLinkNet		0.739	0.138	0.132	0.233	0.863	0.962	0.835	0.910
PSPNet		0.694	0.138	0.130	0.231	0.868	0.958	0.836	0.911
BFINet	SD	0.746	0.107	0.104	0.188	0.805	0.674	0.580	0.734
BSiNet		0.468	0.121	0.107	0.193	0.604	0.660	0.461	0.631
HGBNet		0.410	0.116	0.100	0.181	0.638	0.686	0.494	0.661
ReaUNet		-	-	-	-	0.800	0.675	0.578	0.732
SEANet		0.679	0.070	0.068	0.128	0.703	0.769	0.580	0.734
DLinkNet		0.860	0.155	0.151	0.263	0.755	0.949	0.725	0.841
PSPNet		0.822	0.160	0.154	0.268	0.781	0.943	0.746	0.855
BFINet	ZJ	0.795	0.171	0.163	0.281	0.781	0.690	0.578	0.733
BSiNet		0.456	0.230	0.181	0.306	0.608	0.664	0.465	0.635
HGBNet		0.514	0.199	0.167	0.286	0.684	0.803	0.585	0.738
ReaUNet		-	-	-	-	0.709	0.643	0.509	0.674
SEANet		0.703	0.098	0.094	0.172	0.568	0.738	0.472	0.642
DLinkNet		0.861	0.164	0.160	0.276	0.549	0.896	0.516	0.681
PSPNet		0.809	0.161	0.155	0.268	0.560	0.896	0.526	0.690

Table 3. Quantitative evaluations for verized results. ↑ higher is better, ↓ lower is better.

Configuration	Model	Area	Prec. ↑	Rec. ↑	F1 ↑	IoU ↑	GOC ↓	GUC ↓	GTC ↓	PoLiS ↓
w/Both ( $w = 10, 16$ )	DLinkNet	HLJ	0.9154	0.9079	0.9113	0.8393	0.1330	0.1345	0.1316	114.66
	DLinkNet	SD	0.9112	0.9328	0.9218	0.8554	0.1106	0.1016	0.1222	83.39
	DLinkNet	ZJ	0.8244	0.8415	0.8325	0.7130	0.1464	0.1450	0.1492	29.06
	PSPNet	HLJ	0.9161	0.8965	0.9057	0.8299	0.1234	0.1298	0.1179	123.93
	PSPNet	SD	0.9065	0.9318	0.9188	0.8502	0.1026	0.0973	0.1094	75.35
	PSPNet	ZJ	0.8371	0.8049	0.8198	0.6947	0.1231	0.1258	0.1210	31.28

Table 4. Ablation study of the DLD and DLE post-processing modules. All metrics are area-level averages across two test images. ↑ higher is better, ↓ lower is better.

Configuration	Model	Area	Prec. ↑	Rec. ↑	F1 ↑	IoU ↑	GOC ↓	GUC ↓	GTC ↓	PoLiS ↓
Base (w/o DLD, w/o DLE)	DLinkNet	HLJ	0.9391	0.8621	0.8977	0.8177	0.1106	0.1001	0.1247	118.57
	DLinkNet	SD	0.9291	0.8655	0.8962	0.8125	0.0808	0.0573	0.1378	88.91
	DLinkNet	ZJ	0.8715	0.7242	0.7908	0.6540	0.1056	0.0890	0.1309	30.52
	PSPNet	HLJ	0.9371	0.8575	0.8942	0.8119	0.1066	0.1018	0.1132	126.21
	PSPNet	SD	0.9232	0.8745	0.8982	0.8156	0.0802	0.0601	0.1209	76.53
	PSPNet	ZJ	0.8763	0.7002	0.7775	0.6361	0.0879	0.0803	0.0975	31.77
w/DLD ( $w = 16$ )	DLinkNet	HLJ	0.9283	0.8889	0.9074	0.8332	0.1303	0.1245	0.1373	116.82
	DLinkNet	SD	0.9187	0.9137	0.9162	0.8458	0.1064	0.0887	0.1333	85.80
	DLinkNet	ZJ	0.8467	0.8024	0.8236	0.7001	0.1483	0.1372	0.1628	29.55
	PSPNet	HLJ	0.9278	0.8802	0.9024	0.8249	0.1213	0.1213	0.1218	124.45
	PSPNet	SD	0.9149	0.9121	0.9134	0.8411	0.0986	0.0844	0.1189	76.38
	PSPNet	ZJ	0.8588	0.7623	0.8068	0.6762	0.1178	0.1132	0.1232	31.15
w/DLE ( $w = 10$ )	DLinkNet	HLJ	0.8639	0.9445	0.9023	0.8241	0.0979	0.1445	0.0760	109.97
	DLinkNet	SD	0.8890	0.9580	0.9222	0.8558	0.0872	0.0983	0.0804	78.64
	DLinkNet	ZJ	0.7674	0.8753	0.8171	0.6910	0.1017	0.1168	0.0907	29.48
	PSPNet	HLJ	0.8623	0.9385	0.8988	0.8181	0.0910	0.1399	0.0689	123.64
	PSPNet	SD	0.8791	0.9607	0.9179	0.8486	0.0787	0.0965	0.0679	72.96
	PSPNet	ZJ	0.7608	0.8684	0.8100	0.6809	0.0879	0.1082	0.0745	33.66
w/Both ( $w = 10, 16$ )	DLinkNet	HLJ	0.9154	0.9079	0.9113	0.8393	0.1330	0.1345	0.1316	114.66
	DLinkNet	SD	0.9112	0.9328	0.9218	0.8554	0.1106	0.1016	0.1222	83.39
	DLinkNet	ZJ	0.8244	0.8415	0.8325	0.7130	0.1464	0.1450	0.1492	29.06
	PSPNet	HLJ	0.9161	0.8965	0.9057	0.8299	0.1234	0.1298	0.1179	123.93
	PSPNet	SD	0.9065	0.9318	0.9188	0.8502	0.1026	0.0973	0.1094	75.35
	PSPNet	ZJ	0.8371	0.8049	0.8198	0.6947	0.1231	0.1258	0.1210	31.28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Duan, Y.; Zhao, R.; Xu, X.; Zhang, J. Topology-Aware Field Parcel Delineation: Bridging Deep Semantic Features and Geometric Constraints. Remote Sens. 2026, 18, 1783. https://doi.org/10.3390/rs18111783

AMA Style

Duan Y, Zhao R, Xu X, Zhang J. Topology-Aware Field Parcel Delineation: Bridging Deep Semantic Features and Geometric Constraints. Remote Sensing. 2026; 18(11):1783. https://doi.org/10.3390/rs18111783

Chicago/Turabian Style

Duan, Yaming, Runze Zhao, Xiangde Xu, and Jinshui Zhang. 2026. "Topology-Aware Field Parcel Delineation: Bridging Deep Semantic Features and Geometric Constraints" Remote Sensing 18, no. 11: 1783. https://doi.org/10.3390/rs18111783

APA Style

Duan, Y., Zhao, R., Xu, X., & Zhang, J. (2026). Topology-Aware Field Parcel Delineation: Bridging Deep Semantic Features and Geometric Constraints. Remote Sensing, 18(11), 1783. https://doi.org/10.3390/rs18111783

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Topology-Aware Field Parcel Delineation: Bridging Deep Semantic Features and Geometric Constraints

Highlights

Abstract

1. Introduction

2. Datasets and Study Sites

2.1. Study Sites

2.2. Training and Testing Data

2.3. Auxiliary Dataset for Boundary Width Analysis

3. Methods

3.1. Deep Feature Representation in TopoFP

3.1.1. Boundary-Aware Representation Strategy

3.1.2. Network Architectures

3.1.3. Loss Function

3.2. Topology-Constrained Vectorization

3.2.1. Topological Relationship Construction (TRC)

3.2.2. Dual Line Detection (DLD)

3.2.3. Dangling Line Extension (DLE)

3.2.4. Vectorization

3.3. Accuracy Evaluation

3.3.1. Pixel Level Metrics

3.3.2. Object Level Metrics

3.3.3. Geometric Level Metrics

3.4. Implementation Details

4. Results and Analysis

4.1. Quantitative Evaluation Based on Network Characteristics

4.2. Qualitative Analysis Combined with Visual Features

4.3. Comparison with State-of-the-Art Cultivated Land Optimization Models

4.4. Assessment of Proposed Post-Processing

5. Discussion

5.1. Representation Consistency and Application Potential

5.1.1. Statistical Analysis of Boundary Width Alignment

5.1.2. Visualization Analysis of Boundary Features

5.2. Feasibility of Post-Processing

5.2.1. Ablation Study on Post-Processing Functions

5.2.2. Comparison of Polygon FP Conversion

5.3. Parameter Sensitivity Analysis and Robustness

5.3.1. Sensitivity of Dangling Line Extension (DLE)

5.3.2. Sensitivity of Double-Line Detection (DLD)

5.3.3. Cross-Resolution and Cross-Landscape Generalizability

5.3.4. Skeletonization Limitations

5.4. Computational Efficiency

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

Appendix A.4

Appendix A.5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI