1. Introduction
Generating accurate and well-structured building footprints from remote sensing data is fundamental for understanding and managing the complex spatial dynamics of urban environments. Building footprints serve as fundamental spatial units in Geographic Information Systems (GISs), enabling downstream applications such as cartography, 3D city modeling [
1], and cadastral mapping [
2,
3,
4]. Compared with raster-based representations, vectorized building polygons offer compact and topologically consistent descriptions that better capture the geometric and semantic structure of urban areas. Consequently, developing accurate and computationally efficient building polygon extraction methods has become a crucial research frontier. While deep learning has significantly advanced this field, existing methods generally fall into four categories, each with distinct limitations.
The first category, segmentation-and-post-processing approaches [
5,
6,
7,
8], predicts binary building masks using neural networks, followed by geometric post-processing (e.g., polygon simplification) to regularize the output. A representative pipeline involves extracting building contours from segmentation networks such as Mask R-CNN [
9] or U-Net [
10], simplifying the contours using algorithms like Douglas-Peucker algorithm [
11], and further optimizing the polygons through descriptor-based strategies such as the Minimum Description Length (MDL) principle [
12]. For example, Zhao et al. [
6] combined Mask R-CNN with contour simplification and MDL optimization to refine polygon representations, while Daranagama et al. [
13] directly generated shapefile-format polygons from U-Net predictions. Other studies [
7,
8] further enhanced polygon precision by improving the segmentation backbone or incorporating additional geometric constraints such as convex hull regularization. While straightforward and easy to implement, these methods often suffer from a lack of coherence between prediction and post-processing stages, since geometric regularization is performed independently from feature learning. As a result, the final polygons may deviate from actual building contours, particularly in complex urban scenes.
To overcome this, a second class of methods [
14,
15,
16,
17,
18,
19,
20,
21,
22] aims to directly predict polygon vertices or vector representations in an end-to-end fashion. Early vertex prediction methods, such as PolygonRNN [
23] and PolygonRNN++ [
24], were originally proposed for semantic image annotation and employed recurrent neural networks (RNNs) to sequentially generate polygon vertices. Building upon this idea, PolyMapper [
14] extended the framework to building boundary regularization by introducing a bounding-box-based mechanism for extracting multiple building polygons and directly connecting sequentially predicted vertices. To improve vertex prediction accuracy, Zhao et al. [
15] incorporated spatial attention and Conv-GRU [
25] modules. To reduce training complexity, subsequent studies [
19] replaced RNNs with Graph Convolutional Networks (GCNs) [
26]. BuildingVec [
21] enhances building polygon prediction accuracy by jointly predicting the vertex positions and classifying whether each vertex represents a corner. In addition, a technique called ROIPoly [
20] reduces computational costs by representing vertices as queries within a Region-of-Interest (ROI) framework, thereby limiting attention computation to pertinent areas.
Another approach adopted a two-stage strategy: first detecting an initial contour segmentation map, then extracting polygons and vertices from these contours, and subsequently deriving building polygons through vertex optimization and fine-tuning [
22,
27,
28,
29,
30]. Other works directly detect vertex locations from images, refine them via Graph Neural Networks (GNNs), and establish edge connections to generate polygons [
17,
18].
With the emergence of the Transformer [
31] architecture, global attention mechanisms have been increasingly adopted for modeling long-range dependencies in vertex prediction tasks. Related studies have shown that Transformer’s attention mechanisms (as used by Zhang et al. [
32]) and the DETR [
33] architecture (as used by Hu et al. [
34]) offer an effective approach for building vertex sequence prediction.
Another line of research seeks to directly generate regularized building polygons by incorporating geometric constraints into the network training process [
35]. Instead of relying on explicit post-processing or sequential vertex prediction, these methods integrate regularization objectives within the learning framework to encourage structured polygon outputs. For instance, Generative Adversarial Networks (GANs) [
36] have been employed to regularize segmentation maps through adversarial supervision, guiding the network toward more geometrically consistent building boundaries [
37,
38]. Similarly, regularization loss functions [
39,
40] have been introduced to impose structural constraints during training, enabling the network to produce polygons with desired geometric properties.
More recently, end-to-end architectures have attempted to combine contour refinement and vectorized prediction within a unified framework. For example, Sun et al. [
41] proposed a Transformer-based contour refinement head that jointly performs roof extraction and fine-grained classification to enhance polygon accuracy. While these direct extraction approaches demonstrate promising improvements in boundary regularization, they often involve adversarial optimization or tightly coupled network designs. Such training schemes tend to be computationally expensive and less stable than conventional supervised learning, and their structural complexity may further increase optimization difficulty, which limits their scalability and practical applicability in large-scale remote sensing scenarios.
Frame field-driven methods [
42,
43,
44] offer a compelling alternative by introducing geometric priors into the learning process to facilitate boundary alignment and polygon regularization. Inspired by Marcos et al. [
45], who combined the Active Contour Model (ACM) with CNNs, Girard et al. [
42] introduced frame fields, shifting contour optimization from direct boundary evolution to skeleton-based graph representations.
Subsequent studies advanced the framework. Nguyen et al. [
43] combined super-resolution, frame field learning, and polygonization to extract building footprints in dense building areas. Sun et al. [
44] fused Digital Surface Models (DSMs) and RGB imagery, performing post-processing by comparing segmentation maps obtained from a U-Net with frame fields to derive regularized building polygons, which resulted in high-precision outputs. Xu et al. [
46] proposed the Alignment-Free Module (AFM), where the network simultaneously predicts vertices, masks, and AFM, matching them to obtain the final result. These studies collectively validate the effectiveness of frame field methods in building polygon extraction tasks and their potential in boundary regularization.
The existing frame field network architecture typically consists of a feature encoder for multi-scale image features and a decoder that predicts frame fields, binary masks, and edges. Despite their overall effectiveness, conventional decoders in such networks typically employ simple stacked convolution layers. Consequently, while these methods achieve good performance in regularized building extraction, their efficacy is dependent on the accuracy and completeness of the intermediate binary mask and edge predictions. Remote sensing images often contain vegetation occlusions, complex textures, and blurred boundaries, which make it difficult for simple decoders to simultaneously capture global structure and local details. This fundamental limitation poses significant challenges for further improving the performance of frame field methods.
To address these limitations, we propose an Edge-Attentive Dual-Branch Frame Field Network (EA-DBFFN), building upon existing frame field networks for high-precision building polygon extraction. Our method’s key insight is to separately optimize for regional consistency and boundary precision. We implement this through a Dual-Task Decoder, structured with a shared encoder, dual task-specific branches, and a dedicated Dual-Task Mutual Guidance Module. The mask branch integrates a dynamic receptive field module and a spatial attention mechanism to improve the consistency of regional representations. The edge branch leverages fixed Sobel and Laplacian filters to amplify high-frequency features, improving boundary localization accuracy. The mutual guidance module further facilitates residual-based cross-task optimization, enabling effective collaboration between mask and edge predictions. This strategy enhances the alignment between masks and frame fields, ultimately leading to higher-precision building polygons.
The main contributions of this work are as follows:
We propose a dual-branch decoding structure that explicitly separates regional and boundary modeling, addressing the limitation of conventional unified decoders in simultaneously capturing global structure and local boundary details, thereby enhancing the geometric precision of building boundaries.
We design lightweight and efficient sub-modules, including a dynamic receptive field module to handle multi-scale building variations, and an edge enhancement module incorporating fixed Sobel and Laplacian filters to mitigate boundary ambiguity caused by vegetation occlusion and cluttered backgrounds.
We introduce a dual-task mutual guidance mechanism that enables residual-based information exchange between mask and edge predictions, addressing structural misalignment and prediction inconsistency across the two branches, and ultimately improving the alignment between masks and frame fields for higher-precision polygon extraction.
2. Materials and Methods
2.1. Network Structure
We propose an Edge-Attentive Dual-Branch Frame Field Network (EA-DBFFN) for precise building polygon extraction, whose overall architecture is shown in
Figure 1. Our end-to-end workflow is clearly divided into four main stages: input images first undergo feature extraction via a backbone network; subsequently, extracted features are fed into a dual-task decoder to concurrently generate refined building masks and edge predictions, which mutually enhance each other through collaborative optimization. Simultaneously, the network performs frame field generation, which is then fused with the aforementioned predictions in a frame field alignment module to enhance boundary orientation and geometric consistency. Finally, leveraging the fused information, a polygonization step yields high-precision vector building polygons. This explicit decoupling and synergistic enhancement of area and edge modeling effectively mitigates the limitations of traditional unified decoding strategies, significantly improving the geometric accuracy and topological quality of building boundaries. This section details the core components of the network.
2.2. Dual-Task Decoder
To address structural discontinuities and tracking errors commonly encountered in corners or occluded regions, we introduce a lightweight dual-branch decoder. It incorporates a mutual guidance mechanism that promotes joint optimization between mask and edge predictions. As illustrated in
Figure 1, the decoder comprises a shared feature encoder and two task-specific branches: the mask branch employs a Dynamic Receptive Field module and spatial attention mechanism to enhance regional consistency, while the edge branch utilizes fixed Sobel and Laplacian filters to strengthen boundary detection. These outputs are further refined via a Dual-Task Mutual Guidance Module using cross-task residual learning.
2.2.1. Shared Feature Encoder
To provide high-quality input to both branches, a shared feature encoder enhances shallow features from the backbone. It is built from two sequential convolutional modules, where each module includes a convolution layer followed by batch normalization and an ELU activation function. The first module projects the backbone output from 32 to 64 channels to enrich feature representation, while the second refines the expanded features to improve spatial consistency. This two-layer design strikes a balance between representational capacity and computational efficiency: a single layer is insufficient to bridge the gap between the backbone output and the task-specific branches, while deeper stacking would introduce unnecessary parameters without meaningful performance gain.
2.2.2. Mask Branch
At the core of the mask branch lies an advanced attention module designed to efficiently fuse global contextual information with fine-grained local structural details—a critical factor for generating high-quality segmentation masks. As illustrated in
Figure 2, this module achieves this goal through a dual mechanism combining the Dynamic Receptive Module and Spatial Gating.
First, the Dynamic Receptive Block captures multi-scale features through parallel branches
(hollow convolution) and
(original receptive field). The attention weights
and
are dynamically generated by concatenating
and
along the channel dimension, followed by a
convolution to reduce the dimension to 2 channels and a channel-wise Softmax operation, ensuring that the attention weights at each spatial location are normalized:
where
denotes channel-wise concatenation and the Softmax operation normalizes the attention weights at each spatial location across the two branches. The channel attention mechanism then adaptively fuses the two branches, yielding the channel-weighted output
, where ⊙ denotes element-wise multiplication:
Next, the spatial gating module employs lightweight
convolutions and Sigmoid operations to generate a spatial attention mask
G. This mask
G suppresses background noise while enhancing features in the target region. The final masked feature
is obtained through gating and
convolution:
where
denotes the Sigmoid operation.
This dual attention mechanism effectively fuses multi-scale information while suppressing spatial noise, significantly enhancing the masking branch’s perception of complex structural boundaries.
2.2.3. Edge Branch
To enhance boundary representation capability, we introduce an Edge Enhancement Block specifically for high-frequency feature modeling, whose structure is shown in
Figure 3.
This module first convolves the input feature map using three classical edge detection operators—horizontal Sobel (), vertical Sobel (), and the Laplacian operator (L)—and extracts the initial multi-directional edge features , , and . These features are then concatenated along the channel dimension to form . To fuse multi-directional edge information and reduce dimensionality, we apply a convolution combined with ELU activation to obtain the mixed edge feature .
Finally, applying a set of
convolutions followed by Sigmoid activation, the module generates a single-channel edge attention map
to guide feature learning in the main branch:
where
denotes the Sigmoid operation.
The choice of fixed Sobel and Laplacian operators is motivated by their complementary frequency characteristics. Sobel captures first-order directional gradients, while the Laplacian responds to second-order intensity variations, together providing a diverse representation of boundary structures. Compared to learnable convolutional kernels, fixed operators introduce no additional trainable parameters, which helps control model complexity and may reduce the risk of overfitting, especially under limited training data. In this work, they are used as lightweight priors rather than replacements for learnable components. Although the Laplacian is sensitive to noise, its output in our design is not used as a direct prediction but instead serves as an intermediate feature that is subsequently fused and refined through learnable 1 × 1 and 3 × 3 convolutions, effectively suppressing noise-induced artifacts before the features influence the final edge attention map. Moreover, building boundaries in high-resolution remote sensing imagery typically exhibit strong and consistent gradients, allowing classical edge operators to provide useful structural cues. The overall performance improvements observed in our experiments also suggest the effectiveness of this design in practice. This design ensures the network explicitly models high-frequency boundary information, significantly enhancing edge accuracy in segmentation results.
2.2.4. Mutual Guidance Module
To exploit the complementarity between region and boundary predictions, we design a Mutual Guidance Module. Given the initial mask prediction
M and edge prediction
E, both paths share the same joint input formed by channel-wise concatenation
. Each path applies two
convolutional layers with an ELU activation in between to produce a residual correction, which is then added back to the original prediction and normalized via Sigmoid:
where
and
denote the two independent refiners. This residual formulation ensures that each path corrects prediction errors rather than overwriting the original outputs, mitigating structural misalignment and enhancing consistency across branches.
2.3. Frame Field Alignment and Polygonization
To achieve the transition from pixel-level prediction to structured vector polygons for building contours, we introduce the Frame Field representation into the network. The Frame Field characterizes the principal direction and geometric consistency of local boundaries within an image. It is a direction field with second-order symmetry, described by a pair of complex coefficients , at each pixel location, representing the local principal direction and its relative stability. In our network, the frame field is predicted from feature maps shared between the backbone network and decoder, serving as directional priors for edge structures.
After producing the frame field, we combine it with the mask and edge predictions to refine the contour. First, our algorithm employs Active Skeleton Model (ASM) optimization to refine the initial contour for precise alignment with the predicted frame field. The ASM constructs a skeleton graph from the predicted edge map and optimizes it by minimizing an energy function that jointly considers boundary probability, frame field alignment, and path length regularity. Subsequently, a critical corner detection step utilizes frame field information to identify key geometric vertices of buildings. Corner-aware simplification then minimizes non-corner vertices while preserving essential geometric structures. Finally, polygonization is performed, followed by polygon filtering to remove redundant or minute polygons that do not conform to building geometry standards, yielding the final high-precision vector building outline. Both the ASM optimization and corner detection procedures follow the implementation of Girard et al. [
42].
2.4. Loss Function
To support accurate segmentation and stable geometric modeling, the network employs a composite loss comprising Segmentation Loss, Frame Field Loss, Output Coupling Loss, and a final weighted aggregation. This design jointly optimizes semantic prediction and structural consistency for vectorized building contour extraction.
2.4.1. Segmentation Loss
To achieve fine-grained modeling of regions and boundaries, we introduce a dual-branch segmentation head structure on top of the backbone network to separately predict the building interior regions
and boundary regions
. A composite loss function, comprising a weighted blend of cross-entropy and Dice losses, is employed during training:
where
and
denote the ground-truth masks for the building interior and boundary regions, respectively, and
and
are the balance coefficients.
2.4.2. Frame Field Loss
Building upon the semantic segmentation prediction, we introduce a frame field prediction branch to encode the directional field information of building boundaries. This branch takes as input and outputs a complex symmetric tensor field composed of and , which fits the principal orientation of the building contours. We design three constraint terms:
Alignment Loss:
ensuring alignment with contour tangential direction.
Orthogonal Alignment:
preventing field degeneration into a linear field.
Smoothness Regularization:
enforcing spatial continuity to avoid local oscillations.
2.4.3. Output Coupling Loss
To enhance the collaborative relationship between different prediction branches, we introduce three output coupling terms:
Internal Segmentation and Frame Field Coupling:
aligning the internal segmentation map’s gradient direction with the frame field for geometric consistency.
Edge Segmentation and Frame Field Coupling:
applying frame field alignment to the edge segmentation map’s gradient for enhanced edge-direction coordination.
Internal Segmentation and Boundary Consistency:
encouraging the edge map’s strength to match the internal map’s gradient magnitude at building boundaries, boosting outer contour response.
2.4.4. Final Loss Composition
To prevent training instability due to differing numerical scales, each loss term undergoes statistical normalization with respect to the initial network parameters. These normalized losses are then linearly weighted and combined to form the final loss function:
where
represents the weight for each individual loss term, and L represents the loss term after normalization. The weights
are set empirically based on the relative importance of each loss term. Specifically,
= 10 for the segmentation loss,
= 1.5 for edge loss,
= 1 for crossfield alignment,
= 0.2 for orthogonal alignment, and
= 0.005 for smoothness regularization. For the coupling losses,
,
and
are gradually increased from 0 to 0.2 during the first 5 training epochs to stabilize early training, after which they remain fixed.
2.5. Evaluation Metrics
Standard segmentation performance indicators, such as MS COCO’s [
47] Average Precision (AP) and Average Recall (AR) as well as their variants (
,
,
,
), quantify regional overlap and matching performance. However, building contour extraction requires a stronger emphasis on geometric regularity and structural fidelity. Overlap-based measures often favor smooth or blurred boundaries and may not adequately capture the sharp angles and rectilinear structures characteristic of buildings, particularly under annotation noise. Therefore, we complement the standard metrics with geometry-oriented measures to more comprehensively evaluate contour quality.
2.5.1. Max Tangent Angle Error (MTA)
This metric [
42] measures the maximum deviation between the tangent directions of the predicted and ground-truth contours, reflecting the preservation of local geometric morphology. Given the contour curves
and
, the tangent angle difference is:
The MTA is then defined as the maximum angle value over all sampled points.
2.5.2. Invalid Polygon Ratio (IPR)
This measures [
48] the proportion of predictions that fail to form valid closed polygons (e.g., self-intersections or missing edges), indicating the structural legality and robustness of the reconstructed contours.
2.5.3. Edge Smoothness (ES)
Computed as the standard deviation of internal polygon angles, this metric reflects boundary regularity. Lower values correspond to cleaner, more rectilinear edges and stronger corner preservation.
2.5.4. Abnormal Convex Hull Ratio (ACHR > 1.5)
This evaluates the proportion of polygons whose convex hull area exceeds the polygon area by a factor greater than 1.5, capturing anomalies such as excessive concavity or incomplete reconstruction.
Together, these geometric metrics offer a multi-dimensional assessment of contour quality—covering structural correctness, edge sharpness, legality, and smoothness—thereby complementing traditional region-based accuracy measures.
2.6. Datasets
We compare the performance of our method with that of the state-of-the-art methods on the two publicly available datasets.
2.6.1. Inria Dataset
The Inria dataset [
49] is a comprehensive benchmark for building extraction, comprising orthorectified RGB aerial imagery with a spatial resolution of 0.3 m and a size of
pixels. The imagery is collected from different cities, covering diverse urban morphologies, building densities, and architectural styles. This diversity makes the dataset well suited for evaluating model generalization. Given its high spatial resolution, the dataset requires models to accurately delineate both large building footprints and small, detailed structures. Variations in building shape, size, and orientation, along with background clutter, further increase the difficulty of achieving precise segmentation, making Inria a challenging benchmark for fine-grained building extraction. Following the official benchmark protocol, the dataset is split into disjoint training and testing areas. The training subset is used to learn model parameters, while all quantitative evaluations are conducted on the official test set only.
2.6.2. CrowdAI Dataset
The CrowdAI dataset [
50] consists of high-resolution satellite imagery paired with annotated building outlines. Owing to its geographical diversity and the high fidelity of its labels, it serves as a benchmark for evaluating building extraction methods. The dataset is available in two variants: a full version and a small version. Both provide RGB images with a resolution of 0.3 m per pixel and a fixed spatial size of
pixels. The full version is a large-scale benchmark, containing 280,741 training images and 60,317 test images. The small version is utilized for resource-efficient experimentation, particularly for ablation studies, with 8366 training images and 1820 test images. For both versions, models are trained on the official training split, and evaluation metrics are computed on the corresponding test set.
2.7. Experimental Setup
All experiments were conducted under a unified hardware and software environment to comprehensively validate the proposed method. The hardware platform is equipped with an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), and the software environment is based on Windows, using PyTorch 2.3.0 and Python 3.9. All models are trained using the Adam optimizer with a base learning rate of , scaled by the effective batch size, and capped at a maximum learning rate of . An exponential learning rate scheduler with a decay factor of = 0.95 is applied throughout training. The batch size is set to 2 per GPU, and all models are trained for 150 epochs. No weight decay or dropout is applied.
To verify the effectiveness and generalization capability of EA-DBFFN, experiments were conducted on two datasets: Inria and CrowdAI. Firstly, a series of ablation studies were designed to evaluate the contribution of each module within the dual-task decoder, all conducted on the small version of CrowdAI for efficiency. Secondly, to validate the representativeness and reliability of the results, a comparative analysis was performed between our proposed network and mainstream benchmarks (e.g., Mask RCNN [
9], PANet [
51], TransBuilding [
52], PolyMapper [
14], Frame Field Network [
42], and PolyWorld [
17]) across the Inria and full CrowdAI datasets.
To maintain fair and consistent evaluation, slightly different strategies were adopted for the two datasets. On the Inria dataset, we report only COCO-style segmentation metrics, as the dataset focuses on large-scale region extraction and most existing methods provide only raster-level results without publicly available polygon predictions, making polygon-level comparison impractical. In contrast, the CrowdAI dataset provides richer annotations and is a mainstream dataset for polygon-based building extraction, so we conducted a more comprehensive evaluation, including COCO metrics, polygon quality indicators, and computational efficiency. Finally, visual comparisons in complex urban scenes are provided to highlight our method’s advantages in detailed boundary depiction, structural clarity, and overall polygon regularity.
3. Results
This section provides a thorough analysis of the results obtained using EA-DBFFN. We first report results from ablation studies to evaluate the contribution of each module within the dual-task decoder, followed by comparisons with several advanced networks on the Inria and CrowdAI datasets. To ensure fairness and consistency, COCO-style segmentation metrics are used for both datasets, while polygon-level and efficiency evaluations are conducted only on CrowdAI due to its richer annotations. Finally, qualitative visualizations further demonstrate the advantages of our approach in extracting building polygons in complex urban environments. Overall, the findings demonstrate the reliability and practical utility of the proposed method for precise building polygon extraction.
3.1. Ablation Studies
To assess the contribution of each component, we performed three sets of ablation experiments, focusing on the architectural design and task collaboration mechanisms. The configurations for these studies are illustrated in
Figure 4.
We first evaluated each branch of the dual-task decoder independently. Two ablated models were built: an edge-only branch (using baseline mask prediction,
Figure 4a) and a mask-only branch (using baseline edge prediction,
Figure 4b), isolating semantic and geometric modeling effects. We then removed the inter-branch feedback mechanism (
Figure 4c) to assess mutual guidance importance. This tests how cross-task interaction improves structural consistency and prediction accuracy.
All ablation experiments were conducted on the small-scale CrowdAI dataset for 150 epochs. Models are trained on the corresponding training split, and performance is evaluated on the test set.
Table 1 presents the training accuracy results for each model.
The results show that enabling a single branch yields slight performance gains over the baseline, with 1% AP improvement and marginal AR differences. The dual-branch structure (row 4) improves AP by 1.4% and AR by 1.6%, confirming the synergistic benefit of modeling both region and edge features. Finally, introducing the mutual guidance module (row 5) leads to the best performance, especially on stricter metrics like
(59.8%) and
(66.9%), validating its effectiveness in promoting joint feature refinement. The parameter counts listed in
Table 1 provide clear evidence that the performance gains are driven by the proposed architectural design rather than by increased model capacity. Despite the Mask and Edge branches together introducing 87,685 additional parameters, the AP improvement from the baseline to the dual-branch variant without mutual guidance is only 1.4%. In contrast, the Mutual Guidance Module introduces merely 453 parameters, accounting for 0.31% of the total, yet produces the largest single performance jump of 4.3%. This disproportionate relationship between parameter count and performance gain strongly indicates that the improvements stem from the effectiveness of the proposed architecture rather than from increased model capacity.
Table 2 further presents the geometric quality metrics for each model variant, revealing a consistent improvement trend as modules are progressively added. The baseline model yields an MTA of 33.3, an IPR of 7.42‰, an ACHR of 4.88‰, and an ES of 145.1. Introducing the edge branch alone improves boundary localization, reducing MTA to 32.6 and IPR to 4.32‰. The mask branch contributes more notably to polygon regularity, with ACHR dropping to 2.45‰ and ES to 139.2. Combining both branches without mutual guidance achieves further improvement, particularly in ACHR (1.09‰). The full EA-DBFFN achieves the best results across all geometric indicators, with an MTA of 30.8, IPR of 4.07‰, ACHR of 1.08‰, and ES of 130.3, demonstrating that the mutual guidance module plays a critical role in refining both boundary precision and overall polygon regularity.
Figure 5 presents visual comparisons of different model variants across typical regions, with each row showing a scene and each column corresponding to a model configuration. Visually, the baseline model trained without any of the proposed modules exhibits contour blurring, edge discontinuities, and missed detections in complex or densely arranged buildings. Introducing an improved single branch helps enhance geometric consistency and sharpen edge details; for instance, in the first row, the “Only-Edge” model shows better contour closure than the dual-branch variant, reflecting its strength in localized detail refinement. Nevertheless, from an overall perspective, the dual-branch structure more comprehensively captures semantic shapes and geometric boundaries, producing clearer contours and more complete segmentation, with generally more regular predicted edges compared to single-branch alternatives. When the mutual guidance module is incorporated, inter-branch coordination is further strengthened, resulting in continuous boundaries, higher contour closure, and more regular shapes across most highlighted areas. This demonstrates stronger generalization, particularly in complex clusters or heavily occluded regions.
In summary, despite single-branch models sometimes exhibiting finer local details, the overall advantages of the dual-branch structure combined with the mutual guidance mechanism are more significant. This combination enhances object completeness while maintaining edge precision, substantially boosting the stability and robustness of building contour extraction.
3.2. Results on the Inria Dataset
This section presents a detailed evaluation of our method on the Inria dataset, with comparisons against several leading approaches, including Mask RCNN, PANet, PolyMapper and the original Frame Field Network.
Table 3 presents the quantitative comparison of several models under standard instance segmentation metrics. Our EA-DBFFN demonstrates superior overall performance, with AP and
values of 46.8% and 49.5%, respectively, outperforming all other methods. The improvement at higher IoU thresholds highlights the model’s superior boundary localization and geometric precision in building extraction.
Moreover, EA-DBFFN achieves a significantly higher AR (64.4%) compared to the other models, indicating enhanced detection completeness across varying building scales and densities. These improved results are a consequence of employing a dual-branch framework, which jointly optimizes edge and region representations, allowing the network to better capture detailed boundaries while maintaining region consistency. While PolyMapper and PANet exhibit competitive performance, their drop at stricter IoU thresholds suggests weaker polygon refinement capability.
Figure 6 provides qualitative visualizations of the predicted building polygons on several Inria samples. It can be observed that our method is capable of generating geometrically regular and detailed polygonal structures that closely follow the true building outlines, regardless of variations in building density, size, or shape across different urban environments. This further confirms the stability and adaptability of our method on diverse aerial scenes.
3.3. Results on the CrowdAI Dataset
This section presents a detailed evaluation of the proposed method on the full CrowdAI dataset, benchmarking its performance against leading state-of-the-art approaches such as Mask RCNN, PANet, TransBuilding, PolyMapper, the original Frame Field Network, and PolyWorld. We quantitatively evaluated each model across standard segmentation metrics, polygon geometric quality, and model efficiency, supplemented by visualizations, to comprehensively demonstrate our method’s superiority and effectiveness.
3.3.1. Analysis of Region-Based Standard Evaluation Metrics
Table 4 presents various methods’ performance under standard instance segmentation metrics. Our EA-DBFFN achieved an AP of 63.7%, outperforming all others (PolyWorld 63.3%, FrameField 61.3%). Notably, EA-DBFFN scored highest at 72.9% for
, indicating high overlap with ground-truth building regions even under stricter IoU thresholds.
While PolyWorld excelled in AR and its variants, EA-DBFFN’s AP advantage, especially at high IoU thresholds (), is more critical for high-precision building polygon extraction. This suggests a trade-off between recall and geometric precision, which will be further discussed in the Discussion section.
3.3.2. Analysis of Polygon Geometric Quality Metrics
To comprehensively evaluate extracted building polygon geometric quality, we used Max Tangent Angle Error (MTA), Invalid Polygon Ratio, Abnormal Convex Hull Ratio, and Edge Smoothness.
Table 5 shows results for FrameField, PolyWorld, and EA-DBFFN.
Table 5 reveals EA-DBFFN’s significant advantages in polygon geometric quality. Our MTA decreased to 30.8, lower than PolyWorld (32.9) and FrameField (31.9). Smaller MTA indicates better alignment of extracted polygon boundaries with ground-truth tangents, proving our advantage in capturing precise geometric shapes. EA-DBFFN’s Invalid Polygon Ratio (6.55‰) and Abnormal Convex Hull Ratio (2.06‰) are significantly lower than PolyWorld (19.44‰, 3.45‰) and FrameField (7.27‰, 2.57‰). This implies better internal consistency and topological correctness in our network’s masks and frame fields, substantially reducing polygon degradation or distortion and ensuring higher usability.
EA-DBFFN’s Edge Smoothness metric is 133.1, significantly better than PolyWorld’s 486.7 and FrameField’s 149.7. A lower Edge Smoothness value implies smoother, more natural extracted building polygon boundaries, reducing jagged or irregular edges, which is crucial for high-quality GIS data production.
The geometric quality metrics also provide additional insight into the recall gap between EA-DBFFN and PolyWorld observed in
Table 4. PolyWorld’s higher AR of 75.4% should be interpreted with caution. Its substantially higher Invalid Polygon Ratio of 19.44‰, compared to EA-DBFFN’s 6.55‰, suggests that a non-trivial proportion of its recalled instances are geometrically degraded polygons. Such polygons, despite being matched to ground truth under lower IoU thresholds, may not meet the geometric quality requirements of practical GIS applications. In this sense, PolyWorld’s higher AR may partly reflect lenient matching of low-quality predictions rather than genuinely complete building detection. In contrast, EA-DBFFN applies strict geometric constraints during polygonization to filter out incomplete or irregular predictions, which contributes to its lower recall but ensures that all detected buildings are represented with accurate and regularized boundaries. This property is critical for downstream GIS tasks such as cadastral mapping and 3D city modeling, where boundary accuracy directly affects the quality of subsequent analysis.
Figure 7 intuitively demonstrates the visual effect of our method’s building polygon extraction in different scenarios, further confirming the quantitative analysis results. For irregular or complex buildings, original FrameField and PolyWorld often show overly smooth or distorted boundaries (red circles). EA-DBFFN precisely captures these fine geometric features, yielding polygon boundaries that better align with ground-truth contours and sharper edges, consistent with our lower MTA and higher
. In cases of tree canopy occlusion as shown in rows 2 and 4, vertex-prediction methods like PolyWorld often miss vertices, leading to incomplete or entirely missed buildings. Our method, however, exhibits stronger robustness, effectively identifying and completely extracting these challenging targets, thus reducing abnormal convex hulls and invalid polygons.
3.3.3. Analysis of Model Efficiency and Inference Speed
To comprehensively evaluate the network’s computational performance, we compared its floating-point operations (FLOPs) and accuracy (
) with representative vertex- and mask-based methods, as shown in
Table 6 and
Figure 8.
Vertex-based approaches such as PolyMapper and PolyWorld achieve relatively high precision but at the expense of extremely high computational costs. In contrast, mask-based methods including Mask R-CNN and PANet exhibit lower complexity but limited accuracy. FrameField achieves a good balance between these two paradigms. Building upon this foundation, our EA-DBFFN achieves the highest accuracy (
of 72.9%) while maintaining the same FLOPs (204.4G) as FrameField, clearly standing out in the upper-left region of
Figure 8. Notably, despite PolyMapper and PolyWorld consuming 3.5× and 2.2× more FLOPs than EA-DBFFN, respectively, and PolyMapper employing a larger VGG-16 backbone, both methods yield lower AP75 scores of 65.1% and 70.5%. These results demonstrate that the performance gains of EA-DBFFN are attributable to the effectiveness of the proposed architectural design rather than a larger computational budget or more powerful backbone, confirming that EA-DBFFN achieves a favorable balance between accuracy and efficiency through architectural innovation alone.
In summary, our EA-DBFFN achieved excellent performance on the CrowdAI dataset. Its innovative dual-branch mask prediction and dual-task mutual guidance modules significantly enhanced mask and frame field registration accuracy. Quantitatively, the method demonstrated superior instance segmentation and notably improved polygon geometric quality by reducing MTA, invalid polygon, and abnormal convex hull ratios while substantially enhancing edge smoothness. Furthermore, its high computational efficiency and exceptionally fast inference speed underscore its practicality for high-precision mapping tasks requiring efficient processing of large image data.