PDAA: An End-to-End Polygon Dynamic Adjustment Algorithm for Building Footprint Extraction

Luo, Longjie; Cai, Jiangchen; Feng, Bin; Tao, Liufeng

doi:10.3390/rs17142495

Open AccessArticle

PDAA: An End-to-End Polygon Dynamic Adjustment Algorithm for Building Footprint Extraction

¹

Collaborative Innovation Center of Geo-Information Technology for Smart Central Plains, Zhengzhou 450046, China

²

Key Laboratory of Spatiotemporal Perception and Intelligent Processing, Ministry of Natural Resources, Zhengzhou 450046, China

³

School of Computer Science, China University Geoscience, Wuhan 430074, China

⁴

Institute of Geophysical and Geochemical Exploration, China Academy of Geological Sciences, Langfang 065000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2495; https://doi.org/10.3390/rs17142495

Submission received: 19 June 2025 / Revised: 12 July 2025 / Accepted: 14 July 2025 / Published: 17 July 2025

(This article belongs to the Special Issue Machine Learning at the Object: Fine-Grained Extraction and Analysis in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Buildings are a significant component of urban space and are essential to smart cities, catastrophe monitoring, and land use planning. However, precisely extracting building polygons from remote sensing images remains difficult because of the variety of building designs and intricate backgrounds. This paper proposes an end-to-end polygon dynamic adjustment algorithm (PDAA) to improve the accuracy and geometric consistency of building contour extraction by dynamically generating and optimizing polygon vertices. The method first locates building instances through the region of interest (RoI) to generate initial polygons, and then uses four core modules for collaborative optimization: (1) the feature enhancement module captures local detail features to improve the robustness of vertex positioning; (2) the contour vertex tuning module fine-tunes vertex coordinates through displacement prediction to enhance geometric accuracy; (3) the learnable redundant vertex removal module screens key vertices based on a classification mechanism to eliminate redundancy; and (4) the missing vertex completion module iteratively restores missed vertices to ensure the integrity of complex contours. PDAA dynamically adjusts the number of vertices to adapt to the geometric characteristics of different buildings, while simplifying the prediction process and reducing computational complexity. Experiments on public datasets such as WHU, Vaihingen, and Inria show that PDAA significantly outperforms existing methods in terms of average precision (AP) and polygon similarity (PolySim). It is at least 2% higher than existing methods in terms of average precision (AP), and the generated polygonal contours are closer to the real building geometry. Values of 75.4% AP and 84.9% PolySim were achieved on the WHU dataset, effectively solving the problems of redundant vertices and contour smoothing, and providing high-precision building vector data support for scenarios such as smart cities and emergency response.

Keywords:

building footprint extraction; polygon dynamic adjustment; vertex optimization; remote sensing images

1. Introduction

Buildings are a fundamental component of urban structure and are essential to land use, environmental monitoring, disaster assessment, transportation planning, and other domains. The advancement of remote sensing technology offers a new method for highly accurate and large-scale building information collection. We can accomplish dynamic monitoring of urban spatial morphology and supply essential data support for smart city development and emergency response by extracting building contours from remote sensing images. However, the complicated background information, dense distribution, and varied appearance of buildings make it difficult to develop an efficient automatic building footprint extraction method. Traditionally, building contour extraction relies on a series of rule-based image processing techniques, including edge detection, threshold segmentation, morphological operations, region growing, watershed transform, etc. These methods mainly rely on the brightness changes, color, or texture features of the image to distinguish buildings from other objects. In recent years, with the rapid development of deep neural networks and the increasing abundance of building annotation data, building extraction methods based on deep learning have become mainstream and have been widely used and explored in many fields such as remote sensing, geographic information systems (GISs), and computer vision [1,2,3,4,5]. However, considerable regional variations in building color, shape, area, and material, as well as the high similarity and hazy boundaries between buildings and other landforms in remote sensing images, make it difficult to automatically and accurately extract building contours from these images [6].

Most deep learning-based studies formulate building extraction as a pixel semantic labeling task. Each pixel is classified as “building” or “non-building” according to a carefully designed network architecture. Even though they perform well when creating binary segmentation masks, turning these masks into a set of building polygons needed for mapping and geographic applications is difficult. Although post-processing and regularization procedures can help these pixel-based methods generate higher quality building polygons [7,8], due to the separation of segmentation and regularization procedures, such methods usually need to train multiple separate models for segmentation, regularization, and vectorization [9]. These models are ineffective and complicated, and mistakes are made all along the way. The produced polygons’ application in real engineering is limited since they have incorrect edges and differ greatly from the actual building contours (Figure 1b).

To meet the demands of practical applications in geospatial analysis, contour-based techniques [10,11,12] have emerged as an alternative to pixel-by-pixel segmentation, directly predicting the polygon of each building instance. Existing contour-based methods can be broadly categorized into two approaches. The first approach [10,11,12,13] constructs building polygons by predicting the precise coordinates and sequential arrangement of building vertices. Its core challenge lies in accurately determining both vertex positions and their ordering to form a polygon that faithfully represents the building’s geometry. Typically employing convolutional neural networks (CNNs), this approach detects all building corners across the input image and infers their connection order to form the polygon, offering a conceptually straightforward framework. However, it frequently suffers from missed corner detections, particularly for inconspicuous vertices, sometimes leading to entire building instances being overlooked (Figure 1c). Moreover, an incorrect prediction of one vertex often propagates errors to subsequent vertices, compromising the overall polygon quality. Crucially, the lack of an effective compensation mechanism limits its ability to represent irregular building shapes accurately. Additionally, the requirement for pixel-level vertex and direction prediction often necessitates complex models with higher parameter counts and longer inference times.

The second approach [14,15,16,17] initializes a fixed number of ordered polygon vertices and iteratively optimizes their positions to converge on the final building polygon. While simplifying the problem formulation, this method overlooks the inherent geometric diversity of buildings. Specifically, vertex positions are refined through iterative coordinate regression. A significant drawback is that the preset, fixed vertex count is inherently mismatched to the variable complexity of building outlines. This results in redundant vertices for simple buildings and insufficient vertices for complex ones (Figure 1d). Furthermore, these methods lack mechanisms to prune redundant points or prioritize key vertices, often leading to overly smoothed polygons. This excessive smoothing obscures critical geometric features such as right angles and sharp edges, diminishing the fidelity of the representation. Particularly for buildings with intricate internal structures or unusual geometries, this smoothing tendency can cause significant divergence between the predicted polygon and the actual building contour.

To address the aforementioned limitations, this paper proposes an end-to-end polygon dynamic adjustment algorithm (PDAA). PDAA is designed to adapt to complex and varied building shapes while reducing reliance on large-scale labeled data and maintaining low inference latency. Furthermore, it aims to enhance both the accuracy and geometric fidelity of the predicted building polygons. Our approach initially concentrates on localized regions of interest (RoIs). Leveraging the detection head of a deep learning model, we predict a set of building bounding boxes based on the feature map derived from the input image. These bounding boxes pinpoint the location of individual buildings. Crucially, a variable number of polygon vertices is then generated per instance to construct the initial building polygon. This strategy enables the capture of a building’s general structure while accommodating structures of diverse sizes and shapes.

PDAA incorporates four specialized modules: (1) Feature Enhancement Module: This focuses on capturing key feature points within the RoI, thereby deepening the model’s understanding of specific building details and refining polygon generation accuracy. (2) Contour Vertex Adjustment Module: This operates on the initial polygon vertices within the proposal. It extracts detailed coordinate information, encodes instance-specific features into the RoI features, and learns to predict displacements for each vertex, relocating them to more accurate positions. (3) Learnable Redundant Vertex Removal Module: This functions as a corner point classifier, specifically tasked with distinguishing true corner points from redundant ones and eliminating the latter. This ensures that the final polygon comprises only essential vertices, enhancing its quality and geometric consistency. (4) Missing Vertex Completion Module: This iteratively optimizes the positions of incorrectly predicted vertices, gradually recovering overlooked key feature points. This mechanism guarantees the generation of high-precision polygons, even for complex structures or small building features.

In summary, the main contributions of this paper include the following:

(1): We propose an end-to-end contour-based polygon dynamic adjustment algorithm (PDAA) for high-quality building contour extraction. The framework combines efficient reasoning with high precision. Through the dynamic vertex optimization mechanism driven by contour features, the regularized building contour consisting of key corner points is directly generated, achieving high precision and efficiency in building contour extraction.
(2): A deep learning model is applied to generate the initial polygons, and a technique that concentrates on local features within the RoI is proposed. The prediction accuracy and geometric similarity are significantly improved through four core modules (feature enhancement, contour vertex adjustment, redundant vertex removal, and missing vertex completion).
(3): The prediction process is simplified, the computational complexity and running time are reduced, and polygon vertices are adaptively generated according to different building instances, making the prediction results closer to the geometric characteristics of real buildings, significantly improving the accuracy and visual effects.

2. Related Work

2.1. Pixel-by-Pixel Methods for Building Footprint Extraction

Extracting building footprints from satellite or aerial images has been widely studied for decades. Traditional building extraction methods include shadow index-based [18], edge regularity-based [19], or line segment-based [20] methods. This type of method relies on manual features and prior knowledge and has certain applicability in specific scenarios, but it has weak generalization ability in complex remote sensing images. Recently, most studies on building footprint extraction have been based on deep learning methods for pixel-level semantic labeling. Semantic segmentation models such as U-Net [21], FC-DenseNet [22], high-resolution network (HR-Net) [23], and instance segmentation models such as Mask-RCNN [24] have been widely explored and have achieved effective building extraction results. In particular, after the success of U-Net in medical image segmentation, it was quickly introduced into the remote sensing field and became the basic architecture for many subsequent works. Its advantage is that it retains spatial information through skip connections and is suitable for processing fine-grained targets in high-resolution remote sensing images. In addition, instance segmentation models such as Mask R-CNN are also used in building extraction tasks, which can achieve detection and segmentation at the same time and are suitable for building recognition in multi-object scenes. With the rapid development of remote sensing image semantic segmentation technology, a series of advanced models based on the Transformer architecture and multi-scale feature fusion have been proposed, such as TransUNet [25], GPINet [26], and RSMamba [27]. They used attention mechanism and sequence modeling structure to optimize the multi-scale feature extraction process of remote sensing images, respectively. These models enhance the adaptability to complex remote sensing scenes and improve segmentation accuracy. There are also some works that try to improve the boundary quality by improving the loss function or introducing auxiliary strategies. Guo et al. [28] proposed a coarse-to-fine network based on the U-Net architecture to gradually refine the building boundaries of different scales. Wu et al. [29] proposed a terrain-aware loss based on a high-resolution network architecture to improve the network’s ability to preserve building segmentation boundaries. In addition, various useful strategies, such as data fusion [1,30,31], distance transformation [32], and boundary regularization [7,8], are combined with the pixel-by-pixel segmentation network to further improve the building segmentation performance. Although CNN-based pixel-by-pixel methods have achieved increasingly high prediction accuracy, the building extraction results produced by these methods are often very different from the polygonal buildings required in practical applications. The contours of building footprints predicted by these methods are curved shapes, while the actual building polygons are manually annotated in a line-based manner with a relatively small number of edges and vertices. Processing the prediction results of these methods into the required building polygons in vector format usually requires heavy work. Although these pixel-based methods achieve great results in terms of building area coverage, they show inefficiency in inferring the overall shape of buildings, limiting their applicability in practical engineering.

2.2. Contour-Based Methods for Building Footprint Extraction

To address the demands of real-world applications, a number of contour-based methods have been put out to produce building polygons directly. Polygon-RNN [33] and Polygon-RNN++ [13] are among the earliest works to propose such a framework. They use a recurrent neural network (RNN) structure to sequentially predict the vertices of a polygon given a true bounding box. This design is the first to achieve an end-to-end polygon generation process, but its reliance on manually annotated bounding boxes limits its practicality. PolyMapper [12] extended this architecture by adding a pre-process object detection module. It uses a feature pyramid network (FPN) architecture to predict building bounding boxes [34] and then uses a ConvLSTM model to sequentially predict building corners within the extracted bounding boxes [35]. Zhao et al. [3] further improved PolyMapper by upgrading the detection module and the RNN part. The CNN-RNN paradigm is inefficient in training and inference due to its complex architecture and beam search process, respectively. Compared with the Polygon-RNN method [13,33], Curve-GCN [36] uses a graph convolutional network (GCN) to predict all polygon vertices at once and model the topological relationship between vertices through a graph structure. This method avoids the sequential dependency of vertex generation and improves reasoning efficiency. Wei et al. [14] followed the Curve-GCN method [36] to predict the building corners at once and added a target detection module on this basis, which is more efficient. However, the limitation of the CNN-GCN paradigm is that the preset fixed number of vertices will lead to vertex redundancy. When the number of vertices set is more than what is actually required, redundant points will be generated; setting too few may cause structural distortion. Other methods [11,37] predict building polygon inflection points by performing corner detection as a segmentation task and then applying a GCN module to refine the coordinates of the previously extracted vertices. PolyWorld [11] adopts this approach and further predicts a permutation matrix to encode the connectivity information between the extracted vertices. The final polygons are subsequently created by connecting the vertices in accordance with the permutation matrix; however, during vertex segmentation, PolyWorld might not recognize building corners because of vegetation or shadows.

In summary, pixel-by-pixel segmentation methods typically yield building contours characterized by irregular, curved shapes, diverging significantly from the regular, manually annotated polygons required in practical applications. This necessitates complex and computationally intensive post-processing to convert the pixel-level results into the vector polygon format needed for architectural workflows. Contour-based methods, while addressing this format issue, often suffer from intricate network architectures that hinder training and inference efficiency. Moreover, their reliance on a fixed number of vertices frequently results in either redundant or insufficient vertices, compromising polygon accuracy.

To overcome these limitations, this paper proposes an end-to-end learnable polygon dynamic adjustment algorithm (PDAA) that seamlessly integrates building geometry understanding with the polygon deformation process. Our approach eliminates the pixel-level mask post-processing step, directly generating polygon contours that exhibit enhanced similarity to manually annotated regular polygons and directly outputting the coordinates and connectivity of polygon vertices. Crucially, PDAA adaptively generates a variable number of vertices tailored to each building’s shape.

3. Method

The overall process of PDAA is shown in Figure 2. The key idea of PDAA is to view the construction polygon as an extension of the bounding box and utilize four main components to generate the final prediction. Given an input image, we first apply the CNN backbone to extract the image feature map and then enhance the feature map to obtain the feature map

F

. Then, we build a detection head on

F

to predict a set of building detection boxes and construct the initial contour of the building. Subsequently, PDAA generates a building polygon for each instance through three key modules: contour vertex adjustment, redundant vertex removal, and missing vertex completion. The contour vertex adjustment module predicts the offset of each point of the initial building contour to adjust the contour position, and then the redundant vertex removal module removes the redundant vertices and retains the key vertices to generate the building polygon. After that, the missing point completion module takes the building polygon and feature map

F

as input, predicts the missing building vertices, and restores the polygon with the missing vertices as the final building polygon of the instance.

3.1. Initial Contour Generation

3.1.1. Backbone

The feature map is used to identify and locate building instances and should encode semantic and spatial information. PDAA follows a convolutional neural network (CNN) based on the deep layer aggregation architecture, namely DLA [38], as the backbone network to achieve an effective fusion of low-level features with strong spatial information and high-level features rich in semantic information. The CNN backbone network structure is shown in Figure 3, which generates a feature pyramid using a top-down and bottom-up approach, with a number of residual connections at each level to guarantee seamless information transfer between scales. Building detection performance is enhanced by the model’s ability to learn richer and more representative feature representations through this bidirectional information flow technique.

3.1.2. Feature Enhancement

We designed a feature enhancement module (FEC) for the network construction to increase the feature representation capabilities by processing the feature maps that the backbone network retrieved in order to address the issues with the multi-scale object detection task. This module can efficiently collect objects of various scales in an input image while maintaining spatial resolution by merging data of various receptive field sizes. A global average pooling branch, three dilated convolution branches with varying expansion rates, a final fusion layer, and a 1 × 1 convolution branch make up the FEC module’s four primary components. The structure of the FEC network is shown in Figure 4. A 1 × 1 convolution kernel is used to reduce the number of channels and computational complexity, while retaining the original spatial information and alleviating the spatial information loss problem caused by multiple downsampling. Three dilated convolution branches are represented by the three distinct expansion rates (r = 6, 12, 18) that are set. These branches can expand the receptive field without increasing the number of model parameters, thereby capturing contextual information in a wider range. In this way, FEC is able to handle objects of multiple scales and improve the recognition accuracy of objects of different sizes. The global average pooling branch first applies an adaptive average pooling operation to the feature map, reducing it to a single pixel to obtain global contextual information. Then, we restore this feature map to its original size through bilinear interpolation to ensure consistency with the output of other branches. Finally, we connect the outputs of all branches along the channel dimension and further integrate the information through a 1 × 1 convolution layer to generate the final feature map. This process not only integrates the information from each branch but also plays a role in dimensionality reduction, making subsequent processing more efficient.

The feature map in the backbone network is fed into the FEC module, where it passes through the four branches mentioned above. Each branch independently extracts information of a specific type or scale. Afterwards, the results of all branches are concatenated together and processed through the fusion layer to produce an enhanced feature representation

F

.

As described in Figure 1, the detection head comprises two sibling branches, which are used to carry out a dense prediction of building center points and bounding boxes. For each detected building instance, the detection box is used to identify four side center points (top, left, bottom, and right). Subsequently, these center points are regressed to obtain four extreme points (topmost, leftmost, bottommost, and rightmost) [15], which collectively define the building’s bounding box. Next, four edges are constructed: each edge is centered on one extreme point, aligned with the corresponding bounding box side, and has a length equal to 1/4 of that side. Any edge segment extending beyond the bounding box boundary is truncated. Finally, an octagonal contour is formed by sequentially connecting the endpoints of these edges [39]. The PDAA then uniformly samples vertices along this octagonal contour to generate the initial polygon.

This module efficiently generates an initial outline for each building instance. While existing contour-based methods also produce building polygons, they often incur higher inference times or yield overly smoothed shapes. In contrast, our approach eliminates the need for explicit vertex order prediction and dynamically adjusts the number of polygon vertices according to the specific building shape. This dual advantage significantly reduces inference time while simultaneously improving PolySim.

3.2. Contour Evolution Module

The initial contour obtained can only roughly represent the approximate location of the building, but it cannot accurately represent the building contour. Therefore, the vertex of the initial contour is encoded as a feature vector as the input of the contour vertex adjustment module, the offset of each point is predicted, and the offset is added to the corresponding vertex coordinate to obtain a new vertex. In order to make the building polygon concise and efficient, the redundant vertex removal module is used to remove redundant vertices and only retain the building inflection points, ensuring that the final generated polygon contains only necessary vertices, thereby improving the quality and geometric consistency of the polygon.

3.2.1. Contour Vertex Adjustment

The initial contour obtained can only roughly represent the approximate location of the building, but it cannot accurately represent the building contour. Therefore, the vertex of the initial contour is encoded as a feature vector as the input of the contour vertex adjustment module, the offset of each point is predicted, and the offset is added to the corresponding vertex coordinate to obtain the new vertex. Figure 5 shows the network structure of adjusting contour points.

Let

\hat{P} = {\{{\hat{p}}_{i}\}}_{i = 1}^{N}, {\hat{p}}_{i} \in R^{2}

be the set of N initial points and

\hat{O} = {\{{\hat{o}}_{i}\}}_{i = 1}^{N}, {\hat{o}}_{i} \in R^{2}

be the predicted point offset; then the prediction process is as follows:

\hat{O} = f_{θ} (F)

(1)

where

f_{θ}

represents the vertex offset prediction mode with parameter θ. F is the input vector of the model, which is constructed as a set of feature vectors of N initial points.

F = {\{F ({\hat{p}}_{i}) \circ {\hat{p}}_{i}\}}_{i = 1}^{N}

(2)

Here,

\circ

is the concatenation operation, and

{\hat{p}}_{i}

is the coordinate of the i-th initial point.

F

is the feature map extracted by the CNN backbone, and

F ({\hat{p}}_{i})

means extracting the feature vector from the feature map

F

at the coordinate

{\hat{p}}_{i}

using bilinear interpolation. The number of channels of the feature vector at position

{\hat{p}}_{i}

is the same as that of the feature map

F

(i.e., 64), and

{\hat{p}}_{i}

is a two-dimensional vector. Therefore, the input of the vertex offset prediction model is a vector of shape N × 66. The network architecture of the model is shown in Figure 5. The model consists of several standard one-dimensional convolutional layers and circular convolution (CirConv) layers [15]. The kernel sizes of circular convolution and standard one-dimensional convolution are 9 and 1, respectively.

According to the predicted vertex offset

\hat{O}

, the vertex set

V^{0}

of the initial polygon is constructed as follows:

V^{0} = \{{\hat{p}}_{i} + {\hat{o}}_{i} ∣ 1 \leq i \leq N\}

(3)

During training, PDAA similarly uniformly samples N target points

P = {\{p_{i}\}}_{i = 1}^{N}, p_{i} \in R^{2}

along the real building contours. If we let

Y = {\{y_{i}\}}_{i = 1}^{N}

be the target vertex map, then

y_{i}

∈

Y

is defined as

y_{i} = I (\exists g_{j} \in G : i = a r g {m i n}_{i} {\{|p_{i} - g_{j}|\}}_{i = 1}^{N})

(4)

Here,

I

is an indicator function, which takes 0 or 1. G =

{g_{j}}_{j = 1}^{M}, g_{j} \in R^{2}

represents the set of M vertices of the ground-truth building polygons. To optimize the predicted vertex graph

\hat{P}

, a focal loss [40] is applied as follows:

L_{c l s} = \frac{- 1}{\sum_{i = 1}^{N} {\hat{y}}_{i}} \sum_{i = 1}^{N} \{\begin{matrix} {(1 - {\hat{y}}_{i})}^{α} l o g ({\hat{y}}_{i}), i f y_{i} = 1 \\ {(1 - y_{i})}^{β} {({\hat{y}}_{i})}^{α} l o g (1 - {\hat{y}}_{i}), o t h e r w i s e \end{matrix}

(5)

Here, following [41], α is set to 2 and β is set to 4. To optimize the predicted point offset, a smooth L1 loss [42] is adopted:

L_{p y} = \frac{1}{\sum_{i = 1}^{N} ω_{i}} \sum_{i = 1}^{N} ω_{i} \cdot {s m o o t h}_{L_{1}} ({\hat{p}}_{i} + {\hat{o}}_{i} - p_{i})

(6)

where

ω_{i} = {(0.1)}^{(1 - y_{i})}

is used to scale the loss of the i-th initial point and depends on the value of

y_{i}

.

3.2.2. Redundant Vertex Removal

In the process of building vector extraction, redundant vertex removal is a key step, which aims to filter out the key vertices that best represent the building contour from the initial prediction results. As shown in Figure 6, the algorithm consists of three core modules: multi-scale feature extraction, self-attention mechanism, and learnable non-maximum suppression (NMS).

First, a multi-scale feature extractor is used to process the input data. A number of one-dimensional convolutional layers in this module are capable of capturing spatial features at various scales and offering a wealth of data for further processing. Next, a self-attention mechanism is used to enable the model to dynamically adjust the importance weight of each point according to the global context and enhance the key point recognition ability. Lastly, each point is given an importance score by a fully connected layer, and redundant points are eliminated by combining them with an enhanced non-maximum suppression algorithm.

The multi-scale feature extractor is designed to capture different scale features of the input contour. Specifically, it consists of two consecutive 1D convolutional layers, using kernels of size 3 and 5, respectively, and corresponding padding strategies to ensure that no information is lost during feature extraction. The convolution operation is followed by a batch normalization layer and a ReLU activation function to facilitate gradient flow during training. The output of the feature extractor is reshaped [B, P, 64], where B is the batch size and P is the number of vertices, to facilitate subsequent self-attention mechanism processing.

The self-attention mechanism determines the importance weight of each point by calculating the similarity score between the query, key, and value. This approach allows the model to effectively capture the relationship between points, especially in complex scenarios. The self-attention module sets the embedding dimension to 64 and uses 4-head attention to calculate the global dependencies between vertices to improve processing efficiency and flexibility.

We propose a learnable NMS predictor to more precisely eliminate redundant vertices. Using a self-attention mechanism and a multi-scale feature extractor, the predictor first evaluates the input data before creating an importance score for every point via a fully connected layer. Based on these scores, an improved NMS algorithm is applied, and the dynamic screening strategy contains two constraints: (1) Geometric constraint: if the distance between the vertex

v_{i}

and its adjacent vertices satisfies

d (v_{i}, v_{i \pm 1}) < τ_{d}

. (2) Semantic constraint: if the probability

s_{i} > τ_{s}

is retained and the maximum confidence in the local neighborhood

N_{r} (v_{i})

is achieved. The parameter is set to

τ_{d} = 3 p x, τ_{s} = 0.3, r = 3

. While considering the geometric distance, the learned importance score is also incorporated to more intelligently select which points to retain.

3.3. Missing Vertex Completion

There may be certain subtle building vertices that are not foreseen for constructing instances with more intricate geometries. To cope with this problem, PDAA proposes a missing vertex completion module, which iteratively adds new polygon vertices to the initial building polygons and matches these new polygon vertices with the unpredicted ground-truth building vertices. There are T iterations in the missing vertex recovery module. For the first iteration, the building polygons generated in the redundant vertex removal module are regarded as the initial polygons. For the t-th (2 ≤ t ≤ T) iteration, the building polygon generated by the (t − 1)-th iteration is regarded as the initial polygon. In each iteration, PDAA first uniformly samples a set of initial points along the initial polygon and then constructs a vertex and offset prediction model, as shown in Figure 7. This module takes the initial point as input to predict the vertex heatmap and point offset, where the former indicates the probability of each initial point becoming a new polygon vertex and removes redundant points, while the latter is used to refine the position of the initial point.

Formally, for the t-th iteration, let

{\hat{P}}^{t} = {{\hat{p}}_{i}^{t}}_{i = 1}^{N}, {\hat{p}}_{i}^{t} \in R^{2}

be the set of N initial points uniformly sampled along the initial polygon. Then, the input features

F^{t}

of the vertex and offset prediction model are constructed as

F^{t} = {\{F ({\hat{p}}_{i}^{t}) \circ {\hat{p}}_{i}^{t} \circ ι_{i}^{t}\}}_{i = 1}^{N}

(7)

Here, i is the coordinate of the i-th initial point in the t-th iteration, and

ι_{i}^{t}

takes 0 or 1, indicating whether the i-th initial point is a polygon vertex.

ι_{i}^{t}

is defined as

ι_{i}^{t} = I (\exists v \in V^{t - 1} : i = a r g {m i n}_{i} {\{|{\hat{p}}_{i}^{t} - v|\}}_{i = 1}^{N})

(8)

where

V^{t - 1}

is the vertex set of the building polygons produced at the (t − 1)-th iteration. The network architecture of this model is the same as that shown in Figure 5, except that the shape of the input vector becomes N × 67. For the t-th iteration, let

{\hat{Y}}^{t} = {{\hat{y}}_{i}^{t}}_{i = 1}^{N}

and

{\hat{O}}^{t} = {{\hat{o}}_{i}^{t}}_{i = 1}^{N}

be the predicted vertex map and point offsets, respectively. The vertex set

V^{t}

of the building polygon is constructed as

V^{t} = \{{\hat{p}}_{i}^{t} + {\hat{o}}_{i}^{t} ∣ ({\hat{y}}_{i}^{t} \geq s) \lor (ι_{i}^{t} = 1), 1 \leq i \leq N\}

(9)

where

ι_{i}^{t} = 1

indicates that the i-th initial point is the original polygon vertex. Then, the building polygon can be generated based on

V^{t}

. Considering that each iteration may introduce new polygon vertices, which may lead to vertex redundancy, PDAA solves this problem in two steps. First, in each iteration, PDAA aims to regress redundant vertices to their adjacent ground-truth construction vertices. Second, PDAA adopts a redundant vertex removal module procedure to remove redundant vertices around building vertices [43].

3.4. Loss Function

During training, PDAA first matches the vertex set

V^{t - 1}

of the initial polygon with the vertex set G of the ground-truth building polygon. Considering the dependencies between polygon vertices, PDAA explores a variant of the dynamic time warping (DTW) algorithm [44] to implement the vertex matching process. Specifically, the DTW algorithm is first applied to match

V^{t - 1}

and G, which enables any

\begin{matrix} υ & \in & V^{t - 1} \end{matrix}

to be matched with one or more consecutive vertices in

G_{v} \subseteq G

, and vice versa. However, since a vertex

\begin{matrix} υ & \in & V^{t - 1} \end{matrix}

can only match one ground-truth vertex in G, if

g \in G_{v}

, then PDAA uniquely selects

g = a r g m a x g_{j} {|(g_{j - 1} - g_{j}) \times (g_{j + 1} - g_{j})|}_{g_{j} \in G_{v}}

. For brevity, the matching process is denoted as

V D ()

, and the ground-truth vertex that matches any

\begin{matrix} υ & \in & V^{t - 1} \end{matrix}

is denoted as

V D (v)

.

After matching the two vertex sets, PDAA samples N target points along the ground-truth building polygon. Formally, let K be the set of indices of the initial points closest to the vertex in

V^{t - 1}

,

K = {a r g {m i n}_{i} {|{\hat{p}}_{i}^{t} - v|}_{i = 1}^{N} | v \in V^{t - 1}}

. Then, the number of initial points between any two adjacent vertices

v_{i}, v_{i + 1} \in V^{t - 1}

is calculated as

k_{i + 1} - k_{i}

, which is the number of target points

⟨ V D (v_{i}), V D (v_{i + 1}) ⟩

uniformly sampled along the edge. By collecting all sampled points on all edges, a set of N target points is constructed.

Let

P^{t} = {p_{i}^{t}}_{i = 1}^{N}, p_{i}^{t} \in R^{2}

and

Y^{t} = {y_{i}^{t}}_{i = 1}^{N}

be the target point set and target vertex graph, respectively. Any

y_{i}^{t} \in Y

is defined as

y_{i}^{t} = I (\exists g \in G - V D (V^{t - 1}) : i = a r g {m i n}_{i} {\{|p_{i}^{t} - g|\}}_{i = 1}^{N})

(10)

where

V D (V^{t - 1}) \subseteq G

represents the set of ground-truth vertices matching all vertices in

V^{t - 1}

. At the t-th iteration, the loss function for optimizing the predicted vertex heatmap is defined as

\begin{array}{l} L_{c l s}^{t} = & \frac{- 1}{\sum_{i = 1}^{N} {\hat{y}}_{i}^{t}} \\ \times \sum_{i = 1}^{N} (1 - ι_{i}^{t}) \{\begin{array}{l} {(1 - {\hat{y}}_{i}^{t})}^{α} l o g ({\hat{y}}_{i}^{t}) & i f y_{i}^{t} = 1 \\ {(1 - y_{i}^{t})}^{β} {({\hat{y}}_{i}^{t})}^{α} l o g (1 - {\hat{y}}_{i}^{t}) & o t h e r w i s e \end{array} \end{array}

(11)

Here, α and β follow the previous settings. The loss function for the offset prediction in the t-th iteration is defined as

L_{p y}^{t} = \frac{1}{\sum_{i = 1}^{N} ω_{i}^{t}} \sum_{i = 1}^{N} ω_{i}^{t} \cdot {s m o o t h}_{L_{1}} ({\hat{p}}_{i}^{t} + {\hat{o}}_{i}^{t} - p_{i}^{t})

(12)

where

ω_{i}^{t} = {(0.1)}^{(1 - m a x (y_{i}^{t}, ι_{i}^{t})}

is a scalar used to scale the loss. The overall loss function for optimizing the missing vertex recovery module is calculated as

L_{i t e r} = \sum_{t = 1}^{T} L_{c l s}^{t} + L_{p y}^{t}

(13)

The process carried out in this study is shown in Algorithm 1.

Algorithm 1 PDAA Training

Input:: $X = {x_{h}}_{h = 1}^{H}$ is the set of $H$ training images; $Π = {π_{h}}_{h = 1}^{H}$ is the set of ground-truth building polygons for $H {i m a g e s;}_{c e r} f_{μ}^{d e t}$ is detection model with network parameters $μ; f_{ν}^{e x}$ is extreme point prediction model with network parameters $ν; f_{θ}$ and $f_{θ_{t}}^{t}$ are vertex and offset prediction models with network parameters $θ$ and $θ_{t} (1 \leq t \leq T)$

1:

for x_{h}, π_{h} \in X, Π do

2: Generate ground-truth detection boxes

b b

with

π_{h}

3: Extract Feature map and predict detection boxes:

F, \hat{b} b \leftarrow f_{μ}^{d e t} (x_{h})

4:

P r e d i c t e x t r e m e p o i n t s : \hat{e x} \leftarrow f_{ν}^{e x} (F, b b)

, and generate target

e x

5: Sample initial points

\hat{P}

along contour formed by

\hat{e x}

6: Construct input vector

F

with

F

and

\hat{P} (

Equation (2))
7:

\hat{O}, \hat{Y} \leftarrow f_{θ} (\tilde{F})

8: Generate targets

P

and

Y

with

π_{h}

(Equation (4))
9:

for t = 1

to

T do

10: Compute vertex set

V^{t - 1}

from

{\hat{O}, \hat{Y}} (t = 1, E q u a t i o n (3))

or

{\{{\hat{\hat{O}}}^{t - 1}, {\hat{Y}}^{t - 1}\}}_{\hat{A}} (t > 1

, Equation (9))
11: Sample points

{\hat{P}}^{t}

along polygon formed by

V^{t - 1}

12: Construct input vector

F^{t}

with

F

and

{\hat{P}}^{t} (

Equation (7))
13:

{\hat{O}}^{t}, {\hat{Y}}^{t} \leftarrow {\hat{f}}_{θ_{t}}^{\dot{t}} (F^{t})

14: Generate targets

P^{ι}

and

Y^{ι}

with

π_{h}

and

{\hat{P}}^{ι}

(Equation (10))
15: end for
16:

C o m p u t e L o s s e s L_{d e t} a n d L_{e x} w i t h \hat{b b}, b b, \hat{e} x, e x

17:

C o m p u t e L o s s e s L_{c l s}, L_{p y}, a n d L_{i t e r} w i t h \hat{O}, \hat{P}, \hat{Y}, P

Y

,

{\tilde{O}}^{t}

,

{\hat{P}}^{t}

,

{\hat{Y}}^{t}

,

P^{t}

,

Y^{t} (E q u a t i o n s

(5), (6) and (13))
18:

L \leftarrow L_{d e t} + L_{e x} + L_{c l s} + L_{p y} + L_{i t e r}

19:

U p d a t e μ, ν, θ, a n d θ_{t} by backpropagating loss L

20:

end for

4. Experiment

In all experiments, the models were trained for 150 epochs on an NVIDIA GTX 1080 Ti GPU using a mini-batch size of eight images. The initial learning rate of the Adam optimizer was 1 × 10⁻⁴. The learning rate was halved at 80 and 120 epochs. The models were trained with multi-scale data augmentation and tested without tricks. The CNN backbone was initialized with weights pre-trained on ImageNet, while the other layers were initialized as in [15].

4.1. Datasets and Evaluation Metrics

We evaluated our proposed method on three building datasets: the WHU dataset [45], the Vaihingen dataset [46], and the Inria building dataset [47]. The WHU dataset has a large number of highly accurate building labels. The dataset covers an area of 450 square kilometers and includes more than 187,000 buildings of various architectures and uses. All aerial images are seamlessly cropped into 1024 × 1024 blocks. Following [8], the original images with a resolution of 0.075 m/pixel are downsampled to 0.2 m/pixel, and 130,000/14,500/42,000 buildings are used for training/validation/testing datasets, respectively. The Vaihingen dataset covers the area of Vaihingen, a small town near Stuttgart, Germany, and provides high-resolution (9 cm/pixel) aerial images, consisting of 168 images of size 512 × 512 and a resolution of 0.09 m/pixel. Following [10], the dataset is split into 100/68 images for training/testing. The Inria building dataset contains 360 images of size 5000 × 5000 pixels and a resolution of 0.3 m/pixel collected from five different cities (Austin, Chicago, Kitsap, Tyrol, and Vienna). To adapt to specific experimental requirements, these original images are first padded to 5120 × 5120 pixels and then cropped into small patches of 512 × 512 pixels. Following [14], the city of Tyrol is selected from the Inria dataset to evaluate the proposed algorithm. In the experiment, it is divided into training and test sets in a ratio of 3:1.

To evaluate the performance of methods for generating building polygons, classic metrics in the field of object detection and instance segmentation are used: AP (averaged over intersection-over-union (IoU) thresholds of 0.50:0.05:0.95), AP₅₀ (IoU threshold of 0.5), and AP₇₅ (IoU threshold of 0.75). AP_S emphasizes the importance of accurate and complete detection of building instances and is widely used in existing methods [13,15,24]. Unless otherwise specified, IoU in AP_S is based on building polygons rather than detection boxes. In addition, to evaluate the geometric similarity between the polygons generated by different methods and the ground-truth building polygons, we also evaluated the PolySim metric, as used in previous studies [10,48]. Specifically, PolySim is calculated as the product of the average difference in the orientation angles of all edges of two polygons and the IoU value between the two polygons.

P o L i S (X, Y) = \frac{1}{2 m} \sum_{x_{i} \in X} {m i n}_{y \in \partial Y} ∥ x_{j} - y ∥ + \frac{1}{2 n} \sum_{y_{j} \in Y} {m i n}_{x \in \partial X} ∥ y_{j} - x ∥

(14)

Here, the first term is the average distance between each vertex

x_{i} \in X

,

i

= 1, …, m and its closest point

y

on the polygon’s boundary, while the second term is the average distance between each vertex

y_{j} \in Y, j = 1, \dots, n

and its closest point on the polygon’s boundary to the point

x

.

In addition, to evaluate the efficiency of different methods, the inference time (Time) and the number of learnable parameters (Params) were calculated.

4.2. Results and Analysis

We compare our proposed algorithm with the pixel-based method Mask R-CNN [24], as well as recent contour-based methods including MA-FCN [11], Polygon-RNN++ [15], APGA [12], Deep Snake [17], and Hisup [12]. For a fair comparison, the detection boxes generated by PDAA are fed to Polygon-RNN++ and APGA at inference time.

4.2.1. Results from WHU Dataset

Table 1 shows the numerical results of the different methods used on the WHU dataset. As a typical representative of semantic segmentation, MA-FCN focuses on distinguishing between buildings and non-building areas without involving the identification of specific building instances. This mechanism minimizes the amount of learnable parameters required by MA-FCN and also significantly reduces the inference time. However, this advantage also brings limitations, especially when dealing with densely distributed building instances; for example, MA-FCN has difficulty in effectively distinguishing adjacent buildings, resulting in its performance on the WHU dataset being lower than expected. Similarly, HiSup generates polygons by relying on the results of semantic segmentation, which helps improve the overall segmentation quality, but may also cause the generated polygon shapes to deviate from the ground truth of the actual building. In addition, HiSup can simultaneously learn mask, line, and vertex information to improve segmentation results, but this also means that it requires more learnable parameters and longer reasoning time than MA-FCN. In contrast, methods that distinguish different instances: Mask R-CNN, Polygonrnn++, APGA, and Deep Snake show significant improvements in average precision (AP_S) and polygon similarity (Polysim).

In the WHU dataset, PDAA shows significant performance advantages, not only surpassing existing methods by more than 2% in AP but also achieving up to a 6% performance improvement in core evaluation indicators such as AP, AP₅₀, and AP₇₅. This is a considerable improvement for methods that have already achieved good results on the WHU dataset. The results of the WHU dataset are visualized in Figure 8. MA-FCN faces particular challenges when dealing with adjacent building instances, and its use of the empirical regularization algorithm can cause the generated building contour to deviate from the target, producing inaccurate boundary predictions. Polygonrnn++ misses several vertices in the process of generating buildings, producing incorrect building contours in building instances with complex shapes. Although APGA can generate relatively regular building shapes, their contours do not necessarily strictly fit the target building boundaries. HiSup faces great challenges in distinguishing adjacent building instances, which may lead to errors in building prediction. Instead, PDAA performs better in both building instance recognition and accurate boundary prediction. The generated building polygons contain only necessary vertices, presenting a structure that is more consistent with the ground truth and making the prediction results closer to the geometric characteristics of real buildings, thereby improving the quality and geometric consistency of polygons.

4.2.2. Results from Vaihingen Dataset and Inria Dataset

As shown in Table 2, this study conducted a multi-method comparison experiment on the Vaihingen dataset. The experiment shows that the performance of the Polygon-RNN++ method is significantly lower than that of other methods using contour information. This phenomenon is mainly attributed to the limitations of Polygon-RNN++ in processing fine structures in high-spatial-resolution images. In contrast, PDAA demonstrates its superiority, especially in more accurate predictions, producing simpler building contours and reducing the generation of redundant points. Compared with HiSup, PDAA not only achieves a similar performance level in PolySim indicators, but it also achieves a more than 2% improvement in AP, AP₅₀, and AP₇₅ accuracy standards, which highlights the advantage of PDAA in achieving higher accuracy predictions. To more intuitively demonstrate these results, some visualization examples in the Vaihingen dataset are presented in Figure 9.

The numerical results on the Inria dataset are reported in Table 3. PDAA performs better than APGA in terms of AP, AP₇₅, and PolySim, with an improvement of more than 6%, respectively. At the same time, PDAA also shows strong competitiveness in the AP₅₀ indicator. Compared with other methods, PDAA’s significant improvements in AP and AP₇₅ further demonstrate PDAA’s excellent ability in accurately depicting building contours.

Whether for the detailed Vaihingen dataset or the larger Inria dataset, PDAA shows obvious advantages over other existing methods, especially in improving the accuracy of building contour prediction. The accuracy and effect of these methods are significantly different from those of the WHU dataset. This is because the large number of small auxiliary structures in the Vaihingen scene puts higher requirements on local feature extraction, while the diversity of architectural styles in the Inria dataset tests the generalization representation ability of the model.

4.3. Ablation Experiment

In order to more deeply and comprehensively evaluate the effectiveness of each key module in the PDAA framework, a series of ablation experiments were conducted to carefully analyze the impact of each component on the final performance. All experiments were conducted on the WHU dataset, and it was ensured that they were performed in the same experimental environment. The training parameters (such as learning rate, batch size, etc.) were kept consistent to ensure that the experimental results were highly comparable and reliable.

Table 4 shows, in detail, the performance changes under different module combinations. Among them, the introduction of the FEC module significantly improves the performance of the model, which shows that FEC plays a vital role in optimizing feature extraction. In addition, the addition of the missing vertex completion module (MVCM) further improves the PolySim score to 84.8%, showing its great potential in improving contour shape accuracy. In particular, when the redundant vertex removal module (RVRM), the missing vertex completion module, and the FEC are used together, the AP value reaches 75.4%. This result not only reflects the importance of each of the three modules but also reveals the synergistic effect between them, especially the outstanding performance in dealing with complex building structure extraction tasks.

Figure 10 intuitively shows the changing process of the building extraction effect under different module combinations. It can be clearly seen from the figure that without the application of RVRM, MVCM, and FEC, the predicted building contour has obvious vertex deviation, inaccurate positioning, and a loss of important vertices (Figure 10b). With the gradual introduction of RVRM, it can be seen that the prediction of redundant point vertices has been effectively controlled (Figure 10c), while MVCM has successfully restored some missing key point vertices (Figure 10d). Although the overall prediction of the building contour is now quite close to the actual situation, there are still certain errors in the positions of some vertices. Finally, by adding the FEC module, these subtle deviations have been effectively corrected, achieving a more accurate building contour prediction (Figure 10e).

To further emphasize the importance of each module and the interaction mechanism between them, we also analyzed the specific reasons for the improvement at each stage. RVRM reduces unnecessary calculations and potential sources of error by identifying and removing redundant points; MVCM uses multi-angle information to make up for missing information, thereby improving reconstruction accuracy; and the FEC module enhances feature expression capabilities, allowing the model to more accurately capture the detailed features of the building. This multi-level, all-round optimization strategy works together to enable the PDAA framework to demonstrate excellent performance in building extraction tasks.

4.4. Discussion

The end-to-end polygon dynamic adjustment algorithm (PDAA) proposed in this study has shown significant performance advantages in the task of building contour extraction. The core of this method is to solve many key challenges of traditional methods through the synergy of four modules. First, the local feature extraction strategy focusing on the region of interest (RoI) effectively reduces the redundancy of global features, allowing the model to more accurately capture the local geometric details of the building. This design is inspired by the sensitivity of the human visual system to local features. By positioning the RoI detection box, the model’s adaptability to complex building structures is significantly improved.

Secondly, the feature enhancement module, FEC, significantly enhances the model’s ability to detect key vertices in complex backgrounds by optimizing feature expression capabilities. The introduction of this module compensates for the sensitivity of traditional vertex detection methods to illumination changes and texture interference. FEC reduces the risk of missing vertices by enhancing the robustness of features. In the ablation experiment, the introduction of FEC improved the accuracy of the model in complex building examples by about 1%, which shows that its enhancement effect on key features cannot be ignored.

The synergy between the redundant vertex removal module and the missing vertex completion module is one of the core innovations of PDAA. Traditional methods often lead to the accumulation of redundant vertices or the loss of key vertices due to the lack of a dynamic adjustment mechanism. The redundant vertex removal module effectively distinguishes redundant points from key points through a learnable classification mechanism, significantly reducing the number of redundant vertices of polygons (as shown in Figure 8). At the same time, the missing vertex completion module restores vertices missed due to occlusion or noise interference through iterative optimization. This dynamic adjustment mechanism not only improves the geometric consistency of polygons but also avoids the shape distortion problem caused by a fixed number of vertices. In the experiment, the combined use of the redundant vertex removal module and the missing vertex completion module increased the PolySim score to 84.8%, proving its effectiveness in complex contour modeling.

This study conducted experiments on the WHU, Vaihingen, and Inria datasets, and the results showed significant performance differences among the datasets. The WHU dataset, due to its high-quality annotations and simple building forms, allows PDAA to fully utilize the advantages of its vertex adjustment mechanism and obtain higher scores. In contrast, the Vaihingen and Inria datasets contain more complex building forms and inconsistent annotations, which leads to a decrease in the performance of PDAA. These differences reveal the challenges faced by PDAA in dealing with complex building outlines and provide guidance for further optimization.

However, this study still has some limitations. PDAA’s reliance on RoI detection boxes may lead to a decline in its performance in dense building complexes. When the detection boxes overlap or have positioning deviations, it may affect the accuracy of subsequent vertex predictions. In addition, the current method’s reliance on high-quality labeled data still needs to be further reduced. In future studies, the model’s feature extraction capabilities for overlapping areas or fuzzy boundaries can be enhanced by introducing a multi-scale attention module, and the deviations of local RoI detection boxes can be corrected using global context information to improve the global consistency of vertex predictions. Semi-supervised or self-supervised learning strategies can be introduced to reduce the cost of manual labeling. Pseudo-annotations are dynamically generated during training, and consistency regularization constraints are added to improve the model’s generalization ability for unseen data.

It is worth mentioning that the good geometric consistency shown by PDAA not only reflects its effectiveness in building extraction tasks but also provides a solid foundation for practical applications. This method can be widely used in scenarios such as urban planning, disaster assessment, and the automatic interpretation of remote sensing images. The building polygons extracted by PDAA can be used as high-quality vector inputs: after natural disasters, the rapid extraction of damaged building outlines is helpful for loss assessment and reconstruction planning; accurate building boundary information can assist in the construction of high-precision maps and improve environmental perception capabilities. These potential application directions also further highlight the research value and promotional significance of the PDAA method.

5. Conclusions

In this paper, an end-to-end polygon dynamic adjustment algorithm (PDAA) is proposed to solve the challenge of extracting building contours in remote sensing images. By focusing on the collaborative optimization of modules such as the local feature extraction of RoI, redundant contour removal, and missing contour completion, PDAA has achieved significant improvements in the geometric similarity and extraction accuracy of complex building contours. Experiments show that PDAA achieves 75.4% AP and 84.8% PolySim scores on the WHU dataset, verifying its effectiveness in processing redundant vertices, recovering missing vertices, and adapting to changing building forms. Compared with existing methods, PDAA simplifies the prediction process, reduces dependence on large-scale annotated data, and generates polygons that are closer to real geometric features through a dynamic adjustment mechanism. The main contributions of this study include the following: (1) proposing an RoI local feature extraction strategy, which significantly improves the adaptability of complex buildings; (2) designing redundant point removal and missing point completion modules, which solves the shape distortion problem caused by the fixed number of vertices in traditional methods; and (3) through module design, efficient and high-precision end-to-end prediction is achieved, providing reliable technical support for practical engineering applications. Future work should focus on a lightweight optimization of the algorithm and improvement in cross-domain generalization capabilities to further promote the practical application of remote sensing image analysis technology.

Author Contributions

Conceptualization, L.T. and B.F.; methodology, L.L. and B.F.; validation, L.L.; review and editing, L.L., B.F. and J.C.; visualization, L.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Joint Fund of Collaborative Innovation Center of Geo-Information Technology for Smart Central Plains, the Henan Province and Key Laboratory of Spatiotemporal Perception and Intelligent Processing, the Ministry of Natural Resources (No. 231202), the Opening Fund of Key Laboratory of Geological Survey and Evaluation of Ministry of Education (GLAB2024ZR08), Operation and maintenance of cloud platform geological survey nodes and network security guarantee (No. DD20251200113).

Data Availability Statement

WHU Dataset link: https://gpcv.whu.edu.cn/data/building_dataset.html. Inria building datase link: https://project.inria.fr/aerialimagelabeling/. Vaihingen dataset link: https://github.com/ai4os-hub/semseg-vaihingen.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, W.; He, C.; Fang, J.; Zheng, J.; Fu, H.; Yu, L. Semantic segmentation-based building footprint extraction using very high-resolution satellite images and multi-source GIS data. Remote Sens. 2019, 11, 403. [Google Scholar] [CrossRef]
Shi, Y.; Li, Q.; Zhu, X.X. Building segmentation through a gated graph convolutional neural network with deep structured feature embedding. ISPRS J. Photogramm. Remote Sens. 2020, 159, 184–197. [Google Scholar] [CrossRef]
Zhao, W.; Persello, C.; Stein, A. Building outline delineation: From aerial images to polygons with an improved end-to-end learning framework. ISPRS J. Photogramm. Remote Sens. 2021, 175, 119–131. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Huang, W. Building outline delineation from VHR remote sensing images using the convolutional recurrent neural network embedded with line segment information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4705713. [Google Scholar] [CrossRef]
Huang, W.; Tang, H.; Xu, P. OEC-RNN: Object-oriented delineation of rooftops with edges and corners using the recurrent neural network from the aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5604912. [Google Scholar] [CrossRef]
Li, L.; Liang, J.; Weng, M.; Zhu, H. A multiple-feature reuse network to extract buildings from remote sensing imagery. Remote Sens. 2018, 10, 1350. [Google Scholar] [CrossRef]
Zhao, K.; Kang, J.; Jung, J.; Sohn, G. Building extraction from satel-lite images using mask R-CNN with building boundary regularization. In Proceedings of the IEEE/CVF Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 247–251. [Google Scholar]
Wei, S.; Ji, S.; Lu, M. Toward automatic building footprint delin-eation from aerial images using CNN and regularization. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2178–2189. [Google Scholar] [CrossRef]
Zorzi, S.; Bittner, K.; Fraundorfer, F. Machine-learned regularization and poly-gonization of building segmentation masks. In Proceedings of the 2020 25th International Conference on Pattern Recognition ICPR, Milan, Italy, 10–15 January 2021; IEEE: Piscatvi, NJ, USA, 2021; pp. 3098–3105. [Google Scholar]
Zhu, Y.; Huang, B.; Gao, J.; Huang, E.; Chen, H. Adaptive polygon generation algorithm for automatic building extraction. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4702114. [Google Scholar] [CrossRef]
Zorzi, S.; Bazrafkan, S.; Habenschuss, S.; Fraundorfer, F. PolyWorld: Polygonal building extraction with graph neural networks in satellite images. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1848–1857. [Google Scholar]
Li, Z.; Wegner, J.D.; Lucchi, A. Topological map extraction from overhead images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1715–1724. [Google Scholar]
Acuna, D.; Ling, H.; Kar, A.; Fidler, S. Efficient interactive annotation of segmentation datasets with Polygon-RNN++. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 859–868. [Google Scholar]
Wei, S.; Ji, S. Graph convolutional networks for the automated production of building vector maps from aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602411. [Google Scholar] [CrossRef]
Peng, S.; Jiang, W.; Pi, H.; Li, X.; Bao, H.; Zhou, X. Deep snake for real-time instance segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8530–8539. [Google Scholar]
Zhang, L.; Bai, M.; Liao, R.; Urtasun, R.; Marcos, D.; Tuia, D.; Kellenberger, B. Learning deep structured active contours end-to-end. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8877–8885. [Google Scholar]
Cheng, D.; Liao, R.; Fidler, S.; Urtasun, R. DARNet: Deep active ray network for building segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7431–7439. [Google Scholar]
Huang, X.; Zhang, L. Morphological building/shadow index for building extraction from high-resolution imagery over urban areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2011, 5, 161–172. [Google Scholar] [CrossRef]
Chen, R.; Li, X.; Li, J. Object-based features for house detection from RGB high-resolution images. Remote Sens. 2018, 10, 451. [Google Scholar] [CrossRef]
Sun, X.; Christoudias, C.M.; Fua, P. Free-shape polygonal object localization. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 317–332. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Jégou, S.; Drozdzal, M.; Vazquez, D.; Romero, A.; Bengio, Y. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 11–19. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 4005615. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic segmentation of remote sensing images by interactive representation refinement and geometric prior-guided inference. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5400318. [Google Scholar] [CrossRef]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
Guo, H.; Du, B.; Zhang, L.; Su, X. A coarse-to-fine boundary refinement network for building footprint extraction from remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 183, 240–252. [Google Scholar] [CrossRef]
Wu, Y.; Xu, L.; Chen, Y.; Wong, A.; Clausi, D.A. TAL: Topography-aware multi-resolution fusion learning for enhanced building footprint extraction. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506305. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, X.; Zhao, X.; Xin, Q. Extracting building boundaries from high resolution optical images and LiDAR data by integrating the convolutional neural network and the active contour model. Remote Sens. 2018, 10, 1459. [Google Scholar] [CrossRef]
Hosseinpour, H.; Samadzadegan, F.; Javan, F.D. CMGFNet: A deep cross-modal gated fusion network for building extraction from very high-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2022, 184, 96–115. [Google Scholar] [CrossRef]
Bischke, B.; Helber, P.; Folz, J.; Borth, D.; Dengel, A. Multi-task learning for segmentation of building footprints with deep neural networks. In Proceedings of the 2019 IEEE International Conference on Image Processing. ICIP, Taipei, Taiwan, 22–25 September 2019; pp. 1480–1484. [Google Scholar]
Castrejon, L.; Kundu, K.; Urtasun, R.; Fidler, S. Annotating object instances with a polygon-rnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5230–5238. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; Woo, W.-C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in neural information processing systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Ling, H.; Gao, J.; Kar, A.; Chen, W.; Fidler, S. Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5257–5266. [Google Scholar]
Li, W.; Zhao, W.; Zhong, H.; He, C.; Lin, D. Joint semantic-geometric learning for polygonal building segmentation. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1958–1965. [Google Scholar] [CrossRef]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep layer aggregation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2403–2412. [Google Scholar]
Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 850–859. [Google Scholar]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting objects as paired keypoints. In Proceedings of the the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 1–13 December 2015; pp. 1440–1448. [Google Scholar]
Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR), Hong Kong, China, 20–24 August 2006; pp. 850–855. [Google Scholar]
Rabiner, L.; Juang, B.H. Fundamentals of Speech Recognition; Prentice-Hall: Englewood Cliffs, NJ, USA, 1993. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
ISPRS 2D Semantic Labeling Contest. Available online: https://github.com/ai4os-hub/semseg-vaihingen (accessed on 15 June 2025).
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The Inria aerial image labeling benchmark. In In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
Wang, S.; Bai, M.; Mattyus, G.; Chu, H.; Luo, W.; Yang, B.; Liang, J.; Cheverie, J.; Fidler, S.; Urtasun, R. TorontoCity: Seeing the world with a million eyes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3009–3017. [Google Scholar]

Figure 1. Building contour generated by different methods. (a) The true value of the building contour; (b) the contour extracted based on pixels; (c) the contour extracted by the first type of contour method; (d) the contour extracted by the second type of contour method.

Figure 2. The overall framework of PDAA. The coordinate vector of each vertex is used as input in the contour vertex adjustment module to predict the offset of each vertex. The red arrow in the figure indicates the regression offset of the initial vertices.

Figure 3. The network structure of the CNN backbone. DLA serves as the foundational network structure, and it progressively improves the resolution, aggregates the scale fusion, and aggregates each level into representations of various levels.

Figure 4. The construction of the FEC module is shown graphically.

Figure 5. The network structure of the vertex offset prediction model. The input vector F is calculated by (2). The numbers in the box indicate the number of channels of the feature vector.

Figure 6. The process of multi-scale feature extraction and non-maximum suppression (NMS) assisted by a self-attention mechanism.

Figure 7. For missing vertex recovery, the green nodes represent the new predicted polygon vertices in each iteration.

Figure 8. Qualitative polygon mapping examples of building results for the WHU dataset. (a) Original image with labels; (b) MA-FCN; (c) PolygonRNN++; (d) APGA; (e) HiSup; (f) and ours.

Figure 9. Qualitative polygon mapping examples of building results for the Vaihingen dataset. (a) Original image with labels; (b) MA-FCN; (c) Polygon-RNN++; (d) APGA; (e) HiSup; (f) and ours.

Figure 10. Visualization results of ablation experiments: (a) true label; (b) the base model has vertex offset and key point missing; (c) introducing RVRM effectively suppresses the generation of redundant vertices; (d) superimposing MVCM restores missing vertices; (e) combined with FEC module to correct slight positioning deviations.

Table 1. Comparison of different methods on the WHU dataset.

Method	AP	AP₅₀	AP₇₅	PolySim	Parama (M)	Time (ms)
Mask R-CNN	0.707	0.904	0.803	0.820	63.62	103.77
MA-FCN	0.674	0.858	0.764	0.821	22.48	41.90
Polygon-RNN++	0.701	0.908	0.796	0.812	58.00	1431.03
APGA	0.708	0.905	0.794	0.821	52.43	345.92
Deep Snake	0.736	0.911	0.823	0.844	23.11	46.36
HiSup	0.735	0.864	0.798	0.846	74.29	99.21
PDAA	0.754	0.912	0.835	0.849	25.55	40.69

Table 2. Comparison of different methods on the Vaihingen dataset.

Method	AP	AP₅₀	AP₇₅	PolySim
Mask R-CNN	0.565	0.771	0.649	0.789
MA-FCN	0.497	0.697	0.563	0.726
Polygon-RNN++	0.513	0.740	0.592	0.749
APGA	0.570	0.780	0.659	0.793
Deep Snake	0.585	0.752	0.662	0.813
HiSup	0.609	0.784	0.682	0.819
IPDD	0.621	0.801	0.732	0.813

Table 3. Comparison of different methods on the Inria dataset.

Method	AP	AP₅₀	AP₇₅	PolySim
Mask R-CNN	0.391	0.687	0.411	0.638
MA-FCN	0.364	0.622	0.379	0.642
Polygon-RNN++	0.415	0.707	0.438	0.639
APGA	0.410	0.682	0.439	0.637
Deep Snake	0.429	0.712	0.460	0.699
HiSup	0.458	0.711	0.513	0.686
IPDD	0.465	0.714	0.518	0.690

Table 4. Ablation study of the proposed algorithm. The marked items in the table indicate the addition of this module.

Baseline	RVRM	MVCM	FEC	WHU Dataset
Baseline	RVRM	MVCM	FEC	AP	AP₅₀	AP₇₅	PolySim
√				0.744	0.901	0.825	0.835
√	√			0.745	0.903	0.828	0.839
√	√	√		0.747	0.904	0.827	0.848
√	√	√	√	0.754	0.912	0.835	0.849

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, L.; Cai, J.; Feng, B.; Tao, L. PDAA: An End-to-End Polygon Dynamic Adjustment Algorithm for Building Footprint Extraction. Remote Sens. 2025, 17, 2495. https://doi.org/10.3390/rs17142495

AMA Style

Luo L, Cai J, Feng B, Tao L. PDAA: An End-to-End Polygon Dynamic Adjustment Algorithm for Building Footprint Extraction. Remote Sensing. 2025; 17(14):2495. https://doi.org/10.3390/rs17142495

Chicago/Turabian Style

Luo, Longjie, Jiangchen Cai, Bin Feng, and Liufeng Tao. 2025. "PDAA: An End-to-End Polygon Dynamic Adjustment Algorithm for Building Footprint Extraction" Remote Sensing 17, no. 14: 2495. https://doi.org/10.3390/rs17142495

APA Style

Luo, L., Cai, J., Feng, B., & Tao, L. (2025). PDAA: An End-to-End Polygon Dynamic Adjustment Algorithm for Building Footprint Extraction. Remote Sensing, 17(14), 2495. https://doi.org/10.3390/rs17142495

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PDAA: An End-to-End Polygon Dynamic Adjustment Algorithm for Building Footprint Extraction

Abstract

1. Introduction

2. Related Work

2.1. Pixel-by-Pixel Methods for Building Footprint Extraction

2.2. Contour-Based Methods for Building Footprint Extraction

3. Method

3.1. Initial Contour Generation

3.1.1. Backbone

3.1.2. Feature Enhancement

3.2. Contour Evolution Module

3.2.1. Contour Vertex Adjustment

3.2.2. Redundant Vertex Removal

3.3. Missing Vertex Completion

3.4. Loss Function

4. Experiment

4.1. Datasets and Evaluation Metrics

4.2. Results and Analysis

4.2.1. Results from WHU Dataset

4.2.2. Results from Vaihingen Dataset and Inria Dataset

4.3. Ablation Experiment

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI