PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation

Cui, Longfei; Li, Chao; Chen, Xin; Wang, Xiao; Qian, Haizhong

doi:10.3390/rs17091650

Open AccessArticle

PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation

by

Longfei Cui

^1,2,

Chao Li

²,

Xin Chen

¹,

Xiao Wang

¹ and

Haizhong Qian

^1,*

¹

Institute of Geospatial Information, Information Engineering University, Zhengzhou 450052, China

²

State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information System, Luoyang 471000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(9), 1650; https://doi.org/10.3390/rs17091650

Submission received: 18 March 2025 / Revised: 24 April 2025 / Accepted: 3 May 2025 / Published: 7 May 2025

(This article belongs to the Section Remote Sensing for Geospatial Science)

Download

Browse Figures

Versions Notes

Abstract

High-resolution remote sensing imagery has become the primary data source for obtaining building information. Automatically extracting regularized building outline polygon vectors is crucial for improving vector mapping efficiency and geographic information system applications, but existing deep learning methods struggle to simultaneously achieve accurate detection, high pixel-level coverage, and geometric regularity. This paper proposes a novel two-stage building outline extraction method. In the first stage, the SegFormer model is used to extract image features, effectively capturing global context information. In the second stage, a polygon outline regularization model (PolyReg) based on a Masked Attention Encoder is innovatively introduced. The PolyReg model draws on the sequence generation idea from natural language processing, transforming the outline regularization task into a sequence generation problem. Through a cleverly designed self-attention mask matrix, it achieves an autoregressive output of regularized building outline coordinates, eliminating the need for cumbersome post-processing steps. Experimental results show that on the Inria Aerial Image Labeling Dataset, compared with traditional methods and existing deep learning methods, the proposed method demonstrates significant advantages in metrics such as IoU, C-IoU, and Hausdorff distance. It effectively improves the regularity and geometric accuracy of building outlines while maintaining high pixel-level coverage.

Keywords:

building outline extraction; polygon regularization; sequence generation; self-supervised learning

1. Introduction

In recent years, the rapid development of remote sensing and Earth observation technologies has made high-resolution remote sensing images the primary data source for obtaining building information [1]. High-resolution remote sensing images contain rich geographic object features, including color, size, shape, texture, and spatial relationships between objects [1]. Compared to traditional pixel-based spectral feature extraction methods, deep learning and machine learning techniques can extract building outlines from high-resolution remote sensing images more quickly and accurately [1,2,3], and update building outline data in a timely manner. Automatic extraction of building outlines is an important method for improving vector mapping efficiency [4] and has consistently been a focus and challenge in remote sensing applications and cartography research [5,6]. In particular, buildings represented in polygon format are widely adopted in geographic information systems due to their portability and flexibility. Therefore, the automatic extraction of regularized building polygon vectors from high-resolution remote sensing images has significant research significance and application value.

Building outline extraction and regularization methods can mainly be divided into three categories: traditional image processing-based methods, auxiliary information-based methods, and deep learning-based methods.

Traditional image processing-based methods primarily rely on image morphology, geometric features, and other traditional image processing techniques. Image morphology methods optimize contours through operations such as erosion and dilation, for example, the grid-filling method proposed by [7] and the multi-star constrained segmentation method devised by [8]. Geometric correction methods use geometric constraints to regularize contours, such as the Douglas–Peucker algorithm [9] and its improved algorithms, as well as the Minimum Bounding Rectangle (MBR) and Hausdorff distance. Although these methods are simple and practical, they usually require manually designed features and rules, and the regularization effect is often not ideal. For instance, the Douglas-Peucker algorithm and its improved versions tend to lose detail information when processing complex contours [10]. Additionally, these methods cannot directly solve the irregular contour problems in the mask outputs of currently popular deep learning models.

Auxiliary information-based methods utilize auxiliary data such as LiDAR point cloud data, Digital Surface Models (DSMs), and shadows to improve the accuracy of building extraction. For example, ref. [11] combined a DSM and image information for building detection and regularization, while ref. [12] integrated LiDAR point cloud data with high-resolution remote sensing images. Although auxiliary information can improve extraction accuracy, these methods often depend on additional data sources, limiting their applicability.

The rapid development of deep learning in the field of image processing has brought a series of approaches for extracting buildings from high-resolution remote sensing images, which can be divided into two categories: deep learning-based raster segmentation and deep learning-based vector polygon extraction methods. Deep learning-based raster segmentation methods view building extraction as a pixel-level classification problem, outputting raster-form building masks. For example, networks such as U-Net [13], FCN [14], DeepLab [15], and ViT [16] are widely applied for pixel-level segmentation. To distinguish individual building instances, researchers have also adopted instance segmentation methods, such as Mask R-CNN [17], and applied them to building extraction [5]. However, the building segmentation masks obtained by these methods typically have blurred boundaries, numerous redundant points, and are not ideal formats for cartography and GIS applications. To address this issue, some researchers have proposed methods for regularizing segmentation results. For example, ref. [18] used minimum description length techniques to regularize building roof shapes based on airborne LiDAR data. Ref. [6] used the main direction concept for fine regularization through an improved Douglas-Peucker algorithm, and ref. [19] used polygon partition refinement to polygonize building segmentation masks. Although these methods can effectively solve the problem of irregular boundaries in building segmentation masks, the post-processing steps are cumbersome, and geometric boundary optimization methods are highly dependent on designed features and rules, requiring more human intervention [3].

Deep learning-based vector polygon extraction methods achieve a vector format-building output by detecting vertices or corner points and connecting them [1,20,21]. Polygon-RNN [22], PolyMapper [1], and their improved versions [23] sequentially predict polygon vertices through RNNs. Some studies have also utilized GCNs for polygon vertex prediction, such as Curve-GCN [24] and its improved versions [25], as well as PolyWorld [21]. Furthermore, some end-to-end methods such as PolygonCNN [26], the frame field learning method [2], Deep Snake [27], PolarMask [28], DANCE [29], BuildMapper [30], and PolyBuilding [31] achieve direct mapping from images to regularized building contours. However, existing direct prediction methods still face challenges in generating structured building polygon vectors. For example, these methods typically have vertex redundancy [25] or missing detection due to occlusion [21]; and some local modeling methods (such as one-dimensional convolution [32] and circular convolution [27,29]) result in smooth contours. In summary, existing methods still struggle to simultaneously achieve accurate building detection, high pixel-level coverage, and geometrically regular polygons with low complexity.

In response to the research status and limitations of existing methods, this paper proposes a two-stage method for extracting building outlines from high-resolution remote sensing images. This method draws inspiration from UNILM [33] and large language models in natural language processing, transforming the building outline extraction task into a sequence generation problem. First, the Segformer model [34] is used to obtain image feature maps. This model adopts a Transformer structure that can effectively capture the global contextual information of images, thus providing stronger feature representation for subsequent building outline extraction. Then, unlike the current common practice of encoder–decoder and combined object detection with multiple coordinate regression prediction heads, this paper proposes Polygonal Outline Regularization via a Masked Attention Encoder (PolyReg), which uses only a Transformer encoder and achieves an autoregressive output of regularized building outline coordinates through the design of appropriate self-attention mask matrices. Specifically, the model represents building outline coordinates as a sequence and predicts coordinate points one by one in an autoregressive manner, aiming to generate more structured contours. The main contributions of this paper can be summarized as follows:

(1): Proposing a new two-stage building outline extraction method that combines the feature extraction capabilities of Segformer with the sequence generation capabilities of Transformer models.
(2): Drawing inspiration from UNILM and large language models to construct the PolyReg model, which is based on the Transformer encoder and achieves sequence-to-sequence generation tasks through cleverly designed self-attention mask matrices. This model can directly output building outline coordinates in an auto-regressive manner without cumbersome post-processing steps.
(3): Introducing the advantages of sequence generation models to the building outline extraction task. Compared to currently popular methods that regress endpoints and corner point coordinates, the proposed method is not limited to predicting fixed point sets or predefined vertices. By modeling the entire contour through sequence generation, it can flexibly generate a variable number of contour points according to requirements, better adapting to buildings of different shapes.

2. Materials and Methods

2.1. Model Architecture

The overall model architecture, as shown in Figure 1, adopts a two-stage approach. The first stage utilizes SegFormer to extract mask annotations from high-resolution remote sensing imagery. These binary masks are then converted into initial vector polygon representations. Specifically, we employ a standard contour finding algorithm [35] to trace the boundary pixels of each detected building mask. Subsequently, the Douglas–Peucker algorithm [9] with a small tolerance value (e.g., 1.0 pixel) is applied to remove collinear points and reduce initial vertex redundancy, resulting in a sequence of coordinates

[P_{1}, P_{2}, \dots, P_{m}]

for each building contour. These vector sequences then serve as the input for the second stage. SegFormer is a lightweight, hierarchical Transformer that overcomes the limitations of other Transformer architectures like SETR and Swin, which struggle to effectively connect contextual information. It is specifically optimized for semantic segmentation tasks, enabling efficient and rapid extraction of multi-scale features.

The second stage involves the PolyReg network, which employs an autoregressive model for generative prediction on the extracted vector contour sequences, achieving regularization of the original contours. This concept is based on Natural Language Processing (NLP), treating the sequence as a “sentence” with geographic meaning. By training a domain-specific geographic information model, a generative model capable of understanding geometric relationships is achieved.

2.2. SegFormer

The SegFormer network serves as a crucial component in the first stage of this paper, selected for its ability to efficiently extract pixel-level semantic masks from high-resolution remote sensing images. This network employs a lightweight, hierarchical Transformer structure, addressing the issue of insufficient contextual information in Transformer architectures for semantic segmentation tasks.

As shown in Figure 2, SegFormer consists of a hierarchical Transformer encoder and a lightweight all-MLP decoder. The encoder is inspired by the Vision Transformer but optimized for semantic segmentation. It divides the input image into small, overlapping patches, ensuring continuity between local regions. These patches are then processed by a multi-layer Transformer encoder, producing multi-level feature maps. Since the feature maps from all layers of the Transformer are merged, the shallow, high-resolution feature maps focus on coarse-grained information, while the deep, low-resolution feature maps capture fine details, achieving multi-resolution information fusion. A 3 × 3 depthwise separable convolution (Mix-FFN) is used to fuse the multi-resolution feature maps. This approach also integrates positional information, avoiding the need for explicit positional encoding.

The decoder is a simple Multi-Layer Perceptron (MLP) structure. The feature maps output by the encoder are upsampled and then fused and predicted through multiple MLP layers, outputting the final semantic mask at the original resolution. SegFormer’s decoder structure is lighter and can leverage the non-local attention obtained in the Transformer encoder, effectively expanding the receptive field, making it suitable for semantic mask generation in this study.

2.3. PolyReg

The PolyReg model achieves regularization of the vector contour sequences extracted by SegFormer through an autoregressive generative approach. PolyReg’s design is inspired by the field of Natural Language Processing (NLP), treating vector contours as “sentences” with geographic meaning. By training a domain-specific model, it learns to understand and generate contours with geometric regularity. The core of the model is a Transformer encoder, modified with Mask-Attention. This allows the encoder-only architecture to possess Seq2Seq capabilities for sequence generation. A gated MLP is also introduced to enhance the final generation performance.

2.3.1. PolyReg Network Architecture

PolyReg utilizes a Transformer encoder architecture based on multi-head self-attention. The model architecture is illustrated in Figure 3. Since the self-attention mechanism lacks explicit positional information, positional encoding is required when inputting the sequence. Instead of using direct absolute positional encoding, this paper modifies the self-attention mechanism to enable the model to perceive relative positions.

The input to PolyReg is the contour coordinate sequence extracted and vectorized by SegFormer, represented as

S = [P_{1}, P_{2}, \dots, P_{m}]

, where

P_{i} (X_{i}, Y_{i})

is the coordinate of the i-th vertex. To effectively capture the geometric features of the contour, each coordinate

P_{i}

is first transformed into a vector

e_{i}

containing positional embedding and feature embedding,

e_{i} = e_{i}^{p o s} + e_{i}^{f e a t}

. The feature embedding

e_{i}^{f e a t}

is learned by the model, while the positional embedding

e_{i}^{p o s}

uses Sinusoidal positional encoding. The final input representation is

E = [e_{1}, e_{2}, e_{3}, \dots, e_{m}]

when used as input by the transformer to extract attention.

\{\begin{matrix} q_{i} = e_{i} W_{Q} \\ k_{j} = e_{i} W_{K} \\ v_{j} = e_{j} W_{V} \\ a_{i, j} = s o f t m a x (q_{i} k_{j}^{T}) \\ o_{i} = \sum_{j} a_{i, j} v_{j} \end{matrix}

(1)

The expansion of

Q K^{T}

results in:

Q K^{T} = {(E}_{q} W_{q}) {(E_{k} W_{k})}^{T} = E_{q} W_{q} W_{k}^{T} E_{k}^{T}

(2)

Substitution of

e_{i} = e_{i}^{p o s} + e_{i}^{f e a t}

into Equation (2) results in:

\begin{matrix} q_{i} k_{j}^{⊤} = e_{i}^{f e a t} W_{Q} W_{K}^{T} {(e_{j}^{f e a t})}^{T} + e_{i}^{f e a t} W_{Q} W_{K}^{T} {(e_{j}^{p o s})}^{T} \\ + e_{i}^{p o s} W_{Q} W_{K}^{T} {(e_{j}^{f e a t})}^{T} + e_{i}^{p o s} W_{Q} W_{K}^{T} {(e_{j}^{p o s})}^{T} \end{matrix}

(3)

where

e_{i}^{f e a t} W_{Q} W_{K}^{T} {(e_{j}^{f e a t})}^{T}

is the attention between inputs and has no relationship with the position.

e_{i}^{f e a t} W_{Q} W_{K}^{T} {(e_{j}^{p o s})}^{T}

and

e_{i}^{p o s} W_{Q} W_{K}^{T} {(e_{j}^{f e a t})}^{T}

represent the relationship between position and geometric features in the sequence, which is extremely important in extracting building contour features. Therefore, a value

R_{i - j}

related to the relative position is used to represent

e_{i}^{f e a t} W_{Q} W_{K}^{T} {(e_{j}^{p o s})}^{T}

and

e_{i}^{p o s} W_{Q} W_{K}^{T} {(e_{j}^{f e a t})}^{T}

while

e_{i}^{p o s} W_{Q} W_{K}^{T} {(e_{j}^{p o s})}^{T}

is the attention between two positional features. The relationship between positions has already been represented by the relative position, so this part is directly deleted. Therefore, the final positional encoding is determined by the relative value of the position in the attention mechanism:

q_{i} k_{j}^{⊤} = e_{i}^{f e a t} W_{Q} W_{K}^{T} {(e_{j}^{f e a t})}^{T} + R_{i - j} W_{Q} W_{K}^{T} {(e_{j}^{f e a t})}^{T} + e_{i}^{f e a t} W_{Q} W_{K}^{T} R_{i - j}^{T}

(4)

The value of

R_{i - j}

follows the Sinusoidal position encoding [36]:

R_{i - j} = P E (i - j, 2 k) = \sin (\frac{p o s}{{10,000}^{\frac{2 k}{d_{m o d e l}}}}),

(5)

R_{i - j} = P E (i - j, 2 k + 1) = c o s (\frac{p o s}{{10,000}^{\frac{2 k}{d_{m o d e l}}}}) .

(6)

Positional encoding effectively captures the relative positional information, helping the model capture the relative positional relationships between building vertices, learn the relative positional relationships between adjacent vertices, and learn the relative positional relationships between different parts of the building outline, thereby enhancing the performance of the model. In addition, relative positional encoding can better handle long-distance dependencies, allowing it to manage tasks involving more complex building outlines.

The encoder of the model is designed with six identical stacked Transformer layers. Each Transformer layer consists of a multi-head self-attention layer, a Gated Multi-Layer Perceptron (Gated MLP), skip connections, and layer normalization.

The input coordinate sequence is processed by the input representation part to obtain the Query (

Q

), Key (

K

), and Value (

V

) matrices. The attention weight A is calculated based on Equation (3) and the positional encoding information. Scaled Dot-Product Attention is used to normalize the attention, with the normalization formula as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V .

(7)

This gives the attention of one head. The same method is used to calculate the attention of eight heads in parallel, and the outputs of all heads are concatenated and unified, as follows:

M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{8}) W_{O} .

(8)

To enhance the model’s ability to express sequence features, we improve the feedforward network of a single hidden layer by introducing a gating mechanism, constructing a Gated MLP (Figure 4). By introducing parallel linear layers and gating units, the model can better model nonlinear interactions between features. The concatenated multi-head attention is processed by the Gated MLP as the output of the entire Transformer layer. The calculation process is as follows:

G M L P (x) = (R e l u (x W_{G a t e} + b_{G a t e}) \otimes (x W_{U p} + b_{U p})) * W_{D o w n} + b_{D o w n},

(9)

where

\otimes

represents element-wise multiplication,

R e l u

is the activation function, and

G a t e

,

U p

, and

D o w n

represent the three linear transformation matrices in the Gated MLP. GateLinear and UpLinear map the input to the extended dimension in parallel. The output of GateLinear is processed by the nonlinear activation function Relu and then element-wise multiplied with the output of UpLinear to achieve gated selection of features. Subsequently, the feature dimension is restored to the dimension of hidden states through the down-linear layer, resulting in the final output of the Gated MLP. This gating mechanism enables the model to adaptively adjust the information flow based on the input, thereby enhancing the flexibility and effectiveness of feature representation.

In addition, residual connections and layer normalization are added between each sub-layer to ensure model stability and accelerate training.

2.3.2. Mask-Attention

PolyReg uses an autoregressive generative approach to achieve contour regularization. To ensure the integrity of the sequence, it is necessary to mark the start and end positions of the sequence. Therefore, special coordinate points are defined:

P_{S} (- 1, - 1)

as the start marker,

P_{C} (- 2, - 2)

as the separator between different contour sequences, and

P_{E} (- 3, - 3)

as the end marker of the sequence.

During the training phase, the original contour sequence is

S = [P_{1}, P_{2}, \dots, P_{m}]

, and the corresponding expected regularized output is

S^{'} = [P_{1}^{'}, P_{2}^{'}, \dots, P_{n}^{'}]

). The input to the PolyReg model is then

\hat{S} = [{P_{S}, P}_{1}, P_{2}, \dots, P_{m}, P_{C}, P_{1}^{'}, P_{2}^{'}, \dots, P_{n}^{'}]

, and the target of the model is

{\hat{S}}^{'} = [{P_{S}, P}_{1}, P_{2}, \dots, P_{m}, P_{1}^{'}, P_{2}^{'}, \dots, P_{n}^{'}, P_{E}]

, as shown in Figure 5. This design allows the model to learn the entire output sequence at once during training, without the need for a regression process, greatly improving training efficiency.

During the inference phase, the model generates vertices autoregressively one by one. First,

\hat{S} = [{P_{S}, P}_{1}, P_{2}, \dots, P_{m}, P_{C}]

is input to generate the first regularized vertex

P_{1}^{'}

. Then,

\hat{S} = [{P_{S}, P}_{1}, P_{2}, \dots, P_{m}, P_{C}, P_{1}^{'}]

is input to generate the second vertex

P_{2}^{'}

, and so on, until the end marker

P_{E}

is generated. The output is the regularized contour.

However, there is a problem during training: when generating

P_{i}^{'}

in the training phase, since the attention mechanism is bidirectional, the encoder can directly see

P_{i}^{'}

in the input sequence, leading to the failure of the entire training process.

To solve this problem, PolyReg borrows from the idea of UNILM and introduces a Mask-Attention mechanism to avoid this leakage problem. The network structure is shown in Figure 6.

In the figure,

S

represents the original building coordinate sequence and

{\hat{S}}^{'}

represents the regularized building coordinate sequence. In the input sequence

S

, each vertex can attend to each other, i.e., a bidirectional attention mechanism is used. In the target sequence

{\hat{S}}^{'}

, each vertex can only attend to itself and the previously generated vertices, using a unidirectional attention mechanism, corresponding to the lower triangular part of the mask matrix. This way, each vertex in the target sequence can attend to all vertices in the input sequence and the points generated before itself, while vertices in the input sequence cannot attend to the target sequence to prevent information leakage. This masking strategy ensures the consistency of the model during training and inference. During generation, each vertex can only rely on the information of its previous vertices, thus achieving autoregressive generation. In practice, the mask attention matrix needs to set the element

M_{i j}

at the corresponding position to

- \infty

to block the attention from

x_{j}

. After softmax, the corresponding weight becomes 0, thus avoiding information leakage from

x_{j}

to

x

, that is:

Q = H^{l - 1} W_{l}^{Q}, K = H^{l - 1} W_{l}^{K}, V = H^{l - 1} W_{l}^{V},

(10)

M_{i j} = \{\begin{matrix} 0, & allow to attend \\ - \infty, & prevent from attending \end{matrix}

(11)

A_{l} = s o f t m a x (\frac{{Q K}^{⊺}}{\sqrt{d_{k}}} + M) V_{l}

(12)

Thus, through the Mask-Attention mechanism, the Seq2Seq function is realized based on a single encoder, improving the training and prediction efficiency of the model.

2.3.3. Pre-Training Task Design

Obtaining sample pairs for building regularization is costly. To use fewer samples and achieve more generalized regularization capabilities, PolyReg employs a self-supervised contour reconstruction pre-training task. By collecting regularized building contours and randomly perturbing their coordinate sequences to simulate different degrees of geometric deformation and data noise, the model is then trained to reconstruct the original coordinate sequences. The specific processing methods include vertex masking, vertex replacement, and vertex reordering, as shown in Figure 7.

Vertex Masking: Randomly select some building vertices and replace them with special identifiers to simulate missing information.

Vertex Replacement: Randomly select a portion of vertices and replace them with new, random points.

Vertex Reordering: Randomly reorder the vertex sequence of the building contour, forcing the model to learn the relative positional relationships between vertices and understand the overall shape of the building contour.

The complete process of the self-supervised reconstruction task is shown in Figure 8. The original coordinates of the shape are

[P_{1}, P_{2}, \dots, P_{5}]

. After preprocessing, they become

[P_{1}^{'}, P_{2}^{'}, \dots, P_{5}^{'}]

. This sequence is then individually input into the PolyReg network, and the training target is to reconstruct the original coordinate sequence, i.e.,

[P_{1}, P_{2}, \dots, P_{5}]

.

Through the pre-training task, on the one hand, the model’s dependence on labeled data can be greatly reduced; on the other hand, directly reading the original coordinate sequence lacks explicit features of the shape such as angle, curvature, and direction. The self-supervised task of reconstructing the original building contour based on noise can well learn these implicit geometric relationships, which can greatly improve the efficiency and generalization of supervised training for different scales and different simplification tasks.

After pre-training is completed, the model has the ability to understand and reconstruct building contours. Next, a small amount of regularized sample data is used to fine-tune the model. The fine-tuning process removes the processing steps of the pre-training task, as shown in Figure 9. The model takes the vectorized building contours of the semantic mask output by SegFormer as the input, and the manually annotated simplified target sequence as the target. The model learns the internal rules of contour regularization through the cross-entropy loss function.

By combining the geometric feature reconstruction pre-training task and the regularization fine-tuning task, the construction of the PolyReg model is completed.

3. Experiments

3.1. Dataset

The data used in this study are derived from the Inria Aerial Image Labeling Dataset [37], which encompasses high-resolution remote sensing imagery from five different cities, covering a total area of over 400 square kilometers and containing annotations for over 200,000 individual building instances. Figure 10 shows examples of the imagery and corresponding building annotation data from this dataset.

To train the Segformer and PolyReg models, we selected data from three cities—Austin, TX, Kitsap County, WA, and Vienna, Austria—as the training set. For the testing phase, we randomly selected portions of areas from Chicago, IL, and West Tyrol, Austria, as the test set to evaluate the model’s generalization ability. The data for each city have a resolution of 0.3 m and cover an area of 81 square kilometers. Table 1 provides detailed information about the datasets for each city.

3.2. Model Parameter Settings

The models are implemented using Python 3.8 and the Pytorch 1.10 framework. The experiments were conducted on a notebook equipped with 64 GB of memory, an AMD R9-5900HX CPU, and an NVIDIA GeForce RTX 3080 GPU. The equipment was manufactured by Lenovo, located in Beijing, China.

For the SegFormer model, we adopted the pre-trained weights on ImageNet. During fine-tuning on the Inria dataset, we employed the AdamW optimizer with an initial learning rate of 6 × 10⁻⁵, a weight decay of 0.01, and a batch size of 4. The model was trained for 50 epochs using a cross-entropy loss function for semantic segmentation. Standard data augmentation techniques such as random flipping and scaling were applied.

For the PolyReg model, the core architecture consists of eight stacked Transformer encoder layers. Each encoder layer contains eight multi-head attention heads and encodes shape features into a 256-dimensional hidden vector. The intermediate layer dimension of the gated MLP is set to 1024, and GELU is used as the activation function. The maximum sequence length supported by the model is set to 64. During training, the batch size was set to 32. We used the AdamW optimizer with an initial learning rate of 1 × 10⁻⁴ and a weight decay coefficient of 1 × 10⁻⁵. The model was trained until convergence, monitored by the validation loss, typically around 50 epochs for pre-training and 10 epochs for fine-tuning. The cross-entropy loss was used to optimize the sequence generation task.

3.3. Evaluation Metrics

To comprehensively evaluate the performance of the proposed regularization method, we established evaluation metrics in two dimensions: overall segmentation performance and the quality of the regularized contours.

For overall segmentation performance, drawing on common evaluation methods for semantic segmentation tasks, we use the F1-Score and Intersection over Union (IoU) to measure the overall performance of the model’s regularization. Their calculation formulas are as follows:

P r e c i s i o n (S_{r e s u l t}, S_{r e f e r e n c e}) = \frac{A r e a (S_{r e s u l t} \cap S_{r e f e r e n c e})}{A r e a (S_{r e s u l t})},

(13)

R e c a l l (S_{r e s u l t}, S_{r e f e r e n c e}) = \frac{A r e a (S_{r e s u l t} \cap S_{r e f e r e n c e})}{A r e a (S_{r e f e r e n c e})},

(14)

F 1 - S c o r e = \frac{2 P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l},

(15)

I O U (S_{r e s u l t}, S_{r e f e r e n c e}) = \frac{A r e a (S_{r e s u l t} \cap S_{r e f e r e n c e})}{A r e a (S_{r e s u l t} \cup S_{r e f e r e n c e})}

(16)

where

S_{r e s u l t}

represents the contour predicted by the model, and

S_{r e f e r e n c e}

is the manually annotated reference contour of the building.

S_{r e s u l t} \cap S_{r e f e r e n c e}

and

S_{r e s u l t} \cup S_{r e f e r e n c e}

are the intersection and union of the surface features formed by the model prediction results and the reference contours, respectively, and Area represents the area of the region.

To more meticulously evaluate the quality of the regularized building contours, we introduce the following vector-level evaluation metrics:

Hausdorff Distance (HD) measures the maximum distance between two contours, effectively reflecting the difference in vertex positions between the regularized contour

S^{'} = [P_{1}^{'}, P_{2}^{'}, \dots, P_{n}^{'}]

and the original contour

S = [P_{1}, P_{2}, \dots, P_{m}]

. A smaller Hausdorff distance indicates that the regularized contour is closer to the original contour. It is defined as:

H D (S, S^{'}) = \max (h (S, S^{'}), h (S^{'}, S)),

(17)

h (S, S^{'}) = \underset{i \in [1, m]}{m a x} \{\underset{j \in [1, n]}{m i n} ‖P_{i} - P_{j}‖\} .

(18)

Area Change Ratio (ACR) measures the change in building size before and after regularization:

A C R (S, S^{'}) = \frac{|A r e a (S^{'})|}{A r e a (S)} \times 100 %

(19)

Perimeter Change Ratio (PCR) measures the change in building perimeter before and after regularization:

P C R (S, S^{'}) = \frac{|P e r i m e t e r (S^{'})|}{P e r i m e t e r (S)} \times 100 % .

(20)

N-Ratio evaluates the change in the number of building points before and after regularization:

N - r a t i o (S, S^{'}) = \frac{N_{S}}{N_{S^{'}}} \times 100 % .

(21)

The calculation basis for the relative metrics ACR and PCR consistently evaluates the geometric similarity to the ground truth reference polygon. In contrast, N-Ratio employs a dual reference: for initial vectorization

S_{v e c}

, it compares the ground truth vertex count to the vectorization’s count, indicating initial complexity relative to the reference; for the final regularized output

S_{r e g}

, it measures the simplification effect by comparing the initial vectorization’s vertex count to the regularized count.

Shape Regularity Percent (SRP) quantifies the regularity of the regularized contour. For each interior angle of the regularized contour, it is evaluated whether the angle is close to 90 and 270 degrees. If it is less than a threshold, the angle is considered a regularized interior angle. The proportion of regularized interior angles among all interior angles is calculated as follows:

S R P (S) = \frac{\sum_{a n g l e_{i} \in S} R e g u l a r (a n g l e_{i})}{N}

(22)

R e g u l a r (a n g l e_{i}) = \{\begin{matrix} 1, \min (a n g l e_{i} - \frac{π}{2}, a g n l e_{i} - \frac{3}{2} π) < θ_{t h r e d} \\ 0, \min (n g l e_{i} - \frac{π}{2}, a g n l e_{i} - \frac{3}{2} π) \geq θ_{t h r e d} \end{matrix}

(23)

Complexity-aware Intersection over Union (C-IoU) is used to evaluate the balance between overall performance and contour geometric characteristics. It measures the model’s ability to generate simpler contours while preserving the details of the original building. Its calculation formula is as follows:

C - I O U (S, S^{'}) = (1 - N G (S^{'})) * I O U (S, S^{'}),

(24)

N G (S^{'}) = \frac{|N_{S^{'}} - N_{S_{r e f e r e n c e}}|}{N_{S^{'}} + N_{S_{r e f e r e n c e}}} .

(25)

where

N G (S^{'})

represents the difference between the number of points in a predicted contour and the number of points in the reference (ground truth) contour.

N_{S^{'}}

is the number of points in the predicted contour, and

N_{S_{r e f e r e n c e}}

is the number of points in the ground truth contour. This ratio is used to weigh the IoU. C-IoU achieves a higher value when a prediction balances shape characteristics (represented by IoU) and polygonal complexity (represented by NG). The metric decreases significantly when the regularized contour is too simple or has too many redundant vertices.

3.4. Experiments and Analysis

In the experimental phase, we used two regions from the cities of Chicago, IL, USA and West Tyrol, Austria, as test sets. These two regions represent two different architectural styles, as shown in Figure 11. The buildings in West Tyrol are relatively sparsely distributed but have more complex shapes, representing a typical suburban area. The West Tyrol test area represents a densely populated residential area in a large city, with dense buildings but relatively regular shapes.

We first used the Segformer model to process the remote sensing images of these two regions to obtain initial building masks, which were then converted into vector contours. Next, we input these initial contours into the trained PolyReg model for regularization, obtaining the final regularized contour results.

Figure 12 shows a visual comparison of some building contours before and after regularization. It can be intuitively seen that the contour map 12(d) processed by PolyReg is smoother and more regular than the original contour map 12(c) directly generated by Segformer. Redundant vertices are effectively removed, while the shape characteristics of the original buildings are still well preserved.

To quantitatively evaluate the performance of the PolyReg model, we used the annotated building reference contours in the dataset as ground truth and calculated the various evaluation metrics for both the direct vectorization results of Segformer and the results after PolyReg regularization. Table 2 details the experimental results on these two test sets.

Analyzing the data in Table 2, it can be found that compared with the direct vectorization results of Segformer, the results after PolyReg regularization show significant improvements in various evaluation metrics. On the Chicago test set, PolyReg increased the IoU value from 0.67 to 0.80 and the C-IoU value from 0.57 to 0.71. On the West Tyrol test set, the IoU value increased from 0.60 to 0.79, and the C-IoU value increased from 0.43 to 0.57. These results indicate that PolyReg significantly reduces the complexity of the contours while maintaining a high pixel-level coverage rate, achieving an improvement in overall performance.

Further observing the indicators related to the regularization effect, it can be found that the building contours regularized by PolyReg have significantly lower HD values. On the Chicago test set, the HD value decreased from 1.20 to 0.60; on the West Tyrol test set, the HD value decreased from 1.78 to 0.58. At the same time, the ACR and PCR indicators also improved significantly. For example, on the Chicago test set, the ACR increased from 0.79 to 0.88, and the PCR decreased from 1.92 to 1.03; on the West Tyrol test set, the ACR increased from 0.69 to 0.80, and the PCR decreased from 1.90 to 0.83. In addition, the N-Ratio indicator also shows that PolyReg effectively simplifies building contours. These results indicate that PolyReg can effectively remove redundant vertices, generate more regular polygons, and better maintain the shape and size of the original contours.

To more intuitively evaluate the performance of the proposed method across different landscapes, this paper presents, in Figure 13, the regularization results for large-scale representative areas selected from the Chicago and West Tyrol test datasets. As can be seen from the visualization results, the proposed model demonstrates strong generalization ability in handling diverse scenes, capable of generating regularized building outlines over broad areas such as urban and suburban regions. However, the presence of occlusion (indicated by the yellow circle in Figure 13a) and shadows (indicated by the yellow circle in Figure 13b) can have a certain adverse effect on the regularization results.

The improvement in the performance of the PolyReg model is mainly attributed to the design of its pre-training tasks and the Mask-Attention-based sequence generation capability. The pre-training tasks simulate various deformations and noises of building contours, enabling the model to learn the inherent geometric rules of building contours. This pre-training method enhances the model’s generalization ability when faced with different building shapes. The Mask-Attention-based sequence generation mechanism enables the model to fully exploit the differences between the prediction results of the Segformer model and the ground truth when generating sequences, allowing the PolyReg model to learn the potential structural knowledge of building contours. Therefore, based on the generative ability of the model, it is possible to recover the original ground truth area from the occluded shape, overcoming the inherent problems of the algorithm, such as the mask area prediction being too large, too small, or tending to be rounded.

4. Discussion

4.1. Comparative Analysis

The method proposed in this study was compared with a traditional method and two deep learning-based methods using the test set. The evaluation was conducted through multiple parameters assessing overall performance and contour quality to determine the advantages and disadvantages of different methods.

The traditional method used is the Contour Regularization Method (CRM) [6], while the two deep learning-based methods are Machine-Learned Regularization (MLR) [3] and Hisup [38]. CRM is a traditional building regularization method based on classic image processing techniques that incorporates multiple geometric constraint rules. MLR is a building regularization method based on Generative Adversarial Networks (GANs), which optimizes vertex positions to regularize contours. Hisup is a multi-task network with hierarchical supervision used for building contour extraction and regularization.

All methods were evaluated on the same test set (Chicago and West Tyrol) using the same metrics. For the traditional method, we used the optimal parameter settings provided in its original publication. For deep learning-based methods, we used the pre-trained models provided by their respective authors. The experimental results of the five methods are detailed in Table 3.

The proposed Segformer + PolyReg method outperforms others across several metrics, such as F1-Score, IoU, HD, ACR, PCR, and C-IoU. This indicates that our method can generate more accurate and regular building contours while maintaining high pixel-level coverage.

Specifically, on the Chicago test set, significant advantages in the F1-Score (0.89), IoU (0.80), and C-IoU (0.71) are observed. These values indicate that the generated building masks exhibit the highest overlap with the real masks and the best overall performance. Furthermore, HD (0.59) and ACR (0.88) on this dataset are superior to other methods, demonstrating high geometric accuracy and good preservation of the original building area.

On the West Tyrol test set, a comparable performance to the Hisup method is observed, with both methods achieving good regularization results. Specifically, the presented method exhibits a slight advantage in F1-Score (0.85), IoU (0.79), and C-IoU (0.57). Conversely, Hisup demonstrates marginally better performance in HD (0.57), ACR (0.83), PCR (0.83), and SRP (0.52). This suggests that when applied to diverse datasets, the presented method tends to produce segmentation results that closely align with the ground truth masks, while Hisup demonstrates a marginal advantage in maintaining geometric accuracy, area, perimeter, and regularity.

PolyReg exhibits close and excellent performance in building contour regularization tasks, demonstrating advantages in overall segmentation accuracy and shape matching. Hisup shows superior performance in maintaining geometric accuracy, area, perimeter, and regularity. CRM sacrifices some performance to generate more regular contour shapes and may produce incorrect regularization results for certain buildings. The GAN-based method experiences performance degradation when the main direction of the contour is offset, leading to jagged edges. Visualization results in Figure 14 further substantiate the above analysis, indicating that contours generated by PolyReg are smoother, more regular, and more closely resemble real building contours.

To further assess the practical applicability of the proposed methods, average inference times were compared across the test sets. Specifically, 1000 image tiles of size 512 × 512 were extracted from the test set, and the complete inference time for each algorithm was measured on every image tile. The inference time for each algorithm was measured on these 1000 image tiles, and the mean value was subsequently calculated, as summarized in Table 4. The traditional method, CRM, achieves the fastest processing speed, attributable to its reliance on simpler geometric operations. Among the deep learning approaches, Hisup is the fastest, followed by MLR. The proposed Segformer + PolyReg framework yields a running time of 0.98 s per image tile, which is slower than both Hisup and MLR in this comparison. Although the method presented in this paper requires more computational resources compared to traditional techniques such as CRM, and its inference time is slightly longer than the other deep learning models tested here, the overall time cost remains acceptable and comparable. Furthermore, the improvement in output quality justifies the additional computational cost.

4.2. Impact of Different Backbones on PolyReg

To investigate the influence of the initial segmentation quality on the final regularization performance of PolyReg, this section compares three commonly used semantic segmentation networks: U-Net [13], U-Net++ [39], and DeepLabv3_Plus [40], and combines each with the PolyReg model for experiments.

The experimental process is as follows: First, the same training set (Austin, Kitsap County, Vienna) is used to train the four Backbones to obtain building masks. Then, the masks generated by the four Backbones are used to train the PolyReg model separately. Finally, the same test set (Chicago, West Tyrol) is used to test the four model combinations. The experimental results are shown in Table 5.

From Table 5, it can be seen that under different Backbones, PolyReg can significantly improve various indicators, especially the C-IoU metric, which again verifies the effectiveness of PolyReg. Segformer, as a semantic segmentation model based on Transformer, can better capture global context information, thus achieving the best performance when combined with PolyReg. However, even with traditional U-Net and its variants, PolyReg can still achieve near-optimal performance, indicating that PolyReg has a certain robustness and is not particularly dependent on the quality of the initial segmentation results.

The results presented in Table 4 and Figure 14 indicate an interaction between the chosen segmentation backbone and the PolyReg regularization module. PolyReg demonstrates the capability to refine initial masks generated by various backbones. However, the data suggest that the quality of the initial segmentation, influenced by the backbone’s characteristics such as SegFormer’s ability to capture global context, impacts the final outcome. Using SegFormer as the backbone resulted in higher overall performance metrics after PolyReg regularization compared to the other tested backbones. Figure 15 provides a visual illustration of this point, showing the initial contours generated by different backbone networks as well as the results after PolyReg regularization. For example, in cases where corners are arc-shaped, PolyReg is able to correct them into right angles, as demonstrated in the UnetPlusPlus example of group (C). Even when the initial contours have defects, such as pronounced jagged edges and redundant vertices in the contours generated by U-Net, PolyReg regularization (as shown in the Unet+PolyReg column) can still produce relatively regular contours. These results indicate that, although PolyReg can improve imperfect initial segmentations, its performance is optimal when combined with a stronger segmentation backbone.

4.3. Limitations

Despite the good experimental results achieved by the method proposed in this paper, there are still some limitations:

First, for buildings with complex geometric shapes, such as those containing numerous curves or concavities, the regularization effect of PolyReg may still have room for improvement. Although pre-training tasks can enable the model to learn certain geometric prior knowledge, for overly complex shapes, the model may struggle to generate fully regularized results.

Second, the method’s performance can be compromised by challenging image conditions. Severe building occlusion (Figure 13a), strong shadows (Figure 13b), or significant boundary blurring—simulated here using 8x downsampling (Figure 16)—can decrease Segformer’s segmentation accuracy. This lower-quality input directly impacts the PolyReg stage, potentially leading to unsatisfactory regularization results such as false detections, omissions, and an overall degradation of contour quality (highlighted by yellow circles in Figure 16).

Third, the performance of the PolyReg model benefits from the self-supervised contour reconstruction pre-training task, but the quality and diversity of pre-training data affect the model’s generalization ability. If the pre-training data lack certain specific types of building contours, the model may not perform well when processing these types of buildings.

Fourth, the computational complexity of Transformer models is relatively high, especially when processing longer contour sequences. Although the PolyReg model proposed in this paper only uses the encoder part, its computational cost is still higher than that of some traditional regularization methods.

Finally, while compared to relevant traditional and deep learning methods including the recent Hisup, future work could involve benchmarking against an even broader spectrum of the latest state-of-the-art techniques as the field continues to evolve.

5. Conclusions

This paper proposes a novel two-stage method for high-resolution remote sensing building contour extraction and regularization. The method effectively combines the powerful feature extraction capabilities of Segformer with the sequence generation ability of the Transformer-based PolyReg model. By adapting concepts from natural language processing (NLP), PolyReg frames the building regularization task as a sequence generation problem and employs a cleverly designed Mask-Attention mechanism for autoregressive contour coordinate prediction.

Experiments conducted on the Inria Aerial Image Labeling Dataset demonstrate that the proposed method achieves excellent performance in the building contour regularization task, significantly improving contour regularity and geometric precision while maintaining high pixel-level coverage. Compared to traditional approaches like RS-building-regularization and deep learning-based methods such as projectRegularization and Hisup, our method exhibits notable superiority in all evaluation metrics. Additionally, experiments with different segmentation backbones further validate PolyReg’s robustness and its ability to deliver near-optimal performance, even with lower-quality initial masks.

While limitations exist, such as challenges with complex geometric shapes, sensitivity to occlusions, and relatively high computational costs, this work demonstrates the theoretical significance and practical value of PolyReg for high-resolution remote sensing building contour regularization. Future research could explore methods to address these limitations and improve the model’s handling of complex shapes, such as by incorporating more advanced loss functions, introducing multi-source remote sensing data, or lowering computational costs through more efficient sequence-generation strategies. Additionally, combining PolyReg with other tasks, such as building instance segmentation, may enable end-to-end solutions for building extraction and regularization.

Author Contributions

Conceptualization, L.C. and H.Q.; Data curation, C.L. and X.C.; Formal analysis, C.L.; Methodology, L.C.; Software, L.C.; Supervision, X.W. and H.Q.; Validation, L.C.; Visualization, L.C. and X.W.; Writing—original draft, L.C.; Writing—review and editing, X.C., X.W. and H.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant numbers 42271463, 42101453.

Data Availability Statement

The dataset can be downloaded from Inria Aerial Image Labeling Dataset (https://project.inria.fr/aerialimagelabeling/ (accessed on 21 February 2025)). The code is available by contacting the first author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Z.; Wegner, J.D.; Lucchi, A. Topological map extraction from overhead images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1715–1724. [Google Scholar]
Girard, N.; Smirnov, D.; Solomon, J.; Tarabalka, Y. Polygonal Building Extraction by Frame Field Learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 5887–5896. [Google Scholar]
Zorzi, S.; Bittner, K.; Fraundorfer, F. Machine-learned Regularization and Polygonization of Building Segmentation Masks. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3098–3105. [Google Scholar]
Yang, H.L.; Yuan, J.; Lunga, D.; Laverdiere, M.; Rose, A.; Bhaduri, B. Building extraction at scale using convolutional neural network: Mapping of the united states. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2600–2614. [Google Scholar] [CrossRef]
Zhao, K.; Kang, J.; Jung, J.; Sohn, G. Building extraction from satellite images using mask R-CNN with building boundary regularization. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 247–251. [Google Scholar]
Wei, S.; Ji, S.; Lu, M. Toward automatic building footprint delineation from aerial images using CNN and regularization. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2178–2189. [Google Scholar] [CrossRef]
Wang, W.; Du, J.; Li, X.; Hu, H.; Xu, W.; Guo, H.; Ding, Y. A grid filling based rectangular building outlines regularization method. Geomat. Inf. Sci. Wuhan Univ. 2018, 43, 318–324. [Google Scholar] [CrossRef]
Yazhou, D.; Fajie, F.; Junping, L.; Yan, H.; Weihong, C. Right-angle buildings extraction from high-resolution aerial image based on multi-stars constraint segmentation and regularization. Acta Geod. Cartogr. Sin. 2018, 47, 1630. [Google Scholar]
Douglas, D.H.; Peucker, T.K. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica 1973, 10, 112–122. [Google Scholar] [CrossRef]
Xiang, H.; Jianhua, W.; Ning, W.; Haowen, X. Building contour optimization method for multi-source data. Acta Opt. Sin. 2023, 43, 1228012. [Google Scholar] [CrossRef]
Mousa, Y.A.; Helmholz, P.; Belton, D.; Bulatov, D. Building detection and regularisation using DSM and imagery information. Photogramm. Rec. 2019, 34, 85–107. [Google Scholar] [CrossRef]
Yunfan, L.; Gong, W.; Lin, Y.; Wang, B. The extraction of building boundaries based on LiDAR point cloud data and imageries. Remote Sens. Land Resour. 2014, 26, 54–59. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference On Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference On Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Jung, J.; Jwa, Y.; Sohn, G. Implicit regularization for reconstructing 3D building rooftop models using airborne LiDAR data. Sensors 2017, 17, 621. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Lafarge, F.; Marlet, R. Approximating shapes in images with low-complexity polygons. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8633–8641. [Google Scholar]
Girard, N.; Tarabalka, Y. End-to-end learning of polygons for remote sensing image classification. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience And Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 2083–2086. [Google Scholar]
Zorzi, S.; Bazrafkan, S.; Habenschuss, S.; Fraundorfer, F. Polyworld: Polygonal building extraction with graph neural networks in satellite images. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1848–1857. [Google Scholar]
Castrejon, L.; Kundu, K.; Urtasun, R.; Fidler, S. Annotating object instances with a polygon-rnn. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5230–5238. [Google Scholar]
Zhao, W.; Persello, C.; Stein, A. Building outline delineation: From aerial images to polygons with an improved end-to-end learning framework. ISPRS J. Photogramm. Remote Sens. 2021, 175, 119–131. [Google Scholar] [CrossRef]
Ling, H.; Gao, J.; Kar, A.; Chen, W.; Fidler, S. Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5257–5266. [Google Scholar]
Wei, S.; Ji, S. Graph convolutional networks for the automated production of building vector maps from aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Chen, Q.; Wang, L.; Waslander, S.L.; Liu, X. An end-to-end shape modeling framework for vectorized building outline generation from aerial images. ISPRS J. Photogramm. Remote Sens. 2020, 170, 114–126. [Google Scholar] [CrossRef]
Peng, S.; Jiang, W.; Pi, H.; Li, X.; Bao, H.; Zhou, X. Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8533–8542. [Google Scholar]
Xie, E.; Sun, P.; Song, X.; Wang, W.; Liu, X.; Liang, D.; Shen, C.; Luo, P. Polarmask: Single shot instance segmentation with polar representation. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12193–12202. [Google Scholar]
Liu, Z.; Liew, J.H.; Chen, X.; Feng, J. Dance: A deep attentive contour model for efficient instance segmentation. In Proceedings of the IEEE/CVF Winter Conference On Applications Of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 345–354. [Google Scholar]
Wei, S.; Zhang, T.; Ji, S.; Luo, M.; Gong, J. BuildMapper: A fully learnable framework for vectorized building contour extraction. ISPRS J. Photogramm. Remote Sens. 2023, 197, 87–104. [Google Scholar] [CrossRef]
Hu, Y.; Wang, Z.; Huang, Z.; Liu, Y. PolyBuilding: Polygon transformer for building extraction. ISPRS J. Photogramm. Remote Sens. 2023, 199, 15–27. [Google Scholar] [CrossRef]
Zhang, T.; Wei, S.; Ji, S. E2ec: An end-to-end contour-based method for high-quality high-speed instance segmentation. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4443–4452. [Google Scholar]
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.-W. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 13063–13067. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Suzuki, S. Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 1985, 30, 32–46. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. In Proceedings of the IEEE International Geoscience And Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017. [Google Scholar]
Xu, B.; Xu, J.; Xue, N.; Xia, G.-S. HiSup: Accurate polygonal mapping of buildings in satellite imagery with hierarchical supervision. ISPRS J. Photogramm. Remote Sens. 2023, 198, 284–296. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning In Medical Image Analysis And Multimodal Learning For Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; proceedings 4. Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]

Figure 1. Two-stage building contour regularization framework.

P_{S}

,

P_{C}

, and

P_{E}

respectively denote the start token, contour separator, and end token. The numbers in the subplot at the bottom indicate the sequential numbering of the vertices.

Figure 1. Two-stage building contour regularization framework.

P_{S}

,

P_{C}

, and

P_{E}

respectively denote the start token, contour separator, and end token. The numbers in the subplot at the bottom indicate the sequential numbering of the vertices.

Figure 2. SegFormer network architecture for semantic mask extraction.

Figure 3. PolyReg transformer encoder architecture.

Figure 4. Gated multi-layer perceptron (Gated MLP) structure.

Figure 5. Self-supervised reconstruction task setup.

P_{S}

,

P_{C}

, and

P_{E}

respectively denote the start token, contour separator, and end token;

S

represents the original contour sequence, and

S^{'}

represents the desired regularized output sequence. The numbers in the subplot at the bottom indicate the sequential numbering of the vertices.

Figure 5. Self-supervised reconstruction task setup.

P_{S}

,

P_{C}

, and

P_{E}

respectively denote the start token, contour separator, and end token;

S

represents the original contour sequence, and

S^{'}

represents the desired regularized output sequence. The numbers in the subplot at the bottom indicate the sequential numbering of the vertices.

Figure 6. Mask-Attention mechanism for sequence generation.

P_{S}

,

P_{C}

, and

P_{E}

respectively denote the start token, contour separator, and end token;

S

represents the original contour sequence, and

S^{'}

represents the desired regularized output sequence. On the right, green solid lines indicate allowed attention paths, while blue dashed lines represent masked (unreachable) paths. The green solid lines on the right indicate the allowed bidirectional attention paths for the input sequence, while the blue dashed lines represent the forward-only attention paths allowed for the output sequence. The numbers in the subplot at the bottom indicate the sequential numbering of the vertices.

Figure 6. Mask-Attention mechanism for sequence generation.

P_{S}

,

P_{C}

, and

P_{E}

respectively denote the start token, contour separator, and end token;

S

represents the original contour sequence, and

S^{'}

represents the desired regularized output sequence. On the right, green solid lines indicate allowed attention paths, while blue dashed lines represent masked (unreachable) paths. The green solid lines on the right indicate the allowed bidirectional attention paths for the input sequence, while the blue dashed lines represent the forward-only attention paths allowed for the output sequence. The numbers in the subplot at the bottom indicate the sequential numbering of the vertices.

Figure 7. Transformations for noising the input shape. The numbers (0–5) indicate the order of the vertices in the building contour and red font highlights the indices or sequence of vertices that are modified or affected by the transformation.

Figure 8. Self-supervised reconstruction network architecture. The numbers (0–5) indicate the order of the vertices in the building contour and red font highlights the indices or sequence of vertices that are modified or affected by the transformation.

Figure 9. End-to-end building simplification network architecture. The numbers (0–5) indicate the order of the vertices in the building contour and red font highlights the indices or sequence of vertices that are modified or affected by the transformation.

Figure 10. Examples from the Inria Aerial Image Labeling Dataset. (a) High-resolution aerial imagery sample depicting an urban area; (b) Corresponding building annotation map, where building footprints are shown in white and background in black.

Figure 11. Examples of test regions in (a) Chicago, IL, USA and (b) West Tyrol, Austria.

Figure 12. Visual comparison of building contours before and after regularization. (Sub-captions would be something like: (a) Original Image, (b) Ground Truth, (c) Segformer Output, (d) PolyReg Output).

Figure 13. Visualization of regularization results on larger areas from the (a) Chicago and (b) West Tyrol test datasets. The yellow circles indicate areas where regularization failed.

Figure 14. Visual comparison of regularization results from different methods. (a–e) show five different sample regions.

Figure 15. Visual comparison of initial contours generated by different backbones and their regularization results after PolyReg processing. (a–e) show five different sample regions.

Figure 16. Failure cases under simulated severe boundary blur (8× downsampling). The three yellow circles highlight examples of (i) false positive, (ii) missed detection, and (iii) degraded outline quality.

Table 1. Evaluation results of different methods and TPSM.

Dataset	City	Total Area	Total Buildings
Train	Austin, TX, USA	$81 k m^{2}$	52,275
	Kitsap County, WA, USA	$81 k m^{2}$	24,066
	Vienna, Austria	$81 k m^{2}$	32,229
Test	Chicago, IL, USA	$81 k m^{2}$	80,652
Test	West Tyrol, Austria	$81 k m^{2}$	17,458

Table 2. Experimental results on the Chicago and West Tyrol test sets.

City	Ways	IoU	HD	ACR	PCR	SRP	N-Ratio	C-IoU
Chicago	Segformer	0.67	1.20	0.79	1.92	0.88	2.98	0.57
Chicago	Segformer + PolyReg	0.80	0.60	0.88	1.03	0.63	0.63	0.71
West Tyrol	Segformer	0.60	1.78	0.69	1.91	0.90	2.57	0.43
West Tyrol	Segformer + PolyReg	0.79	0.58	0.80	0.83	0.55	0.51	0.57

Table 3. Performance comparison of different methods on the Chicago and West Tyrol datasets.

City	Ways	F1-Score	IoU	HD	ACR	PCR	SRP	C-IoU
Chicago	CRM	0.75	0.58	0.72	0.67	1.02	0.05	0.54
	MLR	0.34	0.25	2.89	0.58	3.34	0.69	0.13
	Hisup	0.70	0.58	1.83	0.65	1.53	0.45	0.48
	Segformer + PolyReg	0.89	0.80	0.59	0.88	1.03	0.63	0.71
West Tyrol	CRM	0.79	0.68	1.45	0.76	0.62	0.12	0.49
	MLR	0.58	0.45	3.27	0.53	1.76	0.67	0.27
	Hisup	0.83	0.75	0.57	0.83	0.83	0.52	0.54
	Segformer + PolyReg	0.85	0.79	0.58	0.80	0.83	0.55	0.57

Table 4. Comparison of average inference time.

Method	Average Inference Time (Seconds per Image Tile)
CRM	0.44
MLR	0.89
Hisup	0.70
Segformer + PolyReg	0.98

Table 5. Performance comparison of different backbone networks combined with PolyReg on Chicago and West Tyrol test sets.

City	Ways	P	R	IoU	HD	ACR	PCR	SRP	N-Ratio	C-IoU
Chicago	U-Net	0.773	0.831	0.568	1.641	0.701	1.574	0.755	3.244	0.512
	U-Net + PolyReg	0.922	0.856	0.799	0.632	0.873	0.995	0.588	0.697	0.638
	U-Net++	0.812	0.808	0.639	1.399	0.731	1.691	0.721	3.098	0.537
	U-Net++ + PolyReg	0.921	0.849	0.791	0.755	0.871	1.009	0.463	0.579	0.676
	DeepLabv3+	0.815	0.838	0.690	1.029	0.797	1.878	0.782	2.976	0.594
	DeepLabv3+ + PolyReg	0.928	0.839	0.788	0.777	0.863	0.996	0.435	0.791	0.694
	Segformer	0.877	0.781	0.673	1.197	0.785	1.917	0.883	2.978	0.571
	Segformer + PolyReg	0.920	0.862	0.797	0.594	0.882	1.029	0.628	0.625	0.706
West Tyrol	U-Net	0.888	0.656	0.553	1.927	0.631	1.958	0.937	3.135	0.411
	U-Net + PolyReg	0.941	0.808	0.764	0.983	0.805	0.990	0.496	0.434	0.543
	U-Net++	0.770	0.821	0.613	2.224	0.711	1.9235	0.825	4.111	0.299
	U-Net++ + PolyReg	0.947	0.791	0.755	1.075	0.798	0.989	0.562	0.434	0.543
	DeepLabv3+	0.810	0.612	0.587	1.878	0.671	2.192	0.750	3.058	0.414
	DeepLabv3+ + PolyReg	0.942	0.800	0.758	1.047	0.795	0.986	0.463	0.377	0.539
	Segformer	0.851	0.677	0.599	1.784	0.685	1.905	0.900	2.565	0.431
	Segformer + PolyReg	0.875	0.826	0.793	0.582	0.801	0.827	0.545	0.506	0.574

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, L.; Li, C.; Chen, X.; Wang, X.; Qian, H. PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation. Remote Sens. 2025, 17, 1650. https://doi.org/10.3390/rs17091650

AMA Style

Cui L, Li C, Chen X, Wang X, Qian H. PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation. Remote Sensing. 2025; 17(9):1650. https://doi.org/10.3390/rs17091650

Chicago/Turabian Style

Cui, Longfei, Chao Li, Xin Chen, Xiao Wang, and Haizhong Qian. 2025. "PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation" Remote Sensing 17, no. 9: 1650. https://doi.org/10.3390/rs17091650

APA Style

Cui, L., Li, C., Chen, X., Wang, X., & Qian, H. (2025). PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation. Remote Sensing, 17(9), 1650. https://doi.org/10.3390/rs17091650

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation

Abstract

1. Introduction

2. Materials and Methods

2.1. Model Architecture

2.2. SegFormer

2.3. PolyReg

2.3.1. PolyReg Network Architecture

2.3.2. Mask-Attention

2.3.3. Pre-Training Task Design

3. Experiments

3.1. Dataset

3.2. Model Parameter Settings

3.3. Evaluation Metrics

3.4. Experiments and Analysis

4. Discussion

4.1. Comparative Analysis

4.2. Impact of Different Backbones on PolyReg

4.3. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI