RegionGraph: Region-Aware Graph-Based Building Reconstruction from Satellite Imagery

Li, Lei; Fang, Chenrong; Li, Wei; Chen, Kan; Li, Baolong; Sun, Qian

doi:10.3390/jimaging12040161

Open AccessArticle

RegionGraph: Region-Aware Graph-Based Building Reconstruction from Satellite Imagery

by

Lei Li

¹,

Chenrong Fang

²

,

Wei Li

¹

,

Kan Chen

^3,*

,

Baolong Li

¹

and

Qian Sun

^1,*

¹

School of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

College of Intelligence and Computing, Tianjin University, Tianjin 300072, China

³

Infocomm Technology Cluster, Singapore Institute of Technology, Singapore 138683, Singapore

^*

Authors to whom correspondence should be addressed.

J. Imaging 2026, 12(4), 161; https://doi.org/10.3390/jimaging12040161

Submission received: 25 January 2026 / Revised: 11 March 2026 / Accepted: 31 March 2026 / Published: 8 April 2026

(This article belongs to the Special Issue Progress, Challenges, and Future Trends in Computer Vision and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

Structural reconstruction helps infer the spatial relationships and object layouts in a scene, which is an essential computer vision task for understanding visual content. However, it remains challenging due to the high complexity of scene structural topologies in real-world environments. To address this challenge, this paper proposes RegionGraph, a novel method for structural reconstruction of buildings from a satellite image. It utilizes a layout region graph construction and graph contraction approach, introducing a primitive (layout region) estimation network named ConPNet for detecting and estimating different structural primitives. By combining structural extraction and rendering synthesis processes, RegionGraph constructs a graph structure with layout regions as nodes and adjacency relationships as edges, and transforms the graph optimization process into a node-merging-based graph contraction problem to obtain the final structural representation. The experiments demonstrated that RegionGraph achieves a 4% improvement in average F1 scores across three types of primitives and exhibits higher regional completeness and structural coherency in the reconstructed structure.

Keywords:

structural reconstruction; remote sensing image reconstruction; primitive detection; region-aware; subgraph solving; graph contraction

1. Introduction

Building structure extraction from top-down remote sensing imagery [1] has been an active research topic in both the computer vision and remote sensing communities. Buildings are key objects that convey rich geographic information in remote sensing images, and their extraction is important for many applications, such as land cover classification [2], urban planning [3], and geographic information database updating. In the context of smart city development, accurate and automated building extraction is particularly important, as it directly affects the efficiency and reliability of downstream applications. Structural information extracted from remote sensing images provides valuable support for urban planning, land resource management, and real estate registration. Moreover, such information promotes the intelligent use of remote sensing data and improves its practical value. In military scenarios, remote sensing imagery can also be used to analyze the structure and layout of buildings, providing useful information for reconnaissance tasks.

Despite its importance, building structure extraction from remote sensing images remains challenging. A major difficulty lies in handling building layouts with arbitrary and complex topologies. In particular, for outdoor building vectorization from satellite images [1], as illustrated in Figure 1, the traditional Manhattan-world assumption often fails due to long-range imaging effects and perspective distortion. These factors make accurate structure extraction more difficult. Therefore, developing an accurate, efficient, and automated method for building structure extraction from remote sensing images is of significant practical value.

Early building reconstruction methods for remote sensing imagery mainly relied on heuristic algorithms [4,5]. Although effective, these methods often had a high computational cost. With the rapid development of deep neural networks (DNNs), remote sensing image processing has undergone significant progress. DNNs [6] show strong performance in detecting low-level primitives, such as corner points, greatly improving detection accuracy and efficiency.

However, despite these advances, understanding high-level geometric structures, such as global topology and graph relationships, remains challenging for DNN-based methods. Most state-of-the-art approaches still rely on traditional optimization techniques to infer high-level structures after low-level primitives are detected. While these optimization methods are effective, they often involve complex formulations and extensive engineering to encode structural constraints, resulting in complicated and time-consuming pipelines. Therefore, improving high-level structural reasoning while maintaining efficient low-level primitive detection remains an open problem in building reconstruction from remote sensing imagery.

To address these challenges, this paper proposes a structured reconstruction framework based on graph construction and graph optimization for building structure extraction from remote sensing images. The core idea is to use a graph model to represent and integrate topological information in a scene, and to iteratively refine this representation through optimization. Based on this framework, we introduce a new method, RegionGraph. The experimental results show that the proposed method achieves competitive performance on building structure extraction tasks. Compared with traditional heuristic approaches and DNN-based low-level primitive detection methods, RegionGraph provides more accurate structural reconstruction and demonstrates higher robustness when handling complex scenes and topologies. Our main contributions are summarized as follows:

We propose a structured reconstruction framework based on graph construction and graph optimization. In the graph construction stage, topological information is encoded into an initial graph representation. In the graph optimization stage, this representation is refined by automatically adjusting graph nodes and edges to improve reconstruction accuracy and efficiency.
We introduce RegionGraph for building structure extraction from overhead remote sensing images. The method incorporates a primitive estimation network, ConPNet, which uses regional heatmap context to estimate primitive locations and attributes. By representing building topology as a region-based graph and formulating graph optimization as a node-merging contraction process, RegionGraph improves both regional completeness and structural consistency.
We conduct comparison and ablation experiments on the SpaceNet dataset. The results demonstrate that the proposed method performs well in terms of regional completeness and structural relationship modeling, achieving performance comparable to state-of-the-art methods.

Note that we compare RegionGraph with a set of representative methods that span the major paradigms in building structure extraction from remote sensing imagery. Given the large body of existing work, these baselines are not intended to be exhaustive, but are selected to reflect commonly adopted reconstruction approaches. Through this comparison, we demonstrate that explicitly incorporating region-aware representations together with graph-based structural optimization leads to improved structural completeness and topological consistency. Notably, the proposed region-level graph formulation is method-agnostic and can be readily integrated into other primitive-based or wireframe reconstruction pipelines, highlighting its general applicability beyond the specific instantiation used in RegionGraph.

The remainder of this paper is organized as follows. Section 2 reviews related work on building structure extraction from remote sensing images. Section 3 describes the proposed RegionGraph method in detail. Section 4 presents the experimental results, including comparison and ablation studies, as well as qualitative visualizations. Finally, Section 5 concludes the paper and discusses future research directions.

2. Related Work

2.1. Structural Reconstruction and Structural Reasoning

Structural reconstruction belongs to the broader domain of structural reasoning, together with tasks such as human pose estimation [7] and semantic relational reasoning in images [8]. These tasks share a common goal of extracting vectorized structural representations from input data. Depending on the reconstruction target, such representations may include line segments [9], planar surfaces [10,11,12], room layouts [13,14], and polygonal rings [15].

Research on building extraction from remote sensing images can be traced back to the 1990s [16]. Early approaches mainly treated building extraction as a pixel-level image segmentation problem. However, such methods often produce jagged, irregular, and fragmented building boundaries, which are insufficient for practical applications. To address this issue, vector-based reconstruction methods were introduced, where buildings are represented as closed polygonal rings. More recently, reconstruction methods have further shifted their focus toward modeling internal and finer-grained structural details in order to obtain more accurate and complete building vector representations. The increasing complexity and variability of building topologies make building structure extraction from remote sensing images both challenging and impactful.

2.2. Traditional Structural Reconstruction Methods

Traditional structural reconstruction methods mainly rely on basic image processing techniques, such as histograms [17], Hough transforms, superpixel segmentation, and geometric fitting. For example, Okorn et al. [17] detect vertical planes in 3D point clouds by constructing histograms of vertical point distributions to generate floor plans. Furukawa et al. [10] and Silberman et al. [11] perform planar reconstruction using graph-cut formulations. Cabral et al. [18] recover floor layouts through dynamic programming, while Delage [19] estimates room layouts using Bayesian networks.

These traditional approaches rely heavily on heuristic rules and handcrafted assumptions, making them sensitive to noise and complex scenes. As a result, their performance often degrades when applied to real-world data with irregular structures. With the development of neural networks, structural reconstruction has gradually shifted toward data-driven solutions, enabling more robust modeling of complex structural patterns.

2.3. Deep Learning for Building Extraction from Remote Sensing Images

In recent years, deep learning-based methods have significantly improved the accuracy of building extraction from remote sensing images. Early approaches based on Convolutional Neural Networks (CNNs) [6] and Fully Convolutional Networks (FCNs) [20] typically formulate building extraction as a semantic segmentation problem, where each pixel is assigned a category label [21,22,23]. While these methods can identify building regions, they are unable to distinguish individual building instances.

To overcome this limitation, instance segmentation methods such as Mask R-CNN [24], SOLO [25], YOLACT [26], and PANet [27] have been applied to building extraction tasks. These methods predict individual building masks using bounding boxes or prototype-based representations. However, instance segmentation methods still face challenges, such as interference between overlapping instances and inaccurate boundary localization [28]. Moreover, both semantic and instance segmentation approaches produce pixel-level outputs that require extensive post-processing and still fall short of manual-level boundary delineation.

2.4. Contour-Based and Topology-Aware Methods

Another line of research treats building extraction as a contour regression problem, where the vertex coordinates of building polygons are directly predicted [29,30,31,32,33,34]. Compared to pixel-based segmentation methods, contour-based approaches are theoretically more efficient because they directly generate vector representations and reduce the need for post-processing steps such as raster-to-vector conversion.

However, simple contour representations often fail to capture complex building shapes, especially in scenarios where building blocks are fragmented, overlapping, or vertically varied. To address these limitations, recent studies have begun to explore the internal topology of building structures. For example, Zhang et al. [35] proposed a convolutional message-passing neural network that models buildings as structured graphs and extracts vector representations of building planar components from satellite images. This work highlights the importance of topological reasoning and graph-based modeling in structured reconstruction.

2.5. Summary

Despite recent progress, reconstructing building exterior structures remains challenging due to the presence of arbitrary and complex graph topologies. First, neural network-based heuristic optimization methods often require substantial computational resources, as they rely on iterative optimization over large search spaces. For complex building layouts, this process can become computationally expensive and time-consuming.

Second, end-to-end reconstruction pipelines may struggle to produce compact and well-closed structural representations. Although these methods can directly generate reconstruction results from input images, the diversity and complexity of building structures may lead to redundant or overly complex outputs. Therefore, achieving a balance between reconstruction accuracy, structural completeness, and computational efficiency remains a key research direction in structured building reconstruction from remote sensing imagery.

Recent transformer-based contour and graph reconstruction models further explore global attention mechanisms. RegionGraph differs by emphasizing region-aware graph initialization and contraction for structural compactness.

3. Method

3.1. Overall Architecture

The overall architecture of the proposed RegionGraph is illustrated in Figure 2.

The method represents the vectorized reconstruction result as a region-based topological graph, where nodes correspond to region (room-like) primitives and edges encode adjacency relationships between neighboring regions, as shown in Figure 3. This graph representation provides an explicit and structured description of building topology.

RegionGraph follows a two-stage pipeline. In the first stage, contour sampling points are obtained through primitive extraction. Based on these sampled points, a triangulation process is applied, and each triangulated unit is treated as a node to construct an initial graph structure. This step transforms low-level geometric primitives into a graph representation that preserves local spatial relationships. Figure 4 details the internal architecture of ConPNet used in the first stage.

In the second stage, a node-merging strategy is applied to the initial graph to perform graph contraction. By iteratively merging nodes according to predefined criteria, the graph is simplified and refined, leading to the final reconstruction result. This region-based graph construction and optimization strategy effectively improves both the regional completeness and the structural accuracy of building extraction from remote sensing images.

3.2. Graph Construction: ConPNet

We present the proposed ConPNet, a novel contextual information-based primitive estimation network. In the graph construction stage, RegionGraph first initializes the graph by estimating keypoint information through a keypoint heatmap regression task. For building structure extraction from remote sensing images, structural lines correspond to the contours of building roof components. Therefore, ConPNet focuses on regional features of building rooftops and extracts keypoint information based on the contextual information provided by regional heatmaps. The predicted geometric primitives at keypoint locations form the basis for subsequent topological graph construction. These keypoints include corner points and sampling points, which correspond to contour intersection points and uniformly distributed samples along contour edges, respectively.

ConPNet performs the transformation from remote sensing images to building structural primitive heatmaps, which can be viewed as an image-to-image translation task under a constrained scenario. Given a remote sensing image X containing a building, the image includes intrinsic structural information that remains invariant under different environmental conditions. This structural information is denoted as S:

S = \{E, V\}, V = \{(x, y) ∣ x, y \in S\}, E = \{(v_{i}, v_{j}) ∣ v_{i}, v_{j} \in V\}

(1)

Here, V denotes the set of building corner points, and E denotes the set of contour edges. The combination of V and E uniquely determines the building structure. Since corners and edges belong to different geometric categories, we adopt a sampling-based representation to unify these primitives and approximate S using

S^{'}

:

S^{'} = \{P\}, P = \{(x, y) ∣ (x, y) on e, e \subseteq E\}

(2)

Here, P represents the set of sampling points uniformly distributed along building contour edges, and

V \subseteq P

. Therefore, for a given input image X, the task of ConPNet can be expressed as:

R (S^{'}) = F (X)

(3)

Here,

F (X)

denotes the heatmap extraction operation, and

R (S^{'})

represents the kernelized heatmap generated from the point set

S^{'}

. Based on this formulation and the structural properties of different heatmaps, ConPNet decomposes the extraction of

S^{'}

into two stages: structural extraction and rendering synthesis, defined as:

\begin{matrix} F (X) & = F_{f} (F_{v e} (X)) \end{matrix}

(4)

\begin{matrix} F_{v e} (X) & = [F_{v} (X), F_{e} (X)] \end{matrix}

(5)

1.: Structure extraction operation $F_{v e}$ : This step aims to remove environmental and appearance variations while recovering vectorized structural information. Specifically, corner reconstruction $F_{v}$ and edge reconstruction $F_{e}$ are applied to suppress background noise and façade textures, and to recover building contour corners and edges. The outputs of $F_{v}$ and $F_{e}$ are concatenated along the channel dimension before being passed to the subsequent rendering synthesis module.
2.: Rendering synthesis operation $F_{f}$ : This step performs discrete sampling of boundary structures based on detected corner points. Sampling points are generated along boundary heatmaps according to corner locations to produce the final sampled-point representation.

Since building reconstruction from remote sensing images emphasizes the main building region, and because direct extraction of corners and contours is challenging, ConPNet incorporates regional geometric priors into the structure extraction process. The transformation is therefore expressed as:

F (X) = F_{f} (F_{v e} (F_{r} (X))),

(6)

where

F_{r}

denotes the regional heatmap reconstruction operation.

Based on this design, ConPNet consists of two main components: a structural heatmap reconstruction network and a rendering synthesis network. The structural heatmap reconstruction network includes a Region Reconstruction Network (RRN), a Boundary Reconstruction Network (BRN), and a Point Reconstruction Network (PRN), which correspond to

F_{r}

,

F_{e}

, and

F_{v}

, respectively. The Fusion Network (FN) implements the rendering synthesis operation

F_{f}

. The separation of region-, boundary-, and corner-aware branches is motivated by their distinct structural roles in graph initialization and contraction.

Note that all reconstruction modules in ConPNet, including the RRN, BRN, PRN, and FN, are trained end-to-end using supervised heatmap regression losses. No predefined operators or handcrafted rules are used in the structure extraction or rendering synthesis stages.

3.2.1. Structural Design of ConPNet

As illustrated in Figure 4, ConPNet adopts a two-module architecture composed of the structure extraction module and the rendering synthesis module. The structure extraction module processes the input image to generate multiple primitive heatmaps, while the rendering synthesis module integrates these primitives to produce the final sampled point heatmap.

The detailed architecture of each module is shown in Figure 5. Given a remote sensing image

F \in R^{H \times W \times 3}

, a backbone network first extracts low-level features, producing a feature map

F_{1} \in R^{\frac{H}{2} \times \frac{W}{2} \times N}

. The backbone consists of a

3 \times 3

convolution layer, a residual ConvBlock, and another

3 \times 3

convolution layer. The extracted feature map

F_{1}

is then fed into the RRN to obtain intermediate region-aware features

R_{i n}

and the corresponding regional heatmap R.

The regional heatmap R is upsampled and combined with

F_{1}

and

R_{i n}

through residual fusion to produce

F_{2}

, which serves as the input to both the PRN and the BRN. These two networks generate corner-aware features

P_{i n}

and edge-aware features

B_{i n}

, respectively, which are further transformed into the corner heatmap P and the boundary heatmap B. Finally,

P_{i n}

and

B_{i n}

are concatenated along the channel dimension to form

F_{3}

, which is passed to the Fusion Network to produce the sampled-point heatmap

W P

. While channel attention improves feature weighting during fusion, alternative lightweight fusion strategies may also be explored in future work.

The following subsections describe the structure extraction and rendering synthesis modules in detail.

3.2.2. Structure Extraction

The goal of the structure extraction stage is to generate multiple primitive heatmaps that encode building structural information. Since Hourglass networks [36] have demonstrated strong performance in multi-scale feature extraction and localization tasks [37], we adopt Hourglass-style architectures as the backbone of the structure extraction module. The parameter settings of the base convolutional blocks are summarized in Table 1, and the configurations of selected network modules are shown in Table 2.

The RRN aims to suppress complex background information and extract region-aware features corresponding to the building body. It consists of a three-layer Hourglass network [36] followed by a RegionHead. The Hourglass network produces a region-aware feature map

R_{i n}

, which is further transformed by the RegionHead to generate the regional heatmap R. This heatmap serves as a geometric prior that constrains subsequent primitive heatmap generation. Due to the smooth spatial distribution of region heatmaps, the RRN is trained using a mean-squared-error loss. The configuration of RegionHead is presented in Table 2.

The PRN infers the spatial distribution of corner points along building contours. Similarly to the RRN, the PRN adopts a three-layer Hourglass architecture followed by a PointHead. The fused feature map incorporating regional priors is processed to produce corner-aware features and the corresponding corner heatmap P. The PRN is trained using an

L_{1}

loss to encourage accurate localization of corner points. The parameter settings of the PointHead are provided in Table 2.

The BRN is designed to recover building contour boundaries from the fused feature representation. It shares the same Hourglass-based architecture as the PRN and outputs an edge heatmap B through a BoundaryHead. To handle the imbalance between boundary and non-boundary pixels, the BRN is trained using the Adaptive Wing Loss [37]. The detailed configuration of BoundaryHead is also shown in Table 2.

3.2.3. Rendering Compositing

The rendering compositing stage aims to generate the sampled-point heatmap by discretely sampling boundary information guided by corner cues. This process is implemented by the Fusion Network (FN), which takes corner-aware and edge-aware feature maps as inputs, and then outputs the sampled-point heatmap

W P

.

To effectively fuse multi-scale structural information, the FN incorporates a channel attention mechanism. Channel attention assigns adaptive weights to feature channels, enabling the network to emphasize informative channels while suppressing less relevant ones. As illustrated in Figure 6, the channel attention module follows a Squeeze–Excitation–Transform pipeline, where global average pooling is used for feature squeezing, fully connected layers model inter-channel dependencies, and a sigmoid activation normalizes the channel weights.

Specifically, the FN applies multiple convolutional layers with different kernel sizes (from

3 \times 3

to

25 \times 25

) to capture both local and global structural information. The resulting multi-scale features are concatenated and passed through the channel attention module to produce scale-adaptive feature weights. The weighted features are then fused through element-wise multiplication and addition, followed by a convolution layer to generate the final sampled-point heatmap

W P

.

The Fusion Network is trained using a combination of Adaptive Wing Loss [37] and

L_{1}

loss:

L_{w p} = λ_{1} L_{A W L} + λ_{2} L_{1} .

(7)

Here,

y_{w p}

denotes the ground-truth sampled-point heatmap and

{\hat{y}}_{w p}

denotes the predicted heatmap. The Adaptive Weight Loss balances different supervision terms using learnable or predefined scaling factors. The weights

λ_{1}

and

λ_{2}

balance the contributions of the corresponding loss terms. They were selected based on validation-set experiments to ensure stable training and balanced performance across region, boundary, and corner reconstruction tasks and performance. In all experiments, the parameters

ω

,

ϵ

,

α

,

θ

,

λ_{1}

, and

λ_{2}

were set to 14, 1, 2.1, 0.5, 0.5 and 0.5, respectively.

3.3. Graph Optimization: Graph Shrinkage via Node Merging

This section introduces the second stage of RegionGraph, namely graph optimization. The initial region-based graph constructed from ConPNet outputs is simplified into a target graph that represents complete building components.

The optimization stage in RegionGraph is formulated as a geometric simplification problem over a triangulated planar graph rather than as a relational representation learning task. Since the initial graph is constructed by triangulating sampled contour points, over-segmentation naturally appears at the primitive level. Node-merging-based graph contraction is therefore structurally aligned with this initialization: it consolidates adjacent triangular regions into coherent roof components while preserving planarity and connectivity. Compared with learning-based graph optimization methods such as graph neural networks, or global partitioning techniques such as spectral clustering and min-cut formulations, our approach emphasizes deterministic, interpretable, and computationally efficient structural refinement without introducing additional learnable parameters at the optimization stage.

Given an initial graph

G = {E, V}

, where V denotes nodes and E denotes edges, the optimization objective is to minimize the following energy function:

E_{g r a p h} = E_{n o d e} (V) + E_{e d g e} (E),

(8)

where the two terms measure node completeness and inter-region relationships, respectively, as shown in Figure 7.

The energy formulation focuses on enforcing local region completeness and adjacency consistency. Higher-order global topological constraints, such as cycle consistency or global connectivity regularization, are not explicitly encoded in the current model. In practice, the planar triangulation initialization and adjacency-preserving contraction implicitly maintain structural validity for footprint-level reconstruction. However, incorporating explicit global constraints may further improve robustness in highly complex building layouts.

Node term ( $E_{n o d e}$ ). The node term evaluates whether a node corresponds to a complete building roof component. It combines the regional heatmap R and boundary heatmap B to measure the structural completeness of each polygonal node:

E_{n o d e} (V) = \sum_{v \in V} (\sum_{e_{j} \in v} E_{h e a t} (e_{j})) / v_{n u m} .

(9)

The edge heat penalty is defined as:

E_{h e a t} (e_{j}) = \frac{1}{| e_{j} |} \int_{0}^{1} (R (e_{j} (u)) - B (e_{j} (u))) d u,

(10)

e_{j} (u) = u e_{j}^{1} + (1 - u) e_{j}^{2},

(11)

where

| e_{j} |

is the edge length and

e_{j}^{1}, e_{j}^{2}

are its endpoints. Smaller values of

E_{h e a t}

indicate a higher likelihood that an edge corresponds to a true structural boundary. Minimizing

E_{n o d e}

encourages each node to represent a complete building component. In implementation, the continuous integral in Equation (10) is approximated by uniform discrete sampling along each edge. Specifically, points are sampled at one-pixel intervals along the edge segment, and bilinear interpolation is used to obtain heatmap values at subpixel locations. The integral is then computed as the average of sampled values multiplied by the edge length.

Edge term ( $E_{e d g e}$ ). The edge term measures region merging based on adjacency relationships. A small average

E_{h e a t}

along a shared boundary indicates a true structural separation, while a large value suggests that two regions belong to the same component. The formulation is:

E_{e d g e} = \sum_{e^{g} \in E} (\sum_{e_{j} \in e^{g}} E_{h e a t} (e_{j})) / e_{n u m}^{g},

(12)

where

e^{g}

denotes a relational edge and

e_{n u m}^{g}

is the number of triangular edges associated with it. Minimizing

E_{e d g e}

reduces unnecessary region merging and preserves structural independence.

Overall, minimizing

E_{g r a p h}

is formulated as a graph contraction problem based on node merging, where regions corresponding to the same building component are progressively merged.

Node Merging Strategy

The goal of graph contraction is to reduce the number of nodes and edges while preserving graph connectivity. The contraction process consists of two stages:

Triangular-region merging. Each triangular node must belong to the same component as at least one neighboring node. To reduce the search space, nodes connected by the longest edge in each triangle are merged. Given an initial graph

G = {E, V}

, the merged graph

G^{'} = {E^{'}, V^{'}}

is defined as:

G^{'} = {E^{'}, V^{'}},

(13)

V^{'} = V \cup {w} - {u, v ∣ e^{u v} \in {e_{t}^{m a x}}, t \in T},

(14)

E^{'} = {(w, x) ∣ (u, x) \in E, x \neq v} \cup {(w, y) ∣ (v, y) \in E, y \neq u},

(15)

where

e_{t}^{m a x}

is the longest edge of triangle t, T is the triangle set, and w is the merged node.

Relational-edge-based merging. After triangular merging, polygonal nodes are further merged using relational-edge confidence. The procedure consists of three steps:

1.: Compute $E_{h e a t} (e^{g})$ for each relational edge $e^{g} \in E^{'}$ .
2.: Remove edges with confidence greater than a threshold $T_{m e r g e}$ :

$E^{″} = E^{'} - {e^{g} ∣ E_{h e a t} (e^{g}) > T_{m e r g e}} .$

(16)
3.: Merge nodes connected by the removed edges:

$V^{″} = V^{'} \cup {w^{'}} - {u^{'}, v^{'} ∣ e^{u^{'} v^{'}} \in E^{'} - E^{″}} .$

(17)

Figure 8 illustrates the overall contraction process. The initial graph is first simplified by triangular-region merging, followed by relational-edge-based merging. Finally, the dummy background node introduced during graph construction is removed to obtain the final vectorized structural representation.

4. Experiments and Analysis

4.1. Dataset and Sample Processing

4.1.1. Dataset

The experiments are conducted on high-resolution satellite RGB imagery from the SpaceNet corpus [1], hosted on Amazon Web Services (AWS) as part of the SpaceNet Challenge. We adopt the benchmark dataset introduced by Nauata et al. [1], which contains 2001 building crops from three cities (Atlanta, Paris, and Las Vegas). Each image is cropped into a

256 \times 256

RGB patch.

The dataset contains 2001 samples in total, split into 1601/50/350 for training, validation, and testing. Each sample includes a satellite RGB image and a corresponding vector-structure annotation. The vector graph represents a roof structure, where vertices correspond to corners, and edges correspond to roof components (Figure 1). Across the dataset, the average/maximum numbers of corners and edges are 12.6/93 and 14.2/101, respectively.

4.1.2. Sample Processing

RegionGraph constructs graph structures based on ConPNet heatmap regression outputs. We therefore generate ground-truth heatmaps for each geometric primitive as follows:

Sample-point heatmap. We uniformly sample points along each annotated edge at a 10-pixel interval to form the sampled-point set. Each sampled point is represented in the heatmap by placing a 2D Gaussian kernel centered at its pixel location, with standard deviation $σ = 2$ , resulting in the sampled-point heatmap.
Corner heatmap. Corner heatmaps are generated using the same Gaussian rendering procedure applied to the annotated corner set.
Boundary heatmap. Annotated edges are dilated to 3-pixel-wide line segments and then smoothed with a Gaussian filter with $σ = 2$ to form the boundary heatmap.
Region heatmap. The closed region enclosed by annotated edges is filled with 1, and all other pixels are set to 0, producing the region heatmap.

4.2. Evaluation Metrics and Experimental Setup

4.2.1. Evaluation Metrics

We follow the evaluation protocol of Nauata et al. [1]. Metrics are computed for three primitive types (corner, edge, and region) using precision, recall, and F1.

Corner metrics. A predicted corner is counted as a true positive if it lies within 8 pixels of a ground-truth corner. Each ground-truth corner can match at most one prediction. The recall and precision are:

R e c a l l_{V} = \frac{T P_{V}}{T P_{V} + F N_{V}}, P r e c i s i o n_{V} = \frac{T P_{V}}{T P_{V} + F P_{V}} .

(18)

Edge metrics. An edge is counted as correctly reconstructed if and only if both of its endpoints are correctly reconstructed. The recall and precision are:

R e c a l l_{E} = \frac{T P_{E}}{T P_{E} + F N_{E}}, P r e c i s i o n_{E} = \frac{T P_{E}}{T P_{E} + F P_{E}} .

(19)

Region metrics. A predicted region is considered correct if its IoU with a ground-truth roof region is at least 70%. The recall and precision are:

R e c a l l_{R} = \frac{T P_{R}}{T P_{R} + F N_{R}}, P r e c i s i o n_{R} = \frac{T P_{R}}{T P_{R} + F P_{R}} .

(20)

F1 score. We report F1 to jointly evaluate precision and recall:

F 1 = \frac{2 \times R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n} .

(21)

We adopt the standard SpaceNet evaluation protocol to ensure fair comparison with prior work. While corner, edge, and region F1 scores provide a widely accepted measure of reconstruction accuracy, they do not fully capture higher-order topological correctness. More explicit topology-aware metrics, such as graph edit distance or component-level IoU, could provide additional insight and are left for future investigation.

4.2.2. Experimental Setup

All experiments are conducted on a server equipped with a 2.6 GHz CPU and an NVIDIA RTX 3090 GPU, the training required approximately 10 h for convergence. We implement the models in Python 3.8 with PyTorch 1.9.0. Training uses the Adam optimizer with an initial learning rate of

1 \times 10^{- 4}

and weight decay of

1 \times 10^{- 5}

. Models are trained for 700 epochs, and the learning rate is decayed by a factor of 0.1 during the last 100 epochs. Unless specified otherwise, we set the loss weights in Equation (7) to

λ_{1} = 0.5

and

λ_{2} = 0.5

, and set the graph-shrinkage threshold in Equation (16) to

T_{m e r g e} = 0.5

. The value of 0.5 was selected based on validation experiments over the range [0.3, 0.7], where region F1 peaked near 0.5, with lower values causing over-merging and higher values leading to under-merging. ConPNet consists of three Hourglass-style reconstruction branches (RRN, PRN, BRN) with shared feature width

N = 256

, followed by a lightweight fusion module. The overall network contains trainable parameters in the order of

10^{7}

(about 30 M), which is comparable to other multi-branch architectures for structural localization.

Although experiments are conducted on 256 × 256 patches following the SpaceNet protocol, RegionGraph can be extended to large satellite imagery using a sliding-window or tiling strategy with overlap. ConPNet is fully convolutional, allowing arbitrary image sizes during inference, and graph contraction is applied locally on initialized graph components. Therefore, the method can be integrated into large-scene processing pipelines through standard tile-based aggregation.

4.3. Comparative Evaluation

To validate the effectiveness of RegionGraph, we compare it against seven representative baselines: PolyRNN++ [38], PPGNet [39], Hamaguchi [40], SDSC-UNet [41], L-CNN [42], Nauata [1], and ConvMPN [35].

Raster-based segmentation methods. Hamaguchi [40] and SDSC-UNet [41] produce raster masks and require raster-to-vector conversion for structural evaluation.

Vector reconstruction methods. The remaining five methods directly predict vector structures. PolyRNN++ [38] performs contour regression, PPGNet [39] and L-CNN [42] focus on wireframe detection, Nauata [1] is the benchmark for satellite building vectorization, and ConvMPN [35] also reconstructs building structure via a graph representation.

Table 2 reports precision and recall for corners, edges, and regions. RegionGraph achieves the best region performance among all methods. This is primarily because our region-based graph construction preserves more area-level structure by triangulating sampled points into region nodes, which improves region recall. In addition, graph shrinkage via node merging explicitly enforces component completeness, improving region precision.

Methods that represent structures only as corner/edge primitives often ignore global region relationships, leading to overlaps or inconsistencies between adjacent components and weaker region metrics. Segmentation-based methods provide binary masks but are less sensitive to precise boundaries, which also limits region performance. RegionGraph is slightly behind L-CNN [42] and Nauata [1] on corner/edge metrics, which may be attributed to the heatmap regression and peak-based vector extraction, where discretization can introduce small localization errors compared to direct coordinate regression or detection.

Table 3 reports the F1 scores. RegionGraph achieves the best region F1 and the best average F1. Compared with Nauata [1], RegionGraph improves region F1 by nearly 8 points and improves the average F1 by about 4 points, indicating stronger overall structural consistency.

Figure 9 presents the qualitative comparisons. RegionGraph produces more complete region structures, with fewer dangling vertices and fewer intersecting edges. PolyRNN++ [38] typically captures only the outermost contour and can miss internal structure. Primitive-based methods often suffer from fragmentation, producing isolated corners and edges. Segmentation-derived vector contours can be ambiguous at shared boundaries, where adjacent regions are represented by duplicated edges, reducing compactness. The Nauata baseline [1] relies on handcrafted constraints and is more sensitive to Manhattan layouts; in non-Manhattan cases it can show noticeable angular bias. Overall, RegionGraph achieves strong region completeness with clearer structural relationships.

Figure 10 visualizes intermediate heatmaps and final reconstructions. Representative examples of the predicted region, boundary, corner, and sampled-point heatmaps are shown in Figure 10, illustrating the behavior of the learned operators on sample images. In the graph construction stage, ConPNet predicts region (R), boundary (B), corner (

C P

), and sampled-point (

W P

) heatmaps. The region heatmap captures footprint completeness but is less sharp at boundaries, while boundary and corner heatmaps refine geometric details. The sampled-point heatmap integrates these cues to support graph initialization. In the optimization stage, graph shrinkage converts the noisy initial structure into a compact vector representation and can correct local errors using global consistency. For instance, shifted corners can be corrected, false-positive internal edges can be removed, and missing internal boundaries can be recovered.

To better illustrate the reconstructed structures, we manually create 3D models in SketchUp using selected RegionGraph results (Figure 11). The reconstructed floor plans remain accurate for complex composite buildings and ring structures with courtyards, providing a strong basis for downstream 3D reconstruction. Note that building height and roof type are inferred manually from shading.

While RegionGraph improves regional completeness and structural compactness, certain failure cases remain. These include over-merging in closely adjacent roof components, under-segmentation in highly fragmented structures, and occasional missing internal boundaries under weak boundary responses. These limitations reflect challenges in heatmap quality and graph initialization.

4.4. Ablation Study

We evaluate different module combinations to validate the RegionGraph design. Table 4 summarizes the ablation settings, and Table 2 and Table 3 report the quantitative results. Note that the ablation study is designed to evaluate representative configurations of RegionGraph rather than exhaustively testing all possible combinations. Because the graph optimization stage operates on the graph produced by ConPNet, the two stages are structurally coupled and cannot be fully disentangled without modifying the pipeline. Therefore, we adopt a progressive configuration strategy to assess the impact of the main components.

Point_sample. Use sampled points as the triangulation input; otherwise use corners.
Contract_tri. Enable triangular-region merging; otherwise remove this contraction step.
Contract_heat. Enable relational-edge-based merging; otherwise remove this contraction step.
R. Use region heatmap prior in ConPNet; otherwise remove RegionHead and Downsample_R.
CA. Use channel attention in the Fusion Network; otherwise replace it with simple channel summation.

Table 2 and Table 3 show that sampled-point initialization improves structural recall compared to corner-only initialization (Setting 1 vs. Setting 3), indicating that sampled points capture richer boundary and region cues. Graph optimization substantially improves all metrics (Setting 2 vs. Setting 5), demonstrating that node-merging-based shrinkage is critical for removing redundancy and enforcing component-level consistency. The comparison among Settings 3, 4, and 5 highlights that both contraction stages matter: triangular merging provides coarse simplification, while relational-edge-based merging further refines adjacency using global cues. Incorporating region priors improves heatmap quality (Setting 5 vs. Setting 6), and channel attention further strengthens multi-scale fusion (Setting 6 vs. Setting 7).

Figure 12 provides qualitative evidence consistent with the quantitative trends. Sample-point-based initialization yields more complete boundaries and regions. Triangular merging removes many redundant edges, and relational-edge-based merging further suppresses false positives and improves compactness. Adding region priors and channel attention consistently moves predictions closer to the manual annotations.

All experiments follow the official SpaceNet split with identical training protocols to ensure fair comparison. The observed improvements are consistent across multiple structural metrics.

5. Conclusions

We proposed RegionGraph for building structure extraction from top-down satellite imagery. RegionGraph consists of two stages: (i) ConPNet predicts sampled structural primitives via heatmap regression and initializes a region graph using triangulation; (ii) graph optimization formulates structure refinement as a graph shrinkage problem and produces compact vector representations via triangular merging and relational-edge-based merging.

We evaluate RegionGraph using standard metrics for corners, edges, and regions (precision, recall, and F1). Compared with classical wireframe extraction methods (PPGNet [39], L-CNN [42]) and representative remote-sensing baselines (PolyRNN++ [38], Hamaguchi [40], SDSC-UNet [41], Nauata [1], and ConvMPN [35]), RegionGraph achieves stronger region accuracy and better overall structural consistency. The qualitative results further show improved region completeness and clearer structural relationships. The ablation studies confirm the effectiveness of each component in the proposed pipeline.

While the proposed graph contraction strategy effectively improves region-level structural consistency, the current energy formulation primarily captures local completeness and adjacency relationships and does not explicitly model higher-order global topological constraints. Future work may explore integrating global connectivity regularization or learning-based relational refinement to enhance generalization to more complex building layouts and large-scale urban scenes.

In addition, the current evaluation relies on benchmark metrics and does not explicitly quantify graph-level topological consistency, which may be further explored using dedicated topology-aware measures. Although the threshold demonstrates stable behavior within the evaluated range on SpaceNet, adaptive threshold selection or cross-dataset validation may further improve robustness under varying building typologies and imaging conditions.

A more comprehensive combinational analysis, statistical significance testing and broader comparison with recent transformer-based models and unified runtime benchmarking, as well as analysis of large-scene runtime, memory footprint, and real-time deployment, could further strengthen the empirical validation and will be considered in future work. In another future study, we plan to integrate shadow-based height estimation, which could enable a more automated 3D reconstruction pipeline.

Author Contributions

Conceptualization, K.C., B.L., and Q.S.; methodology, K.C., B.L. and Q.S.; software, L.L., C.F. and W.L.; validation, W.L. and K.C.; investigation, L.L., C.F., W.L., K.C., B.L., and Q.S.; writing—original draft preparation, L.L., C.F., W.L., K.C., B.L. and Q.S.; writing—review and editing, K.C. and Q.S.; supervision, K.C., B.L., and Q.S.; funding acquisition, B.L. and Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62571256 and Youth Science and Technology Talent Promotion Project of Jiangsu Province under Grant JSTJ-2024-392.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from SpaceNet and are available at https://registry.opendata.aws/spacenet/(accessed on 30 March 2026) with the permission under License CC BY-NC-SA 4.0.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nauata, N.; Furukawa, Y. Vectorizing world buildings: Planar graph reconstruction by primitive detection and relationship inference. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 711–726. [Google Scholar]
Zhao, S.; Tu, K.; Ye, S.; Tang, H.; Hu, Y.; Xie, C. Land Use and Land Cover Classification Meets Deep Learning: A Review. Sensors 2023, 23, 8966. [Google Scholar] [CrossRef] [PubMed]
Sharifi, A.; Khavarian-Garmsir, A.R.; Allam, Z.; Asadzadeh, A. Progress and prospects in planning: A bibliometric review of literature in Urban Studies and Regional and Urban Planning, 1956–2022. Prog. Plan. 2023, 173, 100740. [Google Scholar] [CrossRef]
Hough, P.V. Method and Means for Recognizing Complex Patterns. U.S. Patent 3,069,654, 18 December 1962. [Google Scholar]
Zhao, W.; Persello, C.; Stein, A. Building instance segmentation and boundary regularization from high-resolution remote sensing images. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium; IEEE: New York, NY, USA, 2020; pp. 3916–3919. [Google Scholar]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2014; pp. 1653–1660. [Google Scholar]
Xu, D.; Zhu, Y.; Choy, C.B.; Li, F. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; pp. 5410–5419. [Google Scholar]
Xu, Y.; Xu, W.; Cheung, D.; Tu, Z. Line segment detection using transformers without edges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 4257–4266. [Google Scholar]
Furukawa, Y.; Curless, B.; Seitz, S.M.; Szeliski, R. Manhattan-world stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2009; pp. 1422–1429. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2012; pp. 746–760. [Google Scholar]
Nguyen, T.; Reitmayr, G.; Schmalstieg, D. Structural modeling from depth images. IEEE Trans. Vis. Comput. Graph. 2015, 21, 1230–1240. [Google Scholar] [CrossRef] [PubMed]
Zou, C.; Colburn, A.; Shan, Q.; Hoiem, D. Layoutnet: Reconstructing the 3D room layout from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 2051–2059. [Google Scholar]
Yang, S.T.; Wang, F.E.; Peng, C.H.; Wonka, P.; Sun, M.; Chu, H.K. Dula-net: A dual-projection network for estimating room layouts from a single rgb panorama. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2019; pp. 3363–3372. [Google Scholar]
Gimenez, L.; Hippolyte, J.L.; Robert, S.; Suard, F.; Zreik, K. Reconstruction of 3D building information models from 2D scanned plans. J. Build. Eng. 2015, 2, 24–35. [Google Scholar] [CrossRef]
Zhang, Y. Optimisation of building detection in satellite images by combining multispectral classification and texture filtering. ISPRS J. Photogramm. Remote Sens. 1999, 54, 50–60. [Google Scholar] [CrossRef]
Okorn, B.; Xiong, X.; Akinci, B.; Huber, D. Toward automated modeling of floor plans. In Proceedings of the Symposium on 3D Data Processing, Visualization and Transmission, Paris, France, 17–20 May 2010; Volume 2. [Google Scholar]
Cabral, R.; Furukawa, Y. Piecewise planar and compact floorplan reconstruction from images. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2014; pp. 628–635. [Google Scholar]
Delage, E.; Lee, H.; Ng, A.Y. A dynamic bayesian network model for autonomous 3D reconstruction from a single indoor image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2006; Volume 2, pp. 2418–2428. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2015; pp. 3431–3440. [Google Scholar]
Girard, N.; Smirnov, D.; Solomon, J.; Tarabalka, Y. Polygonal Building Extraction by Frame Field Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 5891–5900. [Google Scholar]
He, J.; Cheng, Y.; Wang, W.; Ren, Z.; Zhang, C.; Zhang, W. A Lightweight Building Extraction Approach for Contour Recovery in Complex Urban Environments. Remote Sens. 2024, 16, 740. [Google Scholar] [CrossRef]
Li, K.; Liu, R.; Cao, X.; Bai, X.; Zhou, F.; Meng, D.; Wang, Z. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2025; pp. 10545–10556. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 2961–2969. [Google Scholar]
Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. Solo: Segmenting objects by locations. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 649–665. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 9157–9166. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 8759–8768. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H. Conditional convolutions for instance segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 282–298. [Google Scholar]
Peng, S.; Jiang, W.; Pi, H.; Li, X.; Bao, H.; Zhou, X. Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 8533–8542. [Google Scholar]
Liu, Z.; Liew, J.H.; Chen, X.; Feng, J. DANCE: A Deep Attentive Contour Model for Efficient Instance Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2021; pp. 345–354. [Google Scholar]
Wei, S.; Zhang, T.; Ji, S. A Concentric Loop Convolutional Neural Network for Manual Delineation-Level Building Boundary Segmentation from Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4407511. [Google Scholar] [CrossRef]
Wang, L.; Wang, G.; Luo, X.; Wang, L.; Yu, W.; Zhang, Z.; Gao, H. Contour-based instance segmentation method of road scene. Sci. Rep. 2025, 15, 33692. [Google Scholar] [CrossRef] [PubMed]
Xiao, X.; Wang, K.; Zhong, Z.; Qu, W.; Wu, W.; Cui, Z.; Su, Y.; Li, A.; Gong, J.; Li, D. A novel data-driven based high-precision building roof contour full-automatic extraction and structured 3D reconstruction method combining stereo images and LiDAR points. Int. J. Digit. Earth 2025, 18, 2484668. [Google Scholar] [CrossRef]
Yao, W.; Li, C.; Xiong, M.; Dong, W.; Chen, H.; Xiao, X. ContourFormer: Real-Time Contour-Based End-to-End Instance Segmentation Transformer. arXiv 2025, arXiv:2501.17688. [Google Scholar]
Zhang, F.; Nauata, N.; Furukawa, Y. Conv-MPN: Convolutional Message Passing Neural Network for Structured Outdoor Architecture Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 2798–2807. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 483–499. [Google Scholar]
Wu, W.; Qian, C.; Yang, S.; Wang, Q.; Cai, Y.; Zhou, Q. Look at Boundary: A Boundary-Aware Face Alignment Algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018. [Google Scholar]
Acuna, D.; Ling, H.; Kar, A.; Fidler, S. Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 859–868. [Google Scholar]
Zhang, Z.; Li, Z.; Bi, N.; Wang, J.; Zhang, S. PPGNet: Learning Point-Pair Graph for Line Segment Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2019; pp. 7105–7114. [Google Scholar]
Hamaguchi, R.; Hikosaka, S. Building Detection from Satellite Imagery Using Ensemble of Size-Specific Detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE: New York, NY, USA, 2018; pp. 187–191. [Google Scholar]
Zhang, R.; Zhang, Q.; Zhang, G. SDSC-UNet: Dual Skip Connection ViT-Based U-Shaped Model for Building Extraction. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6005005. [Google Scholar] [CrossRef]
Zhou, Y.; Qi, H.; Ma, Y. End-to-End Wireframe Parsing. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 962–971. [Google Scholar]

Figure 1. Example of extracting building structures from remote sensing images [1], the right image shows the reconstructed structure with corners and edges highlighted.

Figure 2. Overall architecture of RegionGraph. The method consists of two stages: (1) graph construction, where ConPNet predicts structural primitive heatmaps (region, boundary, corner, and sampled points) from the input satellite image and initializes a region-based topological graph; (2) graph optimization, where node merging is applied to contract and refine the graph structure, producing the final vectorized building reconstruction.

Figure 3. Methods for constructing graph structures. (a) Sampled contour points predicted by ConPNet are first used to generate a triangulated partition of the building footprint. (b) Each triangular unit is treated as a graph node, and adjacency between neighboring triangles is encoded as graph edges. (c) This process converts low-level geometric primitives into a structured topological graph, which serves as the input to the subsequent graph contraction stage.

Figure 4. Overall architecture of the proposed ConPNet. ConPNet contains a structure extraction module composed of three reconstruction branches, the Region Reconstruction Network (RRN), Boundary Reconstruction Network (BRN), and Point Reconstruction Network (PRN), followed by a Fusion Network (FN) for sampled-point heatmap synthesis. The predicted region (R), boundary (B), and corner (CP) heatmaps provide structural cues that are integrated to produce the sampled-point heatmap (WP), which serves as the basis for graph initialization.

Figure 5. Detailed architecture of the proposed ConPNet.

Figure 6. Squeeze–Excitation module based on channel attention.

Figure 7. Graph structure representation of reconstruction results. (a) Example of a target building reconstruction composed of multiple connected structural components. (b) Corresponding transformation into a region-based graph representation. Each structural region is modeled as a graph node, and shared boundaries between regions are encoded as relational edges. The sets E and V illustrate how edges and nodes are organized in the graph structure, including the background node

v_{b a c k}

. This representation forms the basis for the subsequent graph contraction process.

Figure 7. Graph structure representation of reconstruction results. (a) Example of a target building reconstruction composed of multiple connected structural components. (b) Corresponding transformation into a region-based graph representation. Each structural region is modeled as a graph node, and shared boundaries between regions are encoded as relational edges. The sets E and V illustrate how edges and nodes are organized in the graph structure, including the background node

v_{b a c k}

. This representation forms the basis for the subsequent graph contraction process.

Figure 8. Graph shrinkage process based on node merging.

Figure 9. Visualization results of different methods on the SpaceNet corpus dataset.

Figure 10. Primitive heatmaps and reconstructed structures produced by RegionGraph on SpaceNet.

Figure 11. Partial 3D reconstruction results based on RegionGraph outputs. They are manually constructed from the 2D vector outputs of RegionGraph to illustrate potential downstream 3D reconstruction applications. The proposed method itself performs 2D structural reconstruction and does not directly generate 3D geometry.

Figure 12. Visualization of reconstruction results under different ablation settings.

Table 1. Parameterization of the base convolution module in ConPNet.

	$ConvLayer (C_{in}, C_{out})$	$DownSample (C_{in}, C_{out})$	$UpSample (C_{in}, C_{out})$	$ConvBlock (C_{in}, C_{out})$
Input	$H \times W \times C_{i n}$	$H \times W \times C_{i n}$	$H \times W \times C_{i n}$	$H \times W \times C_{i n}$
Structure	Conv $3 \times 3$ , stride = 1	Conv $3 \times 3$ , stride = 1	ConvT $3 \times 3$ , stride = 2	$C o n v L a y e r (C_{i n}, \frac{C_{o u t}}{2})$
	BatchNorm	BatchNorm	BatchNorm	$C o n v L a y e r (C_{i n}, \frac{C_{o u t}}{4})$
	ReLU	ReLU	ReLU	$C o n v L a y e r (C_{i n}, \frac{C_{o u t}}{4})$
				Concatenate
Output	$H \times W \times C_{o u t}$	$\frac{H}{2} \times \frac{W}{2} \times C_{o u t}$	$2 H \times 2 W \times C_{o u t}$	$H \times W \times C_{o u t}$

Table 2. Comparison of corner/edge/region precision and recall on SpaceNet. The best results are shown in bold, and the second-best are underlined.

Method	Corner (%)		Edge (%)		Region (%)
Method	Precision	Recall	Precision	Recall	Precision	Recall
PolyRNN++ [38]	49.6	43.7	19.5	15.2	39.8	13.7
PPGNet [39]	78.0	69.2	55.1	50.6	32.4	30.8
Hamaguchi [40]	58.3	57.8	25.4	22.3	51.0	36.7
SDSC-UNet [41]	42.5	70.6	25.6	35.6	42.1	42.7
L-CNN [42]	66.7	86.2	51.0	71.2	25.9	41.5
Nauata [1]	91.1	64.6	68.1	48.0	70.9	53.1
ConvMPN [35]	77.9	80.2	56.9	60.7	51.1	57.6
RegionGraph (Ours)	80.3	75.9	61.6	58.3	71.9	65.4

Table 3. Comparison of F1 scores on SpaceNet. The best results are shown in bold, and the second-best are underlined.

Method	F1-Corner (%)	F1-Edge (%)	F1-Region (%)	F1-Average (%)
PolyRNN++ [38]	46.4	17.1	20.4	28.0
PPGNet [39]	73.3	52.8	31.6	52.6
Hamaguchi [40]	58.0	23.8	42.7	41.5
SDSC-UNet [41]	53.1	29.8	42.4	41.8
L-CNN [42]	75.2	59.4	31.9	55.5
Nauata [1]	75.6	56.3	60.8	64.2
ConvMPN [35]	79.0	58.7	54.2	64.0
RegionGraph (Ours)	78.0	59.9	68.5	68.8

Table 4. Ablation settings for RegionGraph (✓ indicates inclusion, / means not).

Setting Name	${Point}_{sample}$	${Contract}_{tri}$	${Contract}_{heat}$	R	CA
Setting 1	/	✓	/	/	/
Setting 2	✓	/	/	/	/
Setting 3	✓	✓	/	/	/
Setting 4	✓	/	✓	/	/
Setting 5	✓	✓	✓	/	/
Setting 6	✓	✓	✓	✓	/
Setting 7	✓	✓	✓	✓	✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, L.; Fang, C.; Li, W.; Chen, K.; Li, B.; Sun, Q. RegionGraph: Region-Aware Graph-Based Building Reconstruction from Satellite Imagery. J. Imaging 2026, 12, 161. https://doi.org/10.3390/jimaging12040161

AMA Style

Li L, Fang C, Li W, Chen K, Li B, Sun Q. RegionGraph: Region-Aware Graph-Based Building Reconstruction from Satellite Imagery. Journal of Imaging. 2026; 12(4):161. https://doi.org/10.3390/jimaging12040161

Chicago/Turabian Style

Li, Lei, Chenrong Fang, Wei Li, Kan Chen, Baolong Li, and Qian Sun. 2026. "RegionGraph: Region-Aware Graph-Based Building Reconstruction from Satellite Imagery" Journal of Imaging 12, no. 4: 161. https://doi.org/10.3390/jimaging12040161

APA Style

Li, L., Fang, C., Li, W., Chen, K., Li, B., & Sun, Q. (2026). RegionGraph: Region-Aware Graph-Based Building Reconstruction from Satellite Imagery. Journal of Imaging, 12(4), 161. https://doi.org/10.3390/jimaging12040161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RegionGraph: Region-Aware Graph-Based Building Reconstruction from Satellite Imagery

Abstract

1. Introduction

2. Related Work

2.1. Structural Reconstruction and Structural Reasoning

2.2. Traditional Structural Reconstruction Methods

2.3. Deep Learning for Building Extraction from Remote Sensing Images

2.4. Contour-Based and Topology-Aware Methods

2.5. Summary

3. Method

3.1. Overall Architecture

3.2. Graph Construction: ConPNet

3.2.1. Structural Design of ConPNet

3.2.2. Structure Extraction

3.2.3. Rendering Compositing

3.3. Graph Optimization: Graph Shrinkage via Node Merging

Node Merging Strategy

4. Experiments and Analysis

4.1. Dataset and Sample Processing

4.1.1. Dataset

4.1.2. Sample Processing

4.2. Evaluation Metrics and Experimental Setup

4.2.1. Evaluation Metrics

4.2.2. Experimental Setup

4.3. Comparative Evaluation

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI