DualRecon: Building 3D Reconstruction from Dual-View Remote Sensing Images

Shao, Ruizhe; Chen, Hao; Li, Jun; Ma, Mengyu; Du, Chun

doi:10.3390/rs17233793

Open AccessArticle

DualRecon: Building 3D Reconstruction from Dual-View Remote Sensing Images

by

Ruizhe Shao

,

Hao Chen

,

Jun Li

,

Mengyu Ma

^*

and

Chun Du

Department of Cognitive Communication, College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(23), 3793; https://doi.org/10.3390/rs17233793 (registering DOI)

Submission received: 14 October 2025 / Revised: 12 November 2025 / Accepted: 19 November 2025 / Published: 22 November 2025

(This article belongs to the Special Issue 3D City Modeling and Observation Using Remote Sensing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

DualRecon can detect buildings from two remote sensing images and estimate building heights based on disparity, enabling 3D building reconstruction.
DualRecon achieves state-of-the-art accuracy in dual-view building reconstruction.

What are the implication of the main findings?

It provides a reconstruction method for large-scale, time-sensitive applications in urban areas.
It provides a practical design recipe for building reconstruction from dual-view remote sensing images.

Abstract

Large-scale and rapid 3D reconstruction of urban areas holds significant practical value. Recently, methods that reconstruct buildings from off-nadir imagery have gained attention for their potential to meet the demand for large-scale, time-sensitive reconstruction applications. These methods typically estimate the building height and footprint position by extracting building roof and the roof-to-footprint offset within a single off-nadir image. However, the reconstruction accuracy of these methods is primarily constrained by two issues: first, errors in the single-view building detection, and second, the inaccurate extraction of offsets, which is often a consequence of these detection errors as well as interference from shadow occlusion. To address these challenges, we propose DualRecon, a method for 3D building reconstruction from heterogeneous dual-view remote sensing imagery. In contrast to single-image detection methods, DualRecon achieves more accurate 3D information extraction for reconstruction by fusing and correlating building information across different views. This success can be attributed to three key advantages of DualRecon. First, DualRecon fuses the two input views and extracts building objects based on the fused image features, thereby improving the accuracy of building detection and localization. Second, compared to the roof-to-footprint offset, the disparity offset of the same rooftop between different views is less affected by interference from shadows and occlusions. Our method leverages this disparity offset to determine building height, which enhances the accuracy of height estimation. Third, we designed DualRecon with a three-branch architecture to be optimally tailored for the dual-view 3D information extraction task. Third, we designed DualRecon with a three-branch architecture to be optimally tailored for the dual-view 3D information extraction task. Moreover, this paper introduces BuildingDual—the first large-scale dual-view 3D building reconstruction dataset. It comprises 3789 image pairs containing 288,787 building instances, where each instance is annotated with its respective roofs in both views, roof-to-footprint offset, footprint, and the disparity offset of the roof. Experiments on this dataset demonstrate that DualRecon achieves more accurate reconstruction results than existing methods when performing 3D building reconstruction from dual-view remote sensing imagery. Our data and code will be made publicly available.

Keywords:

3D reconstruction; remote sensing; deep learning; building detection

1. Introduction

Three-dimensional (3D) data of urban buildings can provide an intuitive and comprehensive representation of urban morphology, while also serving as a fundamental data source for a wide range of critical applications such as urban planning [1,2,3], disaster relief [4], and skyline analysis [5]. Consequently, the construction of 3D models for urban buildings has long attracted extensive scholarly attention. Various approaches to 3D reconstruction have been proposed from different perspectives, including LiDAR-based methods [6,7,8,9,10], unmanned aerial vehicle (UAV) imagery-based methods [11,12], and satellite imagery-based methods [13]. Among these, satellite imagery–based 3D reconstruction methods exhibit distinct advantages in efficiency and cost-effectiveness due to the wide spatial coverage and high acquisition efficiency of satellite imagery. These merits render them particularly valuable in time-sensitive applications or scenarios with constrained data acquisition conditions.

Through an analysis of urban satellite imagery, researchers have developed a new paradigm of instance-aware 3D building reconstruction methods. These methods leverage the unique 3D shape priors of buildings, employing deep learning to reconstruct 3D models of urban areas by extracting buildings from satellite images and estimating their footprints and heights. By harnessing the image semantic analysis and understanding capabilities of deep neural networks, these approaches can perform reconstruction from just a single satellite image. This significantly reduces the required data acquisition costs compared to traditional multi-view-based satellite reconstruction methods. Consequently, they help bridge the gap left by traditional reconstruction methods in large-scale, data-scarce, or time-sensitive application scenarios.

Among these methods based on building instance analysis, a typical approach is 3D building reconstruction from a single off-nadir satellite image. Off-nadir remote sensing imagery refers to images captured by sensors at an oblique viewing angle, rather than directly overhead at the nadir position. Methods utilizing this type of image generally begin by utilizing this type of imagery typically employ neural networks to detect building roofs and roof-to-footprint offsets from a single image. Subsequently, these outputs are used to derive building footprint masks, and building heights are estimated based on the measured offsets, thereby enabling the reconstruction of the building’s 3D model. This category of methods has attracted considerable attention [14,15,16], primarily because they capitalize on two inherent characteristics of off-nadir remote sensing imagery. On the one hand, a large proportion of remote sensing imagery is, by nature, off-nadir. On the other hand, the parallax effect present in off-nadir imagery introduces a measurable displacement between the rooftops and the bases of buildings, which serves as a valuable cue for estimating building heights. The introduction of these methods has addressed the challenges of high data acquisition difficulty and long processing times associated with traditional satellite imagery–based 3D reconstruction. However, this reliance on a single image for reconstruction also introduces its own set of limitations. First, the extraction of buildings from a single remote sensing image using neural networks may lead to false positives (misdetections) and false negatives (missed detections) [15]. Second, these methods depend heavily on the accurate estimation of roof-to-footprint offsets to determine both building height and footprint location. In practice, however, shadows and occlusions are prevalent in remote sensing imagery, and the interference they cause can lead to inaccuracies in both height and footprint estimations [15]. Consequently, in such scenarios, methods that rely on roof-to-footprint offsets can be less reliable than traditional, disparity-based approaches for height estimation.

To enhance the accuracy of 3D building reconstruction, this paper proposes a novel two-view off-nadir–based reconstruction method, DualRecon, which is shown in Figure 1. We posit that the problems inherent in existing single off-nadir image reconstruction methods stem from the fact that a single image provides insufficient information to support reconstruction, particularly in scenes with partial occlusion or shadows. Therefore, inspired by the principle of Multi-View Stereo (MVS), which performs 3D reconstruction from cross-view matching and disparity, we have designed DualRecon, a method that analyzes dual-view information for 3D building reconstruction. DualRecon leverages the semantic understanding capability of deep neural networks to extract and match building instances across two views, thereby establishing cross-view correspondences without relying on multiple highly homologous satellite images for traditional feature-point matching. As a result, it retains the advantage of low data acquisition requirements. Within DualRecon, we propose three strategies to leverage the dual-view input, addressing the key problems inherent in single-view off-nadir methods: errors in building object detection and the susceptibility of roof-to-footprint offset estimation to interference from shadows and occlusions. First, after receiving input images from different angles, DualRecon fuses this richer information and detects building objects based on the fused image features, thereby improving the accuracy of building detection and localization. Second, our method introduces the disparity of the same roof between different views as a basis for determining building height. This is because, compared to the roof-to-footprint offset, disparity is less affected by interference from shadows and occlusions, thus enhancing the accuracy of height estimation. Third, we further synthesize information from both views to extract more accurate building footprints by translating and fusing the network’s predicted roof masks.

In dual-view building reconstruction, the correlation between the dual-view input and the output 3D information is complex. To effectively handle this, a network architecture that aligns with the input-to-output information flow is required. Therefore, we design a pair of Siamese branches for single-view 3D information extraction, along with a fusion branch for cross-view information integration and association. The Siamese branches utilize an identical network structure and share parameters (weight-sharing). This design is highly beneficial: beyond making the model more parameter-efficient, it allows the branch to be trained concurrently on images from both views. This approach effectively enhances the utilization of supervisory signals, reduces the risk of negative transfer, and ultimately enhances both data efficiency during training and the robustness of the model.

The contributions of this paper can be summarized as follows:

1.: We propose DualRecon, which, to the best of our knowledge, is the first method specifically designed for building 3D reconstruction from dual-view remote sensing imagery. DualRecon detects building instances based on two remote sensing images from different view angles, and then utilizes the roof disparity offset to estimate building height, thereby addressing the inaccurate building reconstruction in existing off-nadir-based methods caused by occlusion.
2.: We construct a dual-view remote sensing dataset for building 3D reconstruction, named BuildingDual. In BuildingDual, building instances in both views are annotated, including their distinct roofs in the two views, roof-to-footprint offsets, footprints, and roof disparity offsets.
3.: We conduct extensive experiments to demonstrate that, in the task of 3D building reconstruction from dual-view remote sensing imagery, DualRecon outperforms existing methods in reconstruction accuracy.

2. Related Work

2.1. Traditional Satellite Imagery–Based 3D Reconstruction

Traditional satellite imagery–based 3D reconstruction adopts well-established vision-based 3D reconstruction pipelines. Compared to methods based on other data sources, such as LiDAR [6,7,8], vision-based approaches benefit from more accessible input data and obviate the need for complex data registration across heterogeneous modalities [17]. For these reasons, this general pipeline is not only central to satellite image reconstruction but is also widely applied in other domains, such as 3D reconstruction from UAVs [11,12] and Simultaneous Localization and Mapping (SLAM) [18]. This reconstruction pipeline typically involves four main stages: (1) Structure-from-Motion (SfM) [19], which includes extracting and matching image features followed by triangulation to recover camera parameters and generate a sparse point cloud; (2) Multi-View Stereo (MVS) [20,21], used to generate depth maps or dense point clouds; (3) 3D mesh generation [22,23]; and (4) texture fusion [24,25], which integrates texture information into the mesh to produce a photorealistic 3D model.

However, there are two key distinctions between satellite imagery and general 3D reconstruction tasks. First, satellite images are typically supplied with Rational Polynomial Coefficients (RPCs), which provide information about the acquisition geometry [13,26]. These coefficients can serve as a strong initial guess for the camera parameters required by the SfM stage. Second, the camera model for satellite imagery may differ from those commonly assumed in standard 3D reconstruction, making it necessary to apply skew correction to optimize the camera model [13,26]. While a few studies have explored 3D reconstruction using generalized stereo pairs—that is, images from heterogeneous satellite sources [27]—this approach still faces significant challenges. These methods remain constrained by certain conditions, such as viewing angles [28], and confronts several challenges [13].

2.2. Deep Learning–Based 3D Reconstruction from Remote Sensing Imagery

With the advancement of deep learning techniques [29], the pattern recognition and semantic understanding capabilities [30] of deep neural networks have been leveraged to relax the stringent input data requirements traditionally needed for 3D reconstruction [31,32]. This has greatly enhanced the applicability and potential of satellite imagery–based 3D reconstruction. Some researchers, following a depth estimation paradigm, estimate building heights directly from a single image input, thereby generating 3D building information in the form of a Digital Surface Model (DSM). These methods are relatively simple and are primarily applicable to near-nadir satellite imagery. An example of such an approach is GeoPose [33]. Other methods, such as PSMNet [34] and HSM-Net [35], employ neural networks to perform a task analogous to traditional MVS. They estimate a disparity map from a pair of stereo images, which is then used to reconstruct a DSM. However, these approaches still impose strict requirements on the homogeneity of input imagery. Even on benchmark datasets like DFC2019 [36] and 2021 aerial stereo benchmark [37], which were used in their original papers, these methods are known to have failure cases [38]. More recently, the rise of large-scale models has spurred significant investment of data and computational resources into the fields of depth estimation and deep learning-based 3D reconstruction. This has led to the emergence of new methods for depth estimation and sparse-view 3D reconstruction based on large models, such as Depth Anything [39] and VGGT [40]. These advancements are opening up new frontiers for 3D reconstruction powered by neural network-based depth estimation. However, for the specific task of 3D building reconstruction from remote sensing imagery, which is the focus of this paper, these existing depth estimation methods are still constrained by limitations in image resolution [41] and the achievable level of reconstruction detail [42].

Buildings are a class of geographic features with distinct 3D shapes [43]. Capitalizing on this, a significant body of research has explored how to improve reconstruction accuracy by incorporating inductive priors of building shapes. [44,45]. For example, after reconstructing the building DSM using a traditional pipeline, RESDEPTH [44] refines the results based on the structural characteristics of buildings, thereby making the reconstructed buildings more accurate. Another major research direction involves 3D reconstruction through a detect-then-estimate paradigm. In this approach, buildings are first detected, and their heights are subsequently estimated. Two common cues are leveraged for height estimation: building shadows and the displacement of buildings in off-nadir imagery [14,46,47,48]. While some studies have focused on shadow-based height estimation [49,50], methods based on off-nadir displacement have a broader scope of application. This is because off-nadir images constitute a vast majority of remote sensing data and, crucially, do not depend on specific solar illumination angles. Consequently, 3D building reconstruction from off-nadir imagery has garnered significant attention from researchers in recent years, leading to the development of methods such as LoFT [14], MLS-BRN [15], and SingleRecon [16].

3. Methodology

The proposed dual-view building 3D reconstruction task takes as input two remote sensing images of the same area captured from different viewing angles:

I_{A}, I_{B} \in R^{H \times W \times 3}

. The input images are required to be orthorectified to generate pseudo-orthophotos, in which the building footprints are spatially aligned. Such data are widely available in online map services and is generated via well-established production workflows. The goal of our method is to reconstruct 3D buildings by estimating their footprints and heights. To accomplish this, we design the DualRecon network, which takes the two images as input and outputs three key predictions for the buildings in the imagery: roof-to-footprint offset, disparity offset, and roof mask. The building footprint is obtained by shifting the roof mask according to the predicted roof-to-footprint offset. While the building height is calculated from the predicted roof disparity, leveraging the satellite’s viewing angle information.

3.1. DualRecon Network

This dual-view 3D information extraction task involves two distinct sub-tasks: single-view information extraction from each image, and cross-view information association and fusion. To address these requirements, we design a three-branch network, as illustrated in Figure 2.

The two single—view branches share the same architecture and are responsible for extracting building 3D information from each individual view. Within each single-view branch, the workflow is as follows: First, a backbone network extracts a feature map from the single-view image.

F_{i} = f_{A / B} (I_{i}), f o r i i n {A, B}

(1)

where

f_{A / B}

denotes the shared-weight backbone, for which we use ResNet-50 in this work.

F_{i} = [f_{1}^{i}, f_{2}^{i}, f_{3}^{i}, f_{4}^{i}, f_{5}^{i}], f_{l}^{i} \in R^{256 \times H_{l} \times W_{l}}, i \in \{A, B\}

represents the multi-scale image features from the two views.

Then, guided by the positions of dual-view bounding boxes (which are obtained from the fusion branch), RoIAlign is used to extract region-specific features for each building instance from this feature map.

F_{i}^{k} = R o I A l i g n (F_{i}, B^{k}), f o r k i n \{1,2 \dots N\}, i i n {A, B}

(2)

where

N

is the number of building objects in the image,

B^{k} = (x^{k}, y^{k}, w^{k}, h^{k}) \in R^{4}

is the bounding box of the

k

-th dual-view building, and

F_{i}^{k}

is the image feature of the

k

-th building object from view

i

.

Existing building reconstruction methods based on single off-nadir images often suffer from inaccurate footprint estimations. This is because the roof-to-footprint offset may corrupted by shadows and occlusions. To address this challenge, we propose a fusion strategy that leverages dual-view information. Specifically, we feed instance features from 2 views,

F_{i}^{k}

, into an offset head and a mask head, respectively, to predict the roof-to-footprint offset

δ_{i}^{k} \in R^{2}

, and activation value of roof

A_{i}^{k} \in R^{h^{k} \times w^{k}}

.

A_{i}^{k} = M a s k H e a d_{A} (F_{i}^{k}),

(3)

δ_{i}^{k} = O f f s e t H e a d_{A / B} (F_{i}^{k})

(4)

where

M a s k H e a d_{A / B}

and

O f f s e t H e a d_{A / B}

denotes the shared-weight mask head and offset head in two single-branches, respectively. By leveraging information extracted from the two single-view images, we first translate the roof response maps according to the estimated offsets, then fuse them by averaging, and finally apply a thresholding operation to obtain the building footprint

M_{F o o t p r i n t}^{k} \in {\{0,1\}}^{h^{k} \times w^{k}}

.

A_{F o o t p r i n t}^{k} = \frac{1}{2} \sum_{i = A, B} T_{δ_{i}^{k}} (A_{i}^{k}),

(5)

M_{F o o t p r i n t}^{k} (x, y) = \{\begin{matrix} 1 i f A_{F o o t p r i n t}^{k} (x, y) > 0.5, \\ 0 o t h e r w i s e \end{matrix}

(6)

As previously mentioned, the two single-view branches are designed as a pair of Siamese branches, utilizing an identical architecture with shared parameters (weight-sharing). This design offers two key advantages. First, it makes the model more parameter-efficient. Second, it allows the single shared-weight branch to be trained on both views simultaneously within each training step. This effectively enhances the utilization of supervisory signals for the feature extractor, leading to enhanced data efficiency and model robustness.

The Fusion Branch operates on the features from the two single-view branches and is responsible for extracting both fused and associative inter-view information. The process unfolds in several key steps: First, the multi-scale features from the two single-view branches are fed into a feature fusion neck module

F u s e N e c k

. The core operation of

F u s e N e c k

involves a scale-wise fusion strategy: For each scale, the

F u s e N e c k

first concatenates the corresponding features from both views along the channel dimension. These concatenated features are then processed through a series of multi-scale feature fusion convolutional layers. Finally, the module outputs the fused multi-scale feature map

F_{A & B} = [f_{1}^{A & B}, f_{2}^{A & B}, f_{3}^{A & B}, f_{4}^{A & B}, f_{5}^{A & B}]

, where

f_{l}^{A & B} \in R^{256 \times H_{l} \times W_{l}}

.

F_{A & B} = F u s e N e c k (F_{A}, F_{B}) = \{{C o n v_{l}^{512 \to 256} (c a t (f_{l}^{A}, f_{l}^{B})), l i n 1,2, \dots 5}\}

(7)

Next, this fused feature map is fed into an RPN head and a bounding box head to detect dual-view building bounding boxes.

P = R P N H e a d (F_{A & B}), \{B^{k}, f o r k i n \{1,2 \dots N\}\} = B B o x H e a d (P, F_{A & B})

(8)

where

P

denotes the set of proposals extracted from the image, and

B^{k} = (x^{k}, y^{k}, w^{k}, h^{k}) \in R^{4}

represents a dual-view bounding box.

As illustrated by the yellow rectangle in the upper part of Figure 2, a dual-view bounding box is designed to bound the same building instance across both views simultaneously. These detected boxes provide crucial positional guidance for in each branch. Then, using these bounding boxes, instance-specific fused features for each building are cropped from the feature map. These features are then passed to a disparity offset head to predict

δ_{d i s}^{k} \in R^{2}

, the disparity offset of the building’s roofs between the two views.

F_{A & B}^{k} = R o I A l i g n (F_{A & B}, B^{k}), f o r k i n \{1,2 \dots N\},

(9)

δ_{d i s}^{k} = D i s O f f s e t H e a d (F_{A & B}^{k}), f o r k i n \{1,2 \dots N\}

(10)

Finally, the building height, h, is determined using the disparity offset. Recalling the fundamental equation for depth estimation in Multi-View Stereo (MVS):

Z = \frac{B f}{d}

.

The height of the building is defined as the difference between the ground depth

Z_{g r o u n d}

(which is equivalent to the satellite altitude H) and the roof depth

Z_{r o o f}

, which is formulated as:

h = Z_{g r o u n d} - Z_{r o o f} = H - \frac{B f}{d + δ_{d i s}} = \frac{H d - B f + H δ}{d + δ_{d i s}}

(11)

where d is the disparity of the ground point corresponding to the building’s location after projection onto the rectified image plane. B is the baseline length, which in our scenario is the distance between the optical centers of the satellite images, and f is the focal length.

Considering that

d = \frac{B f}{H}

and

d ≫ δ

, the height h can be approximated as:

h \approx \frac{H \frac{B f}{H} - B f + H δ_{d i s}}{\frac{B f}{H}} = \frac{H^{2}}{B f} δ_{d i s}

(12)

Thus, by substituting the satellite imaging parameters, the building height can be calculated from the disparity offset.

3.2. Loss Functions

During the training phase, DualRecon is optimized using a composite loss function that comprises four terms. Each term is responsible for supervising the learning of a specific prediction: the bounding box, the disparity offset, the two roof-to-footprint offsets, and the two roof masks.

L = {λ_{B B o x} L}_{B B o x} + {λ_{d i s} L}_{d i s} + λ_{R F} L_{R F} + {λ_{R o o f} L}_{R o o f}

(13)

where

λ_{B B o x}, λ_{R F}, λ_{R o o f},

and

λ_{d i s}

denote the loss weights for the respective components. Following previous studies [14,15], the first three weights are set to

λ_{B B o x} = 1, λ_{R F} = 16, λ_{R o o f} = 0.2

, while

λ_{d i s}

is determined through experimental analysis (see Section 4.4.2) and set to 16 in this work.

\{\begin{matrix} L_{B B o x} = \frac{1}{N} \sum_{k = 1}^{N} s m o o t h L 1 ({\hat{B}}^{k}, B^{k}) \\ L_{d i s} = \frac{1}{N} \sum_{k = 1}^{N} {‖{\hat{δ}}_{d i s}^{k} - δ_{d i s}^{k}‖}_{1} \\ L_{R F} = \frac{1}{N} \sum_{i = A, B} \sum_{k = 1}^{N} {‖{\hat{δ}}_{i}^{k} - δ_{i}^{k}‖}_{1} \\ L_{R o o f} = \frac{1}{N} \sum_{i = A, B} \sum_{k = 1}^{N} C r o s s E n t r o p y ({\hat{M}}_{i}^{k} - M_{i}^{k}) \end{matrix}

(14)

Here,

N

is number of positive objects.

{\hat{B}}^{k}

and

B^{k}

denote the predicted disparity offset and its corresponding ground-truth value, respectively.

{\hat{δ}}_{d i s}^{k}

and

δ_{d i s}^{k}

denote the predicted disparity offset and its corresponding ground-truth value, respectively. Similarly,

{\hat{δ}}_{i}^{k}

and

δ_{i}^{k}

represent the predicted roof-to-footprint offset and its ground truth for the

i

-th view, while

{\hat{M}}_{i}^{k}

and

M_{i}^{k}

are the predicted roof mask and its ground truth for the

i

-th view.

As illustrated by the network architecture in Figure 2, the loss terms are applied to different branches. Specifically:

L_{B B o x}

and

L_{d i s}

are used to supervise the training of the Fusion Branch.

L_{R F}

and

L_{R o o f}

are used to constrain the learning of the single-view branches, which encompasses both image feature extraction and single-view information decoding.

4. Experiment

4.1. Dataset

To facilitate the study of dual-view 3D building reconstruction, we constructed a new dataset, named BuildingDual. This dataset was created by collecting data from various sources and covers six major and representative cities in China: Beijing, Shanghai, Xi’an, Chengdu, Harbin, and Jinan. This selection ensures comprehensive coverage of regional architectural characteristics across the country, as shown in Figure 3.

BuildingDual is partitioned into a training set containing 3489 pairs of dual-view remote sensing images, and a test set with 300 image pairs. These sets are annotated with 256,630 and 32,157 building instances, respectively. Each image in the dataset has a resolution of 1024 × 1024 pixels. These images have a spatial resolution of 0.59 m, which corresponds to Level 18 imagery in common web mapping services. The image pairs often exhibit significant variations in imaging conditions between the two views, as illustrated in Figure 4. All images have undergone both georeferencing and orthorectification. For each building instance, we provide a comprehensive set of annotations, including:

The dual-view bounding box;

The roof mask in each of the two views;

The building footprint mask;

The roof-to-footprint offset for each view;

The disparity offset of the roof between the views;

The ground-truth building height.

Based on the building footprint and height, a 3D model of the building can be reconstructed, as shown in Figure 4. This dataset will be made publicly available at https://shaoruizhe.github.io/DualRecon.github.io/ (accessed on 21 October 2025).

4.2. Implementation Details

We conducted training on a server equipped with an NVIDIA GeForce RTX 4090 GPU. The batch size was set to 2. We set the maximum number of training epochs to 500 and the initial learning rate to 0.02, with a decay rate of 0.1 applied at the 32nd and 44th epochs. Using SGD with a weight decay of 0.0001 and momentum of 0.9 as the optimizer. During testing in the 3D reconstruction pipeline, we used a confidence threshold of 0.3 for building instances.

4.3. Comparative Experiments

To evaluate the performance of DualRecon, we selected six state-of-the-art (SOTA) methods as baselines. Among them, four are building-detection-based methods designed for off-nadir imagery:

MLS-BRN (CVPR 2025) [15]: A method specifically designed for 3D building reconstruction from off-nadir imagery.
LoFT (TIPAMI 2022) [14]: A method developed for extracting 3D building information from off-nadir imagery.
ViTAE (TGRS 2022) [51]: A foundation model designed for the analysis of remote sensing data. In our experiments, we built upon its framework for building detection and added an offset head for offset estimation.
Cascade Mask R-CNN (CVPR 2018) [52]: Originally developed for multi-object detection in images. In our adaptation, an offset head is added for the estimation of roof-to-footprint offsets.

Since these baseline methods are inherently designed for 3D reconstruction from a single off-nadir image, while our work addresses reconstruction from a pair of views, we devised two distinct evaluation protocols to ensure a fair comparison:

Single-View Input: In the first protocol, we feed only one of the two images from a pair into the baseline model. This allows the models to operate in their native intended state.
IoU-Based Merging: The second protocol utilizes both images. We first apply the baseline model to each image independently to extract two separate sets of 3D building information. Then, we match the buildings from these two sets based on their footprint overlap. A building is considered a valid detection only if the Intersection over Union (IoU) of its footprints from the two views exceeds a predefined threshold of 0.3; otherwise, it is discarded. For the matched buildings, the outputs from both views are fused to produce the final 3D model. We posit that through this operation, the building extraction accuracy of an existing single-view method can be enhanced by fusing the extracted information from two views.

In our results, we refer to these two protocols as “single” and “IoU-merge”, respectively.

In addition to the aforementioned methods, we also experimented a traditional multi-view 3D reconstruction pipeline, SSR (ISPRS, 2021) [13]. However, it was found that this method failed to match a sufficient number of feature point pairs, rendering the reconstruction infeasible. Consequently, its reconstruction results are not reported. Nevertheless, we observe that large-model–based 3D reconstruction approaches, such as VGGT (CVPR2025) [40], also have the potential to address the dual-view 3D reconstruction task studied in this paper, and thus warranted a comparative study. Notably, since VGGS’s native output is a point cloud, we employ Poisson Surface Reconstruction (PSR) to generate a mesh model. This post-processing step ensures a fair comparison by producing mesh outputs that are consistent with our method and the other baselines.

For the comparison against building detection-based methods, we evaluate the accuracy of footprint extraction using the F1-score. When calculating the F1-score, a prediction is considered a true positive if its Intersection over Union (IoU) with a ground-truth footprint is greater than 0.5. To assess the accuracy of the final 3D reconstruction results, we compute the Mean Absolute Error (MAE) in pixels for the building height. The final experimental results are reported in Table 1.

The experimental results in Table 1 support two key conclusions: First, for the task of building reconstruction in urban scenes from sparse remote sensing imagery, detection-based approaches demonstrate significant advantages. These methods reconstruct 3D buildings by estimating their footprints and heights, effectively leveraging building shape priors as an inductive bias. This strategy reduces the amount of information required for reconstruction and enhances reconstruction performance under conditions where the information provided by sparse imagery is limited. In contrast, large-model-based approaches still fall short in reconstruction accuracy for buildings, although they offer strong generalization capabilities for zero-shot 3D reconstruction across arbitrary scenes. Meanwhile, traditional 3D reconstruction pipelines struggle in this setting due to the low homogeneity of the input remote sensing images and the excessively sparse viewpoints, which make successful reconstruction infeasible. Second, compared with existing reconstruction methods based on a single off-nadir image, the proposed DualRecon demonstrates clear advantages in dual-view reconstruction tasks. Notably, after applying the IoU-merge operation, the accuracy of building footprint extraction improves significantly —particularly in the AP metric. This improvement indicates that the IoU-based dual-view verification effectively removes certain false positives present in single-view detection, thereby enhancing detection performance. DualRecon, in contrast, enhances building detection accuracy by leveraging dual-view feature fusion (as reflected in a higher Footprint F1) and achieves more accurate height estimation via the disparity offset (demonstrated by a lower Height MAE).

To provide a more intuitive presentation of the experimental results, we visualize part of the results in Figure 5. Due to space limitations, for several building-detection-based baselines, we only present the dual-view building information extraction and reconstruction results after the IoU-based merging operation.

As depicted in Figure 5, the building-detection-based methods exhibits a significant advantage in preserving the shape of buildings compared to VGGT. Even under these challenging conditions of poor input image homogeneity, VGGT is still able to perform pixel matching and point cloud reconstruction with relatively high accuracy. This demonstrates that deep learning-based methods, after large-scale training, acquire superior capabilities in image understanding and information extraction that far surpass those of traditional heuristic approaches—particularly in handling non-homologous inputs. VGGT’s zero-shot 3D reconstruction capability is highly impressive. However, the models reconstructed by VGGT still suffer from building shape inaccuracies; they tend to have overly smooth surfaces, lacking the sharp edges and corners characteristic of buildings. Furthermore, in scenarios involving large off-nadir angles and tall buildings, VGGT encounters partial roof pixel matching failures leading to ghosting artifacts in reconstructed models (see right panel of Figure 5, line 1).

Furthermore, compared to other building-detection-based methods, including LoFT, MLS-BRN, and ViTAE, the proposed DualRecon demonstrates superior accuracy in the dual-view reconstruction task. This superiority is primarily manifested in two aspects. First, it achieves higher accuracy in building object extraction. Existing methods are typically designed for single-view inputs, a design that is prone to misjudgments during the extraction process and results in False Positive detections. Although our experiments incorporated an IoU merge method to mitigate False Positives through cross-view validation, this approach introduced another issue: missed detections occurred when inaccurate building footprint detection led to insufficient overlap for the same building. In contrast, our method fuses features from both views for end-to-end building object extraction, thereby fully leveraging information from both perspectives to enhance detection accuracy. Second, it achieves higher accuracy in estimating the building offset, which in turn enhances the accuracy of both building height and footprint position estimation. As can be observed in the red dashed boxes in Figure 5 (lines 2–4), it is a common issue for existing single off-nadir image 3D reconstruction methods, including MLS-BRN and LoFT, to suffer from offset estimation errors and footprint position deviations due to shadows and occlusions. This error not only leads to insufficient accuracy in the reconstruction results but can also cause building objects to be erroneously filtered out during the IoU merge process due to inconsistent footprint alignment. In contrast, DualRecon estimates building height by calculating the disparity offset of the rooftop using images from different viewpoints. This approach effectively avoids the reconstruction errors caused by interference from occlusion in the estimation of the roof-to-footprint offset.

4.4. Ablation Study

4.4.1. The Network Architectures to Extract Single-View Information (Roof-to-Footprint Offset and Roof)

The input for our building 3D reconstruction task is a pair of images from different viewpoints, while the output includes building detection results along with footprint and height estimations used for the final reconstruction. This constitutes a relatively novel task formulation in the field of deep learning that has lacked sufficient investigation in prior studies. To demonstrate that the architecture of our designed DualRecon network is optimal, we have conducted an ablation study. In this study, we propose several alternative candidate network architectures and compare their performance against that of DualRecon.

The candidate architectures are illustrated in Figure 6a,b, while the proposed DualRecon architecture is shown in Figure 6c. Specifically, the network architecture in Figure 6a involves feeding the fused features into a dual-view offset head and a dual-view mask head. These dual-view heads are designed to simultaneously output the offset and roof mask for both respective views. The experiment with this candidate architecture is designed to investigate the optimal network structure for single-view information extraction. It involves an experimental comparison between the Siamese single-view branches designed in our DualRecon and an alternative architecture that first fuses the image features and then simultaneously extracts the information for both views from the fused feature. The alternative architecture in Figure 6b, on the other hand, is designed to directly infer the building footprint from the fused features. In contrast, DualRecon’s procedure for extracting the footprint involves first outputting the roof mask and then deriving the footprint by translating this mask according to the estimated offset. By comparing our method with candidate architecture b, we aim to investigate the relative merits of these two distinct approaches for footprint extraction.

Table 2 report the ablation study results about difference single-view information (roof-to-footprint offset and roof) extraction approaches. The results demonstrate the advantage of our Siamese single-view extraction branches. Although using a fused feature that simultaneously considers both views is theoretically appealing for information fusion, such an architecture treats the single-view feature extraction as two separate tasks handled by different sets of network parameters, and also raise the risk of negative transfer. In contrast, our designed single-view branch establishes a more explicit input-output relationship. Crucially, through parameter sharing, it ensures that the extraction of single-view information from each view is handled by the same set of network parameters. This allows the single-view branch to leverage supervisory signals from both views during training, thereby strengthening its task-specific capability. Consequently, our designed single-view branch architecture achieved superior performance in the experiments.

The experiments on footprint extraction methods in Table 2 reveal an interesting finding. Although architecture (b) achieves a slightly higher F1-score in building footprint detection than our proposed DualRecon, its final reconstruction accuracy (e.g., in terms of Height MAE) is inferior. Our in-depth analysis of the experimental results explains the reason for this phenomenon: the F1-score we used, with an IoU threshold of 0.5, has a limited ability to evaluate the precision of footprint masks that are above this threshold. The fidelity of architecture (b)’s directly extracted footprints is often compromised by occlusions in off-nadir images. Although this effect is not reflected in the F1-score, it becomes evident in the accuracy of the resultant 3D building models. Furthermore, the fidelity of these building shapes has a significant impact on the visual quality of the reconstructed 3D models. Therefore, our footprint extraction strategy is to first extract the roof, then translate it to the footprint via the estimated offset, and finally fuse the two-view footprint hypotheses to produce the final footprint. The experiments confirm that the footprint extraction method designed for DualRecon is more conducive to obtaining accurate footprints.

4.4.2. Weight of the Disparity Offset Loss $λ_{d i s}$

As detailed in Section 3.2, our DualRecon network is trained using a composite loss function (Equation (13)). While optimal values for the weights

λ_{B B o x}, λ_{o f f s e t},

and

λ_{m a s k}

have been studied in prior single-view building reconstruction research [14,15], this experiment specifically investigates the setting of the newly introduced disparity offset loss weight,

λ_{d i s}

. The performance of the network, trained with various weights, is presented in Table 3.

As can be seen from Table 3, an excessively large

λ_{d i s}

leads to a decline in the model’s building detection accuracy (footprint F1), which consequently degrades the overall reconstruction accuracy. Conversely, an overly small

λ_{p a r a}

results in a drop in reconstruction accuracy, even though the building detection accuracy remains unaffected. Since the model’s disparity offset estimation is directly related to height estimation, we hypothesize that this is due to a decline in the performance of height estimation for the correctly detected buildings. Therefore, based on a comprehensive consideration of these trade-offs, we adopt

λ_{d i s}

= 16 in this paper.

4.5. Discussion

The inter-camera angle between the two input images is a key factor in traditional 3D reconstruction. To investigate its effect on DualRecon, we conducted a dedicated study focusing on the inter-camera angle. In this experiment, the test data was divided into three subsets based on the magnitude of the inter-camera angle: samples with an angle of less than 5°, samples with an angle between 5° and 10°, and samples with an angle greater than 10°. We employed both a single-view method and DualRecon to extract 3D building information. (due to space constraints, we report only the strongest baseline from our comparative study, MLS-BRN, as the representative single-view approach.) The extraction results are presented in Table 4 and illustrated in Figure 7.

From the experimental results reported in Table 4 and in Figure 7, DualRecon demonstrates advantages in both building detection and height estimation across different input angles. For building detection, when the inter-camera angle between the two images is small, DualRecon achieves slightly higher detection accuracy than the single-view method. This advantage is primarily attributed to its capability of fusing dual-view information. A clear example is provided in the leftmost sample of Figure 7, where building structures are partially occluded by vegetation: the single-view method fails to correctly detect the buildings (see red box), whereas DualRecon successfully reconstructs it by integrating information from both perspectives. As the inter-camera angle increases, the performance gap in detection accuracy between DualRecon and the single-view method widens. This is because larger angles introduce challenges for the single-view method, such as difficulties in cross-view building matching and interference from adjacent structures (see the red box in the rightmost sample of Figure 7), leading to a decline in its recall. In contrast, DualRecon, by utilizing fused features for building extraction, is more effective at leveraging cross-view information to identify buildings.

For building height estimation, we find that as the angle increases, DualRecon’s height estimates become progressively more accurate, highlighting the effectiveness of the disparity-offset-based height estimation mechanism. We also observe that the single-view method’s height estimation improves with larger angles. This phenomenon is attributed to the fact that image pairs with a larger inter-camera angle often have a larger off-nadir angle, which facilitate height inference from the roof–footprint offset in single-view methods. Nevertheless, even under these conditions, DualRecon, which relies on disparity offsets, maintains superior accuracy in height estimation.

5. Conclusions

In this paper, we have proposed a method for 3D building reconstruction based on dual-view remote sensing imagery. In particular, we designed the DualRecon network to extract various types of 3D building information from dual-view remote sensing imagery. We engineered DualRecon with a three-branch architecture to be optimally tailored for this dual-view 3D information extraction task. Centering on this task, we conducted experimental comparisons of our method against traditional satellite image reconstruction techniques as well as multiple deep learning-based approaches. The results demonstrate that DualRecon can comprehensively analyze images from both views, achieving more accurate 3D building reconstruction results than existing methods. The ablation study validates that our designed three-branch network architecture for DualRecon achieves optimal performance in the reconstruction process. A further contribution of this work is a novel dataset, BuildingDual, curated to enable robust validation of dual-view 3D building reconstruction methods. Given the substantial demand for rapid, large-scale reconstruction of urban buildings in practical applications, we believe that the proposed reconstruction framework, which leverages heterogeneous multi-view remote sensing imagery, holds significant potential in practical applications.

Author Contributions

Conceptualization, R.S. and J.L.; methodology, R.S. and M.M.; software, R.S.; validation, M.M., C.D. and H.C.; data curation, H.C.; writing—original draft preparation, R.S.; writing—review and editing, M.M. and J.L.; supervision, H.C.; project administration, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 42471403, Grant No. 42101432, Grant No. 42101435, and Grant No. 62106276.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://shaoruizhe.github.io/DualRecon.github.io/ (accessed on 21 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Domingo, D.; van Vliet, J.; Hersperger, A.M. Long-Term Changes in 3D Urban form in Four Spanish Cities. Landsc. Urban Plan. 2023, 230, 104624. [Google Scholar] [CrossRef]
Zi, W.; Li, J.; Chen, H.; Chen, L.; Du, C. Urbansegnet: An Urban Meshes Semantic Segmentation Network Using Diffusion Perceptron And Vertex Spatial attention. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103841. [Google Scholar] [CrossRef]
Zi, W.; Chen, H.; Li, J.; Wu, J. Mambameshseg-Net: A Large-Scale Urban Mesh Semantic Segmentation Method Using A State Space Model with A Hybrid Scanning Strategy. Remote Sens. 2025, 17, 1653. [Google Scholar] [CrossRef]
Cheng, M.-L.; Matsuoka, M.; Liu, W.; Yamazaki, F. Near-Real-Time Gradually Expanding 3D Land Surface Reconstruction in Disaster Areas By Sequential Drone Imagery. Autom. Constr. 2022, 135, 104105. [Google Scholar] [CrossRef]
Liang, F.; Yang, B.; Dong, Z.; Huang, R.; Zang, Y.; Pan, Y. A Novel Skyline Context Descriptor for Rapid Localization of Terrestrial Laser Scans To Airborne Laser Scanning Point Clouds. ISPRS J. Photogramm. Remote Sens. 2020, 165, 120–132. [Google Scholar] [CrossRef]
Bizjak, M.; Mongus, D.; Žalik, B.; Lukač, N. Novel Half-Spaces Based 3D Building Reconstruction Using Airborne LiDAR Data. Remote Sens. 2023, 15, 1269. [Google Scholar] [CrossRef]
Chen, X.; Song, Z.; Zhou, J.; Xie, D.; Lu, J. Camera and LiDAR Fusion for Urban Scene Reconstruction and Novel View Synthesis Via Voxel-Based Neural Radiance Fields. Remote Sens. 2023, 15, 4628. [Google Scholar] [CrossRef]
Yan, Y.; Wang, Z.; Xu, C.; Su, N. Geop-Net: Shape Reconstruction of Buildings from LiDAR Point Clouds. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6502005. [Google Scholar] [CrossRef]
Fan, L.; Yang, Q.; Wang, H.; Deng, B. Robust Ground Moving Target Imaging Using Defocused Roi Data and Sparsity-Based Admm Autofocus Under Terahertz Video Sar. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–16. [Google Scholar] [CrossRef]
Fan, L.; Yang, Q.; Wang, H.; Qin, Y.; Deng, B. Sequential Ground Moving Target Imaging Based on Hybrid Visar-Isar Image formation in Terahertz Band. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 8738–8753. [Google Scholar] [CrossRef]
Yu, D.; Wei, S.; Liu, J.; Ji, S. Advanced Approach for Automatic Reconstruction of 3D Buildings from Aerial Images. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 43, 541–546. [Google Scholar] [CrossRef]
Yang, B.; Ali, F.; Zhou, B.; Li, S.; Yu, Y.; Yang, T.; Liu, X.; Liang, Z.; Zhang, K. A Novel Approach of Efficient 3D Reconstruction for Real Scene Using Unmanned Aerial Vehicle Oblique Photogrammetry with Five Cameras. Comput. Electr. Eng. 2022, 99, 107804. [Google Scholar] [CrossRef]
Bullinger, S.; Bodensteiner, C.; Arens, M. 3D Surface Reconstruction from Multi-Date Satellite Images. In Proceedings of the International Society for Photogrammetry and Remote Sensing (ISPRS Congress) 2021, Nice, France, 5–9 July 2021. [Google Scholar]
Wang, J.; Meng, L.; Li, W.; Yang, W.; Yu, L.; Xia, G.-S. Learning To Extract Building Footprints from off-Nadir Aerial Images. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1294–1301. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Yang, H.; Hu, Z.; Zheng, J.; Xia, G.-S.; He, C. 3D Building Reconstruction from Monocular Remote Sensing Images with Multi-Level Supervisions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27728–27737. [Google Scholar]
Shao, R.; Wu, J.; Li, J.; Peng, S.; Chen, H.; Du, C. Singlerecon: Reconstructing Building 3D Models of Lod1 from A Single off-Nadir Remote Sensing Image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19588–19600. [Google Scholar] [CrossRef]
Niu, Y.; Chen, H.; Li, J.; Du, C.; Wu, J.; Zhang, Y. Lgfaware-Meshing: Online Mesh Reconstruction from LiDAR Point Cloud with Awareness of Local Geometric Features. Geo-Spat. Inf. Sci. 2025, 1–19. [Google Scholar] [CrossRef]
Elhashash, M.; Qin, R. Cross-View Slam Solver: Global Pose Estimation of Monocular Ground-Level Video Frames for 3D Reconstruction Using A Reference 3D Model from Satellite Images. ISPRS J. Photogramm. Remote Sens. 2022, 188, 62–74. [Google Scholar] [CrossRef]
Cui, H.; Shen, S.; Gao, W.; Hu, Z. Efficient Large-Scale Structure from Motion By Fusing Auxiliary Imaging Information. IEEE Trans. Image Process. 2015, 24, 3561–3573. [Google Scholar] [CrossRef]
Kuhn, A.; Hirschmüller, H.; Scharstein, D.; Mayer, H. A Tv Prior for High-Quality Scalable Multi-View Stereo Reconstruction. Int. J. Comput. Vis. 2016, 124, 2–17. [Google Scholar] [CrossRef]
Xu, Q.; Kong, W.; Tao, W.; Pollefeys, M. Multi-Scale Geometric Consistency Guided and Planar Prior Assisted Multi-View Stereo. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4945–4963. [Google Scholar] [CrossRef]
Kazhdan, M.; Bolitho, M.; Hoppe, H. Poisson Surface Reconstruction. In Proceedings of the Fourth Eurographics Symposium on Geometry Processing, Sardinia, ltaly, 26–28 June 2006. [Google Scholar]
Cazals, F.; Giesen, J. Delaunay Triangulation Based Surface Reconstruction. In Effective Computational Geometry for Curves and Surfaces; Springer: Berlin/Heidelberg, Germany, 2006; pp. 231–276. [Google Scholar]
Waechter, M.; Moehrle, N.; Goesele, M. Let There Be Color! Large-Scale Texturing of 3D Reconstructions. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 836–850. [Google Scholar]
Yang, B.; Bao, C.; Zeng, J.; Bao, H.; Zhang, Y.; Cui, Z.; Zhang, G. Neumesh: Learning Disentangled Neural Mesh-Based Implicit Field for Geometry and Texture Editing. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 597–614. [Google Scholar]
Zhao, L.; Wang, H.; Zhu, Y.; Song, M. A Review of 3D Reconstruction from High-Resolution Urban Satellite Images. Int. J. Remote Sens. 2023, 44, 713–748. [Google Scholar] [CrossRef]
Jeong, J. Imaging Geometry and Positioning Accuracy of Dual Satellite Stereo Images: A Review. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2017, 4, 235–242. [Google Scholar] [CrossRef]
Facciolo, G.; De Franchis, C.; Meinhardt-Llopis, E. Automatic 3D Reconstruction from Multi-Date Satellite Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 57–66. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Gao, Z.; Sun, W.; Lu, Y.; Zhang, Y.; Song, W.; Zhang, Y.; Zhai, R. Joint Learning of Semantic Segmentation and Height Estimation for Remote Sensing Image Leveraging Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614015. [Google Scholar] [CrossRef]
Buyukdemircioglu, M.; Kocaman, S.; Kada, M. Deep Learning for 3D Building Reconstruction: A Review. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, XLIII-B2-2022, 359–366. [Google Scholar] [CrossRef]
Costa, C.J.; Tiwari, S.; Bhagat, K.; Verlekar, A.; Kumar, K.M.C.; Aswale, S. Three-Dimensional Reconstruction of Satellite Images Using Generative Adversarial Networks. In Proceedings of the 2021 International Conference on Technological Advancements and Innovations (ICTAI), Tashkent, Uzbekistan, 10–12 November 2021; pp. 121–126. [Google Scholar]
Christie, G.; Abujder, R.R.R.M.; Foster, K.; Hagstrom, S.; Hager, G.D.; Brown, M.Z. Learning Geocentric Object Pose in Oblique Monocular Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 14512–14520. [Google Scholar]
Chang, J.-R.; Chen, Y.-S. Pyramid Stereo Matching Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar]
Yang, G.; Manela, J.; Happold, M.; Ramanan, D. Hierarchical Deep Stereo Matching on High-Resolution Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5515–5524. [Google Scholar]
Le Saux, B.; Yokoya, N.; Hansch, R.; Brown, M.; Hager, G. 2019 Data Fusion Contest [Technical Committees]. IEEE Geosci. Remote Sens. Mag. 2019, 7, 103–105. [Google Scholar] [CrossRef]
Wu, T.; Vallet, B.; Pierrot-Deseilligny, M.; Rupnik, E. A New Stereo Dense Matching Benchmark Dataset for Deep Learning. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2021, 43, 405–412. [Google Scholar] [CrossRef]
Marí, R.; Ehret, T.; Facciolo, G. Disparity Estimation Networks for Aerial And High-Resolution Satellite Images: A Review. Image Process. Online 2022, 12, 501–526. [Google Scholar] [CrossRef]
Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything: Unleashing The Power of Large-Scale Unlabeled Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle WA, USA, 17–21 June 2024; pp. 10371–10381. [Google Scholar]
Wang, J.; Chen, M.; Karaev, N.; Vedaldi, A.; Rupprecht, C.; Novotny, D. Vggt: Visual Geometry Grounded Transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5294–5306. [Google Scholar]
Günaydın, E.; Yakar, İ.; Bakırman, T.; Selbesoğlu, M.O. Evaluation of Depth Anything Models for Satellite-Derived Bathymetry. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, 48, 101–106. [Google Scholar] [CrossRef]
Wu, X.; Landgraf, S.; Ulrich, M.; Qin, R. An Evaluation of Dust3r/Mast3r/Vggt 3D Reconstruction on Photogrammetric Aerial Blocks. arXiv 2025, arXiv:2507.14798. [Google Scholar]
Biljecki, F.; Ledoux, H.; Stoter, J. An Improved Lod Specification for 3D Building Models. Comput. Environ. Urban Syst. 2016, 59, 25–37. [Google Scholar] [CrossRef]
Stucker, C.; Schindler, K. Resdepth: A Deep Residual Prior for 3D Reconstruction from High-Resolution Satellite Images. ISPRS J. Photogramm. Remote Sens. 2022, 183, 560–580. [Google Scholar] [CrossRef]
Yu, D.; Ji, S.; Liu, J.; Wei, S. Automatic 3D Building Reconstruction from Multi-View Aerial Images with Deep Learning. ISPRS J. Photogramm. Remote Sens. 2021, 171, 155–170. [Google Scholar] [CrossRef]
Li, W.; Meng, L.; Wang, J.; He, C.; Xia, G.-S.; Lin, D. 3D Building Reconstruction from Monocular Remote Sensing Images. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Li, W.; Zhao, W.; Yu, J.; Zheng, J.; He, C.; Fu, H.; Lin, D. Joint Semantic–Geometric Learning for Polygonal Building Segmentation from High-Resolution Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2023, 201, 26–37. [Google Scholar] [CrossRef]
Li, W.; Hu, Z.; Meng, L.; Wang, J.; Zheng, J.; Dong, R.; He, C.; Xia, G.-S.; Fu, H.; Lin, D. Weakly-Supervised 3D Building Reconstruction from Monocular Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5615315. [Google Scholar]
Li, Z.; Ji, S.; Fan, D.; Yan, Z.; Wang, F.; Wang, R. Reconstruction of 3D Information of Buildings from Single-View Images Based on Shadow Information. ISPRS Int. J. Geo-Inf. 2024, 13, 62. [Google Scholar] [CrossRef]
Ling, J.; Wang, Z.; Xu, F. Shadowneus: Neural Sdf Reconstruction By Shadow Ray Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 175–185. [Google Scholar]
Wang, D.; Zhang, J.; Du, B.; Xia, G.-S.; Tao, D. An Empirical Study of Remote Sensing Pretraining. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5608020. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-Cnn: Delving Into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]

Figure 1. The proposed workflow for dual-view 3D building reconstruction.

Figure 2. Architecture of the proposed DualRecon Network. The red and blue branches are two Siamese single-view extraction branches, each dedicated to processing each view, while the black branch serves as the cross-view association and fusion branch, which is responsible for integrating information from the two single-view branches. The figure visualizes the model’s output as a 3D building representation. The colored elements denote specific components: the red and blue arrows indicate the roof-footprint offset in the two views, respectively; the yellow arrow represents the disparity offset of the roofs; the red and blue striped boxes are the roof masks for each respective view; and the green striped box is the building footprint mask.

Figure 3. Geographical coverage of the remote sensing data used to construct our BuildingDual dataset.

Figure 4. Visualization of our proposed BuildingDual dataset. (Top row) The original remote sensing images. (Middle row) The annotated 3D building information, including bounding boxes (red), roofs in two views (blue polygons), roof-to-footprint offsets (blue arrows), and building footprints (red polygons). (Bottom row) The corresponding 3D reconstruction results generated from the annotations.

Figure 5. A visual comparison of the intermediate processes and final 3D reconstruction results of different methods on dual-view satellite imagery. The visualization for the VGGT method shows the point cloud output by VGGT and the mesh model reconstructed from this point cloud. In contrast, the building detection-based methods, including DualRecon, display both the building 3D information extraction process (the two columns on the left of each sample) and the final 3D models (the one column on the right of each sample). In these visualizations, the 3D building information is represented as follows: red rectangles denote the dual-view bounding boxes, red polygons denote the building footprints, blue polygons denote the building roofs, and blue arrows represent the roof-to-footprint offset. It can be observed that in the information extraction figures for some methods, only the roof and offset are drawn. This occurs when the footprints extracted from the two views have insufficient overlap, causing the cross-view matching to fail and consequently preventing the generation of a final building footprint.

Figure 6. An illustration of different network architectures explored for building 3D information extraction.

Figure 7. A visual comparison of 3D building information extraction results across different inter-camera angles. The meaning of the colored elements in the figure is consistent with that in Figure 5.

Table 1. Quantitative comparison of our proposed DualRecon with other baseline methods on the du-al-view 3D building reconstruction task. The best results are highlighted in bold. ↑ and ↓ indicate that higher and lower values are better, respectively. \ indicates that the metric is not applicable as the method is not building instance-based.

		Footprint AR ↑	Footprint AP ↑	Footprint F1 ↑	Height MAE ↓
VGGT + PSR		\	\	\	4.670
LoFT	single	0.629	0.755	0.686	2.337
MLS-BRN		0.640	0.769	0.698	2.330
ViTAE		0.617	0.727	0.668	2.417
CMRCNN		0.547	0.646	0.592	2.861
LoFT	IoU-merge	0.614	0.787	0.690	2.279
MLS-BRN		0.633	0.808	0.709	2.230
ViTAE		0.592	0.765	0.668	2.348
CMRCNN		0.494	0.750	0.596	2.769
DualRecon		0.652	0.790	0.714	2.059

Table 2. Ablation study results for the network architecture of single-view information (offset/roof) extraction. The best results are highlighted in bold. ↑ and ↓ indicate that higher and lower values are better, respectively.

	Footprint AR ↑	Footprint AP ↑	Footprint F1 ↑	Height MAE ↓
Architecture (a)	0.639	0.768	0.698	2.568
Architecture (b)	0.654	0.801	0.719	2.162
DualRecon (c)	0.652	0.790	0.714	2.059

Table 3. Ablation study on the disparity offset loss weight (

λ_{d i s}

). The best results are highlighted in bold. ↑ and ↓ indicate that higher and lower values are better, respectively.

Table 3. Ablation study on the disparity offset loss weight (

λ_{d i s}

). The best results are highlighted in bold. ↑ and ↓ indicate that higher and lower values are better, respectively.

$λ_{d i s}$	Footprint AR ↑	Footprint AP ↑	Footprint F1 ↑	Height MAE ↓
32	0.646	0.784	0.708	2.063
24	0.652	0.778	0.710	2.063
16	0.652	0.790	0.714	2.059
8	0.651	0.785	0.711	2.062

Table 4. A comparison of the performance of various methods across different inter-camera angles.

	Inter-Camera Angle (°)	Footprint AR ↑	Footprint AP ↑	Footprint F1 ↑	Height MAE ↓
mls-brn	<5	0.635	0.812	0.713	2.199
	5~10	0.633	0.796	0.705	2.139
	>10	0.632	0.835	0.719	1.825
DualRecon	<5	0.649	0.803	0.718	2.095
	5~10	0.653	0.776	0.709	1.946
	>10	0.654	0.809	0.723	1.655

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shao, R.; Chen, H.; Li, J.; Ma, M.; Du, C. DualRecon: Building 3D Reconstruction from Dual-View Remote Sensing Images. Remote Sens. 2025, 17, 3793. https://doi.org/10.3390/rs17233793

AMA Style

Shao R, Chen H, Li J, Ma M, Du C. DualRecon: Building 3D Reconstruction from Dual-View Remote Sensing Images. Remote Sensing. 2025; 17(23):3793. https://doi.org/10.3390/rs17233793

Chicago/Turabian Style

Shao, Ruizhe, Hao Chen, Jun Li, Mengyu Ma, and Chun Du. 2025. "DualRecon: Building 3D Reconstruction from Dual-View Remote Sensing Images" Remote Sensing 17, no. 23: 3793. https://doi.org/10.3390/rs17233793

APA Style

Shao, R., Chen, H., Li, J., Ma, M., & Du, C. (2025). DualRecon: Building 3D Reconstruction from Dual-View Remote Sensing Images. Remote Sensing, 17(23), 3793. https://doi.org/10.3390/rs17233793

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DualRecon: Building 3D Reconstruction from Dual-View Remote Sensing Images

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Traditional Satellite Imagery–Based 3D Reconstruction

2.2. Deep Learning–Based 3D Reconstruction from Remote Sensing Imagery

3. Methodology

3.1. DualRecon Network

3.2. Loss Functions

4. Experiment

4.1. Dataset

4.2. Implementation Details

4.3. Comparative Experiments

4.4. Ablation Study

4.4.1. The Network Architectures to Extract Single-View Information (Roof-to-Footprint Offset and Roof)

4.4.2. Weight of the Disparity Offset Loss $λ_{d i s}$

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

DualRecon: Building 3D Reconstruction from Dual-View Remote Sensing Images

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Traditional Satellite Imagery–Based 3D Reconstruction

2.2. Deep Learning–Based 3D Reconstruction from Remote Sensing Imagery

3. Methodology

3.1. DualRecon Network

3.2. Loss Functions

4. Experiment

4.1. Dataset

4.2. Implementation Details

4.3. Comparative Experiments

4.4. Ablation Study

4.4.1. The Network Architectures to Extract Single-View Information (Roof-to-Footprint Offset and Roof)

4.4.2. Weight of the Disparity Offset Loss λ d i s

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.4.2. Weight of the Disparity Offset Loss $λ_{d i s}$