Stereo Radargrammetry Using Deep Learning-Based Image Matching with Fine-Tuned Model on Synthetic Aperture Radar Images

Ito, Koichi; Sasayama, Tatsuya; Ito, Shintaro; Iwasa, Haruki; Aoki, Takafumi; Uemoto, Jyunpei

doi:10.3390/rs18101662

Open AccessArticle

Stereo Radargrammetry Using Deep Learning-Based Image Matching with Fine-Tuned Model on Synthetic Aperture Radar Images

by

Koichi Ito

^1,*

,

Tatsuya Sasayama

¹,

Shintaro Ito

¹

,

Haruki Iwasa

¹,

Takafumi Aoki

¹

and

Jyunpei Uemoto

²

¹

Graduate School of Information Sciences, Tohoku University, Sendai 980-8579, Japan

²

National Institute of Information and Communications Technology (NICT), Koganei 184-8795, Japan

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1662; https://doi.org/10.3390/rs18101662

Submission received: 24 March 2026 / Revised: 8 May 2026 / Accepted: 18 May 2026 / Published: 21 May 2026

(This article belongs to the Special Issue SAR Images Processing and Analysis (3rd Edition))

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

Fine-tuning a Transformer-based model (RoMa) on a newly constructed SAR dataset significantly outperforms conventional image matching methods and pre-trained deep learning methods.
Establishing direct matching on slant-range images avoids the image quality degradation caused by traditional ground-range projection, achieving highly accurate and dense 3D elevation measurements.

What are the implications of the main findings?

The proposed framework successfully bridges the domain gap between optical and SAR images, enabling robust 3D elevation measurement in mountainous and forested terrains with large geometric modulations.
Eliminating the need for ground-range projection prevents the loss of high-frequency components, making it possible to generate dense and accurate Digital Surface Models (DSMs) while maintaining the original SAR image resolution.

Abstract

Stereo radargrammetry using Synthetic Aperture Radar (SAR) images is a powerful technique for all-weather 3D topographic measurements. However, conventional methods based on local template matching often struggle to establish accurate correspondences in mountainous or vegetated areas due to severe SAR-specific geometric modulations. In this paper, we propose a novel high-accuracy stereo radargrammetry framework by introducing RoMa, a robust Transformer-based deep learning model, for dense SAR image matching. Optical pre-trained deep learning models often suffer from a domain gap. To overcome this limitation, we develop an automated pipeline to construct a patch-based SAR image dataset using a reference Digital Surface Model (DSM) and an SAR projection model. By fine-tuning RoMa on this dataset, the model effectively adapts to the complex non-linear deformations of SAR images. Furthermore, unlike conventional methods, our approach establishes correspondences directly on the original slant-range images without requiring ground-range projection, thereby avoiding image quality degradation caused by pixel interpolation. Experimental results using airborne Pi-SAR2 images demonstrate that the fine-tuned RoMa significantly outperforms conventional methods, achieving an 82.86% matching accuracy at a 10-pixel threshold. In the 3D measurement evaluation, the proposed method achieves the lowest elevation mean error (

- 1.24

m) and the highest inlier ratio (74.1%), proving its effectiveness in generating accurate, dense, and wide-area 3D point clouds even in challenging terrains.

Keywords:

SAR; radargrammetry; deep learning; image correspondence; 3D measurement

1. Introduction

Remote sensing is a fundamental technology that observes the Earth’s surface from a distance using sensors mounted on satellites or aircraft. Since it allows for the observation of wide areas in a short time and at low cost, it is widely used in various fields, including vegetation surveys, forest monitoring, and topographic measurement [1]. In particular, there are growing expectations for 3D measurement technologies using remote sensing as a means to quickly and safely assess damage and topographic changes in the event of large-scale natural disasters, such as earthquakes, volcanic eruptions, and torrential rains [2].

Sensors used for 3D measurement of the ground surface include optical cameras and Light Detection and Ranging (LiDAR); however, these are limited by weather conditions such as clouds and smoke, as well as the presence of sunlight. In contrast, Synthetic Aperture Radar (SAR) is an active sensor that transmits microwaves, capable of penetrating clouds and smoke, and enables observation day and night [3]. Since disasters occur regardless of weather conditions, establishing an all-weather 3D measurement method using SAR is important.

Major methods for 3D measurement using SAR include Interferometric SAR (InSAR) [4] and stereo radargrammetry [5]. Although InSAR can measure ground displacement and elevation with high accuracy using the phase difference between two SAR images, it suffers from ambiguities associated with phase unwrapping and geometric effects like shadow and layover, particularly in steep terrain. Furthermore, its measurement is limited to relative displacement. On the other hand, stereo radargrammetry is a method that measures absolute elevation based on the principle of stereo vision using two SAR amplitude images acquired from different viewpoints. Compared to InSAR, it is less affected by decorrelation and enables more robust measurement [6]. With the recent availability of high-resolution SAR data, the measurement accuracy has reached the meter order, demonstrating the potential of stereo radargrammetry for high-accuracy mapping [7].

The measurement accuracy of stereo radargrammetry depends heavily on the accuracy of the SAR projection model and the accuracy of correspondence matching between images. Many conventional radargrammetry methods required Ground Control Points (GCPs) or assumed parallel tracks. In contrast, Maruki et al. introduced the principle of stereo vision from computer vision and proposed a method for 3D measurement from SAR images acquired on arbitrary tracks without using GCPs [8]. However, their geometric model approximated the Earth’s surface as a plane, which caused errors in wide-area measurements or observations with large intersection angles. To address this problem, Insfran et al. redefined the SAR projection model based on a geocentric coordinate system considering the GRS80 ellipsoid model, reducing errors caused by geometric model imperfections [9].

However, even with a refined geometric model, the final 3D measurement accuracy will not improve without accurate correspondence matching between SAR images. Conventional methods [8,9] use Phase-Only Correlation (POC) [10] for image matching. POC is robust to illumination variations and noise and allows for sub-pixel estimation, but it approximates the deformation between images as a local translation. In SAR images, geometric image modulations such as foreshortening appear strongly in terrain with relief. Therefore, it is difficult to detect accurate correspondence points in steep mountainous areas or vegetated regions using POC based on a simple translation model.

In recent years, image matching methods using deep learning have made significant progress in the field of computer vision [11]. In particular, methods such as RoMa (Robust Dense Feature Matching) [12], which uses Vision Transformers [13], achieve high-density and high-accuracy image matching even for image pairs containing large geometric deformations by considering global context information. Applying these methods to SAR images has the potential to solve the problem of geometric image modulation, which is difficult to handle with conventional signal processing-based methods. Concurrently, the domain of SAR interpretation has witnessed significant advancements between 2023 and 2026, transitioning toward multidomain joint characterization frameworks [14,15]. Recent studies have demonstrated that integrating physically grounded constraints, such as General Polarimetric Correlation Patterns (GPCPs), into deep learning architectures effectively bridges the gap between electromagnetic scattering mechanisms and semantic recognition [16]. Inspired by these physics-driven approaches, we hypothesize that directly learning SAR-specific geometric distortions through a rigorously constructed dataset can effectively adapt optical-based Transformer models to the SAR domain. However, these models are primarily trained on optical camera images, and sufficient performance cannot be obtained by directly applying them to SAR images, which have significantly different radiometric and geometric characteristics. Furthermore, there is currently no public dataset specialized for learning SAR image matching.

To address these limitations, we propose a novel high-accuracy stereo radargrammetry framework. The main contributions of this paper are summarized as follows:

Dataset generation: We develop a fully automated pipeline to construct a large-scale, patch-based SAR image dataset. By back-calculating the true disparities using a reference DSM and a rigorous SAR projection model, we overcome the lack of training data in the SAR domain.
Model adaptation: We fine-tune RoMa, a Transformer-based dense matching model, on our SAR dataset. This explicitly adapts the network to capture complex, SAR-specific geometric modulations.
Slant-range matching: We establish a framework that performs matching directly on slant-range images. By eliminating the conventional ground-range projection step, our method preserves high-frequency components and avoids interpolation-induced image quality degradation.

Through a set of experiments, we demonstrate that the proposed method achieves higher accuracy of stereo radargrammetry than the POC-based method and another deep learning-based image correspondence method.

This article is a revised and expanded version of a paper entitled “Stereo Radargrammetry Using Deep Learning from Airborne SAR Images,” which was presented at the IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025), Brisbane, Australia, in August 2025 [17]. In this paper, we provide a detailed description of the proposed algorithm and the dataset construction process. Furthermore, we have expanded the comparative experiments to provide a more comprehensive discussion on the effectiveness of the proposed method.

2. Materials and Methods

In this section, we describe the proposed method for stereo radargrammetry. The proposed method integrates an SAR geometric model that accounts for the Earth ellipsoid with deep learning-based dense image matching. First, we explain the geometric model that serves as the foundation for the 3D measurement. Next, we describe the method for constructing the SAR image dataset, which is essential for training the deep learning model, as well as the fine-tuning process for the adopted image matching method, RoMa [12]. Finally, we present the overall flow of our 3D measurement pipeline that integrates these components.

2.1. Geometric Model of Stereo Radargrammetry

In this paper, we adopt the geometric model for 3D measurement based on the principle of stereo vision proposed by Maruki et al. [8]. Their method defines the relationship between the SAR image coordinate system

(u, v)

and the 3D radar coordinate system

(X, Y, Z)

based on the antenna position, and calculates 3D coordinates using geometric constraints obtained from two antenna positions. The definitions of the coordinate systems and their geometric relationship are illustrated in Figure 1.

First, the geometric relationship in a single SAR image is described using intrinsic parameters. The intrinsic parameters are defined as a set

(α_{u}, α_{v}, Z_{0}, D_{S L})

, consisting of the reciprocal of the image resolution in the azimuth and range directions

α_{u}

and

α_{v}

, the antenna altitude

Z_{0}

, and the distance from the antenna to the near range

D_{S L}

. Using these parameters, the relationship between a point

M = {(X, Y, Z)}^{⊤}

in the radar coordinate system and its projected point

u = {(u, v)}^{⊤}

on the SAR image is expressed as follows:

u = α_{u} Y, v = α_{v} (\sqrt{X^{2} + {(Z_{0} - Z)}^{2}} - D_{S L}) .

(1)

Next, the positional relationship between the two antennas forming a stereo pair (reference image and source image) is described using extrinsic parameters. The extrinsic parameters are defined as a set

(R, t)

, consisting of a

3 \times 3

rotation matrix

R

and a

3 \times 1

translation vector

t

. The relationship between a point

M

in the radar coordinate system of the reference image and a point

M^{'}

in that of the source image is given by the following rigid body transformation:

M^{'} = R M + t .

(2)

In the initial work by Maruki et al. [8], the Earth’s surface was approximated as a plane when calculating the extrinsic parameters, which caused non-negligible errors in wide-area observations. Therefore, in this paper, following the method of Insfran et al. [9], we treat the Earth as a GRS80 ellipsoid model and apply extrinsic parameters

(R, t)

calculated based on the geocentric coordinate system. This eliminates systematic positional errors caused by the Earth’s curvature. Note that the extrinsic parameters

(R, t)

are accurately estimated using the onboard GNSS/IMU navigation data provided by the Pi-SAR2 system. Finally, if the correspondence points

u = (u, v)

and

u^{'} = (u^{'}, v^{'})

between two SAR images are known, the 3D coordinates of the ground surface

M = {(X, Y, Z)}^{⊤}

can be uniquely determined using the following analytical solution derived from Equations (1) and (2):

M = [\begin{matrix} X \\ Y \\ Z \end{matrix}] = [\begin{matrix} \sqrt{{(\frac{v}{α_{v}} + D_{S L})}^{2} - {(Z_{0} - Z)}^{2}} \\ \frac{u}{α_{u}} \\ \frac{- b - \sqrt{b^{2} - 4 a c}}{2 a} \end{matrix}],

(3)

where the coefficients

a, b, c,

and the intermediate coefficient d are defined as follows:

\begin{matrix} a & = R_{21}^{2} + R_{23}^{2}, b = - 2 (R_{23} d + R_{21}^{2} Z_{0}), \\ c & = d^{2} + R_{21}^{2} Z_{0}^{2} - R_{21}^{2} {(\frac{v}{α_{v}} + D_{S L})}^{2}, \\ d & = \frac{u^{'}}{α_{u}^{'}} - R_{22} \frac{u}{α_{u}} - t_{2}, \end{matrix}

(4)

where

(R_{21}, R_{22}, R_{23})

are the elements of the second row of the matrix

R

, and

t_{2}

is the second component of the vector

t

. The primed symbols

(α_{u}^{'}, \dots)

denote the intrinsic parameters of the source image.

This geometric model allows computer vision techniques to be applied to SAR images. However, the measurement accuracy using Equation (3) depends directly on the accuracy of the input correspondence points

(u, v)

and

(u^{'}, v^{'})

between images. Conventional methods such as POC cannot cope with non-linear geometric distortions specific to SAR images, leading to a significant degradation in matching accuracy, particularly in mountainous areas and vegetated regions. Therefore, to realize high-accuracy 3D measurement, a new image matching method that is robust to these image modulations is essential.

2.2. Overview of the Proposed Method

As described in Section 2.1, the proposed method utilizes a 3D measurement framework based on the principle of stereo vision and a projection model that takes the Earth ellipsoid into account [9]. The major extensions and improvements of the proposed method over this conventional method [9] are the following three points:

1.: Introducing RoMa [12], a deep learning model robust against complex geometric deformations, as the image correspondence method.
2.: Developing an SAR image dataset construction method that automatically calculates ground truth disparities using a DSM and the projection model, enabling the application of deep learning models to SAR image processing.
3.: Eliminating the ground projection process for SAR images, performing matching directly using slant-range images as input.

Regarding the third point in particular, POC [10], which is used in the conventional method, approximates the deformation between images as local translations. Therefore, it is necessary to convert slant-range images to ground-range images before matching to mitigate the deformation. However, this conversion involves pixel interpolation, which causes the loss of high-frequency components and leads to image quality degradation. In contrast, the proposed method employs RoMa [12], which is highly robust to complex non-linear deformations, making it possible to directly establish correspondences in slant-range images without requiring ground projection. This avoids image quality degradation caused by pixel interpolation and realizes high-accuracy 3D measurement while maintaining the original resolution of the SAR images.

2.3. Construction of SAR Image Dataset

To apply a deep learning model such as RoMa [12] to SAR images, a large-scale training dataset consisting of SAR image pairs and their corresponding pixel-wise ground truth disparities is required. However, there are no publicly available datasets specialized for SAR image matching. Therefore, in this paper, we propose a framework to automatically construct a training dataset by back-calculating true disparities using a DSM of the target area and the SAR projection model. The overview of the dataset construction is shown in Figure 2. The construction process consists of two stages: elevation map creation and patch extraction.

In the first step (i), an elevation map corresponding to each pixel of the SAR image is created using the DSM and metadata. Specifically, using the projection model in Equation (1), the 3D coordinates of the DSM are projected onto the SAR image coordinate system, and the absolute elevation Z at each pixel

(u, v)

is mapped. By filling the gaps in the point cloud through interpolation, dense elevation maps are obtained for both the reference and source images.

In the second step (ii), the SAR images and elevation maps are divided into patches of a size that can be input into the deep learning model (e.g., 560 × 560 pixels). The overview of the patch extraction is shown in Figure 3. First, the source image is geometrically aligned with the reference image (ii-1). Next, to suppress boundary effects during training, a common grid is placed (ii-2) so that the patches overlap by approximately one-third. The grid coordinates are then inversely transformed back to the original source image geometry (ii-3) based on the metadata. Finally, the corresponding patch pairs and elevation map pairs are extracted based on these grids (ii-4).

Subsequently, the pixel-wise ground truth disparity is calculated using the extracted patch pairs and elevation map pairs. The ground truth disparity for an arbitrary point

u_{1} = {(u_{1}, v_{1})}^{⊤}

on the reference image is geometrically derived through the following procedure. First, using the elevation value

Z_{1}

obtained from the elevation map of the reference image and the intrinsic parameters

(α_{u 1}, α_{v 1}, Z_{01}, D_{S L 1})

, the 3D point

M_{1} = {(X_{1}, Y_{1}, Z_{1})}^{⊤}

in the radar coordinate system of the reference image corresponding to

u_{1}

is calculated by

\{\begin{matrix} X_{1} & = \sqrt{{(\frac{v_{1}}{α_{v 1}} + D_{S L 1})}^{2} - {(Z_{01} - Z_{1})}^{2}} \\ Y_{1} & = \frac{u_{1}}{α_{u 1}} \end{matrix} .

(5)

Next, using the extrinsic parameters

(R, t)

,

M_{1}

is transformed into a point

M_{2} = {(X_{2}, Y_{2}, Z_{2})}^{⊤}

in the radar coordinate system of the source image as follows:

M_{2} = R M_{1} + t .

(6)

Subsequently, using the intrinsic parameters

(α_{u 2}, α_{v 2}, Z_{02}, D_{S L 2})

of the source image, the coordinates

u_{2} = {(u_{2}, v_{2})}^{⊤}

where

M_{2}

is projected onto the source image are calculated by

\{\begin{matrix} u_{2} & = α_{u 2} Y_{2} \\ v_{2} & = α_{v 2} (\sqrt{X_{2}^{2} + {(Z_{02} - Z_{2})}^{2}} - D_{S L 2}) \end{matrix} .

(7)

Through the above steps, the true corresponding point

u_{2}

on the source image for the point

u_{1}

on the reference image is obtained. By calculating the displacement between these two points, a dense disparity map, which serves as the ground truth data for training, can be obtained.

2.4. Fine-Tuning of RoMa for SAR Images

In this paper, we utilize RoMa [12] as the core algorithm for dense image matching in stereo radargrammetry. RoMa leverages multi-scale features where DINOv2 [18] provides deep semantic embeddings for robust global deformation estimation, and VGG-19 [19] extracts shallow features for fine-grained local correspondence refinement. This architecture is highly suitable for addressing the severe geometric image modulations inherent in SAR images. However, to bridge the domain gap between natural optical images and SAR images, the pre-trained RoMa model must be fine-tuned using the patch-based SAR dataset constructed in Section 2.3. During the fine-tuning process, the patch pairs from the reference and source images are input into RoMa, which outputs the predicted disparity map

\hat{D}

and the corresponding predicted confidence map

\hat{C}

. Let D and C denote the ground truth disparity map and the ground truth confidence map, respectively. The ground truth disparity D is geometrically calculated from the elevation maps as described in Section 2.3. The ground truth confidence map C acts as a binary mask; pixels where the true disparity can be successfully calculated (i.e., valid regions without severe radar shadow or layover) are set to 1, while all other invalid pixels are set to 0. The ground truth confidence map C acts as a binary mask to handle SAR-specific imaging challenges. Pixels where the true disparity cannot be calculated due to radar shadow or layover are set to 0. The fine-tuning is driven by a combined loss function consisting of a regression loss

L_{D}

for the disparity and a binary cross-entropy loss

L_{C}

for the confidence map. The regression loss

L_{D}

is based on the L2 norm of the pixel-wise disparity errors, masked by the ground truth confidence to effectively ignore invalid regions, and is defined as

L_{D} = \frac{1}{| I |} \sum_{i \in I} C_{i} {∥ {\hat{D}}_{i} - D_{i} ∥}_{2},

(8)

where

I

is the set of all pixels in the reference patch, and

{\hat{D}}_{i}

,

D_{i}

, and

C_{i}

are the values of

\hat{D}

, D, and C at pixel i, respectively. The classification loss

L_{C}

evaluates the reliability of the predicted correspondences and is defined by the standard binary cross-entropy as

L_{C} = - \frac{1}{| I |} \sum_{i \in I} \{C_{i} log {\hat{C}}_{i} + (1 - C_{i}) log (1 - {\hat{C}}_{i})\},

(9)

where

{\hat{C}}_{i} \in [0, 1]

is the predicted confidence value at pixel i. The total loss function

L

used to optimize the network is given by the weighted sum of these two losses as

L = L_{D} + λ L_{C},

(10)

where

λ

is a hyperparameter that balances the two loss terms. In our experiments,

λ

is empirically set to 0.01 to stabilize the training process. By minimizing this total loss, the model learns to accurately estimate disparities specific to SAR geometric deformations while appropriately assessing its own matching uncertainty.

2.5. Overall Pipeline of 3D Measurement

Once the RoMa model is fine-tuned, it is deployed for the actual 3D measurement task. The overall stereo radargrammetry pipeline using the proposed method is illustrated in Figure 4. First, similarly to the dataset construction phase, the input reference and source SAR images covering the target observation area are divided into overlapping patches based on their metadata to ensure efficient processing. The extracted patch pairs are then fed into the fine-tuned RoMa model. For each patch pair, RoMa infers the dense disparity map and its associated confidence map. Matches with confidence scores below a certain threshold are treated as outliers and discarded. Subsequently, using the reliable corresponding point pairs

(u, v)

and

(u^{'}, v^{'})

obtained from the valid disparities, the 3D coordinates

(X, Y, Z)

of the ground surface are analytically calculated using the SAR projection model and the rigorous Earth ellipsoid geometry defined in Equation (3). The 3D point clouds generated from individual patch pairs are then integrated and aligned to construct a comprehensive 3D point cloud of the entire target area. Finally, this unified 3D point cloud is interpolated and rasterized to generate the final elevation map. Note that elevations are not calculated for pixels in the reference image that lack valid correspondences due to low confidence, thereby preventing the generation of severe artifacts in the final 3D reconstruction.

3. Results

This section presents a comprehensive evaluation of the proposed stereo radargrammetry framework. After detailing the experimental setup, we first evaluate the image correspondence matching performance to validate the effectiveness of fine-tuning RoMa on SAR images. Subsequently, we evaluate the final 3D measurement accuracy, demonstrating the overall superiority of our deep learning-based approach over conventional methods through both quantitative and qualitative analyses.

3.1. Experimental Setup

Before detailing the dataset and implementation, we outline the objectives of our experiments. The primary goal is to evaluate the effectiveness of the proposed stereo radargrammetry framework, which incorporates RoMa [12] for dense SAR image matching. To verify whether RoMa [12] is optimally suited for handling the severe geometric modulations inherent in SAR images, we compare its performance with another state-of-the-art deep learning-based dense matching method, Deep Kernelized Dense Geometric Matching (DKM) [20]. While DKM [20] has demonstrated strong performance in natural optical image matching, we hypothesize that RoMa’s Transformer-based architecture, which captures global context more robustly, is better equipped to handle the complex non-linear deformations specific to SAR images. Furthermore, we evaluate the final 3D reconstruction accuracy against the conventional POC-based method [9] to demonstrate the overall superiority of the proposed deep learning-based approach.

3.1.1. Study Area and Data

In this study, we utilized airborne SAR images acquired by Pi-SAR2, an airborne SAR system developed by the National Institute of Information and Communications Technology (NICT) [21]. The dataset was constructed using SAR images acquired around Mount Aso in Kumamoto Prefecture, Japan, on 17 April 2016 and 16 November 2017. The azimuth spatial resolution of the SAR images is approximately 0.3 m. To generate the ground truth elevation maps required for the dataset construction, we used the AW3D Digital Surface Model (DSM) [22], jointly provided by NTT DATA Corporation and the Remote Sensing Technology Center of Japan (RESTEC).

3.1.2. Dataset Preparation

The details of the constructed SAR image dataset are summarized in Table 1. To rigorously evaluate the generalization performance and prevent data leakage, the training, validation, and test sets were spatially divided into geographically disjoint regions prior to the patch extraction process. In our experiments, we evaluated the proposed method utilizing RoMa [12] and compared it with another state-of-the-art deep learning-based matching method, DKM [20]. Note that the total number of extracted patch pairs differs between DKM and RoMa because the input patch sizes required by their respective network architectures are different. Specifically, the patch size for RoMa was set to

560 \times 560

pixels to ensure a sufficient receptive field for capturing the global context while conforming to GPU memory constraints, whereas DKM utilized

384 \times 512

pixels as per its default configuration.

3.1.3. Implementation Details

For both DKM [20] and RoMa [12], we initialized the networks with weights pre-trained on the MegaDepth natural camera image dataset [23], provided by their respective authors. Given the significant domain gap between the optical and SAR images, we fine-tuned all parameters of the networks end-to-end rather than freezing specific layers. The fine-tuning and inference were performed on a machine equipped with a single NVIDIA GeForce RTX 4090 GPU. We utilized the AdamW optimizer [24] with a batch size of 2. The initial learning rates were set to

1 \times 10^{- 6}

for the encoder and

2 \times 10^{- 5}

for the remaining network components. The maximum number of training epochs was set to 50, but the models were trained until the evaluation metrics on the validation set stopped improving (i.e., early stopping was applied). All other training hyperparameters were kept identical to those in the original studies [12,20].

3.2. Matching Performance

Before evaluating the final 3D measurement accuracy, we evaluate the 2D image correspondence performance. Since the accuracy of stereo radargrammetry relies on the quality of the dense matching step, this subsection comprehensively evaluates the matching capabilities of the proposed method compared to the baselines. We conduct both quantitative analyses using pixel-level accuracy metrics and qualitative visual inspections of disparity error maps.

3.2.1. Evaluation Metrics and Test Patches

For quantitative evaluation, we assessed the accuracy of the image correspondence on the test dataset. Accuracy is defined as the percentage of matched pixels where the Euclidean distance between the predicted corresponding point and the ground truth point is less than or equal to a specific pixel threshold. In this experiment, we evaluated the accuracy at four thresholds: 1, 3, 5, and 10 pixels.

For qualitative evaluation, we verified the disparity error maps for three image pairs from the test dataset, denoted as Patch 1, Patch 2, and Patch 3, as shown in Figure 5. All three patch pairs are located around the Mount Aso region, but they present different levels of matching difficulty:

Patch 1: The crossing angle of the flight paths is relatively small, and the terrain is generally flat. Textures such as agricultural boundaries are clearly visible, making image matching relatively straightforward.
Patch 2: The crossing angle is larger, and the area is located on a mountainside with steep terrain. The severe geometric image modulations make accurate correspondence much more difficult than in Patch 1.
Patch 3: In addition to a large crossing angle, the trees across the image appear to lean in different directions due to SAR-specific geometric modulations (e.g., layover). Thus, similarly to Patch 2, accurate matching is highly challenging.

Figure 5. Three patch pairs of SAR images used for the qualitative evaluation of matching performance.

As established in Section 3.1, we compared the proposed method with the POC-based method [9], DKM [20], and RoMa [12]. Models pre-trained only on the MegaDepth dataset are denoted with “(M)”, while models fine-tuned on our SAR dataset are denoted with “(S)”. Since POC cannot handle the resolution differences between the azimuth and range directions or the rotation between images, the slant-range images must be projected onto a plane parallel to the ground-range plane at the average elevation before matching. In contrast, all deep learning-based methods (DKM [20] and RoMa [12]) directly establish correspondences using the original slant-range images.

3.2.2. Quantitative Results

The quantitative results of the matching accuracy are presented in Table 2. As shown in the table, at the strictest threshold of 1 pixel, DKM (S) achieves the highest accuracy of 16.73%. However, as the tolerance threshold increases to 3, 5, and 10 pixels, the proposed RoMa (S) consistently demonstrates the best performance, reaching 82.86% at the 10-pixel threshold. By comparing the pre-trained (M) and fine-tuned (S) models, we observe a substantial improvement across all thresholds. For instance, the accuracy of RoMa (M) at 10 pixels is 60.30%, which significantly increases by over 22 percentage points after fine-tuning. These quantitative observations validate that fine-tuning with the constructed SAR image dataset substantially enhances the pixel-level correspondence accuracy.

3.2.3. Qualitative Results

Figure 6 shows the qualitative results visualized as error maps for Patches 1 to 3. The colormap in these figures represents the magnitude of the displacement error at each corresponding pixel. For the POC-based method, significant matching failures (indicated by dark red regions) are widespread even in the relatively flat terrain of Patch 1. In Patches 2 and 3, which contain complex topography, POC produces massive unmeasured regions (shown in gray) alongside large local errors. RoMa (M) appears to recover more corresponding points than POC, yet it still exhibits severe error clusters, particularly in the forested and steep regions of Patches 2 and 3. In stark contrast, the proposed RoMa (S) successfully establishes highly accurate correspondences across the majority of the image patches. The error maps for RoMa (S) are predominantly dark blue, indicating sub-meter to few-meter errors, even in areas with severe geometric modulations where both POC and RoMa (M) failed completely. While some extremely challenging localized areas (such as the base of trees, likely due to deep radar shadows) still present minor errors, the visual comparison confirms that the fine-tuned model yields the most robust and dense matching results.

3.3. 3D Measurement Accuracy

In this section, we evaluate the accuracy of the 3D elevation measurement achieved by the proposed method.

3.3.1. Evaluation Metrics and Test Areas

To evaluate the 3D measurement accuracy, we conducted experiments on three large-scale test regions, denoted as Area 1, Area 2, and Area 3, as shown in Figure 7. Unlike the localized patch-based evaluation presented in Section 3.2, these areas represent continuous SAR images without patch division. Evaluating on these extended areas allows us to verify the practical applicability of the proposed 3D point cloud integration pipeline for wide-area Digital Surface Model (DSM) generation.

Following the comparative setup used in the matching evaluation, we assessed the proposed method (RoMa (S)) against the conventional POC-based method [9], pre-trained models (DKM (M) and RoMa (M)), and the fine-tuned DKM (S). For the quantitative evaluation, the generated elevation maps for the three test areas were compared against the ground truth DSM. We calculated the mean error (

μ

) and standard deviation (

σ

) of the elevation differences. Given the massive number of evaluated pixels across three distinct and large-scale geographic areas, the spatial standard deviation (

σ

) effectively serves as a robust indicator of statistical confidence, ensuring that the performance improvements are statistically significant across diverse topographies. Additionally, we evaluated the inlier ratio, defined as the percentage of reconstructed 3D points with an absolute elevation error of less than 2 m. A cumulative error distribution graph was also plotted across various thresholds to provide a comprehensive evaluation of measurement accuracy. For the qualitative evaluation, we visually inspected the generated elevation maps and their corresponding error maps for the three test areas.

3.3.2. Quantitative Results

The quantitative 3D measurement results for each method across the three test areas are summarized in Table 3. Furthermore, Figure 8 illustrates the cumulative percentage of 3D points against various elevation error thresholds. As shown in Table 3, the POC-based method initially appears to have the smallest mean error (

μ

). However, it exhibits a significantly large standard deviation (

σ

), reaching 21.2 m in Area 1. This indicates that the small mean error is merely a result of large positive and negative errors canceling each other out, rather than a reflection of high precision. Additionally, the low inlier ratios of POC (e.g., only 21.1% in Area 1) highlight its inability to obtain valid corresponding points consistently. The pre-trained models, DKM (M) and RoMa (M), demonstrate both large mean errors and high standard deviations, resulting in poor 3D reconstructions. Although fine-tuning enables DKM (S) to reconstruct a wider area, its mean error does not show significant improvement compared to its pre-trained version. In contrast, the proposed RoMa (S) consistently achieves small mean errors and the lowest standard deviations across all areas. Furthermore, the cumulative error distribution graphs in Figure 8 illustrate that RoMa (S) possesses the fastest-rising curve among all methods, meaning it recovers the highest proportion of 3D points at tight error thresholds. It demonstrates the highest inlier ratio, reaching 74.1% in Area 3, confirming its capability to generate dense and accurate 3D measurements.

3.3.3. Qualitative Results

Figure 9, Figure 10 and Figure 11 compare the generated elevation maps and their corresponding error maps for Areas 1 to 3, respectively. The gray regions represent unmeasured pixels where 3D points could not be calculated due to matching failures or outlier filtering. Consistent with the quantitative metrics, the error maps for POC and the pre-trained models show highly sparse reconstructions characterized by massive unmeasured areas and large local deviations (visible as dark red or deep blue patches). The DKM (S) method reduces the unmeasured areas but still exhibits noticeable elevation artifacts. Conversely, the proposed RoMa (S) visually produces the most complete and smooth elevation maps, closely resembling the ground truth DSM. By effectively minimizing the unmeasured gaps and significantly suppressing the magnitude of elevation errors across continuous landscapes, the proposed framework demonstrates its practical viability for wide-area, high-fidelity stereo radargrammetry.

4. Discussion

In this section, we provide an in-depth analysis of the experimental results, focusing on the underlying matching mechanisms and the physical factors that influence the 3D measurement accuracy in challenging SAR environments.

4.1. Mechanism Analysis: Transformer- vs. CNN-Based Matching

The quantitative results in Table 2 reveal a notable performance trade-off between the proposed RoMa-based framework and the DKM baseline. At the strictest threshold of 1 pixel, DKM(S) slightly outperforms RoMa(S) [12,20]. This behavior is attributed to DKM’s dense kernelized matching approach, which excels at exploiting fine-grained local textures for precise sub-pixel alignment [20]. In contrast, RoMa utilizes a Transformer-based architecture that leverages global semantic embeddings from DINOv2 [18]. While this prioritized global context may result in a marginal loss of local precision at the 1-pixel level, it provides superior robustness as the error threshold increases. As demonstrated in the 3-, 5-, and 10-pixel evaluations, RoMa(S) consistently achieves the highest accuracy because it effectively captures large-scale, non-linear geometric deformations inherent in SAR imagery that often cause CNN-based local windows to lose tracking [12,25].

4.2. Impact of SAR-Specific Geometric Distortions

The performance of stereo radargrammetry is fundamentally constrained by SAR-specific distortions, including foreshortening, layover, and shadowing [16,25]. In our qualitative analysis (Figure 6 and Figure 9), matching failures were primarily concentrated in steep mountainous slopes and dense forested areas. Specifically, the tree lean caused by layover effects in Area 3 creates complex, non-linear disparities that vary with the tree height and radar incidence angle [26]. Conventional POC methods failed significantly in these regions due to their reliance on a simple local translation model [9,10]. The proposed fine-tuned RoMa mitigates these effects by integrating global contextual relationships; however, localized errors persist at tree boundaries where the scattering mechanism becomes highly complex and dispersive [16].

4.3. Reliability and Confidence-Aware Filtering

A key strength of the proposed framework is the integration of a confidence-based outlier removal process. Unlike traditional methods that may produce erroneous 3D points in invalid regions, our model assesses matching uncertainty through the predicted confidence map

\hat{C}

[12]. Regions affected by deep radar shadow represent a total loss of backscatter information [25]. By setting the ground truth confidence to zero during training, the model learns to identify these information gaps. Consequently, the high inlier ratio (74.1% in Area 3) demonstrates that the framework effectively prevents the generation of reconstruction artifacts by discarding low-confidence matches, thereby ensuring the reliability of the generated DSM [9,17].

4.4. Error Propagation and System Limitations

While the proposed matching method significantly reduces the mean elevation error, the final measurement accuracy remains influenced by the extrinsic parameters

(R, t)

. Although the Pi-SAR2 navigation system provides sub-meter precision [21], any residual boresight misalignment can propagate into systematic elevation shifts. Our results indicate that matching error is the dominant source of uncertainty in complex terrains; however, integrating physical navigation constraints directly into the deep learning loss function remains a promising direction for future work to improve global consistency [26].

5. Conclusions

In this paper, we proposed a deep learning-based stereo radargrammetry framework to address the challenges of severe geometric image modulations inherent in SAR images. To overcome the domain gap between optical and SAR images, we developed a pipeline to construct a patch-based SAR image dataset using ground truth DSMs and a rigorous SAR projection model. Using this dataset, we fine-tuned RoMa [12], a robust Transformer-based image matching model, enabling it to accurately capture the complex non-linear deformations specific to SAR images. Furthermore, our approach eliminated the need for ground-range projection, allowing for direct correspondence matching on slant-range images to prevent image quality degradation caused by pixel interpolation.

Through comprehensive experiments using airborne Pi-SAR2 datasets, we demonstrated the clear superiority of the proposed method. In the image correspondence matching evaluation, the fine-tuned RoMa outperformed the conventional POC-based method [9] and the state-of-the-art DKM model [20]. For instance, our method achieved an 82.86% matching accuracy within a 10-pixel error threshold, significantly surpassing the baselines. Consequently, in the 3D measurement evaluation, our method achieved the highest reconstruction accuracy with the lowest elevation errors and the highest inlier ratios. Specifically, the proposed framework reached an inlier ratio of 74.1% and reduced the mean elevation error to

- 1.24

m in the most challenging terrain (Area 3). It successfully generated dense and reliable elevation maps covering wide areas, even in challenging mountainous and forested terrains where conventional methods failed. Future work will focus on extending the proposed framework to spaceborne SAR datasets (e.g., TerraSAR-X and ALOS-2) to verify its applicability to satellite remote sensing. Additionally, we plan to evaluate and further improve the robustness of the model across a wider variety of global topographies and land covers.

Author Contributions

Conceptualization, K.I. and J.U.; methodology, K.I. and J.U.; software, T.S.; formal analysis, T.S., S.I. and H.I.; investigation, T.S. and H.I.; resources, J.U.; writing—original draft preparation, K.I. and T.S.; writing—review and editing, K.I., S.I. and J.U.; visualization, T.S.; supervision, K.I., T.A. and J.U.; project administration, K.I., T.A. and J.U.; funding acquisition, T.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI Grant Number JP25K03131.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The SAR image data analyzed in this study were provided by the National Institute of Information and Communications Technology (NICT) and are not publicly available due to third-party restrictions. The AW3D DSM data used for ground truth generation were provided by NTT DATA Corporation and the Remote Sensing Technology Center of Japan (RESTEC) and are subject to commercial licensing. The custom program code developed in this study is available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

DKM	Deep Kernelized Dense Geometric Matching
DSM	Digital Surface Model
GCP	Ground Control Point
InSAR	Interferometric Synthetic Aperture Radar
LiDAR	Light Detection and Ranging
NICT	National Institute of Information and Communications Technology
POC	Phase-Only Correlation
RESTEC	Remote Sensing Technology Center of Japan
RoMa	Robust Dense Feature Matching
SAR	Synthetic Aperture Radar

References

Richards, J.A. Remote Sensing with Imaging Radar; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Kerle, N. Encyclopedia of Natural Hazards; Springer: Dordrecht, The Netherlands, 2013. [Google Scholar]
Lillesand, T.M.; Kiefer, R.W.; Chipman, J.W. Remote Sensing and Image Interpretation; John Wiley & Sons: New York, NY, USA, 2015. [Google Scholar]
Rosen, P.; Hensley, S.; Joughin, I.; Li, F.; Madsen, S.; Rodriguez, E.; Goldstein, R. Synthetic aperture radar interferometry. Proc. IEEE 2000, 88, 333–382. [Google Scholar] [CrossRef]
Toutin, T.; Gray, L. State-of-the-art of elevation extraction from satellite SAR data. ISPRS J. Photogramm. Remote Sens. 2000, 55, 13–33. [Google Scholar] [CrossRef]
Leberl, F.W. Radargrammetric Image Processing; Artech House: Norwood, MA, USA, 1990. [Google Scholar]
Raggam, H.; Gutjahr, K.; Perko, R.; Schardt, M. Assessment of the Stereo-Radargrammetric Mapping Potential of TerraSAR-X Multibeam Spotlight Data. IEEE Trans. Geosci. Remote Sens. 2010, 48, 971–977. [Google Scholar] [CrossRef]
Maruki, D.; Sakai, S.; Ito, K.; Aoki, T.; Uemoto, J.; Uratsuka, S. Stereo radargrammetry using airborne SAR images without GCP. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 3585–3589. [Google Scholar]
Insfran, K.; Ito, K.; Aoki, T. Accurate 3D measurement from two SAR images without prior knowledge of scene. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 4814–4817. [Google Scholar]
Takita, K.; Muquit, M.; Aoki, T.; Higuchi, T. A sub-pixel correspondence search technique for computer vision applications. IEICE Trans. Fundam. 2004, E87-A, 1913–1923. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; The MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Edstedt, J.; Sun, Q.; Bökman, G.; Wadenbäck, M.; Felsberg, M. RoMa: Robust dense feature matching. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19790–19800. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 23–26 June 2021. [Google Scholar]
Li, H.L.; Chen, S.W. Polyhedral Corner Reflectors Multidomain Joint Characterization With Fully Polarimetric Radar. IEEE Trans. Antennas Propag. 2025, 73, 10679–10693. [Google Scholar] [CrossRef]
Li, X.; Liu, L.; Wan, G.; Zheng, F.; Guo, S.; Sun, G.; Wang, Z.; Liu, X. Physics-Driven SAR Target Detection: A Review and Perspective. Remote Sens. 2026, 18, 200. [Google Scholar] [CrossRef]
Li, H.L.; Chen, S.W. General Polarimetric Correlation Pattern: A Visualization and Characterization Tool for Target Joint-Domain Scattering Mechanisms Investigation. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5200417. [Google Scholar] [CrossRef]
Sasayama, T.; Ito, S.; Ito, K.; Aoki, T. Stereo Radargrammetry Using Deep Learning from Airborne SAR Images. In Proceedings of the 2025 IEEE International Geoscience and Remote Sensing Symposium, Brisbane, Australia, 3–8 August 2025; pp. 7904–7908. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning robust visual features without supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Edstedt, J.; Athanasiadis, I.; Wadenbäck, M.; Felsberg, M. DKM: Dense kernelized feature matching for geometry estimation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17765–17775. [Google Scholar]
Nadai, A.; Uratsuka, S.; Umehara, T.; Matsuoka, T.; Satake, M. Development of X-band airborne polarimetric and interferometric SAR with submeter spatial resolution. In Proceedings of the 2009 IEEE International Geoscience and Remote Sensing, Cape Town, South Africa, 12–17 July 2009; Volume 2, pp. 913–916. [Google Scholar]
Takaku, J.; Tadono, T.; Tsutsui, K.; Ichikawa, M. VaLidation of ‘AW3D’ global DSM generated from ALOS PRISM. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, III-4, 25–31. [Google Scholar]
Li, Z.; Snavely, N. MegaDepth: Learning single-view depth prediction from internet photos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041–2050. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019; pp. 1–19. [Google Scholar]
Jannati, H.; Zoej, H.J.V.; Ghaderpour, E.; Mazzanti, P. Dense Matching with Low Computational Complexity for Disparity Estimation in the Radargrammetric Approach of SAR Intensity Images. Remote Sens. 2025, 17, 2693. [Google Scholar] [CrossRef]
Dong, Y.; Li, Y.; Zhao, J.; Sun, Y.; Liao, M. Deep Learning for Radargrammetric DSM Generation: A StereoSAR Dataset and Multiscale Fusion Network. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5205115. [Google Scholar] [CrossRef]

Figure 1. Geometry of the SAR projection model. (a) SAR image coordinate system. (b) 3D radar coordinate system indicating the flight direction (azimuth) and observation area. (c) Projection model in the range–elevation (X-Z) plane relating the antenna position and the ground point M.

Figure 2. Overview of the SAR image dataset construction. The process mainly consists of two steps: (i) creating dense elevation maps corresponding to the SAR images using a DSM and metadata, and (ii) extracting aligned patch pairs suitable for training deep learning models.

Figure 3. Overview of the patch extraction process. To ensure the exact same physical regions are cropped, the source image is temporarily aligned with the reference image (ii-1), and a common grid is placed (ii-2). The grid coordinates are then inversely transformed back to the original source image geometry (ii-3) before cropping the patch pairs (ii-4).

Figure 4. Overall pipeline of the proposed 3D measurement method. The input SAR images are divided into patches, and dense correspondences are estimated using the fine-tuned RoMa. After filtering out low-confidence matches, the 3D coordinates are calculated using the SAR projection model to generate the final 3D point cloud.

Figure 6. Qualitative comparison of disparity error maps across three test patch pairs, (a) Patch 1, (b) Patch 2, and (c) Patch 3, generated by the conventional POC method, RoMa (M), and the proposed method (RoMa (S)). For each patch pair, the upper and lower maps represent the errors for the reference and source patches, respectively. Color indicates the magnitude of the displacement error in meters and gray regions indicate pixels where no valid correspondences were obtained.

Figure 7. Three pairs of SAR images used for the evaluation of 3D measurement performance: (a) Area 1, (b) Area 2, and (c) Area 3.

Figure 8. Cumulative percentage of reconstructed 3D points plotted against various elevation error thresholds.

Figure 9. Qualitative comparison of 3D measurements for “Area 1” generated by the conventional POC method, RoMa (M), and the proposed method (RoMa (S)). For each area, the upper and lower maps represent the generated elevation map and the corresponding error map, respectively.

Figure 10. Qualitative comparison of 3D measurements for “Area 2” generated by the conventional POC method, RoMa (M), and the proposed method (RoMa (S)). For each area, the upper and lower maps represent the generated elevation map and the corresponding error map, respectively.

Figure 11. Qualitative comparison of 3D measurements for “Area 3” generated by the conventional POC method, RoMa (M), and the proposed method (RoMa (S)). For each area, the upper and lower maps represent the generated elevation map and the corresponding error map, respectively.

Table 1. Details of the constructed SAR image datasets for fine-tuning and evaluation.

Method	Patch Size	Number of Patch Pairs
Method	Patch Size	Training	Validation	Test	Total
DKM [20]	$384 \times 512$	3001	1019	372	4392
RoMa [12]	$560 \times 560$	1656	552	207	2415

Table 2. Quantitative evaluation of SAR image matching. The table reports the accuracy [%] ↑ at various pixel error thresholds. Bold values indicate the best performance.

Matching Method	1 pixel	3 pixels	5 pixels	10 pixels
POC [9]	0.05	4.71	14.54	38.97
DKM (M) [20]	0.30	2.46	9.50	47.12
DKM (S)	16.73	46.06	62.10	79.54
RoMa (M) [12]	0.08	1.73	11.73	60.30
Proposed (RoMa (S))	15.93	48.13	65.08	82.86

Table 3. Quantitative comparison of 3D reconstruction accuracy across the three test areas. The table reports the mean error [m] ↓ and standard deviation [m] ↓. Values in parentheses indicate the percentage of 3D points with an absolute error of less than 2 m [%] ↑. Bold values indicate the best performance.

Test Area	POC [9]	DKM (M) [20]	DKM (S)	RoMa (M) [12]	Proposed (RoMa (S))
Area 1	0.90 ± 21.2	−5.33 ± 40.0	−9.83 ± 8.4	14.40 ± 73.3	−1.24 ± 10.0
Area 1	(21.1)	(16.2)	(12.5)	(16.2)	(42.2)
Area 2	−0.56 ± 9.9	−5.59 ± 19.9	−5.96 ± 7.0	2.15 ± 20.8	−1.62 ± 6.5
Area 2	(47.9)	(36.1)	(19.6)	(56.5)	(60.5)
Area 3	−0.14 ± 12.1	−4.50 ± 32.6	−5.60 ± 5.6	2.37 ± 28.9	−1.56 ± 4.3
Area 3	(47.1)	(33.6)	(25.0)	(66.5)	(74.1)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ito, K.; Sasayama, T.; Ito, S.; Iwasa, H.; Aoki, T.; Uemoto, J. Stereo Radargrammetry Using Deep Learning-Based Image Matching with Fine-Tuned Model on Synthetic Aperture Radar Images. Remote Sens. 2026, 18, 1662. https://doi.org/10.3390/rs18101662

AMA Style

Ito K, Sasayama T, Ito S, Iwasa H, Aoki T, Uemoto J. Stereo Radargrammetry Using Deep Learning-Based Image Matching with Fine-Tuned Model on Synthetic Aperture Radar Images. Remote Sensing. 2026; 18(10):1662. https://doi.org/10.3390/rs18101662

Chicago/Turabian Style

Ito, Koichi, Tatsuya Sasayama, Shintaro Ito, Haruki Iwasa, Takafumi Aoki, and Jyunpei Uemoto. 2026. "Stereo Radargrammetry Using Deep Learning-Based Image Matching with Fine-Tuned Model on Synthetic Aperture Radar Images" Remote Sensing 18, no. 10: 1662. https://doi.org/10.3390/rs18101662

APA Style

Ito, K., Sasayama, T., Ito, S., Iwasa, H., Aoki, T., & Uemoto, J. (2026). Stereo Radargrammetry Using Deep Learning-Based Image Matching with Fine-Tuned Model on Synthetic Aperture Radar Images. Remote Sensing, 18(10), 1662. https://doi.org/10.3390/rs18101662

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Stereo Radargrammetry Using Deep Learning-Based Image Matching with Fine-Tuned Model on Synthetic Aperture Radar Images

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Geometric Model of Stereo Radargrammetry

2.2. Overview of the Proposed Method

2.3. Construction of SAR Image Dataset

2.4. Fine-Tuning of RoMa for SAR Images

2.5. Overall Pipeline of 3D Measurement

3. Results

3.1. Experimental Setup

3.1.1. Study Area and Data

3.1.2. Dataset Preparation

3.1.3. Implementation Details

3.2. Matching Performance

3.2.1. Evaluation Metrics and Test Patches

3.2.2. Quantitative Results

3.2.3. Qualitative Results

3.3. 3D Measurement Accuracy

3.3.1. Evaluation Metrics and Test Areas

3.3.2. Quantitative Results

3.3.3. Qualitative Results

4. Discussion

4.1. Mechanism Analysis: Transformer- vs. CNN-Based Matching

4.2. Impact of SAR-Specific Geometric Distortions

4.3. Reliability and Confidence-Aware Filtering

4.4. Error Propagation and System Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI