Cross-Context Aggregation for Multi-View Urban Scene and Building Facade Matching

Yan, Yaping; Zhou, Yuhang

doi:10.3390/ijgi14110425

Open AccessArticle

Cross-Context Aggregation for Multi-View Urban Scene and Building Facade Matching

by

Yaping Yan

^1,* and

Yuhang Zhou

²

¹

School of Computer Science and Engineering, Southeast University, Nanjing 210096, China

²

School of Automation, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(11), 425; https://doi.org/10.3390/ijgi14110425

Submission received: 2 September 2025 / Revised: 21 October 2025 / Accepted: 28 October 2025 / Published: 31 October 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate and robust feature matching across multi-view urban imagery is fundamental for urban mapping, 3D reconstruction, and large-scale spatial alignment. Real-world urban scenes involve significant variations in viewpoint, illumination, and occlusion, as well as repetitive architectural patterns that make correspondence estimation challenging. To address these issues, we propose the Cross-Context Aggregation Matcher (CCAM), a detector-free framework that jointly leverages multi-scale local features, long-range contextual information, and geometric priors to produce spatially consistent matches. Specifically, CCAM integrates a multi-scale local enhancement branch with a parallel self- and cross-attention Transformer, enabling the model to preserve detailed local structures while maintaining a coherent global context. In addition, an independent positional encoding scheme is introduced to strengthen geometric reasoning in repetitive or low-texture regions. Extensive experiments demonstrate that CCAM outperforms state-of-the-art methods, achieving up to +31.8%, +19.1%, and +11.5% improvements in AUC@{5°, 10°, 20°} over detector-based approaches and up to 1.72% higher precision compared with detector-free counterparts. These results confirm that CCAM delivers reliable and spatially coherent matches, thereby facilitating downstream geospatial applications.

Keywords:

feature matching; urban scene understanding; context aggregation; geometric consistency

1. Introduction

Feature matching is a cornerstone of geospatial computer vision and supports a wide range of applications in geoinformation science, including multi-view 3D city reconstruction [1], urban mapping [2], geospatial localization [3], and urban change detection [4]. Multi-view urban scene and building facade imagery, obtained from ground-level cameras, oblique aerial platforms, or vehicle-mounted systems, captures the structural and visual diversity of urban environments. Accurate and robust correspondence estimation enables the alignment of heterogeneous imagery, supports photogrammetric workflows and integration with geographic information system databases, and facilitates downstream tasks such as digital twin construction, map updating, and large-scale spatial data alignment, which are critical for smart city development and sustainable urban management.

Achieving robust feature matching in urban environments remains a challenging problem. Repetitive architectural textures, textureless surfaces, dynamic occlusions, and large variations in viewpoint and illumination frequently cause feature detectors to produce unstable or ambiguous matches. These factors reduce feature repeatability and lead to geometric inconsistencies across multi-view images, which limits the effectiveness of conventional pipelines in real-world geospatial applications.

Traditional keypoint-based methods, such as SIFT [5] and SURF [6], detect a sparse set of distinctive keypoints and compute compact descriptors for each one. These methods are computationally efficient and exhibit strong invariance to scale and moderate viewpoint or illumination changes, making them well suited for large-scale or real-time tasks. However, their dependence on sparse keypoints restricts their performance in regions lacking stable corners or characterized by highly repetitive patterns, such as glass facades and uniform walls. To overcome this limitation, detector-free approaches have emerged as a promising alternative. Instead of relying on explicit keypoint detection, these methods generate dense feature representations and predict correspondences directly from image-level features. CNN-based models such as NCNet [7] and DRCNet [8] perform dense matching through local convolutional features, while Transformer-based frameworks [9,10] leverage attention mechanisms to model long-range dependencies. Although these methods achieve higher match coverage and robustness under extreme viewpoint and illumination changes, they often involve higher computational costs and may lose fine-grained local texture information.

Recent studies have attempted to combine the advantages of convolutional and attention-based architectures. CNNs effectively capture local structural and texture details but have limited receptive fields, restricting their ability to model global scene contexts. Transformer-based models, in contrast, capture global dependencies but tend to have weaker spatial precision due to positional encoding degradation and reduced sensitivity to local features in deeper attention layers. Consequently, a unified framework that simultaneously preserves fine-grained textures, models global contexts, and maintains geometric consistency remains an open research problem.

To overcome these challenges, we propose the Cross-Context Aggregation Matcher (CCAM) for multi-view urban scene and building facade imagery. Here, the term “cross-context” denotes the aggregation of contextual representations across different spatial regions and scales within the same visible (RGB) image. Specifically, CCAM aggregates three complementary feature representations: (1) local texture features, obtained via a feature pyramid network (FPN) and enhanced by channel reorganization and multi-scale convolutional kernels to preserve fine-grained details across scales; (2) cross-region contextual representations, captured by a Transformer module that performs self-attention and cross-attention in parallel to model long-range dependencies among distant facade elements and scene regions; and (3) geometric structure features, derived from an independent positional encoding scheme that encodes relative spatial relationships to strengthen the geometric consistency. By fusing appearance, cross-region contexts, and geometry, CCAM produces robust and spatially coherent correspondences. As shown in Figure 1, the existing detector-free method LoFTR [9] often produces mismatches in repetitive or texture-sparse regions, highlighting a critical gap in contextual reasoning for correspondence establishment. Our CCAM is designed to bridge this gap by aggregating multi-scale contextual cues across both spatial and semantic dimensions, leading to more accurate and stable matches in such challenging areas.

In summary, the contributions of this work are as follows:

A detector-free framework that integrates multi-scale local representations, cross-region contextual aggregation, and geometric priors, enhancing the robustness and spatial consistency in urban feature matching;
A lightweight channel feature reorganization module that selectively emphasizes informative channels while suppressing redundant features;
A novel attention design that combines parallel self- and cross-attention with independent positional encoding to retain spatial information through deep attention layers;
Extensive experiments on challenging urban scenes that demonstrate the effectiveness and competitiveness of the proposed approach.

The rest of this paper is organized as follows. Section 2 provides a comprehensive review of the related literature. Section 3 details the proposed CCAM framework. Section 4 presents the experimental results and analysis. Finally, the conclusions are drawn in Section 5.

2. Related Work

2.1. Detector-Based Feature Matching

Detector-based methods have long served as a cornerstone in urban photogrammetry, supporting applications such as building facade reconstruction, urban change detection, and multi-view city modeling. Classical pipelines in this paradigm are typically divided into three stages: feature detection, description, and matching. Before the advent of deep learning, researchers achieved notable results using hand-crafted detectors and descriptors. Well-established features such as ORB [11] and SIFT [5] were frequently used due to their robustness to scale and orientation variations. However, their effectiveness declines for repetitive or low-texture surfaces.

Learning-based detectors improved this landscape. D2-Net [12] jointly detects and describes using convolutional feature maps, improving the robustness to repetitive patterns. SuperPoint [13] further advanced the field with a self-supervised framework that jointly learns interest point detection and descriptor extraction, pre-training on synthetic data and fine-tuning with real images. RDLNet [14] addresses hard sample optimization through a triplet-based mining strategy to identify challenging negative samples within mini-batches. It introduces batch margin regularization to explicitly constrain distances between extreme cases and employs orthogonal regularization on convolutional kernels to stabilize training and enrich the feature diversity. Xiao et al. proposed a novel real-time method for UAV-based agricultural photogrammetry [15], addressing low-overlap and repetitive texture challenges through adaptive initialization, robust tracking, and hybrid optimization. He et al. proposed an illumination-invariant multi-view matching framework to address cross-daylight and wide-baseline challenges in urban perception [16]. By leveraging semantic–geometric integration and topology-guided feature propagation, it achieves robust correspondences in complex street-view environments.

Transformer-based models such as SuperGlue [17] model feature matching as a partial assignment with attention-based graph learning, effectively handling occlusion from vegetation or vehicles. To alleviate the computational costs, ParaFormer [18] introduces a parallel attention mechanism that synchronously performs self- and cross-attention and integrates a wave-based position encoder to dynamically fuse descriptor and positional information.

2.2. Detector-Free Feature Matching

Detector-free approaches bypass explicit keypoint detection to compute dense matches, which is particularly beneficial in urban contexts where keypoint extraction is unreliable.

CNN-based detector-free methods. Early detector-free work used convolutional correlation and learned neighborhood consensus to produce dense matches. NCNet [7] introduced 4D convolutional correlation for exhaustive dense matching, suitable for aligning UAV and ground imagery. DRC-Net [8] proposes a dual-resolution architecture that establishes dense correspondences in a coarse-to-fine manner by extracting hierarchical feature maps with an FPN-like backbone. Here, a 4D correlation tensor generated from coarse features is refined through learnable neighborhood consensus, guiding fine-resolution matching while concentrating computation on high-confidence regions to reduce the memory overhead. These methods are effective in capturing local patch-level structures and can support coarse-to-fine refinement, but their limited effective receptive fields restrict the incorporation of long-range contexts, which is often necessary to disambiguate repetitive facade patterns or to reason about the broader scene geometry.

Transformer-based detector-free methods. Recent works have increasingly adopted Transformer-based architectures [19,20] to model long-range dependencies in feature matching. LoFTR [9] replaces detectors with Transformer layers to robustly match in low-texture and extreme viewpoint scenarios. ETQ-Matcher [10] was proposed as an efficient quadtree attention-guided Transformer for detector-free aerial–ground image matching. While ETQ-Matcher shares the general paradigm of Transformer-based matching with LoFTR [9], its design primarily targets cross-scale and cross-modality alignment between aerial and ground views. In contrast, our work addresses multi-view urban scenes and building facades, where repetitive textures, weak local patterns, and geometric consistency are the major challenges. Given the significant differences in application domains, a direct comparison with ETQ-Matcher would not provide a fair assessment. Therefore, LoFTR and other widely adopted detector-free frameworks were chosen as baselines in our evaluation. In addition, ASTR [21] ensures local consistency via a spot-guided attention mechanism and handles scale variations through adaptive grid scaling. The spot-guided module restricts feature aggregation to high-confidence regions, reducing interference from irrelevant areas, while the adaptive scaling dynamically adjusts the grid sizes using depth estimates from coarse matches. RoMa [22] enriches coarse semantic features via foundation models and then refines the matches through convolution for high localization accuracy. MSFormer [23] enhances feature matching by combining convolution-augmented Transformers with a multi-scale attention design and a neighborhood consensus mechanism, enabling more reliable correspondences. CSFormer [24] introduces a cross-scale feature extraction framework that integrates Transformer-based global representations with convolution-based local semantics. By reorganizing multi-scale channel features at the semantic perception level, CSFormer effectively enhanced the local distinctiveness and improved the correspondence reliability in both indoor and outdoor pose estimation benchmarks. To address the common issues of one-to-many correspondences and mismatches in multi-modal remote sensing image matching, MICM [25] employs multi-scale invariant feature extraction together with a feature consistency correlation strategy. These approaches demonstrate strong robustness to variations in viewpoint and illumination. Nevertheless, relying solely on Transformers can result in the loss of fine-grained architectural details. Moreover, positional encodings may degrade across deep attention layers, potentially reducing the spatial consistency.

Table 1 provides a side-by-side comparison of key detector-based and detector-free methods in urban imagery applications. Detector-based approaches are efficient when distinctive, repeatable keypoints can be detected on structural edges, rooflines, or building corners, but they struggle in textureless or repetitive regions and may fail under large viewpoint changes. Detector-free methods overcome the repeatability requirement by generating dense correspondences, which increases the coverage in low-texture regions and supports more complete 3D reconstructions. However, grid-based dense sampling may produce ambiguities in cases of large-scale variation, and many methods still underutilize geometric priors, which are critical in ensuring reliable correspondences in complex urban layouts.

3. Methodology

3.1. Framework

The overall architecture of the proposed CCAM for multi-view urban scene and building facade matching is shown in Figure 2. Given a pair of images acquired from different vantage points within an urban environment, the method first employs a frozen feature pyramid network (FPN) [26] backbone to extract multi-scale deep feature maps. The extracted features are then processed in two complementary branches. The first branch, the parallel attention Transformer, jointly captures intra-view and inter-view contextual dependencies by performing self-attention and cross-attention in parallel. An independent positional encoding is employed to incorporate geometric priors, which are particularly valuable in building facades characterized by strong spatial regularities, such as repetitive window arrangements and consistent vertical–horizontal alignments. The second branch, the local feature enhancement module, focuses on preserving fine-grained local details, which are crucial in establishing accurate correspondences between facade elements and in aligning small-scale urban structures. It comprises a channel feature reorganization submodule, which explicitly models inter-channel dependencies to highlight semantically relevant channels, and a cross-scale feature aggregation submodule, which integrates contextual cues from receptive fields of multiple sizes to improve the robustness against scale variations between viewpoints. The outputs of both branches are fused into a unified representation that is subsequently processed by a matching layer and a refinement module to yield the final correspondence estimates.

3.2. Backbone Feature Extraction

Motivated by the translation equivariance and locality of convolutional neural networks, we employ a standard CNN with an FPN backbone [26] to extract initial multi-scale features. The backbone is initialized with pre-trained weights and then frozen to reduce both the number of trainable parameters and the overall training time. Given two images

I^{A}

and

I^{B}

, the FPN produces feature maps at multiple resolutions. We use

{\tilde{F}}^{A}

and

{\tilde{F}}^{B}

to denote the coarse features at

\frac{1}{8}

of the original image size and

{\hat{F}}^{A}

and

{\hat{F}}^{B}

to denote the fine features at

\frac{1}{2}

of the original image size.

3.3. Self- and Cross-Attention for Feature Aggregation

Within the global feature representation module, self-attention and cross-attention mechanisms are integrated in parallel to promote the exchange of feature information both within an individual image and between different views of the same urban scene. As described in our previous conference paper [18], input features are first linearly projected to obtain queries (Q), keys (K), and values (V). In the self-attention branch, the standard attention is computed as

s o f t m a x (\frac{(Q K^{T})}{\sqrt{d}}) V,

(1)

where Q, K, and V are derived from the same image. In contrast, the cross-attention branch processes features from different images. Specifically, the attention scores are computed as

s o f t m a x (\frac{(Q_{B} {K_{A}}^{T})}{\sqrt{d}}) V_{A}

(2)

for image A and

s o f t m a x (\frac{(Q_{A} {K_{B}}^{T})}{\sqrt{d}}) V_{B}

(3)

for image B. To improve the model efficiency, an attention weight sharing strategy is employed by replacing

Q_{B} {K_{A}}^{T}

with

{(Q_{A} {K_{B}}^{T})}^{T}

, enabling the reuse of cross-attention computations based on

(Q_{A}, V_{A}, K_{B}, V_{B})

. Both attention branches operate concurrently, and their outputs are fused through a two-layer multilayer perceptron (MLP). This parallel attention framework not only enables effective feature integration via learnable fusion but also reduces the number of redundant parameters and computational overhead.

3.4. Independent Positional Encoding

Transformer-based image matching algorithms typically embed spatial information into feature representations using a single absolute positional encoding. The original local features

{\tilde{F}}^{A}

and

{\tilde{F}}^{B}

are first flattened into one-dimensional vectors, after which their spatial locations are encoded based on the absolute pixel coordinates. A two-dimensional extension of the sinusoidal encoding scheme is employed to assign a unique positional descriptor to every pixel. The 2D extended sinusoidal format is defined as

\begin{matrix} P E_{x, y}^{n} = f {(x, y)}^{n} & = \{\begin{matrix} sin (ω_{k} \cdot x), n = 4 k, \\ cos (ω_{k} \cdot x), n = 4 k + 1, \\ sin (ω_{k} \cdot y), n = 4 k + 2, \\ cos (ω_{k} \cdot y), n = 4 k + 3, \end{matrix} \end{matrix}

(4)

and

ω_{k} = \frac{1}{10, 000^{2 k / d}},

(5)

where k is an integer, n indexes the feature channels, and d is the total number of channels to which the positional encoding is applied. For each k, four values are generated, corresponding to the sine and cosine components of the x and y coordinates.

By adding the absolute positional embedding

P E

to the flattened features

{\tilde{F}}^{A}

and

{\tilde{F}}^{B}

, we obtain position-dependent features

{\tilde{F}}_{P}^{A}

and

{\tilde{F}}_{P}^{B}

:

{\tilde{F}}_{P} = Flatten (\tilde{F}) + P E,

(6)

where Flatten denotes the operation of reshaping the spatial feature map into a one-dimensional vector.

This position-dependent representation ensures that the features at each pixel location remain unique, enabling the model to distinguish between visually similar but spatially distinct regions. This property is particularly useful in urban imagery, where repetitive facade elements such as windows or balconies may appear identical in appearance but differ in position. However, in conventional attention mechanisms, positional cues tend to diminish after multiple layers, and embedding positions directly into feature vectors can entangle spatial and semantic information, thereby reducing the discriminative power.

To address this, we introduce an attention mechanism with independent positional encoding, as illustrated in Figure 3. Rather than merging spatial and appearance information at the input stage, the method computes similarity scores for features and positions separately and then sums them to obtain the attention score. This approach integrates positional information at every attention computation, reinforcing its contribution to the matching accuracy while preventing its dilution over successive layers. At the same time, the absolute positional component ensures that implicit relationships between features and positions are retained.

Let the inputs of the l-th attention layer be

f_{i}^{(l)}

and

f_{j}^{(l)}

. For feature similarity,

f_{i}^{(l)}

is projected to

K_{f}^{(l)}

and

V_{f}^{(l)}

, and

f_{j}^{(l)}

is projected to

Q_{f}^{(l)}

, yielding

α_{f}^{(l)} = Q_{f}^{(l)} {(K_{f}^{(l)})}^{⊤} .

(7)

For positional similarity, the embedding

P E

is projected to

K_{p}^{(l)}

and

Q_{p}^{(l)}

, producing

α_{p}^{(l)} = Q_{p}^{(l)} {(K_{p}^{(l)})}^{⊤} .

(8)

The final attention output is computed as

{Attention}^{(l)} = Softmax (\frac{α_{f}^{(l)} + α_{p}^{(l)}}{\sqrt{d}}) V_{f}^{(l)},

(9)

M^{(l)} = norm (Linear ({Attention}^{(l)})),

(10)

f_{i}^{(l + 1)} = norm (MLP (concat (f_{i}^{(l)}, M^{(l)}))) + f_{i}^{(l)} .

(11)

In our implementation, one self-attention layer and one cross-attention layer form an attention aggregation unit, enabling the integration of both intra-image and inter-image information for matching. This aggregation unit is repeated

N_{g} = 4

times. The independent positional encoding design enhances the ability of the model to capture spatial relationships between pixels, improving the robustness against repetitive patterns and viewpoint changes and ultimately increasing the matching accuracy in complex urban environments.

3.5. Local Feature Enhancement

CCAM captures rich local image features from two complementary perspectives by reorganizing channel features to emphasize informative content while suppressing redundancy and by incorporating parallel convolutions at multiple scales to extract hierarchical cues, thereby enabling fine-grained detail capture across varying receptive fields.

Channel Feature Reorganization. The channel reorganization at the semantic perception level draws its inspiration from SENet [27]. We employ a learnable channel attention mechanism to model inter-channel dependencies by assigning each channel a weight based on its semantic importance and using these weights to recalibrate the feature maps, thereby amplifying informative channels and suppressing less relevant ones. To better capture local spatial dependencies at the semantic perception level, the global squeeze operator in SENet is replaced with a local sliding window averaging pooling (SWAP) operation [28]. Given an input feature map

F \in R^{H \times W \times C}

, SWAP generates a per-channel descriptor map

Z = SWAP (F) \in R^{H^{'} \times W^{'} \times C}

. Each element in Z represents the mean value within a local spatial window of the corresponding channel, with a window size of

k = 16

and stride of

s = 8

. The spatial resolution of the descriptor map is computed as

H^{'} = ⌊\frac{H - k}{s}⌋ + 1, W^{'} = ⌊\frac{W - k}{s}⌋ + 1 .

(12)

After this, we use an MLP and Sigmoid functions to remap the per-channel descriptor map Z, and then we expand the feature into the input size as weights to reorganize the input features. Finally, a

3 \times 3

convolution is conducted on the new channel features for smoothing. All these operations can be written as

\begin{matrix} ω = Expand (Sigmod (MLP (Z))), \\ f_{c h a n n e l} = ω ⊙ f_{i}, \\ f_{C A} = {Conv}_{3} (f_{c h a n n e l}), \end{matrix}

(13)

where

{Conv}_{3}

denotes

3 \times 3

conv, ⊙ denotes dot production, and

f_{i}

denotes the input feature.

Cross-Scale Feature Aggregation. Cross-scale feature extraction is mostly inspired by the human visual perception system. In order to fuse the local feature information under different receptive fields, we use a gradually increasing convolution kernel for parallel convolution operations. In our experiments, the convolution kernel size is used as in [27]. Above all, we use a two-layer MLP to process the cross-scale feature. The cross-scale interaction and feature aggregation operations are as follows:

\begin{matrix} f = concat ({Conv}_{3} (f_{i}), {Conv}_{5} (f_{i}), {Conv}_{7} (f_{i})), \\ f_{C S} = MLP (concat (f_{i}, f)), \end{matrix}

(14)

where

{Conv}_{n}

denotes convolution with kernel size

n \times n

, and

f_{i}

denotes the input feature.

After obtaining the above features, we concatenate the feature

f_{C A}

and

f_{C S}

to obtain our enhanced feature of one cross-scale local feature description. This operation can be represented as

\begin{matrix} f_{l o c a l} = {Conv}_{3} (Norm (concat (f_{C A}, f_{C S}))), \end{matrix}

(15)

where

N o r m

denotes instance normalization, and

{Conv}_{3}

is for smoothing as before. In order to fully extract the hierarchical cues of the image and gradually enhance the local features and hierarchical representation of the image, we also use the cross-scale local feature description

N_{c}

times.

3.6. Integration and Description

As shown in Figure 2, we collect the cross-scale local feature and the global feature obtained by the Transformer. It is necessary to flatten the cross-scale local feature into a 1D vector as with the global feature before aggregation. After this, we concatenate these two features in series to provide a richer matching feature for image matching. This operation is

\begin{matrix} f = concat (f_{g l o b a l}, (Flatten (f_{l o c a l}))), \end{matrix}

(16)

where f denotes the enhanced feature descriptors of images, and Flatten denotes the operation of flattening the cross-scale local feature into a 1D vector.

3.7. Matching Layer

After feature aggregation and description, the enhanced feature descriptors of images

I^{A}

and

I^{B}

can be represented by

F^{A}

and

F^{B}

. In this section, the cost matrix of the optimal transport operation is computed from the inner product of

F^{A}

and

F^{B}

, which is called the score matrix S. This can be represented as

\begin{matrix} S_{i j} = < F_{i}^{A}, F_{j}^{B} >, \forall (i, j) \in A \times B . \end{matrix}

(17)

The correspondences of images

I^{A}

and

I^{B}

can be solved by optimal transport with score S. The optimal partial assignment matrix P can be efficiently computed using the Sinkhorn algorithm [29], which iteratively normalizes the score matrix along rows and columns. After T iterations, we obtain the optimal partial assignment matrix

P \in {[0, 1]}^{M \times N}

. Finally, regarding P, matches with

P_{i j}

that are below the matching threshold of

θ

are eliminated first, and then the mutual nearest neighbor (MNN) criterion is used to select the final matches M, which is useful to filter out outlier matches. We denote match prediction as

\begin{matrix} M = {(i, j) | \forall (i, j) \in M N N (P), P_{i j} \geq θ} . \end{matrix}

(18)

This matching strategy effectively leverages both global and local cues to establish geometrically and semantically consistent correspondences, which is essential for multi-view geospatial data alignment.

4. Experiments

4.1. Implementation Details

We implemented the proposed CCAM model using PyTorch 2.1.0. The network was trained for 30 epochs with the Adam optimizer, employing a cosine decay learning rate schedule and a 3-epoch linear warm-up. The initial learning rate was set to 0.008. All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 3060 GPU (12 GB VRAM), 32 GB of system RAM, and an Intel Core i7-10700K CPU. We evaluated the performance of CCAM on three feature matching tasks: outdoor pose estimation, indoor pose estimation, and visual localization. The evaluation includes both quantitative and qualitative comparisons against several state-of-the-art methods on publicly available benchmark datasets. To ensure a fair comparison, all methods were trained and tested under identical hardware and experimental conditions.

4.2. Pose Estimation for Urban Scenes

Datasets. Camera pose estimation is a typical and important application of local feature matching. In this experiment, the proposed method is trained and evaluated on the MegaDepth dataset [30], which contains one million internet images of 196 different urban scenes and building facades. This dataset includes sparse reconstructions generated from COLMAP [31] and depth maps computed through multi-view stereo. The MegaDepth dataset provides various challenges, including diverse illumination conditions, changes in viewpoint, and occlusion within urban scenes. To guarantee a fair comparison, we adhere to identical test sets and evaluation protocols outlined in existing works [9]. The resolutions of the images are resized to 640 and 720 during the training. The resolution of the test images is 1024.

Baseline. We take some detector-based methods and detector-free methods for comparison. In particular, the typical nearest neighbor matching method and the state-of-the-art SuperGlue [17] are taken for comparison in cooperation with SuperPoint [13], which is employed as a feature detector. Meanwhile, the state-of-the-art detector-free methods MSFormer [23], DRC-Net [8], and LoFTR [9] are also compared with the proposed method.

Metrics. Following previous works [9,17], we report the area under the curve (AUC) of the pose accuracy curve and the matching precision.

Specifically, for each image pair, we classify the estimated camera pose as correct if its pose error

Δ T

is below a given angular threshold

τ \in [0^{\circ}, 180^{\circ}]

. We then compute the precision, i.e., the proportion of correctly estimated poses as a function of

τ

, and integrate this precision curve up to a specified maximum threshold of {5°, 10°, 20°}. The resulting AUC value represents the normalized area under this cumulative accuracy curve.

The pose error

Δ T

is defined as the maximum of the rotation error

Δ R

and the translation error

Δ t

:

Δ t = \arccos (\frac{t_{e} \cdot t}{{∥t_{e}∥}_{2} \cdot {∥t∥}_{2}}),

(19)

Δ R = \arccos (\frac{tr (R_{e}^{T} \cdot R) - 1}{2}),

(20)

Δ T = \max (Δ t, Δ R),

(21)

where

t_{e}

and

R_{e}

denote the estimated translation vector and rotation matrix, t and R are the corresponding ground truth values,

{∥\cdot∥}_{2}

is the Euclidean norm, and

tr (\cdot)

is the trace operator.

Matching precision measures the fraction of correct matches among all predicted matches. Formally, it is defined as

Precision = \frac{M_{correct}}{M_{predicted}},

(22)

where

M_{predicted}

is the set of all correspondence candidates, and

M_{correct}

is the subset of predicted correspondences whose epipolar error falls below a predefined threshold of 1 × 10⁻⁴.

Results. As reported in Table 2, the proposed method achieves the best performance on the MegaDepth benchmark. Compared with the detector-based SuperGlue [17], our method shows relative improvements of +31.8%, +19.1%, and +11.5% in the AUC at thresholds {5°, 10°, 20°}, respectively. Among the detector-free approaches, our method also outperforms MSFormer [23], DRC-Net [8], and LoFTR [9] across all metrics. In particular, the matching precision exhibits +1.72% and +17.3% relative gains over CSFormer [24] and LoFTR [9], respectively. These consistent improvements confirm the effectiveness of our proposed feature matching framework for accurate urban scene pose estimation.

Visualization. To further qualitatively validate the effectiveness of our method, we visualize the matching results under challenging viewpoint changes. These visual examples are consistent with the quantitative improvements reported in Table 2. Specifically, Figure 4 presents the qualitative comparisons between our proposed CCAM and two representative baselines, SuperGlue [17] and LoFTR [9], evaluated on the MegaDepth dataset [30]. In scenes where objects exhibit visually similar patterns, such as windows on building facades, our approach leverages spatial geometric information to achieve more accurate correspondences and effectively addresses the common issue of mismatches observed in existing methods. Moreover, CCAM demonstrates superior performance in weakly textured regions such as building facades by preserving fine-grained local texture features. In contrast, the detector-free method LoFTR tends to produce incorrect correspondences when similar patterns occur at different spatial locations and exhibits suboptimal performance in weakly textured areas. Notably, the detector-based method SuperGlue yields fewer matches due to the limited ability of its detector to identify a sufficient number of repeatable keypoints. Overall, these results show that CCAM not only inherits robustness to viewpoint changes and occlusion but also excels in challenging urban scenes characterized by repetitive patterns and weak textures.

Experiments on database dependency. In order to demonstrate the generality and robustness of our approach, the proposed CCAM is trained on the MegaDepth dataset and evaluated on the ScanNet [32] dataset. ScanNet is a large-scale dataset that contains 2.5 million views in more than 1500 scans, annotated with 3D camera poses, surface reconstructions, and instance-level semantic segmentations.

We take more existing works into comparison. For detector-based methods, nearest neighbor matching, GMS [33], the ratio test [5], PointCN [34], OANet [35], and SuperGlue [17] are reported for comparison with various feature detectors, including ORB [11], D2Net [12], ContextDesc [36], and SuperPoint [13]. For detector-free methods, the state-of-the-art CSFormer [24], DRC-Net [8], and LoFTR [9] are also compared with the proposed method. CSFormer [24] and LoFTR [9] are trained on the same hardware environment as our proposed method for a fair comparison. The AUC of the pose error at thresholds {5°, 10°, 20°} and the matching precision@5e-4 are reported in Table 3.

As shown in Table 3, the proposed method consistently outperforms all detector-based methods in terms of the pose estimation AUC across all thresholds. Although the precision@5e-4 of SuperPoint [13] + SuperGlue [17] is slightly higher, its AUC scores are clearly lower than those of our method. When compared with recent detector-free approaches, our method achieves superior results in both the AUC and precision. Specifically, it shows relative improvements of +6.34%, +3.89%, and +8.19% over LoFTR [9] at AUC@

{5^{\circ}, 10^{\circ}, 20^{\circ}}

, respectively, and a notable +25.5% gain in matching precision. The experimental results demonstrate the strong generalizability of our method across datasets.

4.3. Visual Localization Under Cross-Daylight Conditions

Dataset. Visual localization under varying illumination levels, particularly across day and night conditions, is an important application of feature matching. To further assess the robustness of the proposed method, we conduct experiments on long-term visual localization using the large-scale Aachen Day-Night v1.1 benchmark [37], which contains challenging image pairs from the historic city center of Aachen. The primary difficulty arises from matching 824 daytime queries with 98 nighttime queries under extreme illumination changes. For mapping and localization, we adopt the HLoc toolbox [38]. Performance is reported in terms of the localization accuracy at multiple error thresholds.

Baselines. The baseline methods taken into comparison include both the detector-based SuperGlue [17] and the detector-free CSFormer [24] and LoFTR [9].

Metrics. The evaluation is conducted by using the Visual Localization Benchmark, which takes a pre-defined visual localization pipeline based on COMLAP [31]. The successfully localized images are counted within three error tolerances (0.25 m, 2°), (0.5 m, 5°), and (5.0 m, 10°). Each pair of error tolerances includes the maximum position error in meters and degrees.

Results. Experimental results regarding visual localization are reported in Table 4, from which one can find that the best results are not as concentrated as in the other experiments. Nevertheless, the proposed method achieves the highest performance in both day sequences and night sequences at all error thresholds (0.25 m, 2°), (0.5 m, 5°), and (1.0 m, 10°). The results show that the proposed method works well in the task of the visual localization of cross-daylight street views.

Visualization. We further provide representative visualizations on the Aachen Day-Night v1.1 benchmark to qualitatively assess the proposed CCAM. As shown in Figure 5, CCAM produces stable correspondences across challenging urban scenes with significant day–night illumination changes, repetitive facade patterns, and texture-sparse regions. These examples visually confirm the robustness and spatial consistency of the matches.

4.4. Discussion

4.4.1. Embedding Components

We perform ablation studies to investigate the individual contributions of three key embedding components: the parallel attention Transformer, the channel feature reorganization module, and the cross-scale feature aggregation module. As shown in Table 5, removing any of these components results in a noticeable decline in performance. Specifically, the absence of the channel feature reorganization module impairs the network’s ability to model inter-channel semantic dependencies; omitting the cross-scale feature aggregation module reduces the capacity to preserve fine-grained textures across different resolutions; and removing the parallel attention Transformer weakens the modeling of long-range contextual relationships. Integrating all three modules consistently leads to superior results, demonstrating that each component provides distinct and complementary benefits to the overall matching pipeline.

4.4.2. Positional Encoding

To validate the contribution of the independent positional encoding, we performed an ablation study on attention with and without independent positional encoding. From the results reported in Table 6, one can find that simultaneously applying the absolute positional encoding and independent positional encoding yields the best performance, as measured by both the AUC and precision, indicating that the proposed positional encoding method provides stronger position information to guide matching.

4.4.3. Extraction Depth of Cross-Scale Local Features

To analyze the influence of the extraction depth of cross-scale local features, we trained the proposed method with different

N_{c}

. As shown in Table 7, our model achieves the best performance regarding AUC@{10°, 20°} and precision@1e-4 when

N_{c}

equals 4. Although the proposed method achieved the best result of 48.28% for AUC@{5°} when

N_{c}

was equal to 3, its precision was reduced by 1.29% compared with

N_{c}

equal to 4. In order to achieve relatively good results in terms of pose errors and matching precision, we set the parameter

N_{c}

to 4 in the experiments.

4.4.4. Efficiency Analysis

We evaluate the runtime performance of CCAM on the MegaDepth dataset and compare it with two defect-free feature matching methods, DRC-Net and LoFTR. All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 3060 GPU (12 GB VRAM), 32 GB of system RAM, and an Intel Core i7-10700K CPU. The average inference time per image pair was 2.05 s for DRC-Net, 1.71 s for LoFTR, and 1.92 s for CCAM. Among the three methods, DRC-Net exhibits the highest computational cost, primarily because of its reliance on 4D convolutional operations. CCAM incorporates cross-scale feature extraction, a channel feature reorganization module, and independent positional encoding, which moderately increase its computational complexity compared to LoFTR. Nevertheless, CCAM achieves a favorable balance between accuracy and inference efficiency.

5. Conclusions

This paper presents a cross-context aggregation framework for multi-view feature matching in urban scenes and building facades. The main findings can be summarized as follows: (1) integrating local and global contextual information enhances both the distinctiveness and stability of feature representations; (2) modeling relative spatial relationships improves the geometric consistency, particularly in repetitive or low-texture regions; (3) aggregating complementary visual and geometric cues is crucial in achieving robust and spatially coherent correspondences.

Overall, the proposed CCAM framework demonstrates strong potential for geospatial applications. Its ability to handle large viewpoint variations, repetitive facade structures, and texture-sparse surfaces can improve the accuracy and completeness of 3D city reconstruction and urban mapping. In localization and navigation systems, CCAM offers consistent feature correspondences that may enhance map-based positioning and loop closure detection in complex urban environments.

Future work will extend the framework by incorporating higher-level semantic and contextual reasoning to further improve the matching reliability in heterogeneous geospatial datasets.

Author Contributions

Conceptualization, Yaping Yan; methodology, Yaping Yan and Yuhang Zhou; software, Yaping Yan and Yuhang Zhou; validation, Yuhang Zhou; writing—original draft preparation, Yaping Yan; writing—review and editing, Yaping Yan; visualization, Yaping Yan and Yuhang Zhou; funding acquisition, Yaping Yan. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number 62201142).

Data Availability Statement

The datasets used in this research are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Luo, H.; Zhang, J.; Liu, X.; Zhang, L.; Liu, J. Large-Scale 3D Reconstruction from Multi-View Imagery: A Comprehensive Review. Remote Sens. 2024, 16, 773. [Google Scholar]
Huang, X.; Yang, J.; Li, J.; Wen, D. Urban functional zone mapping by integrating high spatial resolution nighttime light and daytime multi-view imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 403–415. [Google Scholar] [CrossRef]
Zhang, X.; Qin, H.; Ma, L.; Yu, Y.; Ma, Y.; Hu, Y. Deep Feature Matching of Different-Modal Images for Visual Geo-Localization of AAVs. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 2784–2801. [Google Scholar] [CrossRef]
Zhang, Y.; Zhen, J.; Liu, T.; Yang, Y.; Cheng, Y. Adaptive Differentiation Siamese Fusion Network for Remote Sensing Change Detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
Rocco, I.; Cimpoi, M.; Arandjelović, R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1658–1669. [Google Scholar]
Li, X.; Han, K.; Li, S.; Prisacariu, V. Dual-Resolution Correspondence Networks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 17346–17357. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8918–8927. [Google Scholar]
Xu, C.; Wang, B.; Ye, Z.; Mei, L. ETQ-Matcher: Efficient Quadtree-Attention-Guided Transformer for Detector-Free Aerial–Ground Image Matching. Remote Sens. 2025, 17, 1300. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-Net: A Trainable CNN for Joint Description and Detection of Local Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8084–8093. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 224–236. [Google Scholar]
Zhang, J.; Jiao, L.; Ma, W.; Liu, F.; Liu, X.; Li, L.; Zhu, H. RDLNet: A Regularized Descriptor Learning Network. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5669–5681. [Google Scholar] [CrossRef] [PubMed]
Xiao, X.; Qu, W.; Xia, G.S.; Xu, M.; Shao, Z.; Gong, J.; Li, D. A novel real-time matching and pose reconstruction method for low-overlap agricultural UAV images with repetitive textures. ISPRS J. Photogramm. Remote Sens. 2025, 226, 54–75. [Google Scholar] [CrossRef]
He, H.; Xiong, W.; Zhou, F.; He, Z.; Zhang, T.; Sheng, Z. Topology-Aware Multi-View Street Scene Image Matching for Cross-Daylight Conditions Integrating Geometric Constraints and Semantic Consistency. ISPRS Int. J. Geo-Inf. 2025, 14, 212. [Google Scholar] [CrossRef]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching With Graph Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4937–4946. [Google Scholar]
Lu, X.; Yan, Y.; Kang, B.; Du, S. ParaFormer: Parallel Attention Transformer for Efficient Feature Matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 1853–1860. [Google Scholar]
Cui, Z.; Tang, R.; Wei, J. UAV Image Stitching With Transformer and Small Grid Reformation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Wang, Z.; Shi, D.; Qiu, C.; Jin, S.; Li, T.; Shi, Y.; Liu, Z.; Qiao, Z. Sequence Matching for Image-Based UAV-to-Satellite Geolocalization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Yu, J.; Chang, J.; He, J.; Zhang, T.; Yu, J.; Wu, F. Adaptive Spot-Guided Transformer for Consistent Local Feature Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 21898–21908. [Google Scholar]
Edstedt, J.; Sun, Q.; Bökman, G.; Wadenbäck, M.; Felsberg, M. RoMa: Robust Dense Feature Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19790–19800. [Google Scholar]
Li, D.; Yan, Y.; Liang, D.; Du, S. MSFORMER: Multi-Scale Transformer with Neighborhood Consensus for Feature Matching. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
Zhou, Y.; Cheng, X.; Zhai, X.; Xue, L.; Du, S. CSFormer: Cross-Scale Transformer for Feature Matching. In Proceedings of the International Conference on Sensing, Measurement and Data Analytics in the Era of Artificial Intelligence, Xi’an, China, 2–4 November 2023; pp. 1–6. [Google Scholar]
Zhang, Y.; Lan, C.; Zhang, H.; Ma, G.; Li, H. Multimodal Remote Sensing Image Matching via Learning Features and Attention Mechanism. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
Lin, T.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2292–2300. [Google Scholar]
Li, Z.; Snavely, N. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041–2050. [Google Scholar]
Schönberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar]
Bian, J.; Lin, W.Y.; Matsushita, Y.; Yeung, S.K.; Nguyen, T.D.; Cheng, M.M. GMS: Grid-Based Motion Statistics for Fast, Ultra-Robust Feature Correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2828–2837. [Google Scholar]
Yi, K.M.; Trulls, E.; Ono, Y.; Lepetit, V.; Salzmann, M.; Fua, P. Learning to Find Good Correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2666–2674. [Google Scholar]
Zhang, J.; Sun, D.; Luo, Z.; Yao, A.; Zhou, L.; Shen, T.; Chen, Y.; Liao, H.; Quan, L. Learning Two-View Correspondences and Geometry Using Order-Aware Network. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5844–5853. [Google Scholar]
Luo, Z.; Shen, T.; Zhou, L.; Zhang, J.; Yao, Y.; Li, S.; Fang, T.; Quan, L. ContextDesc: Local Descriptor Augmentation With Cross-Modality Context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2522–2531. [Google Scholar]
Sattler, T.; Maddern, W.; Toft, C.; Torii, A.; Hammarstrand, L.; Stenborg, E.; Safari, D.; Okutomi, M.; Pollefeys, M.; Sivic, J.; et al. Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8601–8610. [Google Scholar]
Sarlin, P.E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12708–12717. [Google Scholar]

Figure 1. Illustration of the existing gaps and the proposed solution in feature matching for challenging urban scenes. From top to bottom: input image, results of LoFTR [9], and results of the proposed CCAM. Green lines indicate correspondences with an epipolar error below

1 \times 10^{- 4}

. LoFTR fails to establish reliable matches in repetitive and low-texture areas (e.g., glass facades and concrete walls) due to its limited local aggregation. In contrast, our CCAM effectively captures information from local textures, cross-region contexts, and geometric structures, yielding more consistent and geometrically correct correspondences.

Figure 1. Illustration of the existing gaps and the proposed solution in feature matching for challenging urban scenes. From top to bottom: input image, results of LoFTR [9], and results of the proposed CCAM. Green lines indicate correspondences with an epipolar error below

1 \times 10^{- 4}

. LoFTR fails to establish reliable matches in repetitive and low-texture areas (e.g., glass facades and concrete walls) due to its limited local aggregation. In contrast, our CCAM effectively captures information from local textures, cross-region contexts, and geometric structures, yielding more consistent and geometrically correct correspondences.

Figure 2. Overview of the proposed CCAM. The input image pair, potentially captured from heterogeneous viewpoints, is processed by a frozen FPN backbone to obtain multi-scale features. These are forwarded to (1) a parallel attention Transformer, capturing cross-region contextual and geometric priors via

N_{g}

attention layers with independent positional encoding, and (2) a local feature enhancement branch, extracting fine-grained local patterns through channel reorganization and multi-receptive-field convolutions. The fused features are matched and refined to produce accurate correspondences for urban scene and building facade analysis.

Figure 2. Overview of the proposed CCAM. The input image pair, potentially captured from heterogeneous viewpoints, is processed by a frozen FPN backbone to obtain multi-scale features. These are forwarded to (1) a parallel attention Transformer, capturing cross-region contextual and geometric priors via

N_{g}

attention layers with independent positional encoding, and (2) a local feature enhancement branch, extracting fine-grained local patterns through channel reorganization and multi-receptive-field convolutions. The fused features are matched and refined to produce accurate correspondences for urban scene and building facade analysis.

Figure 3. Comparison of attention mechanisms using absolute (left) and independent (right) positional encodings.

Figure 4. Qualitative comparisons between CCAM and representative baselines SuperGlue [17] and LoFTR [9] on the MegaDepth dataset [30]. Correct correspondences are shown in green (epipolar error

\leq 1 \times 10^{- 4}

) and incorrect correspondences in red (epipolar error

> 1 \times 10^{- 4}

). The precision value displayed in the top-left corner provides an objective measure of the matching accuracy.

Figure 4. Qualitative comparisons between CCAM and representative baselines SuperGlue [17] and LoFTR [9] on the MegaDepth dataset [30]. Correct correspondences are shown in green (epipolar error

\leq 1 \times 10^{- 4}

) and incorrect correspondences in red (epipolar error

> 1 \times 10^{- 4}

). The precision value displayed in the top-left corner provides an objective measure of the matching accuracy.

Figure 5. Representative qualitative results on the Aachen Day–Night v1.1 benchmark [37]. Green lines denote correct correspondences (epipolar error

\leq 1 \times 10^{- 4}

), and red lines denote incorrect correspondences (epipolar error

> 1 \times 10^{- 4}

). The precision metric shown in the top-left corner offers a concise quantitative evaluation of the matching accuracy. The proposed CCAM demonstrates robustness in challenging urban scenarios by correctly matching image pairs across extreme illumination changes, distinguishing repetitive facade patterns, and maintaining correspondences in low-texture areas such as walls and pavements.

Figure 5. Representative qualitative results on the Aachen Day–Night v1.1 benchmark [37]. Green lines denote correct correspondences (epipolar error

\leq 1 \times 10^{- 4}

), and red lines denote incorrect correspondences (epipolar error

> 1 \times 10^{- 4}

). The precision metric shown in the top-left corner offers a concise quantitative evaluation of the matching accuracy. The proposed CCAM demonstrates robustness in challenging urban scenarios by correctly matching image pairs across extreme illumination changes, distinguishing repetitive facade patterns, and maintaining correspondences in low-texture areas such as walls and pavements.

Table 1. Comparison of existing detector-based and detector-free matching methods in urban imagery applications.

Category	Method	Advantages	Limitations
Detector-based	SuperPoint [13]	Efficient inference, illumination robust	Limited robustness in dynamic scenes
	RDLNet [14]	Compact discriminative descriptor	Requires extensive training to learn hard samples
	SuperGlue [17]	Strong contextual reasoning, robust to occlusion and viewpoint changes	Poor performance in textureless regions
	ParaFormer [18]	Improved computational efficiency	Prolonged training convergence
Detector-free	DRC-Net [8]	Increased matching reliability	Limited receptive field
	LoFTR [9]	Robust to occlusion and illumination changes	Losing detailed information
	ASTR [21]	High accuracy under extreme viewpoint changes	Requires the camera’s intrinsic parameters
	RoMa [22]	Robustness under extreme scenarios	False geometric matches

Table 2. Pose estimation results on the MegaDepth dataset, reported in terms of AUC@{5°, 10°, 20°} and matching precision @1e-4. The best result is highlighted in bold.

Category	Image Size	Method	Pose Estimation AUC (%)			Precision (%)
Category	Image Size	Method	@5°	@10°	@20°	@1e-4
Detector-based	-	SuperPoint [13]	31.7	46.8	60.1	-
	-	+NN+mutual	31.7	46.8	60.1	-
	-	SuperPoint [13]	36.54	54.33	69.42	-
	-	+SuperGlue [17]	36.54	54.33	69.42	-
Detector-free	-	DRC-Net [8]	27.01	42.96	58.31	-
	640	MSFormer [23]	45.35	62.24	74.96	92.85
	640	LoFTR [9]	36.76	53.04	66.17	89.16
	640	CSFormer [24]	46.84	62.57	74.81	93.65
	640	Ours	46.41	62.72	75.00	94.23
	720	LoFTR [9]	43.13	58.74	70.81	81.75
	720	CSFormer [24]	47.96	64.25	76.75	94.25
	720	Ours	48.15	64.70	77.36	95.87

Table 3. Cross-dataset evaluation on ScanNet. All models were trained on MegaDepth. Results are reported in terms of AUC@{5°, 10°, 20°} and matching precision @5e-4. Best results are highlighted in bold.

Category	Method	Pose Estimation AUC (%)			Precision (%)
Category	Method	@5°	@10°	@20°	@5e-4
Detector-based	ORB [11] + GMS [33]	5.21	13.65	25.36	72.0
	D2Net [12] + NN	5.25	14.53	27.96	46.7
	ContextDesc [36] +Ratio Test [5]	6.64	15.01	25.75	51.2
	SuperPoint [13] + NN	9.43	21.53	36.40	50.4
	SuperPoint [13] + GMS [33]	8.39	18.96	31.56	50.3
	SuperPoint [13] + PointCN [34]	11.40	25.47	41.41	71.8
	SuperPoint [13] + OANet [35]	11.76	26.90	43.85	74.0
	SuperPoint [13] + SuperGlue [17]	11.82	26.76	40.38	85.1
Detector-free	DRC-Net [8]	7.69	11.93	30.49	-
	LoFTR [9]	12.47	26.98	41.64	62.20
	CSFormer [24]	13.08	27.80	44.17	77.36
	Ours	13.26	28.03	45.05	78.08

Table 4. Visual localization results on the Aachen Day–Night v1.1 benchmark. All methods are evaluated under three error tolerances: (0.25 m, 2°), (0.5 m, 5°), and (5.0 m, 10°), where each pair denotes the maximum allowable position and orientation error, respectively. The reported values represent the percentage of successfully localized images within each threshold. Best results are highlighted in bold.

Method	Day			Night
Method	(0.25 m, 2°)	(0.5 m, 5°)	(1.0 m, 10°)	(0.25 m, 2°)	(0.55 m, 5°)	(5 m, 10°)
SuperGlue [17]	87.4	93.4	97.1	66.0	80.6	91.5
LoFTR [9]	86.0	92.5	96.7	68.1	80.1	91.6
CSFormer [24]	86.2	93.9	97.6	72.3	89.5	97.4
Ours	86.5	94.3	97.9	72.6	90.1	98.1

Table 5. Ablation study on embedding components. The best result is highlighted in bold.

Parallel Attention	Channel Reorganization	Cross-Scale Aggregation	AUC (%)
Parallel Attention	Channel Reorganization	Cross-Scale Aggregation	@5°	@10°	@20°
	✓	✓	27.89	43.56	59.03
✓			45.32	61.45	74.81
✓	✓		47.06	63.11	76.27
✓	✓	✓	48.15	64.70	77.36

Table 6. Ablation study on attention with absolute positional encoding (APE) and independent positional encoding (IPE). The best result is highlighted in bold.

APE	IPE	AUC (%)			Precision (%)
APE	IPE	@5°	@10°	@20°	@1e-4
✓		47.96	64.25	76.75	94.25
	✓	47.11	63.71	76.73	95.28
✓	✓	48.15	64.70	77.36	95.87

Table 7. Ablation study on

N_{c}

. Different

N_{c}

are evaluated on the MegaDepth dataset. The best result is highlighted in bold.

Table 7. Ablation study on

N_{c}

. Different

N_{c}

are evaluated on the MegaDepth dataset. The best result is highlighted in bold.

$N_{c}$	AUC (%)			Precision (%)
$N_{c}$	@5°	@10°	@20°	@1e-4
1	45.17	61.43	74.20	92.50
2	47.22	63.89	76.11	92.63
3	48.28	64.46	76.74	94.64
4	48.15	64.70	77.36	95.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, Y.; Zhou, Y. Cross-Context Aggregation for Multi-View Urban Scene and Building Facade Matching. ISPRS Int. J. Geo-Inf. 2025, 14, 425. https://doi.org/10.3390/ijgi14110425

AMA Style

Yan Y, Zhou Y. Cross-Context Aggregation for Multi-View Urban Scene and Building Facade Matching. ISPRS International Journal of Geo-Information. 2025; 14(11):425. https://doi.org/10.3390/ijgi14110425

Chicago/Turabian Style

Yan, Yaping, and Yuhang Zhou. 2025. "Cross-Context Aggregation for Multi-View Urban Scene and Building Facade Matching" ISPRS International Journal of Geo-Information 14, no. 11: 425. https://doi.org/10.3390/ijgi14110425

APA Style

Yan, Y., & Zhou, Y. (2025). Cross-Context Aggregation for Multi-View Urban Scene and Building Facade Matching. ISPRS International Journal of Geo-Information, 14(11), 425. https://doi.org/10.3390/ijgi14110425

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Context Aggregation for Multi-View Urban Scene and Building Facade Matching

Abstract

1. Introduction

2. Related Work

2.1. Detector-Based Feature Matching

2.2. Detector-Free Feature Matching

3. Methodology

3.1. Framework

3.2. Backbone Feature Extraction

3.3. Self- and Cross-Attention for Feature Aggregation

3.4. Independent Positional Encoding

3.5. Local Feature Enhancement

3.6. Integration and Description

3.7. Matching Layer

4. Experiments

4.1. Implementation Details

4.2. Pose Estimation for Urban Scenes

4.3. Visual Localization Under Cross-Daylight Conditions

4.4. Discussion

4.4.1. Embedding Components

4.4.2. Positional Encoding

4.4.3. Extraction Depth of Cross-Scale Local Features

4.4.4. Efficiency Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI