MaLCA: Point Cloud Registration with Mamba-Enhanced Features and Local Correspondence Augmentation

Huo, Yuchen; Zhang, Longyun; Guo, Huijuan; Gong, Jingyi; Kuang, Liqun; Han, Xie; Xiong, Fengguang

doi:10.3390/a19050380

Open AccessArticle

MaLCA: Point Cloud Registration with Mamba-Enhanced Features and Local Correspondence Augmentation

by

Yuchen Huo

^1,2

,

Longyun Zhang

^1,2,

Huijuan Guo

^1,2,

Jingyi Gong

^1,2,

Liqun Kuang

^1,2,

Xie Han

^1,2 and

Fengguang Xiong

^1,2,*

¹

Shanxi Key Laboratory of Machine Vision and Virtual Reality, North University of China, Taiyuan 030051, China

²

School of Computer Science and Technology, North University of China, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(5), 380; https://doi.org/10.3390/a19050380

Submission received: 9 April 2026 / Revised: 7 May 2026 / Accepted: 9 May 2026 / Published: 11 May 2026

Download

Browse Figures

Versions Notes

Abstract

High-quality correspondences are critical to the accuracy and robustness of point cloud registration. Existing Transformer-based methods are fundamentally constrained by the quadratic computational complexity of self-attention, resulting in limited scalability. Moreover, conventional outlier removal paradigms operate by pruning initial correspondences, and thus fail catastrophically in low-overlap scenarios where initial inliers are inherently scarce. To address these challenges, we propose MaLCA, a point cloud registration method based on Mamba-enhanced features and local correspondence augmentation. We first adopt KPFCN as the backbone to extract multi-scale geometric features from raw point clouds. A Mamba selective state space model then replaces self-attention for global context modeling with linear complexity, while cross-attention is retained to facilitate inter-point-cloud feature interaction. Rather than following the conventional subtraction-based outlier removal paradigm, we introduce a prior-guided local rematching strategy combined with a fused neighbor matching mechanism that iteratively constructs dense, high-quality correspondences from sparse initial inliers, fundamentally overcoming the bottleneck of inlier scarcity in challenging scenes. Extensive experiments on the 3DMatch/3DLoMatch and 4DMatch/4DLoMatch benchmarks demonstrate that MaLCA achieves competitive registration performance across both rigid and deformable scenarios, with particular advantages in low-overlap cases.

Keywords:

point cloud registration; state space model; Mamba; correspondence augmentation

1. Introduction

Point cloud registration is a fundamental task in 3D computer vision, aiming to estimate the spatial transformation between two partially overlapping point clouds and align them into a unified coordinate system, with broad applications in simultaneous localization and mapping (SLAM) [1], augmented reality [2], autonomous driving perception [3], and 3D scene reconstruction [4]. With the rapid advancement of deep learning, learning-based registration methods have surpassed traditional hand-crafted feature approaches in both accuracy and robustness, becoming the dominant research paradigm in this field. In recent years, Transformer-based feature enhancement frameworks have achieved remarkable progress by interleaving self-attention and cross-attention modules to model intra-point-cloud global dependencies and inter-frame semantic associations, continuously advancing the state of the art in registration accuracy. However, self-attention suffers from a fundamental efficiency bottleneck: its computational complexity grows quadratically with the number of input points [5], forcing existing methods to aggressively downsample point clouds at the cost of geometric detail. This limitation becomes particularly pronounced in challenging scenarios involving high-density point clouds or low overlap ratios.

Meanwhile, Mamba—a state space model (SSM)-based architecture—has attracted increasing attention as an efficient alternative to Transformers [6]. By introducing an input-adaptive selective state transition mechanism, Mamba achieves global context modeling with linear time complexity, substantially reducing computational cost while preserving long-range dependency awareness. Nevertheless, adapting Mamba to 3D point clouds presents two inherent challenges. First, the inherent unorderedness of point clouds conflicts with Mamba’s implicit requirement for structured sequential inputs; naive ordering strategies disrupt the spatial topology of the point cloud and introduce spurious order dependencies. Second, Mamba operates intrinsically within a single point cloud, and is therefore unable to perceive cross-frame geometric correspondences between two point clouds; yet, inter-point-cloud feature interaction is essential for establishing accurate correspondences in registration [7,8]. How to preserve Mamba’s linear efficiency advantage while effectively compensating for these limitations therefore constitutes a central challenge that must be addressed.

Regarding the outlier problem, existing methods predominantly follow a pruning paradigm: starting from an initial set of putative correspondences, erroneous matches are progressively filtered using geometric consistency constraints, and the surviving inliers are used to estimate the transformation parameters [9]. However, this paradigm has a fundamental limitation: in challenging scenarios characterized by low overlap or strong noise, the number of correct correspondences in the initial set is extremely limited, and purely subtractive filtering cannot fundamentally resolve the problem of inlier scarcity.

To address the above challenges, we propose MaLCA, a point cloud registration method based on Mamba-enhanced features and local correspondence augmentation. Mamba selective state space models replace the self-attention layers for global context modeling with linear complexity, while cross-attention is retained to enable inter-point-cloud feature interaction. To handle the unorderedness of point clouds, a spatially-aware serialization strategy based on Z-order space-filling curves [10] is adopted. Subsequently, a prior-guided local rematching strategy replaces the conventional outlier removal paradigm: using the spatial positions of correspondences from the previous iteration as priors, dense rematching is performed within local neighborhoods via a fused neighbor matching mechanism, progressively augmenting the set of high-quality inliers, and fundamentally overcoming the bottleneck of sparse initial inliers.

The main contributions of this paper are as follows:

We propose a Mamba–cross-attention hybrid feature enhancement module (MCA), in which Mamba replaces self-attention for global context modeling, combined with a Z-order space-filling curve serialization strategy, thereby improving registration efficiency.
We propose a local rematching strategy and a fused neighbor matching mechanism to replace conventional outlier removal. Through multi-round iterative augmentation of high-quality inliers, this approach effectively resolves the fundamental problem of sparse initial inliers in low-overlap scenarios.
Extensive experiments on the 3DMatch/3DLoMatch and 4DMatch/4DLoMatch demonstrate that the proposed method achieves competitive registration performance in challenging scenarios such as low overlap.

The practical value of MaLCA is reflected in two aspects. In terms of efficiency, replacing self-attention with Mamba reduces GPU memory from 12,297 MB to 4915 MB at the 1/8 downsampling level, enabling processing of denser point clouds where Transformer-based methods run out of memory. In terms of registration quality, MaLCA achieves 92.62% RR and the highest FMR of 89.80% on 3DLoMatch among all compared methods, and 78.63% NFMR on 4DLoMatch, surpassing all baselines, confirming its advantage in low-overlap scenarios.

This paper is organized as follows: Section 2 reviews related work on correspondence extraction, Mamba-based point cloud analysis, and outlier removal. Section 3 details the proposed MaLCA framework. Section 4 presents experiments on both rigid and non-rigid benchmarks along with ablation studies. Section 5 concludes the paper.

2. Related Work

2.1. Correspondence Extraction

The quality of correspondence extraction directly determines the final registration accuracy in point cloud registration. Early methods relied on hand-crafted descriptors such as FPFH and SHOT [11,12], which exhibit limited performance under complex conditions involving noise and occlusion. With the advancement of deep learning, FCGF [13] introduced a fully convolutional sparse 3D network for efficient dense feature extraction; Predator [14] incorporated attention mechanisms to predict overlapping regions, demonstrating strong performance in low-overlap scenarios; and RoITr [15] embedded point-pair feature coordinates into the attention mechanism to achieve pose-invariant geometric description. Kernel point convolution backbones, represented by KPConv [16], have been widely adopted in multiple registration frameworks owing to their translation invariance and strong local geometric modeling capability. In the domain of Transformer-based feature enhancement, CoFiNet [17] integrated self-attention and cross-attention into a coarse-to-fine hierarchical framework, establishing a foundational registration paradigm. GeoTransformer [18] proposed geometry-enhanced self-attention to encode pairwise distances and triplet-wise angles, enabling transformation-invariant geometric feature modeling. Fu et al. [19] further improved feature interaction accuracy by adopting superpoint-focused attention and dual-spatial consistency matching at the coarse and fine stages, respectively. However, all of the above Transformer-based methods are subject to quadratic computational complexity. In contrast, the proposed method achieves a balance between computational efficiency and registration performance through a Mamba–cross-attention hybrid feature enhancement module.

2.2. Mamba for 3D Point Cloud Analysis

State space models (SSMs) have attracted considerable attention in recent years as efficient alternatives to Transformers for sequence modeling. S4 [20] demonstrated the ability to model long-range dependencies with linear complexity via structured state matrices. Mamba [7] further introduced a selective state space mechanism, making model parameters input-dependent to enable dynamic information selection, and combined this with a hardware-aware parallel algorithm to achieve linear inference complexity. Adapting Mamba to 3D point clouds faces two core challenges: the unorderedness of point clouds and the lack of explicit local geometric awareness. PointMamba [10] was the first to systematically explore the application of SSMs to point cloud analysis, serializing point clouds using space-filling curves and adopting a non-hierarchical Mamba encoder to surpass Transformer baselines on multiple tasks with linear complexity. Spectral-Informed Mamba [21] proposed a spectral-space traversal strategy combined with spectral graph analysis to improve token ordering, better preserving spatial proximity within the Mamba framework. Mamba3D [22] designed a local norm pooling block for explicit local geometric feature extraction and constructed a bidirectional SSM comprising a forward SSM and a feature-channel-reversed C-SSM to alleviate spurious order dependencies in unordered point clouds. MT-PCR [23] first introduces Mamba into point cloud registration, applying it as an isolated encoder confined to the coarse matching stage. In contrast, we tightly coupled Mamba and cross-attention in a layer-wise alternating structure, producing enhanced features that govern all downstream processing, and further employs a dual-direction scan with feature-channel reversal to reduce pseudo-order dependency.

2.3. Outlier Removal

Traditional geometry-based methods such as RANSAC [24] and its variants represent the most classical approach to outlier removal. TEASER and SC²-PCR [25,26] further improve robustness through truncated least-squares cost functions and second-order spatial compatibility, respectively. Learning-based methods such as PointDSC [27] embed spatial consistency into feature maps and estimate correspondence confidence via neural spectral matching. GraphSCNet [28] is specifically designed for non-rigid scenarios, refining correspondences using local spatial consistency constraints on deformation graphs, and represents the first learning-based outlier removal method tailored for non-rigid registration. All of the above methods follow a pruning paradigm, and consequently struggle to support robust registration when the initial inlier set is extremely scarce. To address this limitation, the proposed method introduces a local neighborhood rematching strategy that, after establishing initial correspondences, progressively generates high-quality and dense correspondences through iterative rematching.

3. Methodology

3.1. Problem Formulation

Given a source point cloud

P = \{p_{i} \in R^{3}| i = 1, \dots, N\}

and a target point cloud

Q = \{q_{j} \in R^{3}| j = 1, \dots, M\}

, the goal of registration is to estimate a rigid transformation

T = \{R, t\}

, where

R \in S O (3)

denotes the rotation matrix and

t \in R^{3}

denotes the translation vector. The transformation can be obtained by solving the following minimization:

\underset{R, t}{m i n} \sum_{(p_{i}, q_{j}) \in G} ∥ T (p_{i}) - q_{j} ∥_{2}^{2}

(1)

where

G

denotes the set of correspondences. In non-rigid scenarios, the transformation function is generalized to a dense per-point warp field

F

, commonly referred to as scene flow.

3.2. Framework Overview

The overall framework is illustrated in Figure 1 and consists of three core processing stages. First, KPFCN [16] is employed as the backbone network to extract features from and downsample the input point clouds, yielding initial geometric features

F_{P}

and

F_{Q}

. These features are then fed into a Mamba–cross-attention hybrid enhancement module (MCA) for deep feature refinement, producing more discriminative enhanced features

{\hat{F}}_{P}

and

{\hat{F}}_{Q}

. A feature matching step is subsequently applied to compute pairwise similarity scores between the enhanced features, forming an initial correspondence set

G^{0}

. Given that initial correspondences typically contain a large proportion of outliers with only a limited number of inliers, a local rematching module is further introduced. Using the spatial positions of correspondences from the previous iteration as priors, this module performs rematching within local neighborhoods and applies geometric consistency correction to progressively augment the set of high-quality inliers. After multiple iterations, the final correspondence set is obtained.

3.3. Initial Feature Extraction

KPFCN employs kernel point convolution as its fundamental operator, performing convolution over point cloud neighborhoods via a set of deformable kernel points. It possesses inductive biases of translation equivariance and local receptive fields, enabling it to directly process irregular 3D point clouds without voxelization or projection preprocessing. The original KPFCN adopts a UNet-style encoder–decoder architecture with three levels of convolutional downsampling and corresponding upsampling decoders. In this work, we structurally prune the backbone by retaining the complete encoder and only the first decoder unit, removing all subsequent upsampling layers, as illustrated in Figure 2. This design preserves multi-scale feature fusion while controlling the density of the output point cloud, thereby reducing the sequence length fed into the subsequent MCA module and lowering computational overhead.

In the encoder stage, the point cloud is progressively downsampled across three levels, with the number of feature channels expanding as [64 → 128 → 256 → 512 → 1024]. At each resolution level, two types of residual blocks are alternately stacked: residual block A captures local neighborhood geometric structure through the KPConv operator, while residual block B reduces point cloud density while enlarging the receptive field, enabling the network to perceive geometric information at different scales. In the decoder stage, only the first upsampling unit is retained. After upsampling, a skip connection concatenates the features from the corresponding encoder level, and a linear convolutional layer compresses the feature dimension to 528, yielding initial features

F_{P}

and

F_{Q}

. It is worth noting that the features extracted by KPFCN possess strict translation invariance, which can lead to feature ambiguity in scenes with repetitive geometric structures or symmetries. The subsequent module addresses this limitation by introducing global context modeling.

3.4. Feature Enhancement

3.4.1. Mamba Encoder

The initial geometric features

F_{P}

and

F_{Q}

encode only local geometric information and lack global context awareness. Moreover, point cloud features are inherently unordered, rendering direct sequence modeling with Mamba infeasible. A naive random ordering would disrupt the spatial topology of the point cloud, causing geometrically adjacent points to become distant in the sequence. To this end, we adopt a serialization method based on Z-order space-filling curves (Morton codes) [10]. Each point

p_{i} = (x_{i}, y_{i}, z_{i})

is first normalized to

[0,1]^{3}

and quantized to integer grid indices

{\hat{x}}_{i} = ⌊ m i n (m a x (x_{i}, 0), 1) \cdot 2^{B} ⌋

, and analogously for

{\hat{y}}_{i}

and

{\hat{z}}_{i}

, where

B = 10

is the bit depth per coordinate axis yielding a

1024^{3}

grid. The Morton code

M (p_{i})

is then computed by extracting each bit of the three quantized coordinates and placing them at interleaved positions in a

3 B

-bit output integer: for each bit index

k \in 0,1, \dots, B - 1

, the

k

-th bits of

{\hat{x}}_{i}

,

{\hat{y}}_{i}

,

{\hat{z}}_{i}

are placed at output bit positions

2 + 3 k

,

1 + 3 k

, and

3 k

, respectively, formally:

M (p_{i}) = \sum_{k = 0}^{B - 1} [({\hat{x}}_{i} ≫ k & 1) ≪ (2 + 3 k) | ({\hat{y}}_{i} ≫ k & 1) ≪ (1 + 3 k) | ({\hat{z}}_{i} ≫ k & 1) ≪ (3 k)]

(2)

where

≫

,

≪

,

&

, and

|

denote bitwise right-shift, left-shift, AND, and OR operations, respectively. The resulting bit layout follows the pattern

\dots {\hat{x}}_{k} {\hat{y}}_{k} {\hat{z}}_{k} \dots {\hat{x}}_{0} {\hat{y}}_{0} {\hat{z}}_{0}

from high to low bits. All

N

points are then sorted in ascending order of their Morton codes, and the serialized feature sequence fed into the Mamba encoder is

Z = [F_{m (1)}, F_{m (2)}, \dots, F_{m (N)}]

, where

m (k)

denotes the index of the point with the

k

-th smallest Morton code. The key property that makes this serialization effective is spatial locality preservation: points residing within the same

2^{k} \times 2^{k} \times 2^{k}

sub-cube share identical top

(3 B - 3 k)

bits of their Morton codes, guaranteeing their contiguity in

Z

at every resolution level

k

, such that geometrically neighboring points remain adjacent in the serialized sequence, and Mamba’s recurrent scan receives a spatially coherent input rather than an arbitrary permutation.

After reordering by Morton codes, the serialized features are then passed through an encoder comprising

L

stacked Mamba blocks (

L

is a hyper-parameter detailed in Section 4.1), where each block consists of layer normalization (LN), depth-wise separable convolution (DWConv), and a selective state space model (SelectiveSSM). The forward propagation of the

l^{t h}

block is defined as:

F_{l}^{'} = σ (D W C o n v (L i n e a r (L N (F_{l - 1}))))

(3)

F_{l} = L i n e a r (S e l e c t i v e S S M (F_{l}^{'}) ⊙ σ (L i n e a r (L N (F_{l - 1}^{'})))) + F_{l - 1}

(4)

where

σ

denotes the SiLU activation function, and the residual connection ensures stable gradient propagation. To mitigate the directional bias introduced by the unidirectional scan of the Z-order curve, the original sequence is processed in the forward direction to obtain

Y^{+}

, while a reversed scan is applied to the feature-channel-flipped sequence to obtain

Y^{-}

. The two outputs are then fused as:

\tilde{F} = W^{+} ⊙ Y^{+} + W^{-} ⊙ Y^{-}

(5)

where

W^{+}

and

W^{-}

are learnable fusion weights. Unlike approaches that flip the spatial token order, the proposed method performs the reverse scan along the feature channel dimension, effectively avoiding the spurious order dependencies that would otherwise be introduced by positional flipping in unordered point clouds. After passing through the Mamba encoder, enhanced features

{\tilde{F}}_{P}

and

{\tilde{F}}_{Q}

are obtained.

It is worth noting that the Mamba scan mechanism itself requires no modification for non-rigid scenarios, as the Z-order serialization operates purely on point coordinates, and the selective state space model processes the resulting sequence in the same manner regardless of whether the underlying deformation is rigid or non-rigid.

3.4.2. Inter-Point-Cloud Feature Interaction

Mamba operates intrinsically within a single point cloud and is therefore unable to perceive potential correspondence regions between source and target point clouds. To overcome this limitation, we propose a Mamba–Cross-Attention hybrid enhancement module (MCA), which introduces a cross-attention layer following the Mamba encoding stage. The computation is illustrated in Figure 3. Given the intermediate features

{\tilde{F}}_{P}

and

{\tilde{F}}_{Q}

of the source and target point clouds, the queries, keys, and values for cross-attention are computed as:

q_{i} = W^{q} {\tilde{F}}_{P}, k_{j} = W^{k} {\tilde{F}}_{Q}, v_{j} = W^{v} {\tilde{F}}_{Q}

(6)

where

W^{q}, W^{k}, W^{v} \in R^{d \times d}

are learnable projection matrices, and

d

denotes the feature channel dimension. The attention weights and the update rule for the source point features are given by:

a_{i j} = softmax (\frac{q_{i} k_{j}^{⊤}}{\sqrt{d}})

(7)

{\hat{F}}_{P} \leftarrow {\tilde{F}}_{P} + MLP (c a t [q i, \sum_{j} a_{i j} v_{j}])

(8)

where

c a t [\cdot]

denotes the concatenation of two feature vectors along the channel dimension. The target point cloud features

{\hat{F}}_{Q}

are updated symmetrically in the same manner. The entire MCA module stacks

L

layers, each consisting of a Mamba global encoding step followed by a cross-attention feature interaction step. The output enhanced features

{\hat{F}}_{P}

and

{\hat{F}}_{Q}

simultaneously possess intra-point-cloud global awareness and inter-point-cloud correspondence awareness, providing high-quality feature representations for subsequent correspondence establishment.

3.5. Matching

3.5.1. Prior-Guided Local Rematching

Following the MCA module, the enhanced features

{\hat{F}}_{P}

and

{\hat{F}}_{Q}

are used to establish correspondences. Following Lepard [29], position-aware linear projections are applied to the enhanced features to compute a pairwise scoring matrix

S \in R^{N \times M}

:

S (i, j) = \frac{1}{\sqrt{d}} ⟨Θ (p_{i}) W^{P} {\hat{F}}_{P}, Θ (q_{j}) W^{Q} {\hat{F}}_{Q}⟩

(9)

where

W^{P}, W^{Q} \in R^{d \times d}

are learnable projection matrices, and

Θ (\cdot)

denotes the rotary position encoding. The scoring matrix is then normalized into a confidence matrix

C

via a dual-softmax operation:

C (i, j) = Softmax (S (i, \cdot)) \cdot Softmax (S (\cdot, j))

(10)

This operation simultaneously applies normalization constraints along both row and column dimensions, enforcing bidirectional mutual exclusivity on high-confidence matches. Point pairs whose confidence exceeds a threshold

θ_{c}

and satisfies the mutual nearest neighbor criterion are selected to form the initial correspondence set

G^{0} = (p_{i}, q_{j})

. Using the spatial positions of these correspondences as priors, rematching and consistency correction are iteratively performed within progressively shrinking local spherical neighborhoods to augment the set of high-quality inliers. The optimization process at iteration

t

is defined as:

G^{t} = Φ (G^{t - 1}, P, Q, r^{t}), t = 1,2, \dots, m

(11)

where

Φ (\cdot)

denotes the single-round correspondence augmentation operation,

r^{t}

is the local search radius at iteration

t

, and

m

is the total number of iterations. The search radius is progressively reduced as iterations proceed to improve precision. It is worth noting that for non-rigid scenarios, following GraphSCNet [28], non-rigid point clouds undergo deformation that increases with the distance between corresponding points. Therefore, the initial search radius and the number of iterations are set more conservatively than in the rigid case to avoid crossing deformation boundaries and introducing erroneous correspondences. At the beginning of each iteration, the correspondences

G^{t - 1}

from the previous round are randomly sampled, and a spherical neighborhood search is performed to construct local region point cloud pairs:

P_{i}^{t} = S e a r c h (G^{t - 1}, r^{t}, p_{i}^{t - 1}), Q_{i}^{t} = S e a r c h (G^{t - 1}, r^{t}, q_{j}^{t - 1})

(12)

where

S e a r c h (X, r, c)

denotes a spherical neighborhood query centered at point

c

with radius

r

over point set

X

. This yields

n

pairs of local regions

(P_{i}^{t}, Q_{i}^{t})

. The iteration terminates when the search radius

r^{t}

decreases to its minimum value. At each iteration, a

k

-nearest neighbor search is performed within the local neighborhood defined by

r^{t}

, with the number of neighbors decreasing monotonically alongside it, progressively contracting the search range until convergence. The neighbor counts are determined empirically, with details reported in Section 4.1.

When performing feature matching within local regions, directly applying nearest neighbor matching (NN) tends to produce a large number of false correspondences owing to the high similarity of local feature distributions, whereas strict mutual nearest neighbor matching (MNN) may fail to find any correspondence at all. To address this issue, we propose a fused neighbor matching (FNM) mechanism, as illustrated in Figure 4. FNM jointly considers the matching matrices in both directions: the NN matching matrix

M_{N N}^{P \to Q}

and the multi-nearest-neighbor matching matrix

M_{M N N}^{P \to Q}

from source to target are computed via

k

-nearest neighbor search in feature space, along with their counterparts

M_{N N}^{Q \to P}

and

M_{M N N}^{Q \to P}

in the reverse direction. Specifically, for the point

p_{s} \in P_{i}^{t}

and the point

q_{t} \in Q_{i}^{t}

, the entries of these matrices are formally defined as:

M_{N N}^{P \to Q} [s, t] = 1 [f (q_{t}) \in N_{1} (f (p_{s}))], M_{M N N}^{P \to Q} [s, t] = 1 [f (q_{t}) \in N_{k} (f (p_{s}))]

(13)

where

f (\cdot)

denotes the MCA feature vector,

N_{1} (\cdot)

and

N_{k} (\cdot)

denote the nearest neighbor and

k

-nearest neighbors in feature space, respectively, and

1 [\cdot]

is the indicator function. The reverse-direction matrices are defined symmetrically. Based on these matrices,

M^{F N M}

is computed as:

M^{F N M} = (M_{N N}^{P \to Q} ⊙ M_{M N N}^{Q \to P}) \lor (M_{N N}^{Q \to P} ⊙ M_{M N N}^{P \to Q})

(14)

where

⊙

denotes the Hadamard product and

\lor

denotes the element-wise logical OR; based on the matching matrix

M^{F N M}

, a correspondence subset

G_{i}^{t}

is generated for each local neighborhood pair and passed to the next stage for consistency correction.

3.5.2. Consistency Correction and Pose Estimation

To evaluate the quality of candidate correspondences generated within each local region, a comprehensive compatibility scoring matrix is constructed to measure the geometric compatibility between any two candidate correspondence pairs. For local candidate correspondences

G_{i} = (p_{i}, q_{i})

and

G_{j} = (p_{j}, q_{j})

, a relatively reliable anchor correspondence

G_{k}^{t - 1} = (p_{k}, q_{k})

from the previous iteration is additionally introduced as an anchor for joint measurement. The comprehensive compatibility score is defined as:

c_{+} (G_{i}, G_{j}) = [c_{θ} (G_{i}, G_{k}^{t - 1}) \cdot c_{θ} (G_{k}^{t - 1}, G_{j})] \lor c_{θ / 2} (G_{i}, G_{j})

(15)

where

c_{θ} (G_{i}, G_{j}) = 1 [(| | p_{i} - p_{j} {| |}_{2} - | | q_{i} - q_{j} {| |}_{2}) \leq θ]

is a pairwise distance consistency indicator function, the

\cdot

between two binary indicator functions denotes logical AND, and

\lor

denotes the logical OR operation. The term

c_{θ / 2}

applies the same pairwise distance consistency check as

c_{θ}

, but with a stricter halved tolerance threshold

θ / 2

. Since the anchor correspondence is not always fully reliable, this stricter direct constraint between

G_{i}

and

G_{j}

ensures that geometric compatibility can still be correctly assessed even when the prior anchor point is insufficiently trustworthy. From

c_{+} (G_{i}, G_{j})

, the score matrix

M_{i}^{S} = {[c_{+} (G_{i}, G_{j})]}_{i j}

for the local correspondence set

G_{i}^{t}

is computed, and a local region quality score is defined as:

S (G_{i}^{t}) = \frac{| M_{i}^{S} |_{1}}{a \cdot N}

(16)

where

| \cdot |_{1}

denotes the

L_{1}

norm of a matrix and

a \in (0,1)

is the threshold for the proportion of correct correspondences, and

N

is the number of candidate correspondences in the local region. A local region is considered a valid correspondence region when this score exceeds the threshold.

Within each valid correspondence region, a subset of precise correspondences

{\hat{G}}_{i}^{t}

with high compatibility scores is identified to guide the refinement of correspondences across the entire local region. A local rigid transformation is then estimated from these correspondences via weighted SVD decomposition:

{\hat{R}}_{i}^{t}, {\hat{t}}_{i}^{t} = m i n \sum (p_{j}, q_{j}) \in {\hat{G}}_{i}^{t} ∥ R {\cdot p}_{j} + t - q_{j} ∥_{2}^{2}

(17)

Using the estimated local transformation

{\hat{R}}_{i}^{t}, {\hat{t}}_{i}^{t}

, each source point

P

in the local source point cloud

P_{i}^{t}

is mapped to the target side, and the nearest point

q^{'}

in the target local point cloud

Q_{i}^{t}

satisfying the distance constraint

| {\hat{R}}_{i}^{t} p + {\hat{t}}_{i}^{t} - q^{'} |_{2} \leq θ_{d}

is searched, thereby updating the local correspondence set as

G_{i}^{t} \leftarrow (p, q^{'})

. The corrected results from all local regions are deduplicated and merged into the global set

G^{t}

via a hash table, followed by a global screening step using second-order geometric consistency to complete one round of optimization. For rigid registration, global pose estimation is likewise performed via SVD decomposition, with the difference that it operates on the final global correspondence set

G^{t}

rather than individual local neighborhoods. For non-rigid registration, the resulting correspondences are fed into GraphSCNet [28] for final smooth point cloud alignment.

4. Experiments

4.1. Experimental Setup

The proposed method is evaluated on both rigid and non-rigid benchmark datasets. 3DMatch and 3DLoMatch are indoor rigid registration benchmarks constructed from RGB-D reconstructions, comprising 62 scenes in total. 3DMatch contains scan pairs with overlap ratios greater than 30%, while 3DLoMatch covers low-overlap scenarios with overlap ratios between 10% and 30%. 4DMatch and 4DLoMatch are built upon animation sequences from DeformingThings4D, comprising 1972 segments with dense deformation annotations, and are further divided into high-overlap and low-overlap subsets based on an overlap ratio threshold of 45%.

The proposed method is implemented in PyTorch 2.5.1 and trained on an NVIDIA RTX 3090 GPU. The AdamW optimizer is adopted with an initial learning rate of 1 × 10⁻⁴, a weight decay of 1 × 10⁻⁴, and a batch size of 1. The number of stacked layers in the hybrid enhancement module is set to L = 3. For the local rematching module, the number of iterations is set to T = 4 for rigid scenarios, with the number of k-nearest neighbors decreasing progressively from 100 to 4; for non-rigid scenarios, the number of iterations is set to T = 2, with the number of k-nearest neighbors decreasing progressively from 15 to 4.

4.2. Evaluation Metrics

Different evaluation metrics are adopted for the rigid and non-rigid benchmarks. On the 4DMatch and 4DLoMatch benchmarks, two core metrics are employed for quantitative evaluation. Inlier Ratio (IR) measures the proportion of predicted correspondences whose error falls below 0.04 m under the ground-truth warp function. Non-rigid Feature Matching Recall (NFMR) measures the proportion of ground-truth correspondences that are successfully recovered after sparse scene flow field interpolation and propagation.

To further validate the effectiveness of the correspondences generated by the proposed method, the correspondences are fed into GraphSCNet [28] for supplementary non-rigid registration experiments, evaluated using four metrics. The 3D End Point Error (EPE) computes the mean Euclidean norm of the 3D warp error vectors over all points. The 3D Accuracy Strict (AccS) measures the proportion of points whose relative error is less than 2.5% of all corresponding points. The 3D Accuracy Relaxed (AccR) measures the proportion of points whose relative error is less than 5%. The Outlier Ratio (OR) measures the proportion of points whose relative error exceeds 30%.

On the 3DMatch and 3DLoMatch benchmarks, five metrics are employed. Inlier Ratio (IR) measures the proportion of predicted correspondences whose error falls below 0.1 m under the ground-truth transformation. Feature Matching Recall (FMR) reports the fraction of point cloud pairs whose inlier ratio exceeds 5%, reflecting the quality of feature descriptors. Registration Recall (RR) evaluates the final registration performance, measuring the fraction of point cloud pairs that are successfully registered by the algorithm. Additionally, Relative Rotation Error (RRE) and Relative Translation Error (RTE) are reported to measure the geometric precision of the estimated transformations, computed over successfully registered pairs only.

4.3. Non-Rigid Benchmark Evaluation

Table 1 presents a quantitative comparison of the proposed method against state-of-the-art approaches on the 4DMatch and 4DLoMatch benchmarks for non-rigid correspondence estimation. The compared methods include two scene flow-based approaches, PointPWC [30] and FLOT [31], as well as several deep feature learning-based methods, including Predator [14], Lepard [29], GeoTransformer [18], RoITr [15], and Diff-Reg [32].

On the high-overlap 4DMatch subset, the proposed method achieves the best performance on both IR and NFMR. On the low-overlap 4DLoMatch subset, the proposed method attains an NFMR of 78.63%, surpassing all compared methods. In low-overlap scenarios, the number of initial correct correspondences is extremely limited. The proposed prior-guided local search progressively expands the set of high-confidence inliers outward, fundamentally alleviating the problem of inlier scarcity. This advantage is particularly prominent on the recall-oriented NFMR metric. Although the IR of the proposed method (66.40%) is slightly lower than that of Diff-Reg (67.80%), the latter is based on a diffusion model with a multi-step denoising mechanism that enforces stricter precision control on individual correspondences. Nevertheless, the lower NFMR of Diff-Reg [32] indicates that the proposed augmentation strategy covers a broader range of ground-truth correspondences, which better aligns with the recall-oriented evaluation objective.

Table 2 summarizes the comparison of registration errors across different methods after GraphSCNet [28] post-processing. Although the proposed method does not surpass Diff-Reg [32] on AccR and Outlier on 4DMatch, it achieves superior results on 4DLoMatch, further confirming that the augmentation strategy of the proposed method yields more significant gains in challenging low-overlap scenarios. The generated correspondences provide better geometric coverage, substantially reducing the optimization difficulty for the post-processing algorithm and enabling it to converge more reliably to the correct deformation solution.

Furthermore, Table 3 presents a comparison of the registration performance obtained by combining the proposed method’s correspondences with various non-rigid registration backends. The results demonstrate that the combination of the proposed method with GraphSCNet [28] achieves the best performance across all evaluated metrics. This is because its graph-based spatial consistency mechanism is most capable of exploiting dense, high-quality correspondences, and is naturally complementary to the augmentation strategy of the proposed method, thereby achieving optimal end-to-end registration performance. These results validate both the high quality and broad applicability of the generated correspondences. Qualitative registration results are shown in Figure 5.

4.4. Rigid Benchmark Evaluation

Table 4 summarizes the comparison results of the proposed method against state-of-the-art approaches on the rigid registration benchmarks. On the high-overlap 3DMatch subset, the proposed method achieves the best FMR (98.50%) and RR (92.62%), with IR (82.78%) remaining at a competitive level. On the low-overlap 3DLoMatch subset, the proposed method attains the highest FMR (89.80%) among all compared methods, with IR (57.23%) also demonstrating strong competitiveness, reflecting the effectiveness of the proposed feature descriptors under extremely low overlap ratios and the advantage of the local rematching strategy in augmenting inliers from a sparse initial set. The RR (69.43%) shows a certain gap compared to some methods, which is primarily attributable to the fact that the proposed method prioritizes increasing the number of inliers and maximizing recall coverage during the correspondence augmentation process. Since low-overlap scenarios impose high demands on the absolute accuracy of pose estimation, directly solving for the transformation via SVD on the augmented correspondences without a dedicated pose refinement step leads to registration failures in certain boundary cases. This represents one of the primary directions for future improvement.

Table 5 further reports RRE and RTE to complement the registration recall metric in Table 4. While some baseline methods rely on RANSAC for pose estimation and others adopt RANSAC-free estimators such as LGR or weighted SVD, our method directly solves for the transformation on the final augmented correspondences via SVD, yet still achieves competitive RRE and RTE on both benchmarks, validating that the local rematching strategy produces sufficiently high-quality correspondences for precise pose estimation.

4.5. Ablation Studies

To validate the effectiveness of each core module, six ablation experiments are designed on both rigid and non-rigid datasets, with results summarized in Table 6 and Table 7. Experiment 1 removes the MCA module entirely, replacing Mamba-based global encoding with conventional self-attention; the comparable accuracy demonstrates (IR: 86.30% vs. 87.70% on 4DMatch, FMR: 98.10% vs. 98.50% on 3DMatch) that Mamba preserves modeling capability while reducing computational complexity from quadratic to linear. Experiment 2 removes the rematching stage, resulting in the most significant performance degradation (NFMR drops from 78.63% to 72.69% on 4DLoMatch, RR drops from 69.43% to 69.34% on 3DLoMatch), which confirms that local rematching is the core module for breaking through the bottleneck of insufficient initial inliers in low-overlap scenarios. Experiment 3 removes FNM, causing a decline across all metrics (NFMR: 75.47% vs. 78.63% on 4DLoMatch, RR: 68.26% vs. 69.43% on 3DLoMatch), validating the critical role of FNM in stably generating candidate correspondences when local feature discriminability is insufficient in small neighborhood regions.

To investigate the impact of different serialization strategies, Experiment 4 replaces Z-order serialization with XYZ-order, resulting in a performance drop compared to Experiment 6 (NFMR: 78.58% vs. 78.63% on 4DLoMatch, RR: 68.71% vs. 69.43% on 3DLoMatch), which indicates that axis-aligned sorting fails to provide Mamba with a geometrically coherent input, as it preserves spatial locality only along individual coordinate dimensions. Experiment 5 replaces Z-order serialization with Hilbert curve serialization, yielding slightly lower performance across all metrics (NFMR: 78.60% vs. 78.63% on 4DLoMatch, RR: 69.22% vs. 69.43% on 3DLoMatch); although the Hilbert curve also possesses locality-preserving properties, Z-order provides a stronger multi-scale locality guarantee through its Morton code structure, where points within the same sub-cube share identical high-order bits and are therefore guaranteed to be contiguous in the serialized sequence at every resolution level. Experiment 6 represents the full model, which achieves the best performance on all metrics across all datasets, demonstrating the necessity of the collaborative design of each module in the proposed method.

To further validate the efficiency of the proposed method, we report inference time and GPU memory consumption at varying downsampling ratios in Table 8. As point cloud density increases to 1/8, RoITr exhausts available GPU Memory while MaLCA remains feasible, with MaLCA (full) consuming substantially less memory than GeoTransformer (4915 MB vs. 12,297 MB), owing to the linear complexity of the Mamba encoder. Regarding the LRS overhead, comparing MaLCA (full) against MaLCA (w/o LRS) shows that LRS adds 163 ms and 355 MB on average, a modest cost that is outweighed by the memory savings from Mamba and the notable accuracy gains demonstrated in the ablation study, particularly in low-overlap scenarios. It is also worth noting that the memory growth from 133 MB (1/64) to 4915 MB (1/8) reflects the increasing dominance of cross-attention cost as point cloud density rises, and replacing it with a more efficient alternative remains a direction for future work.

5. Conclusions

In this work, we address two fundamental challenges in point cloud registration under low-overlap scenarios: computational efficiency bottlenecks and insufficient correspondence quality. We propose MaLCA, a point cloud registration method based on Mamba-enhanced features and local correspondence augmentation. By replacing the conventional self-attention mechanism with a Mamba selective state space model in conjunction with Z-order spatial serialization, the complexity of global context modeling is reduced to linear, while cross-attention is retained to enable efficient inter-point-cloud feature interaction. Subsequently, a prior-guided local rematching strategy combined with a fused neighbor matching mechanism replaces the traditional outlier removal paradigm, fundamentally resolving the problem of sparse initial inliers through iterative correspondence augmentation. Extensive experiments on both rigid and non-rigid benchmarks demonstrate that the proposed method achieves competitive registration performance, effectively balancing accuracy and efficiency.

Nevertheless, the proposed method still has certain limitations. When processing raw high-density point clouds, the method continues to rely on a preprocessing downsampling step, which to some extent restricts the preservation of fine geometric details. Future work will explore more efficient multi-resolution feature interaction mechanisms to enable feature enhancement and correspondence establishment directly at finer-grained point cloud levels, further improving the applicability of the method in high-precision scenarios.

Author Contributions

Methodology, Y.H., F.X., and L.Z.; software, Y.H., H.G., and J.G.; validation, L.K. and X.H.; writing—original draft, Y.H. and F.X.; writing—review and editing, F.X. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Shanxi Province (No. 202203021212138), the National Natural Science Foundation of China (No. 62272426), the Foundation of Shanxi Key Laboratory of Machine Vision and Virtual Reality (No. 447-110103), and the Shanxi Province Science and Technology Major Special Plan “Reveal the List” project (No. 202201150401021).

Data Availability Statement

The datasets used in this study are publicly available. The 3DMatch/3DLoMatch benchmark is available at https://3dmatch.cs.princeton.edu/ (accessed on 1 June 2021), and the 4DMatch/4DLoMatch benchmark is constructed from the DeformingThings4D dataset, which is available at https://github.com/rabbityl/DeformingThings4D (accessed on 13 July 2020).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Tosi, F.; Zhang, Y.; Gong, Z.; Mattoccia, S.; Oswald, M.R.; Sandstrom, E.; Poggi, M. How Nerfs and 3D Gaussian Splatting Are Reshaping Slam: A Survey. IEEE Trans. Robot. 2026, 42, 1405–1427. [Google Scholar] [CrossRef]
Barta, S.; Gurrea, R.; Flavián, C. Augmented Reality Experiences: Consumer-centered Augmented Reality Framework and Research Agenda. Psychol. Mark. 2025, 42, 634–650. [Google Scholar] [CrossRef]
Gao, Y.; Piccinini, M.; Zhang, Y.; Wang, D.; Moller, K.; Brusnicki, R.; Zarrouki, B.; Gambi, A.; Totz, J.F.; Storms, K. Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis. IEEE Open J. Intell. Transp. Syst. 2026. [Google Scholar] [CrossRef]
Chen, Y.; Gu, C.; Jiang, J.; Zhu, X.; Zhang, L. Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-Time Rendering. Int. J. Comput. Vis. 2026, 134, 83. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Sun, K.; Zhou, J.; Wang, M.; Wang, Z. S² Mamba: An Efficient Mamba Accelerator with Word-Importance SSM Sparsity. IEEE Trans. Circuits Syst. I Regul. Pap. 2026, 73, 3424–3437. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Chou, Y.; Yao, M.; Wang, K.; Pan, Y.; Zhu, R.; Zhong, Y.; Qiao, Y.; Wu, J.; Xu, B.; Li, G. Metala: Unified Optimal Linear Approximation to Softmax Attention Map. Adv. Neural Inf. Process. Syst. 2024, 37, 71034–71067. [Google Scholar]
Zhang, X.; Yang, J.; Zhang, S.; Zhang, Y. 3D Registration with Maximal Cliques. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17745–17754. [Google Scholar]
Liang, D.; Zhou, X.; Xu, W.; Zhu, X.; Zou, Z.; Ye, X.; Tan, X.; Bai, X. Pointmamba: A Simple State Space Model for Point Cloud Analysis. Adv. Neural Inf. Process. Syst. 2024, 37, 32653–32677. [Google Scholar]
Rusu, R.B.; Blodow, N.; Beetz, M. Fast Point Feature Histograms (FPFH) for 3D Registration. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation; IEEE: New York, NY, USA, 2009; pp. 3212–3217. [Google Scholar]
Salti, S.; Tombari, F.; Di Stefano, L. SHOT: Unique Signatures of Histograms for Surface and Texture Description. Comput. Vis. Image Underst. 2014, 125, 251–264. [Google Scholar] [CrossRef]
Choy, C.; Park, J.; Koltun, V. Fully Convolutional Geometric Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8958–8966. [Google Scholar]
Huang, S.; Gojcic, Z.; Usvyatsov, M.; Wieser, A.; Schindler, K. Predator: Registration of 3d Point Clouds with Low Overlap. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4267–4276. [Google Scholar]
Yu, H.; Qin, Z.; Hou, J.; Saleh, M.; Li, D.; Busam, B.; Ilic, S. Rotation-Invariant Transformer for Point Cloud Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 5384–5393. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and Deformable Convolution for Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Yu, H.; Li, F.; Saleh, M.; Busam, B.; Ilic, S. Cofinet: Reliable Coarse-to-Fine Correspondences for Robust Pointcloud Registration. Adv. Neural Inf. Process. Syst. 2021, 34, 23872–23884. [Google Scholar]
Qin, Z.; Yu, H.; Wang, C.; Guo, Y.; Peng, Y.; Ilic, S.; Hu, D.; Xu, K. Geotransformer: Fast and Robust Point Cloud Registration with Geometric Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9806–9821. [Google Scholar] [CrossRef] [PubMed]
Fu, K.; Yuan, M.; Wang, C.; Pang, W.; Chi, J.; Wang, M.; Gao, L. Dual Focus-Attention Transformer for Robust Point Cloud Registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 11769–11778. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2022, arXiv:2111.00396. [Google Scholar] [CrossRef]
Bahri, A.; Yazdanpanah, M.; Noori, M.; Dastani, S.; Cheraghalikhani, M.; Hakim, G.A.V.; Osowiechi, D.; Beizaee, F.; Ben Ayed, I.; Desrosiers, C. Spectral Informed Mamba for Robust Point Cloud Processing. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 11799–11809. [Google Scholar]
Han, X.; Tang, Y.; Wang, Z.; Li, X. MAMBA3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model. In Proceedings of the 32nd ACM International Conference on Multimedia; ACM: Melbourne, VIC, Australia, 2024; pp. 4995–5004. [Google Scholar]
Liu, B.; Liu, A.; Chen, H.; Tao, H.; Cui, J.; Wang, Y.; Zhang, H. MT-PCR: Hybrid Mamba-Transformer Network with Spatial Serialization for Point Cloud Registration. arXiv 2026, arXiv:2506.13183. [Google Scholar]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Yang, H.; Shi, J.; Carlone, L. Teaser: Fast and Certifiable Point Cloud Registration. IEEE Trans. Robot. 2020, 37, 314–333. [Google Scholar] [CrossRef]
Chen, Z.; Sun, K.; Yang, F.; Tao, W. Sc2-Pcr: A Second Order Spatial Compatibility for Efficient and Robust Point Cloud Registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 13221–13231. [Google Scholar]
Bai, X.; Luo, Z.; Zhou, L.; Chen, H.; Li, L.; Hu, Z.; Fu, H.; Tai, C.-L. Pointdsc: Robust Point Cloud Registration Using Deep Spatial Consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 15859–15869. [Google Scholar]
Qin, Z.; Yu, H.; Wang, C.; Peng, Y.; Xu, K. Deep Graph-Based Spatial Consistency for Robust Non-Rigid Point Cloud Registration. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 5394–5403. [Google Scholar]
Li, Y.; Harada, T. Lepard: Learning Partial Point Cloud Matching in Rigid and Deformable Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5554–5564. [Google Scholar]
Wu, W.; Wang, Z.Y.; Li, Z.; Liu, W.; Fuxin, L. PointPWC-Net: Cost Volume on Point Clouds for (Self-)Supervised Scene Flow Estimation. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12350, pp. 88–107. [Google Scholar]
Puy, G.; Boulch, A.; Marlet, R. FLOT: Scene Flow on Point Clouds Guided by Optimal Transport. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12373, pp. 527–544. [Google Scholar]
Wu, Q.; Jiang, H.; Luo, L.; Li, J.; Ding, Y.; Xie, J.; Yang, J. Diff-Reg v1: Diffusion Matching Model for Registration Problem. arXiv 2024, arXiv:2403.19919. [Google Scholar] [CrossRef]
Donati, N.; Sharma, A.; Ovsjanikov, M. Deep Geometric Functional Maps: Robust Feature Learning for Shape Correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8592–8601. [Google Scholar]
Huang, J.; Birdal, T.; Gojcic, Z.; Guibas, L.J.; Hu, S.-M. Multiway Non-Rigid Point Cloud Registration via Learned Functional Map Synchronization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2038–2053. [Google Scholar] [CrossRef] [PubMed]
Newcombe, R.A.; Fox, D.; Seitz, S.M. Dynamicfusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 343–352. [Google Scholar]
Li, X.; Kaesemodel Pontes, J.; Lucey, S. Neural Scene Flow Prior. Adv. Neural Inf. Process. Syst. 2021, 34, 7838–7851. [Google Scholar]
Park, K.; Sinha, U.; Barron, J.T.; Bouaziz, S.; Goldman, D.B.; Seitz, S.M.; Martin-Brualla, R. Nerfies: Deformable Neural Radiance Fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 19–25 June 2021; pp. 5865–5874. [Google Scholar]
Li, Y.; Harada, T. Non-Rigid Point Cloud Registration with Neural Deformation Pyramid. Adv. Neural Inf. Process. Syst. 2022, 35, 27757–27768. [Google Scholar]
Chen, H.; Yan, P.; Xiang, S.; Tan, Y. Dynamic Cues-Assisted Transformer for Robust Point Cloud Registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 21698–21707. [Google Scholar]

Figure 1. Overall architecture of the proposed MaLCA framework. Input point clouds

P

and

Q

are processed by KPFCN for initial feature extraction, refined by the Mamba–Cross-Attention (MCA) module, and then matched to obtain initial correspondences. The local rematching strategy (LRS) iteratively augments high-quality inliers guided by spatial priors to produce the final correspondence sets.

Figure 1. Overall architecture of the proposed MaLCA framework. Input point clouds

P

and

Q

are processed by KPFCN for initial feature extraction, refined by the Mamba–Cross-Attention (MCA) module, and then matched to obtain initial correspondences. The local rematching strategy (LRS) iteratively augments high-quality inliers guided by spatial priors to produce the final correspondence sets.

Figure 2. Architecture of the pruned KPFCN backbone: (a) overall encoder–decoder structure; (b) internal structures of ResNetA (top) and ResNetB (bottom).

Figure 3. Illustration of the cross-attention computation. Keys and values are projected from target features

{\tilde{F}}_{Q}

, and queries from source features

{\tilde{F}}_{P}

. After dot-product attention, Softmax, MLP, and residual normalization, enhanced source features

{\hat{F}}_{P}

are produced.

Figure 3. Illustration of the cross-attention computation. Keys and values are projected from target features

{\tilde{F}}_{Q}

, and queries from source features

{\tilde{F}}_{P}

. After dot-product attention, Softmax, MLP, and residual normalization, enhanced source features

{\hat{F}}_{P}

are produced.

Figure 4. Illustration of the fused neighbor matching (FNM) mechanism: NN and MNN matching matrices computed in both directions are integrated via

⊙

and

\lor

to yield

M^{F N M}

.

Figure 4. Illustration of the fused neighbor matching (FNM) mechanism: NN and MNN matching matrices computed in both directions are integrated via

⊙

and

\lor

to yield

M^{F N M}

.

Figure 5. Qualitative comparison on the 4DMatch benchmark. Green/red lines denote correct/incorrect correspondences, respectively; right panels show GraphSCNet [28] registration results. (a) Lepard [29]; (b) GeoTR [18]; (c) Ours.

Table 1. Comparison of non-rigid correspondence estimation on 4DMatch and 4DLoMatch Benchmarks.

Methods	4DMatch		4DLoMatch
Methods	IR ↑	NFMR ↑	IR ↑	NFMR ↑
PointPWC [30]	20.0	21.60	7.20	10.0
FLOT [31]	24.90	27.10	10.70	15.20
Predator [14]	60.40	56.40	27.50	32.10
Lepard [29]	82.64	83.60	55.55	66.63
GeoTR [18]	82.20	83.20	63.60	65.40
RoITr [15]	84.40	83.00	67.60	69.40
Diff-Reg [32]	86.41	88.40	67.80	76.23
Ours	87.70	91.08	66.40	78.63

Table 2. Comparison of different correspondence inputs with GraphSCNet [28] post-processing.

Methods	4DMatch				4DLoMatch
Methods	EPE ↓	AccS ↑	AccR ↑	Outlier ↓	EPE ↓	AccS ↑	AccR ↑	Outlier ↓
FLOT	0.133	7.66	27.15	40.49	0.210	2.73	13.08	42.51
GeomFmaps [33]	0.152	12.34	32.56	37.90	0.148	1.85	6.51	64.63
Synorim-pw [34]	0.099	22.91	49.86	26.01	0.170	10.55	30.17	31.12
Lepard [29] + G [28]	0.042	70.10	83.80	9.20	0.102	40.00	59.10	17.50
GeoTR [18] + G [28]	0.043	72.10	84.30	9.50	0.119	41.00	58.40	20.60
RoITr [15] + G [28]	0.056	59.60	80.50	12.50	0.118	32.30	56.70	20.50
Diff-Reg [32] + G [28]	0.041	73.20	85.80	8.30	0.095	43.80	62.90	15.50
Ours + G [28]	0.039	73.90	85.60	8.40	0.093	45.60	63.10	15.30

Table 3. Registration performance of our correspondences combined with different non-rigid registration methods.

Methods	4DMatch				4DLoMatch
Methods	EPE ↓	AccS ↑	AccR ↑	Outlier ↓	EPE ↓	AccS ↑	AccR ↑	Outlier ↓
Ours + NICP [35]	0.099	50.00	62.90	25.20	0.240	12.70	22.80	50.10
Ours + NSFP [36]	0.128	55.21	61.15	21.87	0.163	29.73	55.04	28.37
Ours + Nerfies [37]	0.132	56.67	67.37	19.48	0.157	34.48	59.83	22.61
Ours + NDP [38]	0.056	73.81	79.66	13.62	0.162	35.26	57.53	23.48
Ours + G [28]	0.039	73.90	85.60	8.40	0.093	45.60	63.10	15.30

Table 4. Comparison of rigid registration results on 3DMatch and 3DLoMatch benchmarks.

Methods	3DMatch			3DLoMatch
Methods	IR ↑	FMR ↑	RR ↑	IR ↑	FMR ↑	RR ↑
Predator [14]	58.00	96.70	91.80	26.70	78.60	62.40
Lepard [29]	57.61	97.95	93.90	27.83	84.22	70.63
GeoTR [18]	71.90	97.90	92.00	43.50	88.30	75.00
RoITr [15]	82.60	98.00	91.90	54.30	89.60	74.80
DCATr [39]	84.70	98.40	92.20	57.90	87.70	75.70
Ours	82.78	98.50	92.62	57.23	89.80	69.43

Table 5. Relative rotation error (RRE) and relative translation error (RTE) on 3DMatch and 3DLoMatch benchmarks.

Methods	Estimator	3DMatch		3DLoMatch
Methods	Estimator	RRE (°) ↓	RTE (m) ↓	RRE (°) ↓	RTE (m) ↓
FCGF [13]	RANSAC	1.949	0.066	3.147	0.100
Predator [14]	RANSAC	2.029	0.064	3.048	0.093
CoFiNet [17]	RANSAC	2.002	0.064	3.271	0.090
GeoTR [18]	RANSAC-free	1.625	0.053	2.547	0.074
DCATr [39]	RANSAC-free	1.536	0.050	2.445	0.072
Ours	RANSAC-free	1.533	0.048	2.453	0.075

Table 6. Ablation study of key modules on 4DMatch and 4DLoMatch.

No.	MCA	Serialization			LRS	FNM	4DMatch		4DLoMatch
No.	MCA	XYZ	Hilbert	Z-Order	LRS	FNM	IR ↑	NFMR ↑	IR ↑	NFMR ↑
1					√	√	86.30	90.14	65.81	77.92
2	√			√			85.91	90.62	63.57	72.69
3	√			√	√		86.95	90.77	65.83	75.47
4	√	√			√	√	86.24	90.63	66.16	78.58
5	√		√		√	√	87.03	90.65	65.92	78.60
6	√			√	√	√	87.70	91.08	66.40	78.63

Table 7. Ablation study of key modules on 3DMatch and 3DLoMatch.

No.	MCA	Serialization			LRS	FNM	3DMatch			3DLoMatch
No.	MCA	XYZ	Hilbert	Z-Order	LRS	FNM	IR ↑	FMR ↑	RR ↑	IR ↑	FMR ↑	RR ↑
1					√	√	84.97	98.10	92.47	57.81	88.30	72.36
2	√			√			78.93	97.60	90.84	55.76	88.60	69.34
3	√			√	√		81.35	98.10	91.88	56.29	88.90	68.26
4	√	√			√	√	81.14	98.30	92.43	57.15	89.20	68.71
5	√		√		√	√	82.26	98.20	92.47	57.18	88.90	69.22
6	√			√	√	√	82.78	98.50	92.62	57.23	89.80	69.43

Table 8. Efficiency comparison at varying downsampling ratios on 3DMatch.

Method	GPU Memory (MB) ↓				Time (ms) ↓
Method	1/8	1/16	1/32	1/64	Average
RoITr [15]	OOM	10,391	3005	938	1371
GeoTR [18]	12,297	2691	706	238	595
Ours w/o LRS	4007	952	284	104	474
Ours (full)	4915	1307	372	133	637

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huo, Y.; Zhang, L.; Guo, H.; Gong, J.; Kuang, L.; Han, X.; Xiong, F. MaLCA: Point Cloud Registration with Mamba-Enhanced Features and Local Correspondence Augmentation. Algorithms 2026, 19, 380. https://doi.org/10.3390/a19050380

AMA Style

Huo Y, Zhang L, Guo H, Gong J, Kuang L, Han X, Xiong F. MaLCA: Point Cloud Registration with Mamba-Enhanced Features and Local Correspondence Augmentation. Algorithms. 2026; 19(5):380. https://doi.org/10.3390/a19050380

Chicago/Turabian Style

Huo, Yuchen, Longyun Zhang, Huijuan Guo, Jingyi Gong, Liqun Kuang, Xie Han, and Fengguang Xiong. 2026. "MaLCA: Point Cloud Registration with Mamba-Enhanced Features and Local Correspondence Augmentation" Algorithms 19, no. 5: 380. https://doi.org/10.3390/a19050380

APA Style

Huo, Y., Zhang, L., Guo, H., Gong, J., Kuang, L., Han, X., & Xiong, F. (2026). MaLCA: Point Cloud Registration with Mamba-Enhanced Features and Local Correspondence Augmentation. Algorithms, 19(5), 380. https://doi.org/10.3390/a19050380

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MaLCA: Point Cloud Registration with Mamba-Enhanced Features and Local Correspondence Augmentation

Abstract

1. Introduction

2. Related Work

2.1. Correspondence Extraction

2.2. Mamba for 3D Point Cloud Analysis

2.3. Outlier Removal

3. Methodology

3.1. Problem Formulation

3.2. Framework Overview

3.3. Initial Feature Extraction

3.4. Feature Enhancement

3.4.1. Mamba Encoder

3.4.2. Inter-Point-Cloud Feature Interaction

3.5. Matching

3.5.1. Prior-Guided Local Rematching

3.5.2. Consistency Correction and Pose Estimation

4. Experiments

4.1. Experimental Setup

4.2. Evaluation Metrics

4.3. Non-Rigid Benchmark Evaluation

4.4. Rigid Benchmark Evaluation

4.5. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI