Bi-Directional Point Flow Estimation with Multi-Scale Attention for Deformable Lung CT Registration

Lee, Nahyuk; Lee, Taemin

doi:10.3390/app15095166

Open AccessArticle

Bi-Directional Point Flow Estimation with Multi-Scale Attention for Deformable Lung CT Registration

by

Nahyuk Lee

¹ and

Taemin Lee

^2,*

¹

VUNO Inc., Seoul 06541, Republic of Korea

²

Department of Electronic and AI System Engineering, Kangwon National University, Samcheok 25913, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5166; https://doi.org/10.3390/app15095166

Submission received: 7 April 2025 / Revised: 30 April 2025 / Accepted: 2 May 2025 / Published: 6 May 2025

(This article belongs to the Topic Applied System on Biomedical Engineering, Healthcare and Sustainability 2024)

Download

Browse Figures

Versions Notes

Abstract

Deformable lung CT registration plays a crucial role in clinical applications such as respiratory motion tracking, disease progression analysis, and radiotherapy planning. While voxel-based registration has traditionally dominated this domain, it often suffers from high computational costs and sensitivity to intensity variations. In this work, we propose a novel point-based deformable registration framework tailored to the unique challenges of lung CT alignment. Our approach combines geometric keypoint attention at coarse resolutions to enhance the global correspondence with attention-based refinement modules at finer scales to accurately model subtle anatomical deformations. Furthermore, we adopt a bi-directional training strategy that enforces forward and backward consistency through cycle supervision, promoting anatomically coherent transformations. We evaluate our method on the large-scale Lung250M benchmark and achieve state-of-the-art results, significantly surpassing the existing voxel-based and point-based baselines in the target registration accuracy. These findings highlight the potential of sparse geometric modeling for complex respiratory motion and establish a strong foundation for future point-based deformable registration in thoracic imaging.

Keywords:

deformable lung CT registration; bi-directional flow estimation; multi-scale attention

1. Introduction

Deformable registration of lung CT scans plays a pivotal role in real-world clinical workflows, including respiratory motion compensation [1,2], longitudinal disease monitoring [3,4,5], and radiation therapy planning [6,7]. Unlike rigid registration methods [8,9,10,11] that assume static global transformations, deformable lung CT registration accounts for the complex non-linear anatomical changes induced by respiration, enabling precise spatial alignment across different breathing phases.

Traditionally, voxel-based registration has been the dominant paradigm for lung CT registration. These methods optimize the spatial transformation fields using intensity-based similarity metrics, such as mutual information [12,13] or normalized cross-correlation [14]. While voxel-level registration provides dense deformation estimates, it suffers from critical limitations: (1) high computational and memory costs due to the dense volumetric nature of CT images [15] and (2) sensitivity to the inconsistencies in intensity caused by artifacts, noise, and contrast variability [7,16].

Recent advances in geometric deep learning and 3D point cloud representations have enabled a new class of point-based deformable registration, offering computational efficiency and robustness to variations in intensity [17,18,19]. These methods typically leverage sparse surface-based representations and geometric feature extraction to estimate dense deformation fields. However, they have not addressed the unique challenges of lung CT registration, where large, non-rigid respiratory deformations and fine-scale anatomical details must be jointly handled.

Against this backdrop, the recent release of the Lung250M dataset [16] has introduced high-resolution lung CT point clouds across different respiratory phases, enabling foundational studies of point-based registration in lung CT scans. Unlike general registration tasks [17,18,19], lung CT registration requires precise modeling of both the global anatomical structure (e.g., inter-lobar coherence) and local correspondence across complex structures such as bronchial trees and vascular bifurcations. In clinical terms, reliable registration of these structures is essential for tracking the disease progression in COPD and measuring the treatment response in pulmonary fibrosis or the accumulating radiation dose across respiratory cycles. These demands necessitate a multi-scale, context-aware framework that can simultaneously capture the long-range consistency and fine-level alignment.

In this work, we propose a novel point-based deformable lung CT registration framework designed to combine coarse-to-fine flow refinements with bi-directional geometric consistency. Specifically, our approach introduces three key innovations:

Geometric keypoint attention at coarse resolutions to enhance global-structure-aware feature matching across widely separated regions;
Context-aware flow refinement at finer levels via attention-based feature interactions, enabling precise modeling of subtle local deformations;
A bi-directional registration strategy that jointly estimates the forward and backward flow fields, enforcing transformation consistency and improving the anatomical plausibility.

By incorporating these components into a unified architecture, the proposed method achieves accurate, stable, and anatomically consistent alignment of lung CT point clouds, significantly improving over the existing point-based baselines [17,20]. Moreover, our method’s use of vascular structures as the keypoints aligns with clinically interpretable anatomical landmarks, offering better transparency and usability than fully voxel-based models.

2. Related Work

2.1. Deformable Lung CT Registration

Deformable registration of lung CT scans is a fundamental task in medical imaging, supporting a wide range of clinical applications such as respiratory motion modeling for radiotherapy [6,21,22] and quantitative analyses of the disease progression in chronic obstructive pulmonary disease (COPD) [5,21]. By aligning inspiratory and expiratory scans, registration enables the computation of regional ventilation maps and volumetric changes over time [23]. These applications demand a high registration accuracy, especially in anatomically critical regions such as the lung lobes, bronchial trees, and vascular structures [24].

However, lung CT registration poses unique challenges compared to that in other organs. The lungs undergo large, non-linear deformations between breathing phases, and the sliding motion at the lung—pleura interface introduces discontinuities that are difficult to capture using smooth deformation models. Furthermore, the CT intensity values vary with air content, making the standard intensity-based metrics less reliable [25]. Techniques addressing these issues include biomechanically inspired models such as mass- or volume-preserving registration [26,27] which incorporate constraints on the local tissue conservation. In addition, the Demons algorithm [28], free-form B-spline deformation [29], and hybrid intensity–feature methods [30] optimize a similarity metric and a smooth deformation field. Although these methods have been widely evaluated on benchmarks (e.g., EMPIRE10 [3]), their reliance on iterative optimization and the voxel intensities limits both their scalability and robustness in complex respiratory motion scenarios.

2.2. Voxel-Based Deformable Registration

Voxel-based deformable registration has traditionally relied on intensity-driven optimization frameworks, often augmented with anatomical priors. For example, Schmidt-Richberg et al. [31] jointly optimize fissure segmentation and registration to improve the alignment near the inter-lobar boundaries, while Ding et al. [24] propose lobe-wise registration strategies to accommodate the sliding motion between lobes better. These methods incorporate domain knowledge to improve their physiological plausibility but often depend on manual annotations and are sensitive to segmentation errors.

To enhance the robustness in regions affected by noise, low contrast, or modality variations, several studies have introduced hand-crafted descriptors into the voxel-based pipeline. The Modality Independent Neighborhood Descriptor (MIND) [32], for instance, encodes local self-similarity patterns and has been widely adopted as a replacement for the raw intensity in similarity metrics, particularly for multi-modal or artifact-prone registrations. More recently, deep-learning-based methods such as VoxelMorph [33] and DLIR [34] have been proposed as data-driven alternatives to iterative registration. These models typically adopt U-Net-style encoder–decoder architectures [35] to directly regress dense voxel-wise displacement fields from input image pairs. Trained in an unsupervised manner using a combination of similarity and smoothness losses, these networks replace conventional optimization with a single forward pass, enabling efficient inference.

Despite their advantages, voxel-based models remain computationally demanding and are inherently sensitive to intensity inconsistencies, noise, and modality shifts. These limitations have motivated the exploration of alternative data representations—such as point-based methods—that aim to improve efficiency and robustness, especially in the presence of large deformations and variations in appearance.

2.3. Point-Based Registration

Recent approaches have explored representing anatomical structures—such as airway centerlines or vascular trees—as sparse 3D point clouds [36,37], enabling alternative formulations of deformable registration beyond traditional voxel grids. This shift has made it possible to leverage advances in 3D scene flow estimation. Models like FlowNet3D [18] and PointPWC-Net [17] have demonstrated effective ways to estimate dense motion fields between point sets through hierarchical features and cost volume aggregation. However, prior point-based approaches have primarily focused on generic scene flow or rigid alignment tasks, with limited investigation into deformable registration in lung CT.

The Lung250M-4B benchmark [16] addresses this gap by providing paired inspiratory–expiratory CT scans with vessel point clouds and high-quality anatomical correspondence, enabling systematic evaluation of point-based registration under realistic respiratory deformation. Both optimization- and learning-based methods [17,20] demonstrate a competitive performance with reduced inference costs; however, the inherent sparsity of point clouds poses challenges for deformation regularization, particularly in anatomically homogeneous or low-feature regions. Our work addresses these limitations through a hierarchical architecture with attention-guided refinements and bi-directional consistency tailored to lung CT registration.

3. The Proposed Approach

In this section, we first formalize our bi-directional objective for deformable lung CT registration (Section 3.1), describe the multi-scale point feature extraction backbone (Section 3.2), elaborate on geometric keypoint attention (Section 3.3), detail the progressive refinement of the flow predictions with flow attention (Section 3.4), and finally outline our comprehensive bi-directional training objective (Section 3.5).

3.1. The Problem Setup

Given a pair of lung CT scans, the task of deformable lung CT registration aims to estimate a dense, non-rigid transformation that accurately aligns the anatomical structures between a fixed scan (source) and a moving scan (target). Formally, let

P = {p_{i}}_{i = 1}^{N}

and

Q = {q_{j}}_{j = 1}^{N}

represent the 3D point clouds extracted from the source and target scans, respectively, where N denotes the number of sampled points. Our objective is to find a transformation

F = {f_{i}}_{i = 1}^{N}

, mapping each source point

p_{i} \in P

to its corresponding target location

q_{j} \in Q

.

We parameterize this transformation as a displacement field, where each point-specific transformation function

f_{i} : R^{3} \mapsto R^{3}

is defined as

\begin{matrix} f_{i} (p_{i}) = p_{i} + v_{i}, \end{matrix}

(1)

with

v_{i} \in R^{3}

being the displacement vector to predict. Given a set of ground-truth correspondences

C

between the source and target points (precomputed using external tools such as corrField [15]), the registration problem is formally expressed as minimizing

\begin{matrix} F^{*} = \underset{F}{arg min} \sum_{(i, j) \in C} {∥ f_{i} (p_{i}) - q_{j} ∥}_{2}^{2} . \end{matrix}

(2)

This optimization encourages the transformed source points

f_{i} (p_{i})

to closely match their ground-truth correspondences

q_{j}

, thereby achieving accurate anatomical alignment and maintaining spatial coherence.

In this work, we propose a learning-based approach to directly estimating the displacement field

F

from the input lung CT point clouds. Inspired by the point-based scene flow framework [17], our architecture incorporates two key innovations. First, we introduce specialized attention mechanisms—geometric keypoint attention at the coarse scale and progressive flow attention at the fine scale—to capture both global structural relationships and local deformation cues better. Second, we employ a bi-directional flow estimation strategy, simultaneously predicting the forward flow (

F

, source-to-target) and the backward flow (

G = {g_{j}}_{j = 1}^{N}

, target-to-source). Formally, the backward transformation is defined as

\begin{matrix} g_{j} (q_{j}) = q_{j} + u_{j}, \end{matrix}

(3)

where

u_{j} \in R^{3}

denotes the displacement vector for the target points. The backward flow is optimized by minimizing

\begin{matrix} G^{*} = \underset{G}{arg min} \sum_{(i, j) \in C} {∥ p_{i} - g_{j} (q_{j}) ∥}_{2}^{2} . \end{matrix}

(4)

This bi-directional formulation enables the network to leverage mutual consistency between the forward and backward transformations, resulting in more stable, accurate, and anatomically plausible registration outcomes. Figure 1 provides an overview of our network design, and we detail each module in the subsequent sections.

3.2. Multi-Scale Feature Extraction

3.2.1. The PointConv Backbone

We extract hierarchical point features using PointConv layers [38], which extend conventional convolutions to irregular 3D point sets. Formally, given an input point cloud

X = {x_{i}}_{i = 1}^{| X |}

with the associated per-point features

F_{X} \in R^{| X | \times D_{emb}}

, each PointConv layer aggregates the local features as follows:

\begin{matrix} PointConv {(F_{X}; X)}_{(i, :)} = \sum_{x_{j} \in N (x_{i})} S (x_{j}) W_{(:, i)} {(F_{X})}_{j}, \end{matrix}

(5)

where

N (x_{i})

denotes the local neighborhood of point

x_{i}

,

S (x_{j})

is an inverse density scale factor computed via kernel density estimation, and

W

represents the learnable convolutional weights conditioned on relative spatial coordinates. By stacking multiple PointConv blocks in a U-Net-style encoder–decoder [35], we generate a coarse-to-fine hierarchy of informative point feature embeddings.

3.2.2. The Coarse-to-Fine Pipeline

To efficiently handle large deformations while preserving anatomical completeness, we progressively downsample the input point clouds using farthest point sampling (FPS) [39], ensuring that the sampled points are broadly distributed across the lung’s geometry without over-concentrating on local regions. Since the original point clouds were extracted primarily from critical anatomical structures such as bronchial trees and vascular networks, FPS maintains essential structural coverage with high fidelity. This results in a hierarchical set of point clouds

{X^{(l)}}_{l = 1}^{L}

, from the finest resolution (

l = 1

) to the coarsest resolution (

l = L

), where each sampled point at the coarsest level encodes a broader anatomical region of the lung, facilitating global alignment. This multi-scale representation is critical for our attention-driven alignment strategy, where global coarse-level correspondences are first established and subsequently refined at higher resolutions.

3.3. Geometric Keypoint Attention

The geometric keypoint attention module captures the global structural correspondences prior to finer-scale refinement. At this stage, capturing the long-range geometric structure is essential for reducing the ambiguity of the correspondence under large respiratory motion, which motivates our use of a geometry-aware attention formulation. Consider the coarse-level feature embeddings

F_{P}^{(L)} \in R^{| P | \times D_{L - emb}}

. The output attention-enhanced feature

H_{P}^{(L)}

is computed as an attention-weighted aggregation:

\begin{matrix} h_{P_{i}}^{(L)} = \sum_{j = 1}^{| P |} α_{i, j} (f_{P_{j}}^{(L)} W^{V}), \end{matrix}

(6)

where the attention weights

α_{i, j}

measure the pairwise geometric relations among points:

\begin{matrix} α_{i, j} = Softmax (\frac{(f_{P_{i}}^{(L)} W^{Q}) {(f_{P_{j}}^{(L)} W^{K} + r_{i, j} W^{R})}^{⊤}}{\sqrt{d_{t}}}) . \end{matrix}

(7)

Here,

r_{i, j}

represents the geometric embeddings, as proposed in GeoTransformer [40], while

W^{Q}, W^{K}, W^{V}, W^{R} \in R^{D_{L - emb} \times D_{L - emb}}

are learnable projection matrices for queries, keys, values, and geometric embeddings, respectively. Analogous computations are performed for the target cloud features

F_{Q}^{(L)}

.

A similar attention mechanism is adopted for cross-attention, which explicitly models inter-cloud interactions by exchanging information between the source and target point clouds. Compared to self-attention, the key difference is that the query

(f_{P_{i}}^{(L)} W^{Q})

is computed from one point cloud (e.g.,

P

), while the key and value

(f_{Q_{j}}^{(L)} W^{K}, f_{Q_{j}}^{(L)} W^{V})

are taken from the other (e.g.,

Q

). This directional change allows the network to attend across point clouds and capture alignment-relevant correspondences at the coarsest level.

To robustly capture both the intra-cloud structure and inter-cloud correspondence cues, the self- and cross-attention modules are alternated for

N_{c}

iterations, resulting in enhanced features

{\hat{F}}_{P}^{(L)} \in R^{| P^{(L)} | \times D_{L - emb}}

and

{\hat{F}}_{Q}^{(L)} \in R^{| Q^{(L)} | \times D_{L - emb}}

. This interleaved design allows the network to iteratively refine the coarse-level representations by jointly reasoning about the global geometry and alignment consistency, which is particularly beneficial in reducing ambiguity in the correspondence under large deformations.

3.4. Progressive Flow Refinement

At each decoder level l, we progressively refine the flow estimation by leveraging coarser predictions from level

l + 1

, as shown in Figure 2 (In the following sections, we focus on describing the forward flow

F

(source-to-target) for clarity and conciseness. The backward flow

G

(target-to-source) is computed in an analogous manner using the same architecture and refinement process.).

3.4.1. Warping and Cost Volume Construction

The estimated flow

v^{(l + 1)}

at the coarser level

l + 1

is propagated to the finer level through inverse distance weighted interpolation, producing an initial flow estimate

{\tilde{v}}^{(l)}

at the resolution l. We use the upsampled flow

{\tilde{v}}^{(l)}

to warp the target points:

{\hat{Q}}^{(l)} = {q_{i}^{(l)} + {\tilde{v}}_{i}^{(l)} | q_{i}^{(l)} \in Q^{(l)}},

bringing the target points into approximate alignment with the source. This reduces large residual displacements and stabilizes the subsequent refinement steps.

Next, we construct a cost volume encoding the local geometric and feature-level differences. For each source point

p_{i}^{(l)} \in P^{(l)}

, we first identify its k-nearest neighbors among the warped target points

{\hat{Q}}^{(l)}

. Then, leveraging the warped target features

F_{\hat{Q}}^{(l)}

, we compute the residual feature embeddings between the source features

F_{P}^{(l)}

and their corresponding neighboring warped target features. These residual embeddings are subsequently aggregated via PointConv layers [38], resulting in the cost volume features

C^{(l)}

, which effectively encode the geometric proximity and the discrepancies in the local appearance to guide the flow refinement.

3.4.2. The Flow Estimator

The cost volume features

C^{(l)}

and source features

F_{P}^{(l)}

are concatenated to form the refinement input:

Z_{P}^{(l)} = [C^{(l)}, F_{P}^{(l)}] \in R^{| P^{(l)} | \times D_{l - ref}} .

To enhance the flow estimation accuracy, we introduce a contextual attention module composed of alternating self-attention and cross-attention layers. At this stage, the goal is to refine the localized residual flow patterns based on the learned contextual properties. Specifically, we first employ standard multi-head self-attention (MHSA) within the source point cloud to capture the internal structural context:

\begin{matrix} {(Z_{P}^{(l)})}^{'} = MHSA (Z_{P}^{(l)}), \end{matrix}

(8)

where MHSA is defined as

\begin{matrix} MHSA (X) & = Concat [{head}_{1}, \dots, {head}_{h}] W^{O}, \end{matrix}

(9)

\begin{matrix} {head}_{m} & = Softmax (\frac{(X W_{m}^{Q}) {(X W_{m}^{K})}^{⊤}}{\sqrt{d_{k}}}) X W_{m}^{V}, \end{matrix}

(10)

with

W_{m}^{Q}, W_{m}^{K}, W_{m}^{V}, W^{O}

as the learnable linear projection matrices and h as the number of attention heads.

Subsequently, to explicitly incorporate the inter-cloud correspondence cues, we perform a cross-attention operation between these self-attended source features

{(Z_{P}^{(l)})}^{'}

and the corresponding warped target features

F_{\hat{Q}}^{(l)} \in R^{| {\hat{Q}}^{(l)} | \times D_{L - emb}}

. This step aggregates relevant information from the warped target points to refine the source features:

\begin{matrix} {({\hat{Z}}_{P}^{(l)})}_{i} & = \sum_{j = 1}^{| {\hat{Q}}^{(l)} |} β_{i, j} ({(Z_{\hat{Q}}^{(l)})}_{j} W^{V}), \end{matrix}

(11)

\begin{matrix} β_{i, j} & = Softmax (\frac{({(Z_{P}^{(l)})}_{i}^{'} W^{Q}) {({(F_{\hat{Q}}^{(l)})}_{j} W^{K})}^{⊤}}{\sqrt{d_{k}}}), \end{matrix}

(12)

where

W^{Q}, W^{K}, W^{V} \in R^{D_{l - ref} \times D_{l - ref}}

are the query, key, and value projection matrices, respectively.

By iteratively alternating the self-attention (intra-cloud) and cross-attention (inter-cloud) layers for

N_{f}

steps, we obtain refined embeddings

{\hat{Z}}_{P}^{(l)}

that jointly leverage the internal point cloud context and the external correspondence information.

Finally, these enriched embeddings

{\hat{Z}}_{P}^{(l)}

are input into a residual flow prediction network (MLP) to estimate the residual displacements

Δ v^{(l)}

:

\begin{matrix} Δ v^{(l)} = MLP (({\hat{Z}}_{P}^{(l)})) . \end{matrix}

(13)

The refined flow at level l is then computed by adding the residual displacement to the initial estimate:

\begin{matrix} v^{(l)} = {\tilde{v}}^{(l)} + Δ v^{(l)}, \end{matrix}

(14)

which is recursively applied at each decoder level until the finest resolution is reached, producing the final flow prediction

v^{(1)}

.

Note that the same refinement procedure is applied in the reverse direction to obtain the backward flows

{u^{(l)}}_{l = 1}^{L}

.

3.4.3. Post-Processing for Dense Deformation Estimation

Although our model estimates the displacements only at sparse vessel-derived points, a dense deformation field across the entire lung volume is required. To achieve this, we employ Thin Plate Spline (TPS) interpolation, a classical method in landmark-based registration. Specifically, given the predicted displacements at a sparse set of source points, this interpolates the transformation to the entire 3D space, allowing the displacement vectors for all of the voxels in the volume to be computed. This design is motivated by the anatomical coupling between the vascular structures and lung parenchyma deformation, ensuring that clinically relevant motion patterns are faithfully captured across the lung.

3.5. The Training Objective

Our training objective combines supervised bi-directional flow losses with the unsupervised cycle consistency loss to promote accurate, coherent, and invertible transformations. By simultaneously supervising both the forward and backward displacement predictions at multiple resolutions and additionally enforcing consistency between these two directions, the network is guided towards robust and anatomically plausible lung registrations.

3.5.1. Bi-Directional Multi-Scale Flow Loss

At each resolution level l, we supervise the predicted displacement fields by comparing them against their corresponding ground-truth flows. Specifically, we define a bi-directional loss that penalizes deviations from the ground truth in both the forward (source-to-target) and backward (target-to-source) directions:

\begin{matrix} L_{flow}^{(l)} = \sum_{p_{i} \in P^{(l)}} ∥ {\hat{f}}_{i}^{(l)} (p_{i}) - f_{i}^{(l)} (p_{i}) ∥_{2} + \sum_{q_{i} \in Q^{(l)}} {∥ {\hat{g}}_{i}^{(l)} (q_{i}) - g_{i}^{(l)} (q_{i}) ∥}_{2}, \end{matrix}

(15)

where

{\hat{f}}^{(l)}

and

{\hat{g}}^{(l)}

denote the predicted forward and backward displacement functions, respectively, while

f^{(l)}

and

g^{(l)}

represent their ground-truth counterparts. This symmetric formulation provides balanced supervision, encouraging consistent flow estimation in both directions.

3.5.2. The Cycle Consistency Loss

To enhance the geometric plausibility and reversibility of the predicted transformations further, we introduce an unsupervised cycle consistency loss. Let

{\hat{F}}^{(l)} = {{\hat{f}}_{i}^{(l)}}_{i = 1}^{| P^{(l)} |}

and

{\hat{G}}^{(l)} = {{\hat{g}}_{i}^{(l)}}_{i = 1}^{| Q^{(l)} |}

be the sets of predicted forward and backward displacement functions, respectively. Additionally, let

T_{\hat{F}}^{(l)} (\cdot)

and

T_{\hat{G}}^{(l)} (\cdot)

denote the transformation functions that deform a point cloud by applying the corresponding per-point displacements; for instance,

T_{\hat{F}}^{(l)} (P^{(l)}) = \sum_{i} {\hat{f}}_{i} (p_{i}^{(l)}) = \sum_{i} (p_{i}^{(l)} + {\hat{v}}_{i}^{(l)})

.

The cycle consistency loss penalizes deviations after performing forward and backward transformations sequentially, promoting reversible mappings between point clouds:

\begin{matrix} L_{cycle}^{(l)} = CD (T_{\hat{G}}^{(l)} \circ T_{\hat{F}}^{(l)} (P^{(l)}), P^{(l)}) + CD (T_{\hat{F}}^{(l)} \circ T_{\hat{G}}^{(l)} (Q^{(l)}), Q^{(l)}), \end{matrix}

(16)

where the Chamfer Distance (CD) between two point clouds

S_{1}

and

S_{2}

is defined as

\begin{matrix} CD (S_{1}, S_{2}) = \frac{1}{| S_{1} |} \sum_{x \in S_{1}} min_{y \in S_{2}} {∥ x - y ∥}_{2}^{2} + \frac{1}{| S_{2} |} \sum_{y \in S_{2}} min_{x \in S_{1}} {∥ x - y ∥}_{2}^{2} . \end{matrix}

(17)

This unsupervised regularization term ensures geometric coherence and encourages mutually consistent bi-directional flows.

3.5.3. The Final Training Objective

Our complete multi-scale training objective integrates these two losses across all resolution levels:

\begin{matrix} L = \sum_{l = 1}^{L} (L_{flow}^{(l)} + λ \cdot L_{cycle}^{(l)}), \end{matrix}

(18)

where the hyperparameter

λ

controls the influence of the cycle consistency regularization and is empirically set to 0.1.

4. Experiments

4.1. Dataset

The experiments utilize the Lung250M dataset, which consists of approximately 250 million points extracted from 248 lung CT scans (124 patients), excluding the full-resolution voxel data (4B) provided in Lung250M-4B [16]. Lung250M integrates scans from Empire10 [3], TCIA-NSCLC [4,41,42,43,44], TCIA-Ventilation [4,6], L2R-LungCT [7], TCIA-NLST [4,45], and DIR-LAB COPDgene [1]. Each dataset features distinct imaging protocols, scanner settings, and pathology distributions, enhancing the diversity and complexity of the lung deformations. The detailed composition of the dataset is summarized in Table 1.

4.2. Evaluation Metrics

To quantitatively assess the registration performance, we follow the evaluation protocol as in the Lung250M-4B benchmark [16], adopting the Target Registration Error (TRE) as our primary metric. In addition to the mean TRE, we report the

{25, 50, 75}

percentile TRE values to provide a more comprehensive view of the registration accuracy across different anatomical regions and difficulty levels. Moreover, we evaluate the alignment quality using the Dice Similarity Coefficient (DSC) between the transformed source and the target mask. A higher DSC indicates a better anatomical overlap, while the TRE assesses the quality of landmark registration.

4.3. Implementation Details

4.3.1. Training Details

We implement our method using PyTorch [46] and conduct all experiments on a machine equipped with an Intel(R) Xeon(R) Gold 5218R CPU and a single NVIDIA Tesla V100 GPU with 32 GB of memory. The model is trained using the Adam optimizer [47] with an initial learning rate of 0.001, decayed at epochs 1200 and 1400 by a factor of 0.1 using a multi-step learning rate scheduler. We train for 1500 epochs with a batch size of 4. For data augmentation, we apply random rigid transformations and Gaussian noise to improve the generalization. Ground-truth correspondences are obtained using corrField [15].

4.3.2. Model Hyperparameters

The network is built as a five-level hierarchical encoder–decoder (

L = 5

), where the number of points progressively decreases across levels as follows:

N \to \frac{N}{4} \to \frac{N}{16} \to \frac{N}{32} \to \frac{N}{128} (l = 1 \to 5)

. The corresponding feature dimensions at each level are set to

[64, 128, 256, 512, 256]

. All of the attention modules use 4-head multi-head attention and are applied iteratively with

N_{c} = N_{f} = 3

steps for both geometric keypoint attention and flow attention.

4.4. The Experimental Results

We evaluate our method on the COPD test set and compare it with several point-based baselines, including optimization-based [20] and learning-based approaches [17]. As shown in Table 2, our method achieves the best performance across all evaluation metrics, including the mean TRE and its 25th, 50th, and 75th percentiles. Notably, compared to the state-of-the-art method, e.g., PPWC [17], our model reduces the mean TRE from 2.85 mm to 2.24 mm and achieves a lower median (50%) TRE of 1.79 mm versus 2.33 mm. Additionally, our method achieves a DSC (due to the lack of publicly reported evaluations for other baselines, we only report our DSC value) of 91.4%, indicating a high degree of anatomical overlap after registration.

In addition to the quantitative gains, the qualitative results (Figure 3) demonstrate that our method produces smoother and more anatomically consistent deformations compared to those with PPWC [17]. Notably, our model preserves the alignment in both the vascular and boundary regions better, capturing complex respiratory motion more effectively.

4.5. Ablation Studies

To assess the contribution of each model component, training objective, and key hyperparameter, we perform extensive ablation studies.

4.5.1. The Model Components

Table 3 shows the effect of each architectural component. Starting from a base model without any attention modules or bi-directional refinement (Table 3 (a)), we first add the geometric keypoint attention at the coarsest level (Table 3 (b)). This yields a minor improvement, suggesting that global alignment provides more stable initialization. In contrast, adding the flow attention module alone (Table 3 (c)) results in a more substantial performance gain, highlighting the importance of context-aware fine-level refinements. When both attention modules are employed jointly (Table 3 (d)), we observe further improvements across all metrics, indicating that the two modules complement each other by handling large and small deformations at different scales. Finally, incorporating both bi-directional refinement and attention modules yields the best performance, with a mean TRE of 2.24 mm and the lowest errors across all evaluated percentiles (25%, 50%, and 75%).

4.5.2. The Loss Function

In Table 4, we analyze the impact of the training objectives. Using only the multi-scale flow loss (Table 4 (a)) provides a strong supervised baseline. However, relying solely on the cycle consistency loss (Table 4 (b)) leads to unstable training and an inferior performance, underscoring the necessity of ground-truth supervision. When both losses are combined, as in our full model, the cycle consistency term serves as an effective regularizer, leading to more consistent and anatomically coherent transformations.

4.5.3. The Weight Factor for Cycle Consistency

To investigate the effect of the cycle consistency loss weight

λ

, we performed a grid search over

λ \in {0.01, 0.05, 0.1, 0.5}

, balancing it against the supervised multi-scale flow loss. As summarized in Table 5, a very small value (e.g., Table 5 (a)) had a minimal regularizing impact, while overly large values (e.g., Table 5 (c)) over-smoothed the deformation field, reducing the DSC. The optimal value

λ = 0.1

provided the best trade-off—enforcing inverse consistency while maintaining local flexibility—and consistently yielded the best TRE and DSC.

4.5.4. The Number of Attention Heads

We conducted an ablation study on the number of attention heads

N_{h}

used in both the geometric keypoint and flow attention modules. As shown in Table 6, increasing the number of heads improved the alignment up to a point, but diminishing returns were observed beyond

N_{h} = 4

. We found this achieved the best balance between accuracy and stability, producing the lowest TRE and hthe ighest DSC, and thus we used

N_{h} = 4

in our final model.

Table 3. Ablation study on model components.

Ref.	Keypoint Attn.	Flow Attn.	Bi-Dir. Refine	TRE ↓	25%	50%	75%	DSC ↑
(a)				2.85	1.52	2.33	3.54	89.9
(b)	✓			2.83	1.54	2.28	3.32	89.9
(c)		✓		2.67	1.38	2.27	3.40	90.1
(d)	✓	✓		2.54	1.29	2.10	3.12	90.2
Ours	✓	✓	✓	2.24	1.16	1.79	2.77	91.4

Table 4. Ablation study on loss function.

Ref.	Multi-Scale Flow	Consistency	TRE ↓	25%	50%	75%	DSC ↑
(a)	✓		2.32	1.25	2.01	2.98	90.9
(b)		✓	Fail to converge
Ours	✓	✓	2.24	1.16	1.79	2.77	91.4

Table 5. Analysis of the weight factor

λ

.

Table 5. Analysis of the weight factor

λ

.

Ref.	$λ$	TRE ↓	DSC ↑
(a)	0.01	2.35	90.4
(b)	0.05	2.26	91.2
Ours	0.1	2.24	91.4
(c)	0.5	2.57	90.0

Table 6. Analysis of the # of attention heads

N_{h}

.

Table 6. Analysis of the # of attention heads

N_{h}

.

Ref.	$N_{h}$	TRE ↓	DSC ↑
(a)	1	2.34	91.1
(b)	2	2.31	90.8
Ours	4	2.24	91.4
(c)	8	2.25	91.0

Overall, these ablation studies demonstrate that each proposed component—ranging from the attention modules and bi-directional training objectives to the key hyperparameter configurations—contributes meaningfully to the improved accuracy, stability, and robustness of our method.

4.6. Analysis of Generalizability

To evaluate the generalization capability of our framework, we conducted cross-dataset experiments using external CT datasets that differed from those used during training. Specifically, the model was trained on the Empire10 [3] and TCIA-NLST [4,45] datasets and then directly tested—without any fine-tuning—on three unseen datasets, L2R-LungCT [7], TCIA-Ventilation [6], and DIR-LAB COPDgene [1], which represent diverse imaging protocols, scanner vendors, and population characteristics.

As shown in Table 7, our method achieves consistent improvements across all external datasets, with substantial reductions in the mean Target Registration Error (TRE) and corresponding increases in the Dice Similarity Coefficient (DSC). These results demonstrate that the model generalizes effectively to unseen anatomical variations and imaging domains, despite being trained on a limited number of source datasets. This robustness to domain shifts underscores the clinical applicability of our point-based framework, as it maintains a reliable performance without requiring retraining or tuning for each target dataset.

4.7. Analysis of Model Efficiency

In addition to evaluating the registration accuracy, we assessed the computational efficiency of our method in comparison with that of a representative voxel-based approach, VoxelMorph++ [48]. Table 8 summarizes the training memory consumption, inference memory usage, floating point operations (FLOPS), and inference time per scan pair for both methods.

Our point-cloud-based model demonstrates substantial efficiency advantages. Specifically, the training memory usage is reduced from 5.70 GB to 2.78 GB, and the inference memory usage is decreased from 4.97 GB to 0.82 GB. In terms of the computational complexity, our method achieves a five-fold reduction in FLOPS (318.72 G vs. 56.89 G). Furthermore, the inference time per scan pair is significantly shortened from 266.88 ms to 74.55 ms, indicating a substantial gain in the processing speed.

These results highlight that by leveraging the sparsity and flexibility of point cloud representations, our method achieves a superior computational efficiency compared to that of voxel-based models while maintaining a competitive registration performance. This efficiency is particularly advantageous for potential clinical applications where fast and memory-efficient processing is critical.

5. Discussion

Our results demonstrate that the proposed point-based deformable registration framework achieves a state-of-the-art performance on the Lung250M benchmark [16]. By incorporating geometric keypoint attention at coarse resolutions, contextual flow refinement at finer scales, and bi-directional learning objectives, our method achieves a superior accuracy across all TRE metrics compared to that of the prior optimization-based and learning-based approaches. These findings validate our core hypothesis: that accurate lung CT registration requires joint modeling of the long-range structural consistency and localized deformation patterns, particularly given the large respiratory motion and anatomical variability inherent in this domain.

Compared to prior point-based methods primarily developed for rigid registration or generic scene flow tasks [17,18], our approach is the first to specifically target deformable lung CT alignment using sparse point clouds, building on the recent Lung250M dataset [16]. While voxel-based learning methods have achieved notable success in brain MRI and abdominal CT registration, their application to thoracic imaging remains limited by the computational costs and sensitivity to variations in intensity. Our method addresses these limitations by leveraging sparse geometric inputs and multi-scale feature interactions, offering a more efficient and robust alternative.

Moreover, the use of bi-directional training with cycle consistency constraints across all resolution levels contributes to the deformation stability by enforcing multi-scale coherence in the flow predictions. This regularization, in combination with a coarse-to-fine refinement architecture, enhances the robustness to noisy or imperfect training correspondences. The empirical consistency of our model’s performance across datasets further suggests resilience to residual noise in the pseudo-ground-truth labels without requiring additional denoising mechanisms.

From a clinical standpoint, the proposed framework offers practical benefits. Its efficiency enables rapid deployment across large-scale datasets, while its anatomical interpretability—grounded in vascular keypoints—aligns with radiologists’ mental models for thoracic anatomy. Such properties suggest its promising utility in workflows such as longitudinal disease monitoring and dose accumulation analyses in radiotherapy, where consistent, reliable registration is critical.

Despite these advantages, our framework also has limitations. As it operates on vessel-derived point clouds, the registration accuracy in regions without clear geometric landmarks—such as homogeneous parenchymal areas—is less directly evaluated. In addition, our training supervision relies on the pseudo-ground-truth correspondences generated by corrField [15]. Although these correspondences have demonstrated high reliability, with reported mean TREs of 1.45 mm (test) and 2.67 mm (validation) against manual landmark annotations [16], a certain level of noise is inevitable. Our model mitigates the potential supervision noise through bi-directional cycle consistency regularization, but an explicit analysis of the robustness to correspondence noise remains an area for future investigation.

Future work could explore hybrid architectures that fuse the voxel-level appearance with the point-based geometry, potentially enhancing the deformation estimations in texture-rich but landmark-sparse regions. Incorporating semantic priors such as airway labels or fissure segmentations may improve the regional interpretability and clinical usability further. Additionally, injecting controlled perturbations into the training correspondences would help quantify the model’s sensitivity and reinforce its reliability. Expanding this framework to handle full 4D CT sequences or to quantify the registration uncertainty also presents promising future directions.

In conclusion, we present a novel attention-based, bi-directional point cloud registration framework specifically designed for deformable lung CT alignment. Through architectural innovations and rigorous evaluations on a challenging large-scale dataset, we show that point-based learning, when tailored to the characteristics of respiratory deformation, can offer an accurate, efficient, and anatomically consistent registration performance. Beyond algorithmic improvements, our framework also provides interpretable and efficient deformation estimates that align with clinically meaningful structures, making it a strong candidate for deployment in medical image analysis workflows. We believe this work provides an important step toward more robust geometric modeling in medical image analysis.

Author Contributions

N.L. and T.L. participated in all phases and contributed equally to this work. Their contributions to this paper are as follows. N.L. contributed to the methodology, software, formal analysis, investigation, resources, data curation, original draft preparation, and visualization. The contributions of T.L. were reviewing and editing of the writing, supervision, and project administration. Also, both contributed to the conceptualization and validation of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare (MOHW), Republic of Korea (No. HI22C041600).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Lung250M-4B dataset [16] used in this study is publicly available and can be accessed at https://github.com/multimodallearning/Lung250M-4B, accessed on 3 February 2025.

Acknowledgments

We gratefully acknowledge the insightful feedback and support provided by the Lung Vision AI team at VUNO Inc.

Conflicts of Interest

Author Nahyuk Lee was employed by the company VUNO Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CT	Computed Tomography
COPD	Chronic Obstructive Pulmonary Disease
MLP	Multi-Layer Perceptron
MHSA	Multi-Head Self-Attention
PPWC	Point-PWC
CD	Chamfer Distance
TRE	Target Registration Error
DSC	Dice Similarity Coefficient
TPS	Thin Plate Spline

References

Castillo, R.; Castillo, E.; Fuentes, D.; Ahmad, M.; Wood, A.M.; Ludwig, M.S.; Guerrero, T. A reference dataset for deformable image registration spatial accuracy evaluation using the COPDgene study archive. Phys. Med. Biol. 2013, 58, 2861. [Google Scholar] [CrossRef] [PubMed]
Liang, X.; Lin, S.; Liu, F.; Schreiber, D.; Yip, M. ORRN: An ODE-based recursive registration network for deformable respiratory motion estimation with lung 4DCT images. IEEE Trans. Biomed. Eng. 2023, 70, 3265–3276. [Google Scholar] [CrossRef] [PubMed]
Murphy, K.; Van Ginneken, B.; Reinhardt, J.M.; Kabus, S.; Ding, K.; Deng, X.; Cao, K.; Du, K.; Christensen, G.E.; Garcia, V.; et al. Evaluation of registration methods on thoracic CT: The EMPIRE10 challenge. IEEE Trans. Med. Imaging 2011, 30, 1901–1920. [Google Scholar] [CrossRef]
Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository. J. Digit. Imaging 2013, 26, 1045–1057. [Google Scholar] [CrossRef] [PubMed]
Yasuda, N.; Iwasawa, T.; Baba, T.; Misumi, T.; Cheng, S.; Kato, S.; Utsunomiya, D.; Ogura, T. Evaluation of Progressive Architectural Distortion in Idiopathic Pulmonary Fibrosis Using Deformable Registration of Sequential CT Images. Diagnostics 2024, 14, 1650. [Google Scholar] [CrossRef]
Eslick, E.; Kipritidis, J.; Gradinscak, D.; Stevens, M.; Bailey, D.; Harris, B.; Booth, J.; Keall, P. CT Ventilation as a functional imaging modality for lung cancer radiotherapy (CT-vs-PET-Ventilation-Imaging) (Version 1) [Data set]. Cancer Imaging Arch. 2022. [Google Scholar] [CrossRef]
Hering, A.; Hansen, L.; Mok, T.C.; Chung, A.C.; Siebert, H.; Häger, S.; Lange, A.; Kuckertz, S.; Heldmann, S.; Shao, W.; et al. Learn2Reg: Comprehensive multi-task medical image registration challenge, dataset and evaluation in the era of deep learning. IEEE Trans. Med. Imaging 2022, 42, 697–712. [Google Scholar] [CrossRef]
Zitova, B.; Flusser, J. Image registration methods: A survey. Image Vis. Comput. 2003, 21, 977–1000. [Google Scholar] [CrossRef]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; IEEE: New York, NY, USA, 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Wang, Y.; Solomon, J.M. Deep closest point: Learning representations for point cloud registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3523–3532. [Google Scholar]
Qin, Z.; Yu, H.; Wang, C.; Guo, Y.; Peng, Y.; Ilic, S.; Hu, D.; Xu, K. Geotransformer: Fast and robust point cloud registration with geometric transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9806–9821. [Google Scholar] [CrossRef]
Viola, P.; Wells, W.M., III. Alignment by maximization of mutual information. Int. J. Comput. Vis. 1997, 24, 137–154. [Google Scholar] [CrossRef]
Wells, W.M., III; Viola, P.; Atsumi, H.; Nakajima, S.; Kikinis, R. Multi-modal volume registration by maximization of mutual information. Med. Image Anal. 1996, 1, 35–51. [Google Scholar] [CrossRef] [PubMed]
Lewis, J.P. Fast normalized cross-correlation. In Proceedings of the Vision Interface, Quebec City, QC, Canada, 16–19 May 1995; Volume 10, pp. 120–123. [Google Scholar]
Heinrich, M.P.; Handels, H.; Simpson, I.J. Estimating large lung motion in COPD patients by symmetric regularised correspondence fields. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part II 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 338–345. [Google Scholar]
Falta, F.; Großbröhmer, C.; Hering, A.; Bigalke, A.; Heinrich, M. Lung250M-4B: A combined 3D dataset for CT-and point cloud-based intra-patient lung registration. Adv. Neural Inf. Process. Syst. 2023, 36, 54819–54832. [Google Scholar]
Wu, W.; Wang, Z.Y.; Li, Z.; Liu, W.; Fuxin, L. Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 88–107. [Google Scholar]
Liu, X.; Qi, C.R.; Guibas, L.J. Flownet3d: Learning scene flow in 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 529–537. [Google Scholar]
Gu, X.; Wang, Y.; Wu, C.; Lee, Y.J.; Wang, P. Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3254–3263. [Google Scholar]
Myronenko, A.; Song, X. Point set registration: Coherent point drift. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 2262–2275. [Google Scholar] [CrossRef]
Galbán, C.J.; Han, M.K.; Boes, J.L.; Chughtai, K.A.; Meyer, C.R.; Johnson, T.D.; Galbán, S.; Rehemtulla, A.; Kazerooni, E.A.; Martinez, F.J.; et al. Computed tomography–based biomarker provides unique signature for diagnosis of COPD phenotypes and disease progression. Nat. Med. 2012, 18, 1711–1715. [Google Scholar] [CrossRef]
Weiss, E.; Wijesooriya, K.; Dill, S.V.; Keall, P.J. Tumor and normal tissue motion in the thorax during respiration: Analysis of volumetric and positional variations using 4D CT. Int. J. Radiat. Oncol. Biol. Phys. 2007, 67, 296–307. [Google Scholar] [CrossRef]
Heinrich, M.P.; Jenkinson, M.; Brady, M.; Schnabel, J.A. MRF-based deformable registration and ventilation estimation of lung CT. IEEE Trans. Med. Imaging 2013, 32, 1239–1248. [Google Scholar] [CrossRef] [PubMed]
Ding, K.; Yin, Y.; Cao, K.; Christensen, G.E.; Lin, C.L.; Hoffman, E.A.; Reinhardt, J.M. Evaluation of lobar biomechanics during respiration using image registration. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 739–746. [Google Scholar]
Nakao, M.; Tokuno, J.; Chen-Yoshikawa, T.; Date, H.; Matsuda, T. Surface deformation analysis of collapsed lungs using model-based shape matching. Int. J. Comput. Assist. Radiol. Surg. 2019, 14, 1763–1774. [Google Scholar] [CrossRef] [PubMed]
Gorbunova, V.; Sporring, J.; Lo, P.; Loeve, M.; Tiddens, H.A.; Nielsen, M.; Dirksen, A.; de Bruijne, M. Mass preserving image registration for lung CT. Med. Image Anal. 2012, 16, 786–795. [Google Scholar] [CrossRef]
Yin, Y.; Hoffman, E.A.; Lin, C.L. Mass preserving nonrigid registration of CT lung images using cubic B-spline. Med. Phys. 2009, 36, 4213–4222. [Google Scholar] [CrossRef]
Thirion, J.P. Non-Rigid Matching using Demons. In Proceedings of the Computer Vision and Pattern Recognition CVPR’96, San Francisco, CA, USA, 18–20 June 1996. [Google Scholar]
Rueckert, D.; Sonoda, L.I.; Hayes, C.; Hill, D.L.; Leach, M.O.; Hawkes, D.J. Nonrigid registration using free-form deformations: Application to breast MR images. IEEE Trans. Med. Imaging 1999, 18, 712–721. [Google Scholar] [CrossRef]
Gao, S.; Zhang, Y.; Yang, J.; Wang, C.H.; Zhang, L.; Court, L.E.; Dong, L. A Hybrid Algorithm to Address Ambiguities in Deformable Image Registration for Radiation Therapy. Int. J. Med. Phys. 2012, 1, 50–59. [Google Scholar] [CrossRef]
Schmidt-Richberg, A.; Ehrhardt, J.; Werner, R.; Handels, H. Lung registration with improved fissure alignment by integration of pulmonary lobe segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2012: 15th International Conference, Nice, France, 1–5 October 2012; Proceedings, Part II 15. Springer: Berlin/Heidelberg, Germany, 2012; pp. 74–81. [Google Scholar]
Heinrich, M.P.; Jenkinson, M.; Bhushan, M.; Matin, T.; Gleeson, F.V.; Brady, M.; Schnabel, J.A. MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration. Med. Image Anal. 2012, 16, 1423–1435. [Google Scholar] [CrossRef] [PubMed]
Balakrishnan, G.; Zhao, A.; Sabuncu, M.R.; Guttag, J.; Dalca, A.V. Voxelmorph: A learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 2019, 38, 1788–1800. [Google Scholar] [CrossRef]
De Vos, B.D.; Berendsen, F.F.; Viergever, M.A.; Sokooti, H.; Staring, M.; Išgum, I. A deep learning framework for unsupervised affine and deformable image registration. Med. Image Anal. 2019, 52, 128–143. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Liu, J.; Akin, O.; Tian, Y. Rethinking pulmonary nodule detection in multi-view 3D CT point cloud representation. In Proceedings of the Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, 27 September 2021; Proceedings 12. Springer: Berlin/Heidelberg, Germany, 2021; pp. 80–90. [Google Scholar]
Jia, J.; Yu, B.; Mody, P.; Ninaber, M.K.; Schouffoer, A.A.; de Vries-Bouwstra, J.K.; Kroft, L.J.; Staring, M.; Stoel, B.C. Using 3D point cloud and graph-based neural networks to improve the estimation of pulmonary function tests from chest CT. Comput. Biol. Med. 2024, 182, 109192. [Google Scholar] [CrossRef]
Wu, W.; Qi, Z.; Fuxin, L. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9621–9630. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5105–5114. [Google Scholar]
Qin, Z.; Yu, H.; Wang, C.; Guo, Y.; Peng, Y.; Xu, K. Geometric transformer for fast and robust point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11143–11152. [Google Scholar]
Roman, N.O.; Shepherd, W.; Mukhopadhyay, N.; Hugo, G.D.; Weiss, E. Interfractional positional variability of fiducial markers and primary tumors in locally advanced non-small-cell lung cancer during audiovisual biofeedback radiotherapy. Int. J. Radiat. Oncol. Biol. Phys. 2012, 83, 1566–1572. [Google Scholar] [CrossRef]
Balik, S.; Weiss, E.; Jan, N.; Roman, N.; Sleeman, W.C.; Fatyga, M.; Christensen, G.E.; Zhang, C.; Murphy, M.J.; Lu, J.; et al. Evaluation of 4-dimensional computed tomography to 4-dimensional cone-beam computed tomography deformable image registration for lung cancer adaptive radiation therapy. Int. J. Radiat. Oncol. Biol. Phys. 2013, 86, 372–379. [Google Scholar] [CrossRef]
Hugo, G.D.; Weiss, E.; Sleeman, W.C.; Balik, S.; Keall, P.J.; Lu, J.; Williamson, J.F. A longitudinal four-dimensional computed tomography and cone beam computed tomography dataset for image-guided radiation therapy research in lung cancer. Med. Phys. 2017, 44, 762–771. [Google Scholar] [CrossRef]
Hugo, G.D.; Weiss, E.; Sleeman, W.C.; Balik, S.; Keall, P.J.; Lu, J.; Williamson, J.F. Data from 4D Lung Imaging of NSCLC Patients (Version 2) [Data set]. Cancer Imaging Arch. 2016. [Google Scholar] [CrossRef]
Team, N.L.S.T.R. Reduced lung-cancer mortality with low-dose computed tomographic screening. N. Engl. J. Med. 2011, 365, 395–409. [Google Scholar]
Paszke, A. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Heinrich, M.P.; Hansen, L. Voxelmorph++ going beyond the cranial vault with keypoint supervision and multi-channel instance optimisation. In Proceedings of the International Workshop on Biomedical Image Registration, Munich, Germany, 10–12 July 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 85–95. [Google Scholar]

Figure 1. Overall architecture. We highlight three main components of our framework: (1) multi-scale feature extraction, (2) geometric keypoint attention, and (3) progressive flow refinement. With input pair of point cloud

P

and

Q

, our model outputs the final corresponding displacements at coarsest level

l = 1

,

v^{(1)}

and

u^{(1)}

. The colors of the displacement vectors represent their angular orientation, facilitating the visual interpretation of flow directionality. Detailed descriptions of each module are provided in Section 3.

Figure 1. Overall architecture. We highlight three main components of our framework: (1) multi-scale feature extraction, (2) geometric keypoint attention, and (3) progressive flow refinement. With input pair of point cloud

P

and

Q

, our model outputs the final corresponding displacements at coarsest level

l = 1

,

v^{(1)}

and

u^{(1)}

. The colors of the displacement vectors represent their angular orientation, facilitating the visual interpretation of flow directionality. Detailed descriptions of each module are provided in Section 3.

Figure 2. An overview of the progressive flow refinement at level l. The module consists of two main steps: (1) warping and cost volume construction and (2) the l-th flow estimator with attention-driven refinement. See Section 3.4 for a detailed description of each module.

Figure 3. A qualitative comparison of the registration results. From left to right: the initial misalignment, the PPWC baseline [17], and our proposed method. To highlight subtle differences, we additionally provide zoom-in views of key anatomical regions, showing clearer visual comparisons of th edeformation quality.

Table 1. Detailed composition of the training, validation, and testing split of the Lung250M-4B dataset.

Split	Sub-Dataset	Index	# Pair
Training	Empire10 [3]	0–1, 3–7, 9–11	97
	TCIA-NLST [4,45]	12–33
	TCIA-NSCLC [4,41,42,43,44]	34–53
	L2R-LungCT [7]	57–83
	TCIA-Ventilation [4,6]	84–93, 95–96, 98–103
Validation	Empire10 [3]	2, 8	17
	L2R-LungCT [7]	54–56
	TCIA-Ventilation [4,6]	94, 97
	TCIA-NLST [4,45]	114–123
Testing	DIR-LAB COPDgene [1]	104–113	10

Table 2. Quantitative results on the COPD test cases of point-based methods, reported as the mean TRE with 25/50/75% percentiles (in mm) and the DSC (in %). A lower TRE and a higher DSC indicate a better performance. The performance metrics for the other methods are sourced from the Lung250M-4B [16] paper.

Method	TRE ↓	25%	50%	75%	DSC ↑
initial	16.25	10.14	15.94	21.76	75.8
VoxelMorph_{w/o IO} [33]	6.53	3.38	5.82	8.50	-
VoxelMorph++_{w/o IO} [48]	4.47	2.41	3.74	5.69	-
CPD [20]	3.13	1.51	2.28	3.58	-
CPD w/ labels [20]	2.59	1.36	2.01	3.16	-
PPWC_sup [17]	2.85	1.52	2.33	3.54	-
PPWC_syn [17]	2.73	1.52	2.28	3.45	-
Ours	2.24	1.16	1.79	2.77	91.4

Table 7. Generalizability experiment results on various datasets. The model was trained on the Empire10 [3] and TCIA-NLST [4,45] datasets and evaluated on L2R-LungCT [7], TCIA-Ventilation [4,6], and DIR-LAB COPDgene [1]. We refer the readers to Table 1 for further dataset details.

Dataset		TRE ↓	25%	50%	75%	DSC ↑
L2R-LungCT [7]	initial	26.04	21.59	25.90	31.05	65.2
L2R-LungCT [7]	result	4.53	2.63	3.79	5.73	84.8
TCIA-Ventilation [4,6]	initial	15.82	11.17	16.03	20.69	71.4
TCIA-Ventilation [4,6]	result	4.33	2.44	3.26	4.50	88.8
DIR-LAB COPDgene [1]	initial	16.25	10.14	15.94	21.76	75.8
DIR-LAB COPDgene [1]	result	3.53	1.89	2.92	4.59	89.3

Table 8. Efficiency comparison between VoxelMorph++ [48] and ours. Lower is better.

Method	Representation	Train Mem.	Inference Mem.	FLOPS	Time
Method	Representation	(GB)	(GB)	(G)	(ms)
VoxelMorph++	Voxel	5.70	4.97	318.72	266.88
Ours	Point cloud	2.78	0.82	56.89	74.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, N.; Lee, T. Bi-Directional Point Flow Estimation with Multi-Scale Attention for Deformable Lung CT Registration. Appl. Sci. 2025, 15, 5166. https://doi.org/10.3390/app15095166

AMA Style

Lee N, Lee T. Bi-Directional Point Flow Estimation with Multi-Scale Attention for Deformable Lung CT Registration. Applied Sciences. 2025; 15(9):5166. https://doi.org/10.3390/app15095166

Chicago/Turabian Style

Lee, Nahyuk, and Taemin Lee. 2025. "Bi-Directional Point Flow Estimation with Multi-Scale Attention for Deformable Lung CT Registration" Applied Sciences 15, no. 9: 5166. https://doi.org/10.3390/app15095166

APA Style

Lee, N., & Lee, T. (2025). Bi-Directional Point Flow Estimation with Multi-Scale Attention for Deformable Lung CT Registration. Applied Sciences, 15(9), 5166. https://doi.org/10.3390/app15095166

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bi-Directional Point Flow Estimation with Multi-Scale Attention for Deformable Lung CT Registration

Abstract

1. Introduction

2. Related Work

2.1. Deformable Lung CT Registration

2.2. Voxel-Based Deformable Registration

2.3. Point-Based Registration

3. The Proposed Approach

3.1. The Problem Setup

3.2. Multi-Scale Feature Extraction

3.2.1. The PointConv Backbone

3.2.2. The Coarse-to-Fine Pipeline

3.3. Geometric Keypoint Attention

3.4. Progressive Flow Refinement

3.4.1. Warping and Cost Volume Construction

3.4.2. The Flow Estimator

3.4.3. Post-Processing for Dense Deformation Estimation

3.5. The Training Objective

3.5.1. Bi-Directional Multi-Scale Flow Loss

3.5.2. The Cycle Consistency Loss

3.5.3. The Final Training Objective

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

4.3.1. Training Details

4.3.2. Model Hyperparameters

4.4. The Experimental Results

4.5. Ablation Studies

4.5.1. The Model Components

4.5.2. The Loss Function

4.5.3. The Weight Factor for Cycle Consistency

4.5.4. The Number of Attention Heads

4.6. Analysis of Generalizability

4.7. Analysis of Model Efficiency

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI