3.1. Point Feature Extraction Network
The initial step in the point correspondence model requires obtaining feature representations for every point within the source point cloud and target point cloud . This foundational step enables the construction of soft correspondence points and the calculation of initial estimated flows. A deep network transforms the original 3D coordinates of each point into high-dimensional feature vectors. These vectors consequently capture the geometric information of the point clouds more effectively. The feature extraction network therefore processes and as input, outputting corresponding feature matrices and , where denotes the dimensionality of the point features.
However, experiments using SCOOP [
9] indicate that incorrect soft correspondences frequently occur when local regions contain similar structures. The root cause lies in the inaccurate feature representations of the points within these structures. This inaccuracy leads to insufficient differentiation between points residing in similar local areas. Consequently, during soft correspondence construction, points from nearby similar structures are often erroneously selected. To address this limitation, this paper proposes an improved point feature extraction network. This enhanced network specifically strengthens the model’s ability to learn local structural information within point clouds [
34,
35]. It mitigates the problem of ambiguous feature representation in geometrically similar regions.
The point feature extraction network in SCOOP [
5] builds upon the PointNet++ [
8] architecture. This network incorporates three SetConv layers. Each layer performs feature extraction while progressively expanding the dimensionality of the feature vectors. Specifically, the SetConv layer operates through three sequential stages: 2D convolution, instance normalization, and Leaky ReLU activation. Subsequently, it applies max pooling to aggregate information within local neighborhoods. This process ensures the final point features capture local geometric structure information effectively. However, this paper observes that the current network design underutilizes the feature information extracted during each processing stage. It relies solely on the highest-level features from the final stage for neighborhood feature aggregation. Consequently, the lower-level features generated in the first two stages remain insufficiently leveraged, despite their inherent utility for capturing nuanced local structural information within the point cloud.
To address this limitation, the paper introduces enhancements to the SetConv layers. By more comprehensively leveraging features extracted at every stage, the network improves its capacity to represent local structures in unstructured, unordered point clouds. This improved variant is termed the SetConv++ layer.
Figure 2 illustrates its structure.
Unlike the SetConv layer, the SetConv++ layer concatenates the three levels of features obtained after each processing stage along the feature dimension. This combined output is termed the concatenated multi-level feature. Subsequently, this concatenated feature undergoes convolution and instance normalization to reduce dimensionality, resulting in the fused multi-level feature. The fused multi-level feature then combines with the third-level feature through element-wise addition.
Following this operation, max pooling is applied within each local neighborhood to extract the most significant feature. This significant feature is then concatenated with the original concatenated multi-level feature along the feature dimension. The combined result is again reduced in dimensionality via convolution and instance normalization. Finally, max pooling is performed once more within the local neighborhood.
This improvement aims to better leverage local contextual information while emphasizing the most salient neighborhood features. Consequently, the network learns local point cloud structures more effectively. Consistent with SCOOP [
5], the current feature vector concatenates with coordinate difference vectors between neighboring points and the center point. This concatenated output serves as input for the next SetConv++ layer. The purpose of this design is to integrate local spatial relationships, thereby enhancing local structure capture. The overall feature extraction network proposed in this paper comprises three SetConv++ layers, with
kn denoting the number of neighborhood points.
3.2. Construct Soft Correspondence
In this paper, we construct soft correspondences for each point in the source point cloud using the feature matrices
and
. This process follows the method described in [
5], which involves solving the optimal transport problem through Sinkhorn iterations with entropy regularization. This method solves the optimal transport problem through the Sinkhorn iteration controlled by the entropy regularization parameter
(where
learned is a learnable parameter). The setting of the final convergence value
≈ 0.15 keeps the entropy of the corresponding matrix in a moderately softened state, balancing the matching accuracy and computational stability. Since the transport cost depends on inter-point similarity, we compute the cosine similarity matrix
between features to represent the similarity between points in
and
, as shown in Equation (1):
where
represents the feature vector of the
i-th point in
, and
represents the feature vector of the
j-th point in
.
By constructing and solving this optimal transport problem [
36], we obtain the optimal transport matrix
. We then select the top
points from
with the highest mass transferred to
based on
. The weights for these
kt points are calculated as shown in Equation (2):
where
i,
j, and
n are the set of indices corresponding to these selected
points in
.
Using the calculated weights, the soft correspondence point of the i-th point in at time , denoted as , is obtained, as shown in Equation (3). The coordinate difference between the soft correspondence point and point is used as the initial estimation of the scene flow at , denoted as .
Using these calculated weights, the soft correspondence point of the
i-th point in
at time
, denoted as
, is determined. Its coordinate difference from point
serves as the initial estimation of the scene flow at
, denoted as
:
where
is the weight for point
kt.
3.3. Self-Supervised Loss Term Design
Since ground truth data for point cloud scene flow is unavailable, effective training of the point model requires the introduction of self-supervised loss terms. SCOOP [
5] proposes three such terms: a chamfer distance loss (
), a confidence loss (
), and a flow smoothness loss (
). This paper acknowledges that
from [
5] helps maintain local consistency of the estimated flow. However, it also propagates incorrect estimations. Therefore, building on
, this paper proposes a novel flow smoothness loss,
. While preserving
’s role in promoting local consistency,
also suppresses the spread of incorrect estimations. Consequently, this paper selects three self-supervised loss terms for training the point correspondence model: the chamfer distance loss and confidence loss from SCOOP [
5], combined with the newly proposed flow smoothness loss term.
In this paper, the set of soft correspondences for the source point cloud
at time
is denoted as
. The purpose of the chamfer distance loss term is to maximize the overlap between
and the actual target point cloud
. Compared to similar chamfer losses in prior works [
6,
37], SCOOP [
5] introduces the concept of matching confidence, which indicates the reliability of the generated soft correspondences. This approach provides valuable insight for the subsequent improvement of the flow smoothness loss proposed in this work. The chamfer distance loss used here follows the same construction as described in SCOOP [
5]. Specifically, the confidence
for an estimated point
is calculated using the cosine similarity matrix, as shown in Equation (4). Introducing this confidence into the chamfer distance loss term increases the supervision weight on high-confidence estimations, as shown in Equation (5).
where
m is the number of three-dimensional points contained in the source point cloud
PCt and the target point cloud
PCt+1.
Additionally, building on [
5], this paper constructs a confidence loss term to penalize low-confidence flow estimations, as shown in Equation (6).
The purpose of the flow smoothness loss term is to enhance the local consistency of the estimated point cloud scene flow. This ensures that neighboring points within
exhibit similar motion patterns. Consequently, the final surface of
becomes as smooth as possible. In SCOOP [
5], the flow smoothness loss term is constructed following this principle, as shown in Equation (7).
where
is the number of points taken from the neighborhood, and
is the set of indices for these
nearest neighbors.
While constructing the flow smoothness loss term as defined in Equation (7) enhances the local consistency of point cloud scene flow, this paper contends that it fails to account for the influence of erroneous estimations. Consequently, incorrect estimations propagate. In particular, the existing method assigns equal influence to every neighboring point on the center point’s flow estimate. This results in excessive negative impact from erroneous neighborhood estimations, further propagating errors. To address this limitation, this paper improves the original flow smoothness loss term
from [
5]. Inspired by SCOOP [
5]’s integration of confidence into the chamfer distance loss, this work introduces confidence-aware weights for neighborhood influences. Specifically, high-confidence flow estimates exert greater influence on the center point, whereas low-confidence estimates have reduced impact. Similarly, the center point’s own confidence level determines its contribution to the total loss: higher confidence increases the contribution, and conversely, lower confidence decreases it. Therefore, consistency with reliable high-confidence flows is enhanced, while the detrimental influence of potentially erroneous low-confidence flows is mitigated. Consequently, the proposed loss term, denoted as
, better maintains flow consistency and effectively prevents error propagation.
where
is obtained by applying softmax to the global confidence vector
(associated with all elements). In contrast,
is obtained by applying softmax to
, which is the local neighborhood confidence vector formed by selecting entries from
corresponding to the indices in
.
In summary, the overall training loss consists of
,
, and
, as shown in Equation (9).
where
,
, and
are hyperparameters.
3.4. Design of the Refinement Module
Instead of directly refining the initial flow estimate with residual flow refinement as in SCOOP [
5], this paper first employs a confidence-based random walk algorithm. While SCOOP’s refinement module aims to improve flow accuracy, experiments in [
5] reveal that errors persist when target point clouds lack corresponding local regions from the source. These errors occur because selected soft correspondence points have no true counterparts, leading to fundamentally incorrect initial flow estimates in affected regions. Crucially, SCOOP’s residual refinement cannot correct such errors.
To address this limitation, the proposed random walk refinement adjusts high-confidence flow estimates further and reconstructs low-confidence estimates using neighboring high-confidence flows. This process enhances local flow consistency while specifically improving scenarios where missing corresponding regions in (relative to ) cause incorrect soft correspondences and erroneous flow estimations.
The random walk refinement module first constructs a graph on , denoted as , where the node set consists of all points in , and represents the edge set. Divide into two parts, with one part including all points where the confidence of the estimated flow is greater than or equal to threshold , denoted as ; the other part contains the remaining points where the confidence of the estimated flow is below threshold , denoted as . The estimated flow at each point in is considered influential and only requires fine-tuning. In contrast, the estimated flow at each point in is deemed non-influential and needs to be reconstructed.
The points within set exhibit measurable influence. Consequently, adjusting their estimated flows necessitates only fine-tuning procedures. In contrast, points belonging to set lack significant influence. This characteristic necessitates the complete reconstruction of their flow estimates.
The distance influence matrix for the points within the graph
is defined as
. Specifically, all elements on the main diagonal of matrix
are explicitly set to zero. Element
is calculated by Equation (10):
where
and
denote the spatial coordinates of the
i-th and
j-th nodes in
, respectively. The hyperparameter
ω controls the rate at which influence diminishes over distance. Consequently, higher values of
generally result in weaker distance-based attenuation, while lower values yield a stronger decrease in influence as the distance between nodes increases. It should be noted that
controls the decay rate of the neighborhood influence. When it approaches 0, the random walk degenerates into isolated point reconstruction, and low-confidence points only rely on themselves. When it approaches positive infinity, weights the of all points are equal, resulting in global over-smoothing and destroying the motion boundary.
The subsequent step involves computing the transition probability matrix. Within this framework, the estimated flow confidence at each point contributes to the calculation. Specifically, points exhibiting confidence values exceeding a predefined threshold
are retained. Conversely, points with confidence values below
are set to zero. Points retained under this criterion are designated as
. Finally, normalization is applied to the valid points. This normalized result defines the transition probability matrix, denoted as
, as shown in Equation (11).
The fine-tuning and reconstruction parts are performed in the same manner as in [
32]. The estimated flow at each point in
after
iterations of fine-tuning is denoted as
, as shown in Equation (12).
where
is the initial estimated flow, and
is the parameter used to adjust the strength of the random walk adjustment.
gradually converges as the number of iterations increases. When
, the estimated flow after infinite iterations can be computed, denoted as, as shown in Equation (13).
Reconstruct the estimated flow at each point in
based on
, denoted as
. The reconstruction process is shown in Equation (14).
By using the fine-tuned estimated flow
and the reconstructed estimated flow
, the complete estimated flow after random walk refinement can be obtained. Unlike in [
32], where all flows are directly replaced based on reliability, this paper introduces two confidence thresholds (
α and
β), with
β > α. Points are divided into three parts based on their flow estimation confidence.
Points with confidence in the estimated flow within the interval are categorized as the first part, and their corresponding indices are denoted as. For points in this part, the estimated flow is directly replaced with the corresponding part from . Points with confidence in the estimated flow within the interval are categorized as the second part, and their corresponding indices are denoted as. For points in the second part, the estimated flow is replaced by a mixture of the corresponding parts from and , according to a certain ratio. The mixing ratio should be such that the lower the original estimated flow’s confidence, the higher the proportion of ; conversely, the higher the confidence, the lower the proportion of . Points with confidence in the flow estimate within the interval are categorized as the third part, and their corresponding indices are denoted as . For points in the third part, the estimated flow is directly replaced with the corresponding part from .
The final result is denoted as
, as shown in Equation (15).
The selection of confidence thresholds
α and
β (where
β > α) is critical for balancing the contributions of
and reconstructed
flows in the final output
. To address potential concerns regarding their sensitivity to different data distributions and environmental conditions, we conducted extensive ablation studies across diverse datasets. As comprehensively demonstrated in
Section 4.3 (Ablation Study), the optimal values of α and β identified on our primary validation split exhibit remarkable robustness. These optimal thresholds consistently delivered high performance not only across different splits of the same dataset but also when applied to datasets featuring significantly different environmental characteristics and point densities. While the optimal values are derived empirically, their robustness across varying conditions suggests they capture a fundamental balance between flow confidence and the need for neighbor-based reconstruction applicable to general dynamic point cloud scenes. After processing through the random walk refinement, the residual flow refinement used in SCOOP [
5] is then applied. Residual flow refinement allows the estimated target point cloud
, formed by the estimated scene flow, to better preserve the geometric structure of the source point cloud
. It also brings
closer to the surface of the target point cloud
. The implementation involves adding a residual flow refinement component, denoted as
, on top of
for fine-tuning. Optimization of
is performed by supervising it with Equations (7) and (9), which serve as the chamfer distance loss term and flow smoothness loss term, respectively. This ensures the estimated result aligns more closely with the target point cloud. The final estimated point cloud scene flow is denoted as
, as shown in Equation (16).