Robust Line-Scan Image Registration via Disparity Estimation for Train Fault Diagnosis

Feng, Darui; Yang, Kai; Ling, Zhi; Wang, Yong; Luo, Lin

doi:10.3390/s25237315

Open AccessArticle

Robust Line-Scan Image Registration via Disparity Estimation for Train Fault Diagnosis

by

Darui Feng

,

Kai Yang

^*

,

Zhi Ling

,

Yong Wang

and

Lin Luo

School of Physical Science and Technology, Southwest Jiaotong University, Chengdu 610031, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(23), 7315; https://doi.org/10.3390/s25237315

Submission received: 7 November 2025 / Revised: 27 November 2025 / Accepted: 29 November 2025 / Published: 1 December 2025

(This article belongs to the Section Fault Diagnosis & Sensors)

Download

Browse Figures

Versions Notes

Abstract

Automatic fault detection based on machine vision technology is crucial for the operational safety of trains. However, when imaging moving trains, system errors may induce localized geometric distortions in the captured images, altering the shapes of critical train components. This, in turn, undermines the precision of subsequent diagnostic algorithms. Therefore, image registration prior to anomaly detection is essential. To address this need, we redefine the horizontal registration of line-scan images as a disparity estimation problem on rectified stereo pairs, which is solved using a proposed dense matching network. The disparity is iteratively refined through a GRU-based update module that constructs a multi-scale cost volume with positional encoding and self-attention. To overcome the absence of real-world disparity ground truth, we generate a physics-based simulation dataset by analytically modeling the nonlinear relationship between train velocity variations and line-scan image distortions. Extensive experiments on diverse real-world train image datasets under varied operational conditions demonstrate that our method consistently outperforms alternatives, achieving 5.8% higher registration accuracy and a fourfold increase in processing speed over state-of-the-art approaches. This advantage is particularly evident in challenging scenarios involving repetitive patterns or texture-less regions.

Keywords:

line-scan image registration; feature aggregation; high-speed railway; simulation dataset; fault diagnosis

1. Introduction

In recent years, high-speed railway and urban metro networks have become indispensable modes of public transportation. As both ridership and route coverage have expanded markedly [1], operational safety and reliability requirements have intensified correspondingly. Consequently, systematic condition monitoring and fault diagnosis of key train components are now essential for ensuring uninterrupted service. Machine-vision-based diagnostic methods offer significant gains in inspection efficiency, accuracy, and scalability to address these challenges. However, a critical assumption underpinning many existing train inspection systems [2,3] is that the train’s velocity is perfectly matched with the line-scan camera’s acquisition rate. While real-time adjustment of the acquisition rate can mitigate this mismatch, residual distortions or misalignments invariably arise from systematic errors in the velocity measurement apparatus [4,5]. Applying diagnostic algorithms to such distorted images can easily generate false alarms. This underscores the necessity for accurate image registration as a prerequisite for reliable fault detection.

Train-inspection systems typically employ either area-array cameras or line-scan cameras. These systems integrate a variety of component technologies, such as stereo matching [6,7], point-cloud processing [8,9,10], 3D shape measurement [11,12,13], and high-precision image-registration-based anomaly detection [14,15]. Although area-array cameras can capture an entire image in single exposure, their limited field of view and the extreme aspect ratio of train carriage necessitate post-capture stitching of multiple sub-images to obtain a complete image. Variations in illumination and lens-edge distortion frequently impede precise alignment between adjacent images [16], posing substantial stitching challenges. By contrast, line-scan cameras acquire imagery single column at a time and exhibit minimal edge distortion, rendering them particularly well suited for high-resolution imaging of elongated subjects such as trains [17,18,19]. Figure 1a illustrates a representative line-scan carriage-imaging system capable of acquiring 360° views of passing trains for comprehensive exterior inspection. Registering two line-scan images of a train faces several significant challenges. First, the imagery includes both highly repetitive textures (e.g., grilles) and nearly texture-less regions (e.g., roof panels and side skirts). Dust accumulation further alters local appearance. These factors impair robust keypoint detection and increase the likelihood of mismatches. Second, spatially varying illumination and occasional specular reflections from metallic surfaces introduce additional appearance variations. This variability complicates descriptor matching. Finally, line-scan images are extremely large. Their heights are typically fixed at 1024 pixels, while their widths in our dataset range from 8192 to more than 32,760 pixels. Such large image sizes impose heavy computational and memory demands on registration algorithms.

Image registration is the process of aligning a pair of images via an appropriate transformation. Zitová and Flusser [20] categorized classical image registration techniques into two broad classes: intensity-based and feature-based methods.

As a specialized subset, existing registration approaches for line-scan images can likewise be divided into two types: intensity-based methods [21] and feature-based methods [1,22,23,24,25]. Intensity-based methods typically rely on template matching for alignment. For example, Song et al. [21] integrate classical template matching with the Enhanced Correlation Coefficient (ECC) algorithm. However, because these methods depend solely on the intensity of information, they often become stuck in local optima when applied to repetitive train-surface textures.

Feature-based methods are more widely used in practice. The time-scale normalization (TSN) algorithm by Lu et al. [22] aligns line-scan images by extracting SIFT [26] keypoints and estimating multiple affine transforms. Yet TSN accuracy degrades in low-texture regions where features are sparse, and the use of separate affine blocks can introduce cumulative misalignment at block boundaries. Chang et al. [23] address this feature scarcity by proposing Omnidirectional Scale Correlation Normalization (OSCN), which refines both keypoint detection and matching. Despite these enhancements, OSCN still fails in highly repetitive regions, leading to local registration breakdowns. Liu et al. [24] partition the full-train image into sub-blocks based on vehicle-body markers and optimize the stretch ratio of each block using mean-squared error (MSE). Although effective when such markers are present, this method incurs substantial computational overhead when processing large line-scan images, which undermines its real-time performance.

Deep-learning approaches have recently emerged to overcome these limitations. Fu et al. [1] apply a SuperPoint [27] sliding-window feature extractor, enforce geometric consistency via RANSAC [28] and a cubic-polynomial filter, and finally fuse correspondences using a weighted radial basis-function (WRBF) interpolation to register entire line-scan images. Although this pipeline accelerates overall alignment, its geometric parameters must be retuned for diverse locomotive types. Chang et al. [25] further improved robustness by encoding multi-scale features with a VGG-style backbone [29] and matching them with SuperGlue [30], then fitting a quintic polynomial to correct horizontal distortions. While this method achieves high accuracy, the quintic-fitting step incurs substantial computational cost, resulting in long per-image processing times. Feature-based registration methods remain the mainstream. However, in regions with repetitive patterns, sparse feature matching often leads to a high rate of incorrect correspondences, significantly degrading registration accuracy.

All of the aforementioned methods adopt a traditional multi-stage, pipeline-style architecture. Even when deep-learning modules are integrated into each stage, it remains difficult to achieve real-time performance. Critically, these existing algorithms derive registration transformations through multiple independent processing steps (e.g., feature extraction, matching, and transformation estimation). This multi-stage structure inevitably leads to the accumulation of errors, further limiting overall registration effectiveness.

Through our analysis of existing line-scan image registration methods, we recognize that, owing to the intrinsic imaging mechanism of line-scan cameras, horizontal alignment must be performed on entire pixel columns (hereafter referred to as line-pixels) rather than on individual pixels. Consequently, the primary challenge lies in estimating the shift of each line-pixel. Thus, we reformulate the problem as estimating line-disparity (defined as the horizontal offset between corresponding line-pixels) on epipolar-rectified stereo pairs, analogous to stereo disparity but specific to line-scan data. To train and evaluate our approach, we also introduce a physics-informed framework for synthesizing labeled line-scan image datasets directly from vehicle velocity profiles. Our key contributions are:

Mathematical modeling of line-scan imaging. We equate the line-scan camera’s acquisition rate to a velocity and derive a closed-form model that links train-velocity fluctuations to the resulting compressive and tensile deformations in line-scan images.
Simulation dataset generation. We introduce a radial basis function (RBF) method for synthesizing a labeled line-scan image dataset suitable for network training.
Reformulation line-scan image registration framework and performance breakthrough. By casting horizontal registration as a line-disparity estimation task and designing an efficient cost volume, our method substantially outperforms existing algorithms in both accuracy and speed, particularly in regions with weak or repetitive textures, while maintaining low memory consumption. To ensure that the estimated disparities and their induced transformations adhere to realistic train-motion characteristics, we further introduce a dual-smoothness loss function.

The remainder of this paper is organized as follows. Section 2 presents an overview of the proposed method. Section 3 details the construction of the simulated line-scan image dataset. Section 4 describes the line-disparity estimation-based registration algorithm. Section 5 reports experimental results, and Section 6 concludes the paper.

2. Methodology Overview

As illustrated in Figure 2a, our methodology begins with a physics-informed simulation process that generates training data with dense ground-truth line-disparity labels. Synthetic data are required because real train-inspection systems cannot measure pixel-level displacement during acquisition, making supervised learning otherwise infeasible.

A key concept of our framework is line-disparity, defined as the scalar horizontal offset shared by all pixels within a captured column. Since line-scan distortions are inherently one-dimensional, the registration problem can be reformulated as estimating this per-column displacement field, which corresponds mathematically to the integral of velocity fluctuations (Section 3).

The core of our approach is Line-Stereo RegNet, a compact architecture designed to predict the complete registration transformation in a single step, unlike existing multi-stage pipelines. The network is tailored to handle the extreme aspect ratios and repetitive textures of line-scan imagery. Given a source–target pair, it outputs a dense line-disparity map (Figure 2b), which is upsampled to full resolution and converted into a pixel-wise mapping function. The target image is finally warped via bilinear interpolation to obtain the geometrically corrected result.

3. Simulation Dataset Generation Method

One significant challenge in training a network to predict accurate line-pixel disparities is the absence of line scan image datasets annotated with ground truth line-disparity. In practice, two images of the same scene captured at different times serve as the source and the target. Any difference in velocity between these acquisitions appears as a positional shift when integrated over time. Specifically, if the source image is recorded at the reference velocity

v_{s r c}

, which ideally matches the camera acquisition line rate, then any deviation in the target velocity

Δ v (τ) = v_{t a r} (τ) - v_{s r c} (τ),

(1)

when integrated over time, induces line-pixel displacements. Consequently, obtaining the ground truth line-disparity, that is the time integrated velocity difference between target and source captures, is essential for aligning the target image to the source image.

3.1. Mathematical Model of Line-Scan Camera Imaging Distortion

Let the line acquisition rate be denoted as

f_{l i n e}

(in lines per second, Hz), and let

R_{y}

represent the spatial resolution along the motion direction, defined as the physical dimension of a single pixel projected onto the object’s surface (e.g., meters per pixel). The camera’s equivalent scanning velocity

V_{c a m}

is then given by:

V_{c a m} = f_{l i n e} \cdot R_{y},

(2)

where

f_{l i n e}

is the line acquisition rate and

R_{y}

is the spatial resolution. Introducing the object velocity

V_{o b j}

(i.e., train velocity) enables a formal description of distortion in line-scan imaging systems. When

V_{o b j} = V_{c a m}

, the system operates in perfect synchronization: during each acquisition interval

1 / f_{l i n e}

, the train advances exactly one pixel length

R_{y}

. Each acquired line then corresponds to a unique, contiguous spatial segment of the object, yielding a geometrically accurate image. When

V_{o b j} > V_{c a m}

(Figure 3a), the train’s velocity exceeds the camera’s equivalent scanning velocity. Consequently, within one acquisition interval, the train moves farther than

R_{y}

, causing spatial information loss between consecutive line acquisitions. This results in image compression along the motion axis as a larger physical length is mapped onto fewer line-pixels. A similar analysis applies when

V_{o b j} < V_{c a m}

, resulting in stretching rather than compression.

To quantify the distortion induced by non-uniform object velocity, we model the positional deviation of each line-pixel in the acquired image. Assume image acquisition begins at time t = 0, for simplicity, we first let

f_{l i n e} = 1 Hz

, such that the n-th line-pixel is captured at time

t = \frac{n}{f_{l i n e}} .

(3)

Let the instantaneous velocity of the train (object) be defined as a time-variant function

V_{o b j} (t)

, and the camera’s equivalent scanning velocity as

V_{c a m} (t)

. The actual physical distance traveled by the object from t = 0 to the capture of the n-th line is:

y (t) = \int_{0}^{t} V_{o b j} (τ) d τ,

(4)

where

y (t)

corresponds to the true, corrected position of the n-th line-pixel, denoted as

p_{c o r r} (n)

. This is obtained by normalizing the physical distance by the spatial resolution

R_{y}

:

p_{c o r r} (n) = \frac{y (t)}{R_{y}} = \frac{1}{R_{y}} \int_{0}^{t} V_{o b j} (τ) d τ = \int_{0}^{t} v_{o b j} (τ) d τ,

(5)

where

v_{o b j} (t)

is the object velocity normalized to image space (i.e., in pixels per second). In the distorted image, the n-th captured line-pixel is located at the observed pixel position

p_{o b s} (n) = n .

(6)

The correct process as shown in Figure 3a, for the n-th line-pixel’s positional deviation,

s h i f t (n)

is the difference between its true corrected position and its observed position in the distorted image:

s h i f t (n) = p_{c o r r} (n) - p_{o b s} (n) = p_{c o r r} (n) - n .

(7)

To express n in terms of the acquisition velocity, we rewrite it as:

n = \int_{0}^{t} v_{c a m} (τ) d τ .

(8)

Substituting (4) and (6) into (5), we derive the final expression for distortion as the cumulative difference in velocities (see Figure 3b):

s h i f t (n) = \int_{0}^{t} [v_{o b j} (τ) - v_{c a m} (τ)] d τ .

(9)

In our proposed framework, this shift is interpreted as line-disparity

S (n)

. Assuming the source image is acquired under constant velocity

v_{s r c} = v_{c a m}

, the target disparity becomes:

S (n) = \int_{0}^{t} [v_{t a r} (τ) - v_{s r c} (τ)] d τ .

(10)

3.2. Generation of Velocity Profiles

Accurate velocity profiles are critical; datasets such as FlyingChairs [31] rely on simple affine transformations, which cannot capture the random compressive/stretching distortions characteristic of train line-scan imagery, and thus fail to ensure network robustness. Drawing on studies in train speed-control systems [32,33], and according to (2), where each line-pixel position can be equated to t, we model the instantaneous velocity of the train within the camera’s field of view using a radial basis function (RBF) expansion:

v (t) = \sum_{i = 0}^{N - 1} {\hat{w}}_{i} \cdot ϕ (t, c_{i}), c_{i} (i = 0, 1, \dots, N - 1),

(11)

where

c_{i}

are control point locations, N is the number of control points, and

ϕ (t, c_{i})

is Gaussian kernel

ϕ (t, c_{i}) = \exp (- \frac{{(t - c_{i})}^{2}}{2 σ^{2}}) .

(12)

The weights

{\hat{w}}_{i}

are drawn from a normal distribution, and then normalized:

w_{i} ~ N (0, 1), {\hat{w}}_{i} = \frac{w_{i}}{\sum_{j = 0}^{N - 1} | w_{j} |}

(13)

where

j = 0, 1, 2, \dots, N - 1

,

w_{i}

are weights from original normal distribution. To facilitate analytical integration, we fit the discrete RBF samples

v (t)

with a cubic spline, yielding a continuous velocity curve

v_{t a r} (t)

. To mimic the carriage–segmentation errors observed in practice, we introduce an initial offset

δ_{0}

. In our implementation, we set

v_{src} (t) = 1

pixel/s, the resulting disparity becomes:

S (n) = δ_{0} + \int_{0}^{t} [v_{t a r} (τ) - 1] d τ .

(14)

3.3. Augmentation Strategy

Real-world disturbances such as camera vibration, exposure fluctuations and random reflections from metal surfaces induce multidimensional differences between historical and current frames. As shown in Figure 4, after applying the horizontal disparity

S (n)

to the source image, we obtain an intermediate, horizontally warped image

I_{hor}

. We then define the augmented target image

I_{en}

as:

I_{en} = G (R (I_{hor}, Y)),

(15)

where

Y

is a vertically periodic shift field

Y = A \cdot \cos (\frac{2 π}{T} \cdot x + ϕ_{0}),

(16)

where

T, A, G

and

R

are period, amplitude, initial phase, random exposure-gain function and specular-highlight noise, respectively. This composite procedure yields realistic target images for robust network training.

Our final simulation dataset is illustrated in Figure 4. Briefly, we first compute the line-disparity for the source image (whose velocity profile is shown by the red curve in the Speed-Line Graph). We then integrate the velocity profile over time to obtain the line- disparity (green curve in the Line-pixel Disparity graph) and apply this line-disparity to generate a horizontally distorted version of the source image. Additional augmentations are introduced to produce the target image. Finally, we crop identical regions from the source and target images to create a composite. This composite exhibits horizontal and vertical offsets that faithfully replicate the true distortion patterns found in train line-scan data.

3.4. Simulation Fidelity and Implementation Assumptions

Our simulation framework links synthetic labels to real data by modeling the physical cause of line-scan distortions, namely velocity asynchrony. Instead of simple affine augmentations, it analytically derives the nonlinear line-disparity field from train velocity profiles and combines these horizontal distortions with vertical periodic shifts as well as illumination and noise variations. This produces deformations that closely match real inspections and enables the network to learn geometric behavior from physical principles rather than surface appearance.

The simulation is based on standard operational assumptions. We focus on geometric distortion and assume high optical quality, which aligns with modern systems that employ strobe lighting and synchronized shutters. As a result, degradations such as blur or exposure lag are not modeled, since they seldom occur in properly calibrated setups. This abstraction allows the network to concentrate on the primary challenge, namely strong nonlinear geometric distortions in repetitive or low-texture regions, while extreme failures such as severe motion blur remain outside the scope of this task.

4. Disparity Estimation Based Line-Scan Image Registration Network

To predict the line-disparity between source and target line-scan image, and inspired by the optical flow estimation network Flow1D [34] and RAFT [35], we propose Line-Stereo RegNet, which introduce a feature-aggregation scheme tailored to disparity estimation on line-scan images. This scheme not only strengthens the representational capacity of line-scan features but also minimizes computational overhead.

4.1. Feature Extraction and Reorganization

As illustrated in Figure 5a, our feature-extraction backbone consists of six residual blocks that output feature maps at one-eighth of the input resolution. To help infer occluded regions and prevent disparity misestimation, we also extract contextual features from the source image using an identical network. Given a pair of feature maps

F_{1}

,

F_{2} \in ℝ^{H \times W \times D}

(where H, W, D denote height, width, and feature dimension, respectively), our goal is to produce a reorganized feature

F_{1, y}

in which each feature vector at (h, w) incorporates information from all features in the same column. This facilitates one-dimensional horizontal search after collapsing the vertical dimension.

To meet the challenges posed by extensive low-texture and repetitive regions in train imagery, we draw on the success of Transformer-based global contextual information in weak-texture matching (e.g., LoFTR [36]). We first add a shared positional encoding for

F_{1}

and

F_{2}

, to get

{\tilde{F}}_{1} = F_{1} + P

,

{\tilde{F}}_{2} = F_{2} + P

. To enable each spatial location in the feature map

F_{1}

vertically reorganized, we apply a y-self attention mechanism. This process aims to refine the features, making them more robust to occlusions and repetitive textures. Specifically, the Query, Key, and Value are all derived from

F_{1}

:

\begin{array}{l} Q_{1} = W_{Q 1} ({\tilde{F}}_{1}) \\ K_{1} = W_{K 1} ({\tilde{F}}_{1}) \\ V_{1} = {\tilde{F}}_{1}, \end{array}

(17)

where

W_{Q 1}

and

W_{K 1}

are 1 × 1 convolutional layers serving as linear transformations. The attention is computed independently for each horizontal position (i.e., for each column of width

W

by permuting the tensor dimensions to allow attention to operate along the vertical height

H

) dimension. The resulting reorganized feature

F_{1, y}

is expressed as:

F_{1, y} = softmax (\frac{Q_{1} K_{1}^{T}}{\sqrt{C}}) V_{1},

(18)

where

C

is a scaling factor that prevents extreme values.

An analogous procedure produces the horizontal reorganized

F_{1, x} \in ℝ^{H \times W \times D}

. Subsequently, we use

F_{1, x}

as the source for the query and perform y-cross attention on the feature map

F_{2}

. This allows the context-aware features at each position in

F_{1}

to find the most relevant correspondences along the vertical axis of

F_{2}

. The Query, Key, and Value are defined as:

\begin{array}{l} Q_{2} = W_{Q 2} (F_{1, x}) \\ K_{2} = W_{K 2} ({\tilde{F}}_{2}) \\ V_{2} = {\tilde{F}}_{2}, \end{array}

(19)

where

W_{Q 2}

and

W_{K 2}

are independent 1 × 1 convolutional layers. Similar to the previous step, this attention operation is also performed along the vertical dimension. The resulting reorganized feature

F_{2, y}

is then given by:

F_{2, y} = softmax (\frac{Q_{2} K_{2}^{T}}{\sqrt{C}}) V_{2} .

(20)

Thus,

F_{1, y}

integrates the intra-image vertical context of

F_{1}

, which is crucial for stabilizing matching in texture-sparse or highly repetitive regions, thereby significantly improving the final matching accuracy; while

F_{2, y}

represents a vertically reorganized representation of

F_{2}

, guided by the horizontal context of

F_{1}

. This vertical reorganization operation, tailored for the 1D distortion characteristics of line-scan imaging, substantially reduces the dimensionality of the cost volume computation, thus significantly boosting computational efficiency and memory utilization. These enhanced features establish a robust foundation for constructing a more effective cost volume (see Figure 5b).

4.2. Cost-Volume Construction via One-Dimensional Correlation

After reorganizing the features, we construct the cost volume based on standard stereo matching principles. Since all pixels within a single column of a line-scan image are captured simultaneously, retaining full vertical information is unnecessary. Moreover, the large size of line-scan images imposes prohibitive memory requirements. Inspired by TSN’s aggregation approach, we apply an

H \times 1

convolution to

F_{1, y}

and

F_{2, y}

to collapse their height dimension, yielding the aggregated features

{\hat{F}}_{1}

,

{\hat{F}}_{2} \in ℝ^{1 \times W \times D}

. We then perform a horizontal search with radius R to construct a cost volume

C \in ℝ^{1 \times W \times (2 R + 1)}

:

C (1, w, R + r) = \frac{1}{\sqrt{D}} {\hat{F}}_{1} (1, w) \cdot {\hat{F}}_{2} (1, w + r)

(21)

where

r \in {- R, - R + 1, \dots, 0, \dots, R - 1, R}

, and · denotes the inner product and

1 / \sqrt{D}

normalizes the result, respectively. In practice, we first compute a full

1 \times W \times W

matrix of inner products between

{\hat{F}}_{1}

and

{\hat{F}}_{2}

along the width dimension, and then extract the band of width

2 R + 1

around the diagonal via sliding-window indexing, thereby efficiently realizing the above expression. Thanks to attention-enhanced features, this simple one-dimensional search suffices to match line-pixels between source and target.

4.3. Disparity-Regression Framework

As shown in Figure 5c, we adopt an iterative scheme to refine the disparity estimate. Given the constructed cost volume

C \in ℝ^{1 \times W \times (2 R + 1)}

, at each iteration t, an update operator predicts a disparity increment

Δ d_{t}

. Specifically, the current disparity estimate

d_{t - 1}

is used to index into

C

and extract a correlation feature map; concurrently, a context feature encoder which shares the same architecture as the backbone extracts contextual cues from the source image. These three inputs (correlation features, current disparity estimate, and context features) are fed into the update operator: after convolving the indexed cost volume and

d_{t - 1}

, their outputs are merged with the context features and passed through a Gated Recurrent Unit (GRU), which yields

Δ d_{t}

. The disparity is then updated by

d_{t} = d_{t - 1} + Δ d_{t}

. In our implementation, we perform N = 12 iterations. The lookup operation is defined such that, for a pixel at

(1, w)

with estimated disparity

d = f_{x}

, the center of the search window is at

(1, w + f_{x})

and the cost volume within radius R is indexed for regression.

The refinement stage is essential for converting the coarse disparity map into a highly accurate result. We adopt a GRU-based update module for two key reasons. First, its recurrent structure enables iterative error correction, allowing the model to capture nonlinear distortions and achieve superior sub-pixel accuracy beyond single-pass methods. Second, the GRU is lightweight and parameter-efficient, offering a much cheaper alternative to deep convolutional refinement networks. This compact design keeps refinement overhead minimal and supports the real-time performance of Line-Stereo RegNet.

4.4. Loss Functions

For horizontal registration, we first define a sequence loss over the predicted disparity

{d_{1}, \dots, d_{N}}

and the ground truth

d_{gt}

. We weight each iteration’s

l_{1}

error by an exponentially decaying factor

γ^{N - i}

:

L_{series} = \sum_{i = 1}^{N} γ^{N - i} ‖ d_{gt} - d_{i} ‖_{1}

(22)

where

γ = 0.8

,

N = 12

in our implementation. To remap the target image, we define the horizontal coordinate mapping:

m (x) = x + d (x) .

(23)

Let

x

denote the line-pixel coordinate in the target image and

d (x)

its estimated disparity. Given that trains exhibit smooth velocity transitions, we introduce a dual smoothness loss to enforce regularity in both the disparity field

d (x)

and the mapping

m (x)

to be smooth. This regularization term combines first- and second-order derivative penalties on

d (x)

and

m (x)

:

\begin{array}{l} L_{smooth} = λ_{1} L_{disp}^{(1)} + λ_{2} L_{disp}^{(2)} + λ_{3} L_{map}^{(1)} + λ_{4} L_{map}^{(2)} \\ L_{disp}^{(1)} = \frac{1}{W} {(\partial_{x} d (x))}^{2}, L_{disp}^{(2)} = \frac{1}{W} {(\partial_{x x} d (x))}^{2} \\ L_{map}^{(1)} = \frac{1}{W} {(\partial_{x} m (x))}^{2}, L_{map}^{(2)} = \frac{1}{W} {(\partial_{x x} m (x))}^{2} \end{array}

(24)

where

\partial_{x}

and

\partial_{x x}

denote the first- and second-order derivatives along

x

, respectively, and

λ_{i} (i = 1, 2, 3, 4)

are weighting coefficients. The total loss function integrates sequence and smoothness regularization:

L_{total} = L_{series} + L_{smooth} .

(25)

Finally, the registered image s generated by warping the source image via bilinear interpolation according to the mapping function m(x), as illustrated in Figure 2.

5. Experiments and Evaluation

To thoroughly assess the performance of the proposed line-scan image registration algorithm (Line-Stereo RegNet), we perform both qualitative and quantitative evaluations across multiple vehicle-type datasets, comparing against current mainstream methods, including TSN, OSCN, and the method proposed in [1] (hereafter referred to as WRBF for clarity). Additionally, comprehensive ablation studies are conducted to validate the contributions of the feature reorganize module, then discuss the effectiveness of dual smoothness loss and the simulation dataset for our research. To ensure experimental consistency and reproducibility, all evaluations are carried out in a standardized hardware and software environment: an Intel i9-12900K CPU @ 3.9 GHz, an NVIDIA GeForce RTX 4090 GPU (24 GB VRAM), and 64 GB of RAM. The environment includes Windows 11, Python 3.8, CUDA 11.3, and PyTorch 1.10.0.

5.1. Dataset and Experiment Configuration

To evaluate the generalization capability of the proposed method, we collected 4569 real-world line-scan image pairs from nine installation positions across diverse high-speed rail (CR400BF, CRH380A, CRH2A, CR200) and subway trains (GZ18, SH6, WX4, NN4). Images were acquired under varying weather and illumination conditions at train velocities of 20–25 km/h (Figure 1). All images have a fixed height of 1024 pixels, with widths varying by train model and camera position, yielding aspect ratios of ~7:1 to 32:1. Using the method described in Section 3, we further generated a simulated dataset comprising 9138 image pairs based on the source images. This dataset is divided into training and validation subsets at an approximate ratio of 5:1, resulting in 7615 pairs for training and 1523 pairs for validation. The original 4569 real-collected image pairs are used as a test set. To accelerate convergence and improve final accuracy, we adopt a two-stage training strategy. In the pretraining stage, input images are downscaled to 256 × 256 pixels. Training employs a search radius of R = 64, an initial learning rate of 1.25 × 10⁻⁴ with a multiplicative decay factor of 10⁻⁵ per iteration, and a batch size of 4. After 10,000 steps of training, a coarse disparity-aware pretrained model is obtained. In the fine-tuning stage, images are resized to a high-resolution format of 256 rows × 16,384 pixels. The learning rate is reduced to 1.25 × 10⁻⁵, and the batch size is decreased to 2, while all other hyperparameters remain unchanged. The pretrained model initializes the network weights, and fine-tuning proceeds for an additional 160,000 iterations on the same dataset. Both stages utilize the Adam optimizer.

5.2. Ablation Study

This section presents an ablation study to validate the efficacy of the primary components within our proposed matching cost volume construction methodology. The study systematically evaluates the contributions of self-attention, cross-attention, and positional encoding, following the experimental protocols detailed in Section 5.

The quantitative results of this study, as shown in Table 1, first underscore the fundamental importance of the attention framework. When both self-attention and cross-attention mechanisms were omitted, the model’s performance decreased substantially, with the mean SSIM score dropping from 0.8812 to 0.6037. This degradation was accompanied by a marked increase in result variance, as indicated by the standard deviation. Concurrently, the average memory usage was reduced from 3.319 GB to 2.545 GB. These results confirm that the integrated attention mechanisms are critical for achieving high-quality registration, despite their computational cost. Moreover, they demonstrate that our algorithm’s overall memory footprint remains remarkably low even when processing ultra-large, high-aspect-ratio images, highlighting its strong practical applicability.

Further analysis of the individual components reveals that cross-attention is the most critical element for the matching task. Its removal resulted in the most significant performance decline among all single-component ablations, reducing the SSIM score to 0.8036. This finding indicates that the direct establishment of inter-image feature correspondences via cross-attention is the principal contributor to the model’s accuracy.

The roles of self-attention and positional encoding were also found to be significant. The exclusion of the self-attention module led to a notable reduction in performance, with the SSIM score decreasing to 0.8377 and memory usage slightly declining to 3.118 GB. This suggests that the self-attention mechanism effectively enhances feature representations by capturing long-range intra-image dependencies, which is particularly valuable for large-aspect-ratio line-scan imagery. Similarly, removing positional encoding caused a comparable performance drop, with the SSIM score falling to 0.8357. The “w/o pos” variant unexpectedly consumed slightly more memory than the full model (3.351 GB vs. 3.319 GB). This suggests that positional encoding provides a useful prior that guides attention toward relevant regions. Without this spatial guidance, attention becomes more diffuse during backpropagation, which may reduce the effectiveness of underlying framework optimizations and result in a small increase in memory usage. This demonstrates that positional information is crucial for providing the spatial priors necessary to disambiguate features, especially in regions with repetitive or weak textures.

In summary, the ablation study confirms that the superior performance of our method stems from the synergistic integration of its core components. The architecture effectively fuses local features with global context, where cross-attention serves as the primary matching engine, self-attention functions as a feature enhancement module and positional encoding supplies essential spatial constraints. This combination enables robust and precise registration of challenging line-scan images, characterized by textural ambiguities and large spatial extents, while operating within a defined computational budget. It also highlights our algorithm’s significant low-memory advantage on ultra-large image inputs, underscoring its excellent value for real-world deployment.

5.3. Ablation Study Comparative Experiments

To comprehensively assess the registration performance of various algorithms under different levels of texture complexity (texture-less, repetitive textures, and rich texture details), we systematically selected three representative line-scan camera positions from nine installation points across eight train models: T1 (train top), L3 (train left side), and B3 (train bottom). From these camera views, 50 pairs of real-world images were randomly sampled and used for the final registration performance comparison, with visual assessments of registration results for a comprehensive judgment.

As the source codes for the baseline methods (TSN, OSCN, and WRBF) are not publicly available, and our dataset differs from those used in previous works, we reimplemented these methods based on their source paper descriptions. The results reported below are the outputs of these reproductions evaluated on our test dataset.

5.3.1. Qualitative Assessment of Image Registration

Figure 6 provides a qualitative comparison of Line-Stereo RegNet against three baselines (TSN, OSCN, WRBF) on real-world line-scan imagery. Registration accuracy is visualized by overlaying the registered red channel onto the reference green channel: yellow indicates proper alignment, while distinct red or green regions signify misalignment. Line-Stereo RegNet demonstrates robust and superior performance across most imaging conditions. In scenarios with severe nonlinear distortions (Column 1), conventional feature-point methods like TSN and OSCN exhibit significant limitations due to their reliance on piecewise affine transformations, leading to cumulative error and suboptimal solutions. Both WRBF and our method are better at modeling complex global nonlinear deformations, with Line-Stereo RegNet showing superior accuracy, particularly in wheelset regions.

Our Feature Reorganization Mechanism directly addresses distinct texture challenges. (1) For repetitive patterns (e.g., intake grilles, Column 2), y-axis self-attention aggregates global vertical context, capturing the unique vertical arrangement of a column. This disambiguates identical horizontal features (slats), preventing cycle-skipping and misalignment. (2) For large low-texture regions (e.g., smooth roof panels, Column 3), cross-attention functions as a context propagator, diffusing distinct features from boundaries into the textureless interior. This enables robust correspondence establishment even without prominent local keypoints. This general advantage is supported by integrated positional embeddings, which capture long-range dependencies for robust matching under significant displacement.

Despite its general superiority, the method exhibits limitations in extreme scenarios (Row 3, Figure 6). (1) Large disparity variations (Figure 6g,h): Registration fails when displacement exceeds 512 pixels. This boundary is a hard limit set by the network’s architecture (8× downsampling factor × 64 search radius). (2) Severe low-texture (Figure 6i): Matching ambiguity persists due to the absolute lack of discriminative features. It is notable, however, that even in these extreme cases, our approach preserves more structural details compared to existing methods, which are prone to catastrophic artifacts.

5.3.2. Quantitative Assessment of Image Registration

Line-scan image registration for high-speed rail and subway systems faces several different challenges that call for separate evaluation. High-speed trains usually traverse at over 250 km/h, where strong vibrations, nonlinear geometric distortions, motion blur and rapidly shifting outdoor illumination severely degrade the registration algorithms’ effect, especially in texture-poor regions. By contrast, subway trains run through enclosed tunnels under steady artificial lighting, with minimal vibration and more varied, distinctive surface textures, which greatly reduces blur and lighting artifacts.

We evaluate our registration algorithms using two independent test sets. As illustrated in Figure 7, we used images captured at three fixed camera positions on four high-speed train models. In each subfigure, the horizontal axis represents 50 complete image pairs; the overlaid line and bar charts, distinguished by contrasting colours, depict the registration SSIM and per-pair processing time of each algorithm, respectively, with values referenced against the left and right vertical axes. The results show that TSN and OSCN demonstrated comparable registration accuracy metrics, though both exhibited fundamental limitations inherent to sparse feature-based matching paradigms. These methods consistently suffered from matching ambiguities in regions characterized by repetitive textural patterns (a prevalent condition in high-speed train exteriors, see train left side image sequence) due to their reliance on local feature point correspondences without adequate global context integration. WRBF achieved superior registration accuracy through its sliding window matching strategy, which effectively addressed some nonlinear deformation challenges. However, this approach incurred substantial computational overhead, particularly in texture-rich (train bottom image sequence) or ultra-wide format images, resulting in processing times greater than the proposed method. The proposed Line-Stereo RegNet achieved both higher precision metrics and significantly reduced processing latency. Statistical analysis revealed that Line-Stereo RegNet improves average registration error by 5.8% and the average processing time accounts for one-fourth relative compared to WRBF. These improvements demonstrate a theoretically and practically superior balance between registration fidelity and operational efficiency, particularly valuable for real-time high-speed rail inspection systems.

In subway train image experiments (see Figure 8), characterized by high grayscale consistency, minimal vibration interference and reduced illumination variation, the registration accuracy of traditional feature-based methods (TSN and OSCN) improved substantially. However, their computational performance remained constrained by the inherent overhead of multi-stage processing pipelines, including separate feature extraction, matching, and transformation estimation phases. WRBF further enhanced registration accuracy under these stable conditions through its refined spatial transformation model, though its computational efficiency remained suboptimal due to iterative refinement requirements. Line-Stereo RegNet outperforms other methods across almost all evaluation metrics, achieving higher registration accuracy. This advantage arises from its end-to-end architecture, which effectively eliminates error accumulation between traditionally separate processing stages. In low-texture, grayscale-uniform underground environments, Line-Stereo RegNet demonstrates exceptional robustness, maintaining precise alignment even on large homogeneous surfaces that typically challenge conventional approaches. In challenging train bottom scenes where many state- of-the-art pipelines experience a significant increase in processing time, our method adds only minimal computational overhead. This efficiency results from training on a highly diverse dataset, which enables rapid convergence and equips the model with the resilience to handle complex real-world situations with almost no extra runtime. The comprehensive experimental analysis across distinct operational environments and vehicle types confirms that the proposed algorithm achieves superior registration accuracy and computational performance across diverse imaging conditions. Its architectural advantages become particularly pronounced in the challenging high-speed rail scenarios, where it effectively addresses the combined challenges of repetitive textures, vibration-induced geometric distortions, and dynamic lighting variations through its integrated attention mechanisms and positional encoding framework.

5.4. Discussion

While Section 3 introduces a synthetic dataset generation workflow incorporating both horizontal shifts and vertical distortions, our network architecture remains specifically optimized for horizontal registration. To address this directional specialization, Section 4 proposes a dual smoothness loss that regularizes line-disparity predictions and transformation mappings. Consequently, this section examines them.

5.4.1. Impact of Vertical Shift in Dataset on Registration Performance

While the primary focus of this research centers on horizontal registration challenges, the inclusion of vertical shifts in training data proves equally critical for developing robust real-world applications. To systematically evaluate this factor, we generated two synthetically distorted datasets following the methodology detailed in Section 3: one incorporating only horizontal displacements and another encompassing both horizontal and vertical distortions. Both datasets underwent identical training protocols and were evaluated under strictly controlled testing conditions to ensure a fair comparison. The quantitative results presented in Table 2 demonstrate that models trained on datasets incorporating vertical displacements achieve superior performance and better metrics, exhibiting higher average structural similarity indices and significantly reduced standard deviations compared to their horizontally constrained counterparts. This performance improvement indicates that explicit consideration of vertical geometric variations during training substantially enhances the model’s ability to handle real-world perturbations caused by factors such as camera height fluctuations and vehicle-induced vibrations. The enhanced robustness observed in these experiments underscores the importance of comprehensive data augmentation strategies that account for multi-directional geometric variations, ultimately contributing to improved generalization capabilities and operational reliability in practical deployment scenarios.

5.4.2. Impact of Dual Smoothness Loss on Registration Performance

To isolate and quantify the impact of the dual-smoothness loss on registration accuracy, we fine-tuned two otherwise-identical models, one with and one without the dual-smoothness term, for 160,000 iterations from the same pretrained weights. Both models achieved nearly equivalent training sequence losses (

L_{series}

= 3.20 without dual-smoothness vs.

L_{series}

= 3.26 with dual-smoothness), ensuring that any observed differences in test-time performance arise from the regularizer itself rather than from disparities in convergence. As shown in Table 3, evaluation on the dataset described in Section 5.2 reveals that incorporating dual smoothness yields a significant increase in mean SSIM and a marked reduction in inter-sample performance variance. This improvement is attributable to the loss’s simultaneous enforcement of spatial coherence in both the disparity and optical-flow domains: it suppresses spurious local fluctuations while preserving salient structural boundaries. The dual regularization is especially beneficial in regions that traditionally challenge registration—namely, repetitive-texture areas prone to ambiguous matches and low-texture zones lacking distinctive features. By balancing the competing objectives of geometric fidelity and boundary preservation, the dual-smoothness loss not only elevates overall registration accuracy but also enhances robustness across a diverse array of imaging scenarios.

Further qualitative analysis (Figure 9) highlights the crucial role of the dual smoothness loss and the complementary interaction between its first- and second-order terms. The second-order term maintains motion continuity and prevents disparity jumps in repetitive-texture regions, while the first-order term suppresses noise and avoids global drift in low-texture areas [37]. Together, they preserve structural consistency and numerical stability, enabling our framework to achieve state-of-the-art performance.

6. Conclusions

This study presents a robust framework for line-scan image registration that effectively addresses geometric distortions induced by velocity fluctuations between moving trains and line-scan cameras. By reformulating the horizontal registration problem as a line-disparity estimation task on epipolar-rectified stereo pairs, we develop an end-to-end deep learning network that seamlessly integrates physics-based simulation, attention-driven feature reorganization, and iterative disparity refinement. Our approach overcomes critical challenges in repetitive and texture-deficient regions through global contextual modeling while efficiently processing ultra-large line-scan images (up to 32,760 pixels wide) with remarkably low memory consumption (approximately 3.3 GB). Comprehensive validation across diverse real-world datasets from eight train models demonstrates that our method consistently outperforms state-of-the-art techniques, improving registration accuracy by 5.8% while requiring only one-fourth of the processing time on average. The integration of velocity-profile-based synthetic data generation and dual-smoothness regularization significantly enhances robustness against real-world perturbations, including camera vibrations and illumination variations. This work establishes a scalable solution for high-precision train inspection systems, with demonstrated applicability to other line-scan imaging domains requiring geometric fidelity under dynamic motion conditions. The framework’s balance of accuracy, efficiency, and practical deployability represents a significant advancement toward reliable automated fault diagnosis in transportation infrastructure inspection.

Looking ahead, our future work will focus on three main directions. First, we will further advance 2D high-precision registration by exploring cascaded architectures and extending the framework to multi-camera fusion. Second, we plan to enhance robustness to extreme motion-induced distortions through hierarchical coarse-to-fine models capable of handling larger disparity ranges. Finally, we aim to optimize the network for real-time deployment on embedded edge devices and broaden the applicability of our framework to a wider range of high-speed line-scan inspection scenarios.

Author Contributions

Conceptualization, D.F. and K.Y.; methodology, D.F., K.Y. and Z.L.; validation, D.F., Y.W. and L.L.; data curation, D.F. and Z.L.; writing—original draft preparation, D.F.; writing—review and editing, K.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (2023YFB2603700).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The generated data supporting the findings are currently not publicly accessible, but may be obtained from the authors upon reasonable request.

Conflicts of Interest

All authors declare that they have no competing financial interests or personal relationships that could influence the work reported in this paper.

References

Fu, Z.; Pan, X.; Zhang, G. Linear array image alignment under nonlinear scale distortion for train fault detection. IEEE Sens. J. 2024, 24, 23197–23211. [Google Scholar]
Li, D.; Xie, Q.; Gong, X.; Yu, Z.; Xu, J.; Sun, Y.; Wang, J. Automatic defect detection of metro tunnel surfaces using a vision-based inspection system. Adv. Eng. Inform. 2021, 47, 101206. [Google Scholar]
Feng, J.; Shi, H.; Qiu, J.; Yu, Z.; He, C. EF-Yolo: An efficient and lightweight network for real-time components detection of freight trains. IEEE Sens. J. 2024, 24, 35872–35888. [Google Scholar]
Li, C.; Xu, F.; Li, W.; Sun, X.; Yuan, B. Design and optimization of a dynamic train image acquisition system utilizing line scan camera technology. In Proceedings of the 2023 International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), Dalian, China, 18–19 August 2023. [Google Scholar]
Wang, Z.; Yu, G.; Zhou, B.; Wang, P.; Wu, X. A train positioning method based-on vision and millimeter-wave radar data fusion. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4603–4613. [Google Scholar]
Ling, Z.; Yang, K.; Li, J.; Zhang, Y.; Gao, X.; Luo, L.; Xie, L. Domain-adaptive modules for stereo matching network. Neurocomputing 2021, 461, 217–227. [Google Scholar] [CrossRef]
Kou, L.; Yang, K.; Luo, L.; Zhang, Y.; Li, J.; Wang, Y.; Xie, L. Binocular stereo matching of real scenes based on a convolutional neural network and computer graphics. Opt. Express 2021, 29, 26876–26893. [Google Scholar] [CrossRef] [PubMed]
Xu, M.; Ma, H.; Zhong, X.; Zhao, Q.; Chen, S.; Zhong, R. Fast and accurate registration of large scene vehicle-borne laser point clouds based on road marking information. Opt. Laser Technol. 2023, 159, 108950. [Google Scholar]
You, B.; Chen, H.; Li, J.; Li, C.; Chen, H. Fast Point Cloud Registration Algorithm Based on 3DNPFH Descriptor. Photonics 2022, 9, 414. [Google Scholar] [CrossRef]
Chu, H.; Fan, J.; Luo, Z.; Cheng, Y.; Tang, Y.; Li, Y. An Improved PCA and Jacobian-Enhanced Whale Optimization Collaborative Method for Point Cloud Registration. Photonics 2025, 12, 823. [Google Scholar] [CrossRef]
Tan, J.; Liu, J.; Wang, X.; He, Z.; Su, W.; Huang, T.; Xie, S. Large depth range binary-focusing projection 3D shape reconstruction via unpaired data learning. Opt. Lasers Eng. 2024, 181, 108442. [Google Scholar] [CrossRef]
Wan, Y.; Tang, T.; Li, J.; Yang, K.; Zhang, Y.; Peng, J. Online nonlinearity elimination for fringe projection profilometry using slope intensity coding. J. Opt. 2024, 26, 095704. [Google Scholar] [CrossRef]
Yang, K.; Liu, Y.; Li, X.; Bai, Z.; Wan, Y.; Xiao, Y.; Li, J. A high-accuracy single-frame 3D reconstruction method with color speckle projection for pantograph sliders. Measurement 2024, 237, 115192. [Google Scholar]
Ho, C.-C.; Hernandez, M.A.B.; Chen, Y.-F.; Lin, C.-J.; Chen, C.-S. Deep residual neural network-based defect detection on complex backgrounds. IEEE Trans. Instrum. Meas. 2022, 71, 1–10. [Google Scholar] [CrossRef]
Zhang, Y.; Pan, H.; Zhou, Y.; Li, M.; Sun, G. Efficient visual fault detection for freight train braking system via heterogeneous self distillation in the wild. Adv. Eng. Inform. 2023, 57, 102091. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, J.; Chen, Y. An online metro train bottom monitoring system based on multicamera fusion. IEEE Sens. J. 2024, 24, 27687–27698. [Google Scholar]
Qiu, J.; Shi, H.; Hu, Y.; Yu, Z. Spatial activation suppression for unsupervised anomaly detectors in freight train fault detection. IEEE Trans. Instrum. Meas. 2023, 73, 3503814. [Google Scholar] [CrossRef]
Zhang, K.; Ku, H.; Wang, S.; Zhang, M.; He, X.; Lu, H. Distributed Acoustic Sensing: A Promising Tool for Finger-Band Anomaly Detection. Photonics 2024, 11, 896. [Google Scholar] [CrossRef]
Chen, C.; Li, K.; Zhongyao, C.; Piccialli, F.; Hoi, S.C.H.; Zeng, Z. A hybrid deep learning based framework for component defect detection of moving trains. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3268–3280. [Google Scholar]
Zitová, B.; Flusser, J. Image registration methods: A survey. Image Vis. Comput. 2003, 21, 977–1000. [Google Scholar] [CrossRef]
Song, W.-W.; Gao, X.-R.; Peng, J.-P.; Li, J.-L.; Xie, L.-M. Abnormal target detection of high-speed train’s roof. In Proceedings of the 2017 Far East NDT New Technology & Application Forum (FENDT), Xi’an, China, 22–24 June 2017. [Google Scholar]
Lu, S.; Liu, Z.; Shen, Y. Automatic fault detection of multiple targets in railway maintenance based on time-scale normalization. IEEE Trans. Instrum. Meas. 2018, 67, 849–865. [Google Scholar] [CrossRef]
Chang, L.; Liu, Z.; Shen, Y.; Zhang, G. Novel multistate fault diagnosis and location method for key components of high-speed trains. IEEE Trans. Ind. Electron. 2021, 68, 3537–3547. [Google Scholar]
Liu, Z.; Huang, D.; Qin, N.; Zhang, Y.; Ni, S. An improved subpixel-level registration method for image-based fault diagnosis of train bodies using SURF features. Meas. Sci. Technol. 2021, 32, 115402. [Google Scholar]
Chang, L.; Du, H.; Zhan, B.; Guo, R.; Zhou, P.; Wang, Z.; Liang, D. Deep-feature-matching-based piecewise polynomial transformation for multi-domain nonlinear scale distortion image correction method. In Proceedings of the 2024 9th Optoelectronics Global Conference (OGC), Shenzhen, China, 10–13 September 2024. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-supervised interest point detection and description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Sarlin, P.-E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1647–1655. [Google Scholar]
Sun, Y.; Xu, J.; Lin, G.; Ji, W.; Wang, L. RBF neural network-based supervisor control for maglev vehicles on an elastic track with network time delay. IEEE Trans. Ind. Inform. 2020, 18, 509–519. [Google Scholar]
Wu, T.; Liu, Y.; Li, W. Adaptive cooperative tracking control of multiple trains with uncertain dynamics and external disturbances based on adjacent information. IEEE Access 2022, 10, 23677–23686. [Google Scholar] [CrossRef]
Xu, H.; Yang, J.; Cai, J.; Zhang, J.; Tong, X. High-resolution optical flow from 1D attention and correlation. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10498–10507. [Google Scholar]
Teed, Z.; Deng, J. RAFT: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 402–419. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8918–8927. [Google Scholar]
Black, M.J.; Anandan, P. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Comput. Vis. Image Underst. 1996, 63, 75–104. [Google Scholar] [CrossRef]

Figure 1. Train carriage image acquisition system. (a) Line-scan imaging equipment installed around the train inspection portal, equipped with a speed-measurement module that dynamically adjusts acquisition line rate (the letters T, L, R, and B denote the top, left side, right side, and bottom, respectively). (b) Carriage images captured from 12 cameras.

Figure 2. Overview of the proposed line-scan image registration method. We train our model only on simulation datasets and evaluate on test images captured from line-scan camera imaging system.

Figure 3. Distortion model illustrating compression effects in line-scan camera imaging. (a) Image registration corrects geometric distortion by realigning line-pixels to their original positions. (b) The shift distance of each line-pixel accumulates progressively along the imaging direction, with the magnitude dependent on its position in the image sequence.

Figure 4. Workflow for generating the simulation line-scan image registration dataset.

Figure 5. Network structure of proposed Line-Stereo RegNet. (a) Extract feature by residual blocks. (b) Construct 3D cost volume with attention mechanism. (c) GRU-based disparity regression scheme.

Figure 6. Qualitative comparison of registration results using different algorithms. Column 1: Middle-bottom camera image, characterized by rich texture facilitating feature matching. Column 2: Left-side camera image, presenting challenges including repetitive structures and low-texture regions. Column 3: Top and left-side camera image, dominated by low-texture content.

Figure 7. Comparison of registration accuracy and processing time on high-speed train images.

Figure 8. Comparison of registration accuracy and processing time on subway train images.

Figure 9. Impact of dual smoothness loss on Camera-L3 registration performance. (a) Without regularization, minor curvature in the mapping function amplifies into significant disparity distortion, inducing registration error. (b) Despite a normal mapping function, unregularized disparity estimation remains erroneous, demonstrating sensitivity to smoothness constraints.

Table 1. Ablation Study on Cost Volume Construction Methods (✓ denotes enabled, ✗ denotes disabled).

Method	Self	Cross	Pos	Memory Usage	SSIM (Mean ± SD)
w/o self & cross	✗	✗	✓	2.545 GB	0.6037 ± 0.1334
w/o self	✗	✓	✓	3.118 GB	0.8377 ± 0.1206
w/o cross	✓	✗	✓	2.998 GB	0.8036 ± 0.1125
w/o pos	✓	✓	✗	3.351 GB	0.8357 ± 0.0791
Line-Stereo RegNet	✓	✓	✓	3.319 GB	0.8812 ± 0.0734

Table 2. Study on the Necessity of Vertical Shift in Dataset Construction.

Method	SSIM (Mean ± SD)
without vertical shift	0.8354 ± 0.1102
with vertical shift	0.8812 ± 0.0734

Table 3. Average SSIM Results of Study on Dual Smoothness Loss.

Method	SSIM (Mean ± SD)
without dual smooth loss	0.8275 ± 0.1189
with dual smooth loss	0.8812 ± 0.0734

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, D.; Yang, K.; Ling, Z.; Wang, Y.; Luo, L. Robust Line-Scan Image Registration via Disparity Estimation for Train Fault Diagnosis. Sensors 2025, 25, 7315. https://doi.org/10.3390/s25237315

AMA Style

Feng D, Yang K, Ling Z, Wang Y, Luo L. Robust Line-Scan Image Registration via Disparity Estimation for Train Fault Diagnosis. Sensors. 2025; 25(23):7315. https://doi.org/10.3390/s25237315

Chicago/Turabian Style

Feng, Darui, Kai Yang, Zhi Ling, Yong Wang, and Lin Luo. 2025. "Robust Line-Scan Image Registration via Disparity Estimation for Train Fault Diagnosis" Sensors 25, no. 23: 7315. https://doi.org/10.3390/s25237315

APA Style

Feng, D., Yang, K., Ling, Z., Wang, Y., & Luo, L. (2025). Robust Line-Scan Image Registration via Disparity Estimation for Train Fault Diagnosis. Sensors, 25(23), 7315. https://doi.org/10.3390/s25237315

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Line-Scan Image Registration via Disparity Estimation for Train Fault Diagnosis

Abstract

1. Introduction

2. Methodology Overview

3. Simulation Dataset Generation Method

3.1. Mathematical Model of Line-Scan Camera Imaging Distortion

3.2. Generation of Velocity Profiles

3.3. Augmentation Strategy

3.4. Simulation Fidelity and Implementation Assumptions

4. Disparity Estimation Based Line-Scan Image Registration Network

4.1. Feature Extraction and Reorganization

4.2. Cost-Volume Construction via One-Dimensional Correlation

4.3. Disparity-Regression Framework

4.4. Loss Functions

5. Experiments and Evaluation

5.1. Dataset and Experiment Configuration

5.2. Ablation Study

5.3. Ablation Study Comparative Experiments

5.3.1. Qualitative Assessment of Image Registration

5.3.2. Quantitative Assessment of Image Registration

5.4. Discussion

5.4.1. Impact of Vertical Shift in Dataset on Registration Performance

5.4.2. Impact of Dual Smoothness Loss on Registration Performance

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI