Self-Supervised Depth and Ego-Motion Learning from Multi-Frame Thermal Images with Motion Enhancement

Yu, Rui; Ma, Guoliang; Guo, Jian; Xu, Lisong

doi:10.3390/app152211890

Open AccessArticle

Self-Supervised Depth and Ego-Motion Learning from Multi-Frame Thermal Images with Motion Enhancement

by

Rui Yu

¹,

Guoliang Ma

^2,*,

Jian Guo

¹

and

Lisong Xu

¹

School of Automation, Nanjing University of Science and Technology, Nanjing 210094, China

²

School of Energy and Power Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 11890; https://doi.org/10.3390/app152211890 (registering DOI)

Submission received: 8 October 2025 / Revised: 31 October 2025 / Accepted: 6 November 2025 / Published: 8 November 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Thermal cameras are known for their ability to overcome lighting constraints and provide reliable thermal radiation images. This capability facilitates methods for depth and ego-motion estimation, enabling efficient learning of poses and scene structures under all-day conditions. However, the existing studies on depth prediction for thermal images are limited. In practical applications, thermal cameras capture sequential frames. Unfortunately, the potential of this multi-frame aspect is underutilized by the previous methods, resulting in limitations on the depth prediction accuracy of thermal videos. To leverage the multi-frame advantages of thermal videos and to improve the accuracy of monocular depth estimation from thermal images, we propose a framework for self-supervised depth and ego-motion learning from multi-frame thermal images. We construct a multi-view stereo (MVS) cost volume from temporally adjacent thermal frames. The construction process is adjusted based on the estimated pose, which serves as a motion hint. To stabilize the motion hint and improve pose estimation accuracy, we design a motion enhancement module that utilizes self-generated poses for additional supervisory signals. Additionally, we introduce RGB images in the training phase to form a multi-spectral loss, thereby augmenting the performance of the thermal model. The experimental results, conducted on a public dataset, demonstrate the proposed method’s accurate estimation of depth and ego-motion across varying light conditions, surpassing the performance of the self-supervised baseline.

Keywords:

depth estimation; pose prediction; self-supervised learning; thermal monocular video; visual sensing

1. Introduction

Understanding depth within images and the ego-motion of cameras is crucial for 3D vision perception. Its significance extends to practical applications like autonomous driving [1], object detection [2], and mixed reality [3]. Although sensors like LiDAR can directly measure object distances, their high expense and sparse depth motivate the investigation of visual self-supervised methods for their versatility and applicability. Self-supervised learning of depth and pose, extensively researched [4,5,6], demands neither expensive hardware nor intricate labels. It adeptly generates dense depth solely from visual data. Nevertheless, most studies concentrate on estimating depth in RGB images. The sensitivity of RGB image quality to illumination conditions results in information loss and heightened noise [7], especially in dim environments like nighttime or tunnels. These challenges, inherent in the real world, compromise the performance and generalization capabilities of RGB-based methods in low-light conditions.

Self-supervised learning from thermal images emerges as a potential remedy to close the performance gap. In contrast to RGB cameras, thermal cameras function independently of extra light sources and can seamlessly operate in complete darkness. Thermal cameras detect the Long-Wave Infrared (LWIR) radiation emitted from objects directly and then convert the raw radiation data into a temperature distribution represented in a visible image for human perception [8]. LWIR radiation’s ability to penetrate certain transparent mediums empowers thermal cameras to operate effectively in adverse weather, like rain and fog. In real-world scenarios, the thermal camera can provide more than one frame during testing, such as vehicles carrying thermal cameras for inspection tasks. However, the existing methods for self-supervised depth and ego-motion learning [9,10,11] fall short in fully utilizing the potential of multiple frames provided by thermal cameras.

To utilize the benefits of the multi-frame property and optimize the network predictions from thermal videos, we propose a self-supervised depth and ego-motion estimation framework based on multi-view stereo (MVS) for thermal videos. A cost volume is constructed by considering temporally adjacent frames from thermal videos with various hypothesized depth values. Then, the cost volume is decoded to generate an MVS depth. However, in large-scale outdoor scenes with slow camera movement, MVS methods face challenges. Owing to minimal disparities and insufficient triangulation [12] in such situations, the network acquires limited new information, potentially resulting in geometry ambiguity in the predicted depth results.

In response to this problem, we adopt a strategy that adaptively adjusts the range of depth candidate sampling guided by motion hints. Depending on the poses generated by the network, the MVS depth will be shrunk to the single-frame depth when the camera is static. To bolster motion hint stability and refine pose estimation accuracy, we design a motion enhancement module based on epipolar geometry constraints. This module utilizes self-generated supervisory signals to aid in network training. Moreover, to address the lower resolution and unclear structures and edges in thermal images, we integrate a depth warping operation into the proposed framework. It combines the thermal camera with a visible camera. This integration facilitates the combination of thermal loss and photometric loss into a multi-spectral loss, providing rich self-supervisory signals. The experimental results on the public dataset demonstrate that our method is capable of predicting depth and ego-motion under low-light conditions, exhibiting performance superior to state-of-the-art baselines, as shown in Figure 1.

In summary, we make the following contributions:

We propose a self-supervised multi-frame depth and ego-motion learning framework for monocular thermal videos. It successfully leverages the temporal advantages of multiple thermal frames by constructing the multi-view stereo. The accuracy of prediction results is further improved through the integration of motion hints and multi-spectral properties.
We design a motion enhancement module that utilizes self-generated motion constraints to complement self-supervised signals. It enhances the stability of motion hints within the framework, consequently improving the accuracy of pose estimation.
We employ an efficient objective function that combines photometric loss, thermal loss, and motion loss to address the challenges in real-world scenarios, including unavoidable low-light conditions.

2. Related Work

2.1. Self-Supervised Depth and Ego-Motion Learning

The goal of depth and ego-motion learning [13,14,15] is to simultaneously estimate the depth of each pixel in an image and the camera poses between image sequences. Self-supervised methods avoid the use of challenging-to-obtain ground-truth by using the view synthesis mechanism. SfM-Learner, proposed by Zhou et al. [4], stands as a pioneering work in this field. It utilizes the generated single-frame depth and poses to project source frame images onto the target frame, synthesizing an image. Subsequently, it calculates the photometric loss between the original and synthesized images for training. Some self-supervised methods [5,16,17,18] build upon this approach to improve depth prediction performance. Extending the work of SfM-Learner, Bian et al. [17] introduce a scale-consistency loss to address scale-ambiguity in each frame. Bian et al. [18] integrate estimation networks with the conventional visual odometry ORB-SLAM2 [19], achieving heightened accuracy. EDS-Depth [20] enhances self-supervised depth estimation in dynamic scenes using frame interpolation and pseudo-labels derived from optical flow. To improve depth estimation with dynamic objects or a static camera, Feng et al. [21] separate static from temporal features via a dual-path encoder and synthesize full-resolution depth using a learnable offset field.

Recent works [22,23,24,25] introduce multi-view stereo (MVS) into depth learning. These methods can generate dense geometry structures of scenes using multiple frames. Assuming static scenes, Watson et al. [24] introduce MVS into self-supervised depth learning. The method requires neither ground-truth nor any pretrained networks. IAFMVS [26] enhances iterative MVS by leveraging deformable convolutions for adaptive feature extraction and attention for feature matching, specifically targeting texture-less regions. Cheng et al. [22] improve the robustness of depth estimation systems under inaccurate camera poses by adaptively fusing high-confidence single-view and multi-view depth information. Emphasizing its MVS foundation, Wang et al. [27] use a cross-attention mechanism to fuse geometric constraints from multi-frame optical flow, enabling recurrent refinement of scene detail. Compared to single-frame depth estimation methods, MVS-based methods can effectively utilize information from various perspectives, exhibiting more robust geometry structures and reduced sensitivity to occlusion. MVS methods hold significant research potential in self-supervised depth prediction, offering broad applicability across diverse scenarios. The potential value of MVS methods in self-supervised depth prediction suggests promising applications in various scenarios.

2.2. Depth and Ego-Motion Learning with Thermal Images

In contrast to the abundant methods for depth and pose estimation from RGB images, the research on estimation from thermal images is limited. RGB cameras are susceptible to illumination conditions, whereas thermal cameras offer image capture under broader conditions, including nighttime and adverse weather environments. This provides a distinct advantage to thermal images in 3D vision perception. Some works [28,29] employ additional motion and ranging sensors, such as IMU and radar, to assist thermal cameras in ego-motion estimation. However, these methods primarily concentrate on localization rather than environmental perception.

Recently, only a limited number of works have focused on depth learning from thermal images. Kim et al. [10] design a system composed of two RGB cameras and one thermal camera. The system utilizes the spatial relationship between the stereo cameras to predict the depth of the current frame. However, this method requires a complex geometry configuration between the cameras and does not estimate the camera poses. Lu et al. [9] propose a similar method to Kim et al. [10], incorporating two RGB cameras and two thermal cameras. They initialize the disparity map with RGB images and design a specialized network module to translate a thermal image from RGB images. In their final step, triangulation using the translated thermal images and original thermal images provides depth ground-truth. Nevertheless, their system still does not estimate ego-motion. Shin et al. [11] propose the first self-supervised method for simultaneously estimating ego-motion and depth from thermal images based on the SfM-Learner framework. They train pose and depth networks by reconstructing thermal images. However, networks from these methods utilize the single-frame image to predict depth, insufficiently exploiting the temporal and multi-frame advantages of thermal videos, thereby limiting improvements in estimation accuracy.

3. Method

3.1. Method Overview

Our framework is designed to jointly train neural networks for estimating depth and camera pose from monocular thermal videos. For this goal, we utilize a self-generated multi-spectral supervision signal and a motion supervision signal from a motion enhancement module to optimize the network.

To refine estimation accuracy, we construct a cost volume between temporally adjacent thermal frames, enabling the direct utilization of geometry constraints from MVS. This approach exploits the strengths of multi-frame thermal images. Additionally, the motion enhancement module, as designed, aims to improve pose estimation accuracy and maintain the stability of cost volume construction. Since the proposed framework employs self-supervised learning, no ground-truth is required at training time.

The main architecture of the proposed framework is illustrated in Figure 2. For a given thermal video input, (a) the PoseNet is employed initially to estimate the camera poses between thermal frames. This combines the poses from the motion enhancement module to compute the motion loss. (b) Subsequently, we utilize the DepthNet to predict the single-frame depth of the current thermal image. (c) We construct a cost volume from the multiple thermal frames. During this process, the predicted single-frame depth serves as a geometry prior, and the predicted poses act as a motion hint to assist the construction of the cost volume. Upon decoding the cost volume, the MVS depth and confidence map are generated. A more accurate depth is achieved through fusing the single-frame depth with the MVS depth.

3.2. Single-Frame Self-Supervised Depth Learning

As illustrated in Figure 2a,b, single-frame self-supervised depth learning relies on the view synthesis mechanism [4] and reconstruction loss. The current thermal frame

I_{t}^{T}

can be reconstructed from the previous frame

I_{t - 1}^{T}

. This mechanism is based on the warping operation:

p_{t - 1} = K^{T} P_{t - 1 \to t}^{T} D_{M o n o, t}^{T} (p_{t}) {(K^{T})}^{- 1} p_{t},

(1)

in which

D_{M o n o, t}^{T} (p_{t})

is the generated depth of

I_{t}^{T}

from DepthNet.

P_{t - 1 \to t}^{T}

is the estimated pose between

I_{t}^{T}

and

I_{t - 1}^{T}

.

K^{T}

is the intrinsic matrix of the thermal camera. The reconstructed frame

{\hat{I}}_{t}^{T}

is then synthesized by a bi-linear sampling operation. Utilizing

{\hat{I}}_{t}^{T}

and

I_{t}^{T}

, we can compute the multi-spectral reconstruction loss, which is described in detail in the subsequent section.

3.3. Multi-Frame Depth Learning from Thermal Images with Motion Hint

Our multi-frame pipeline, detailed in Figure 3, employs a shared MVS encoder to process adjacent thermal frames

\{I_{t}^{T}, I_{t - 1}^{T}\}

. The encoder begins with a convolutional stem, followed by downsampling blocks using

5 \times 5

and

3 \times 3

convolutions to create a feature pyramid [30]. These features are then warped into a cost volume, with the depth sampling range adaptively guided by a motion hint derived from prior depth and pose estimates. A U-Net-like MVS decoder [31], built upon

3 \times 3 \times 3

3D convolutional layers, processes this volume. The decoder leverages skip connections to fuse multi-scale information and uses a final softmax layer to compute a depth probability map. Finally, a lightweight confidence network processes this probability map to generate a weighting map, which fuses the geometrically robust MVS depth with the single-frame depth to produce the refined output.

Following the previous works [23,24,25], our MVS depth estimation part constructs a cost volume by warping the encoded source frame to the target frame. The cost volume is derived from the measurement of geometric similarity between two thermal frames. It signifies the likelihood of the correct depth for each pixel among a set of different depth values. The decoded cost volume then generates the highest-activated depth value for each pixel. The MVS-based method mitigates geometry ambiguity in single-frame depth and enhances prediction accuracy.

3.3.1. Cost Volume Construction

The MVS encoder (Figure 2c) extracts the feature of the current thermal frame

I_{t}^{T}

and the previous frame

I_{t - 1}^{T}

to produce feature maps denoted as

F_{t + i} \in R^{H / 4 \times W / 4 \times C}

,

i \in {- 1, 0}

. Then, the previous feature map is warped to the current feature map:

p_{t - 1, j} = K^{T} P_{t - 1 \to t}^{T} D_{j} (p_{t}) {(K^{T})}^{- 1} p_{t},

(2)

with

p_{t}

being the pixels on the current feature map,

D_{j} (p_{t})

stands for the depth value of the j-th hypothesized depth value, and

p_{t - 1, j}

corresponds to the pixel on the previous frame’s feature map. Through the warping, we construct the feature volume

Z_{t - 1} \in R^{H / 4 \times W / 4 \times C \times N_{D}}

, in which

N_{D}

represents the number of depth values.

The cost volume between the two feature maps is constructed by computing visual similarity:

q_{i} = 〈z_{i}, f_{i}〉,

(3)

here,

〈\cdot, \cdot〉

is the inner product operation.

z_{i} \in R^{C \times N_{D}}

is the feature of the i-th pixel in feature volume

Z_{t - 1}

, and

f_{i} \in R^{C \times 1}

denotes the feature of the i-th pixel in the feature map

F_{t}

. Then, all pixel features form the cost volume

Q_{t} \in R^{H / 4 \times W / 4 \times G \times N_{D}}

.

3.3.2. Depth Sampling with Motion Hint

Within the construction of the cost volume, the candidate depth set is formed by various depth values. However, the observed outdoor scenes are not static, and specifying a fixed range of depth values with predetermined maximum and minimum values would result in redundant depth candidates. To mitigate this, we leverage single-frame depth as prior information. Subsequently, we employ motion hints to adjust the adaptive range of the depth set, guiding the depth sampling process.

As the thermal camera moves, the velocity of the camera significantly influences the thermal frame disparity [12]. Higher velocities lead to more abundant triangulation for multi-view geometry, resulting in larger disparities. Conversely, lower velocities reduce triangulation, yielding smaller disparities and hindering the MVS network to acquire new information. Leveraging this knowledge, we utilize the estimated ego-motion as the motion hint to adaptively adjust the depth range:

\begin{matrix} d_{min} = (1 - γ {∥t r a n s (P_{t - 1 \to t}^{T})∥}_{2}) D_{M o n o, t}^{T} \\ d_{max} = (1 + γ {∥t r a n s (P_{t - 1 \to t}^{T})∥}_{2}) D_{M o n o, t}^{T} \end{matrix},

(4)

where

{∥t r a n s (P_{t - 1 \to t}^{T})∥}_{2}

denotes the camera’s velocity, and

t r a n s (\cdot)

indicates the translational part of the camera’s pose.

γ

is a hyperparameter determined by the camera’s frame rate. To maintain training stability,

γ {∥t r a n s (P_{t - 1 \to t}^{T})∥}_{2}

is clamped to the range

(0, 1)

. We set

γ = 0.15

. For depth sampling, an inverse sampling strategy [23] is employed to ensure that sampling results are closer to

D_{M o n o, t}^{T}

when the camera is static or moving at a low velocity:

d_{j} = {(d_{max}^{- 1} + \frac{j}{N_{D} - 1} (d_{min}^{- 1} - d_{max}^{- 1}))}^{- 1} .

(5)

Following the sampling process, the cost volume is fed into the MVS decoder to produce the depth probability map

M_{p r o b} \in R^{H / 4 \times W / 4 \times N_{D}}

. The MVS depth

D_{M V S, t}^{T}

for the current thermal image is then computed from

M_{p r o b} \in R^{H / 4 \times W / 4 \times N_{D}}

by the local-max function [32]:

D_{M V S} (p_{t}) = \sum_{j = J - ϵ}^{J + ϵ} D_{j} (p_{t}) \frac{M_{p r o b} (j, p_{t})}{Σ_{i = J - ϵ}^{J + ϵ} M_{p r o b} (i, p_{t})} .

(6)

Here,

J

represents the index of the maximum value within the per-pixel probability vector

M_{p r o b} (\cdot, p_{t})

. The parameter

ϵ

is the index radius along the depth dimension, which we set to 1. The resulting low-resolution depth map

D_{M V S} (p_{t})

is then upsampled to its original resolution.

3.3.3. Depth Fusing

To further enhance the accuracy of MVS depth and leverage the strengths of single-frame depth, we rectify regions with lower confidence [33] in the MVS depth using the results from the single-view depth. The confidence map

M_{c o n f}

is generated by computing the information entropy of the decoded depth probability map

M_{p r o b}

:

M_{c o n f} = θ_{c o n f} (- \sum_{j = 0}^{D - 1} M_{p r o b, j} * log M_{p r o b, j}),

(7)

here, * denotes element-wise multiplication, and

θ_{c o n f}

represents the confidence network. The final fused depth is calculated using the following formula:

D_{F u s e d, t}^{T} = (1 - M_{c o n f}) * D_{M o n o, t}^{T} + M_{c o n f} * D_{M V S, t}^{T} .

(8)

3.4. Motion Enhancement Using Epipolar Geometry Constraint

Visual localization systems [19,34,35] often rely on matching sparse stable features like keypoints and lines to achieve robust localization. These systems then employ well-established geometry tools to compute relative camera poses. Namely, the relative ego-motion between two frames can be calculated using the epipolar geometry constraint, and iterative optimization diminishes the estimation error, providing effective guidance for training networks. Therefore, to guarantee stability for the motion hint-driven method and enhance pose estimation accuracy, we design a motion enhancement module, as illustrated in Figure 4. Through offline processing, these geometry elements serve as prior information and self-supervised signals integrated into the neural network training.

3.4.1. Epipolar Geometry Constraint

As the scenes observed by the thermal frames

\{I_{t - 1}^{T}, I_{t}^{T}\}

are similar, we extract keypoints from the thermal frames and obtain a set of feature matches for keypoint pairs by associating their SIFT feature descriptors [36]. According to the epipolar geometry constraint, the relationship between keypoint pairs and the relative camera pose is as follows:

p_{t}^{⊤} {(K^{T})}^{- ⊤} E {(K^{T})}^{- 1} p_{t - 1} = 0, p_{t} = H p_{t - 1},

(9)

with

p_{t - 1}

and

p_{t}

denoting the homogeneous coordinates of a pair of keypoints between two frames.

H

is the homography matrix, and

E

is the essential matrix. The pose

[R_{t - 1 \to t} | t_{t - 1 \to t}]

between the two frames can be determined using the formula

E = {[t_{t - 1 \to t}]}_{\times} R_{t - 1 \to t}

, where

{[\cdot]}_{\times}

represents the skew-symmetric matrix of a three-dimensional vector.

3.4.2. Matrix Scoring and Selection

We calculate the homography matrix and the essential matrix separately and select the one that best fits the current frame situation. With an RANSAC scheme,

E

and

H

are computed using the eight-point algorithm [37] and the DLT algorithm [38], respectively. At each iteration, the score for the solution is calculated using the following formula:

s = \sum_{i} (e_{t - 1 \to t}^{2} (p_{t, i}, p_{t - 1, i}) + e_{t \to t - 1}^{2} (p_{t, i}, p_{t - 1, i})),

(10)

where

e^{2}

represents the symmetric transfer error [38], calculated through mutual projection between the two frames.

p_{t, i}

and

p_{t - 1, i}

are the i-th keypoint pairs. Finally, we preserve

E

and

H

associated with the highest score.

In situations where the scene is planar or exhibits low parallax, explaining it using the essential matrix may lead to wrong results in ego-motion recovery. We should choose the homography matrix for pose retrieval. Conversely, in scenes with sufficient parallax, the homography matrix can only solve for pose using keypoint pairs on the planes, whereas the essential matrix yields more accurate results. Thus, matrix selection is determined by the following:

s_{r a t i o} = \frac{s_{H}}{s_{H} + s_{E}},

(11)

with

s_{r a t i o}

being the ratio of scores from the homography matrix. If

s_{r a t i o} > 0.45

, it indicates situations involving either a low parallax or the camera detecting the extensive planar. In such cases, the scene is processed using the homography matrix. Conversely, the essential matrix is selected.

3.4.3. Montion Recovery

After the matrix selection, to guarantee the validity of the pose solving, we examine the motion hypothesis. In the case of the homography matrices, we perform the hypothesis verification [39] on the eight solutions. Solutions where keypoints go behind the camera are discarded. From the remaining solutions, those with significant parallax and minimal reprojection errors are selected. For essential matrices, the four solutions are examined using the Singular Value Decomposition (SVD), and erroneous solutions are removed using the similar approach as with homography matrices. Then, the pose between the two frames is determined.

3.4.4. Motion Loss

The poses generated by the motion enhancement module are converted from the transformation matrix to the 6DoF representation. We incorporate the pose, computed based on the epipolar geometry constraint, into the motion loss function for optimization:

\begin{matrix} L_{m o t i o n} & = w^{r} {∥{\hat{r}}_{t - 1 \to t} - r_{t - 1 \to t}∥}_{2} \\ + w^{t} {∥{\hat{t}}_{t - 1 \to t} - t_{t - 1 \to t}∥}_{2}, \end{matrix}

(12)

where

\{{\hat{r}}_{t - 1 \to t}, {\hat{t}}_{t - 1 \to t}\}

and

\{r_{t - 1 \to t}, t_{t - 1 \to t}\}

are the 6-DoF relative poses from the motion enhancement module and PoseNet, respectively.

w^{r}

and

w^{t}

denote the weights corresponding to rotation and translation. Notably, normalization of the translation is applied for scale consistency. We empirically set

w^{r} = 10

and

w^{t} = 1

.

3.5. Multi-Spectral Loss

In the training phase, conducting view synthesis on thermal and RGB images allows us to calculate thermal loss and photometric loss, respectively. This provides the network with sufficient multi-spectral self-supervision signals. In the testing phase, our framework solely requires thermal frames to estimate depth. To calculate loss terms from visible images, a depth warping operation is employed to establish the relationship between the two cameras, allowing the depth warping from the thermal camera coordinate system to the RGB camera coordinate system.

3.5.1. Thermal Reconstruction Loss

After the networks generate depths at three levels (single-frame depth

D_{M o n o, t}^{T}

, MVS depth

D_{M V S, t}^{T}

, and fused depth

D_{F u s e d, t}^{T}

), the thermal image

{\hat{I}}_{t}^{T}

is synthesized using the estimated pose

P_{t - 1 \to t}^{T}

and the previous thermal frame

I_{t - 1}^{T}

, as described in Section 3.2. Utilizing both the original thermal image

I_{t}^{T}

and the synthesized image

{\hat{I}}_{t}^{T}

, we compute the thermal reconstruction loss:

L_{r e c}^{T} = min t e (I_{t}^{T}, {\hat{I}}_{t}^{T}),

(13)

in which

t e

is a thermal error and is defined as a combination of an SSIM loss and an L1 loss:

\begin{matrix} t e (I_{t}^{T}, {\hat{I}}_{t}^{T}) & = \frac{w^{T}}{2} (1 - S S I M (I_{t}^{T}, {\hat{I}}_{t}^{T})) \\ + (1 - w^{T}) {∥I_{t}^{T} - {\hat{I}}_{t}^{T}∥}_{1}, \end{matrix}

(14)

here,

S S I M (\cdot, \cdot)

represents the Structural Similarity Index (SSIM) loss [40], and

w^{T}

is the weight of the SSIM loss. The SSIM loss remains effective for training due to the diverse structures and shapes of thermal sources in the scenes. Empirically, we set

w^{T} = 0.15

.

3.5.2. Depth Warping

The photometric loss derived from RGB images effectively supplements self-supervision signals, mitigating inherent deficiencies in thermal images, such as weak textures and blurred boundaries. However, the photometric loss typically demands an additional RGB network to process visible images. To address this challenge, a depth warping operation is employed to enable the depth warping from the thermal camera to the RGB camera, eliminating the requirement for an extra neural network to infer visible images. Initially, exploiting the rigid geometry association of the two cameras and the generated depth maps, we calculate the thermal–visible rigid flow between the thermal and visible frames:

S^{T \to V} (p_{t}) = K^{V} P^{T \to V} D_{t}^{T} (p_{t}) {(K^{V})}^{- 1} p_{t} - p_{t},

(15)

where

D_{t}^{T}

is the depth of the current thermal frame, and

p_{t}

represents the pixels in the current thermal frame.

K^{V}

is the intrinsic matrix of the visible camera, and

P^{T \to V}

is the extrinsic matrix between the two cameras.

Subsequently, with the flow reversal operation and the differentiable sampling, the depth of the current visible frame is obtained as follows:

D_{t}^{V} = f_{s a m p l e} (D_{t}^{T}, f_{r e v e r s e} (S^{T \to V})),

(16)

where

f_{r e v e r s e} (\cdot)

is the flow reversal layer [41], which is designed to reverse the computed optical flow of pixels.

f_{s a m p l e} (\cdot, \cdot)

denotes the spatially sampling function [42], utilized for calculating the visible frame depth, which is transferred with the pixel offset of the optical flow. As the two operations are differentiable, they are allowed for backward propagation during training.

3.5.3. Photometric Reconstruction Loss

The depth

D_{t}^{V}

, computed through the depth warping module, can be utilized for synthesizing visible images

{\hat{I}}_{t}^{V}

. Following [5], the photometric reconstruction loss for visible images is expressed as

L_{r e c}^{V} = min p e (I_{t}^{V}, {\hat{I}}_{t}^{V}),

(17)

\begin{matrix} p e (I_{t}^{V}, {\hat{I}}_{t}^{V}) & = \frac{w^{V}}{2} (1 - S S I M (I_{t}^{V}, {\hat{I}}_{t}^{V})) \\ + (1 - w^{V}) {∥I_{t}^{V} - {\hat{I}}_{t}^{V}∥}_{1}, \end{matrix}

(18)

where

p e

is the photometric error.

w^{V}

is the SSIM loss weight for visible images and is set as

w^{V} = 0.85

.

Then, the multi-spectral loss

L_{r e c}^{M S}

is formed by the thermal loss and the photometric loss:

L_{r e c}^{M S} = λ_{T} L_{r e c}^{T} + λ_{V} L_{r e c}^{V},

(19)

3.5.4. Total Loss Function

Our framework is trained in a self-supervised manner, and we have the total loss, which is formulated as

L_{t o t a l} = L_{r e c}^{M S} + λ_{M} L_{m o t i o n} + λ_{S} (L_{S}^{T} + L_{S}^{V}),

(20)

where

L_{S}

represents the edge-aware smoothness loss [43], which is commonly incorporated in existing works to deal with the discontinuity in depth map:

L_{S} = \sum_{i}^{N} |\nabla D_{t} (p_{i})| \cdot e^{- | \nabla I_{t} (p_{i}) |},

(21)

where N is the number of pixels in an image. ∇ represents the 2D differential operator.

L_{S}

is applicable to both RGB part (V) and thermal part (T).

L_{r e c}^{M S}

signifies the multi-spectral reconstruction loss, composed of the thermal loss and the photometric loss.

L_{m o t i o n}

denotes the motion loss term from the motion enhancement module.

λ_{S}

and

λ_{M}

are weights assigned to the smoothness loss and the motion loss.

4. Experiments

4.1. Datasets

To validate the effectiveness of our framework, we conduct training and testing using the public and available VIVID dataset [44]. This dataset is highly suitable for our task as it comprises multimodal data streams with temporal properties, containing raw thermal images, RGB images, sensor extrinsic calibration results, camera poses (serving as ego-motion ground-truth), and LiDAR point clouds (serving as depth ground-truth). It encompasses challenging scenes from both daytime and nighttime, various lighting conditions, and scenarios with different motion conditions. For our experiments, we utilize 6958 samples for training and 1058 samples for testing, and all images are processed at a 320 × 256 resolution. As our method is self-supervised, the ground-truth data (e.g., LiDAR depth and camera poses) is not required during training. The training phase solely utilizes the thermal images, the corresponding RGB images, and the sensor calibration parameters to compute the loss functions. During inference, our model is fully independent and requires only the monocular thermal frames as input. The ground-truth data is used exclusively for the quantitative evaluation of our method’s performance during the testing phase. Hence, the VIVID dataset provides satisfactory and comprehensive resources for training and evaluating the proposed framework.

4.2. Thermal Image Preprocessing

To meet the training requirements, we perform a multi-step preprocessing pipeline. Typically, the 8-bit thermal images produced by the thermal camera undergo an automatic frame-wise min–max rescaling. The 8-bit thermal images lack temporal consistency, which is unsuitable for the self-supervised training of consecutive image sequences. On the other hand, the 14-bit raw thermal images, due to the expansive temperature measurement range, are impractical for depth and ego-motion estimation, leading to the compression of a substantial amount of valuable information.

Therefore, to ensure temporal consistency while enhancing structural details, we employ a comprehensive pipeline. The resized 14-bit raw data (

I^{T, r a w}

) is first clamped to the predefined temperature hyperparameters

{τ_{min}, τ_{max}}

and normalized to the [0, 1] range. This normalized image, which often suffers from low contrast, then undergoes a mild contrast adjustment. We apply Contrast Limited Adaptive Histogram Equalization (CLAHE), which boosts local features while avoiding the introduction of excessive noise. Finally, this single-channel image is remapped using a jet colormap to 3-channels and standardized. This full pipeline can be formulated as a single operation:

I^{T} = f_{s t d} (f_{m a p} (f_{a d j} (\frac{f_{c l a m p} (I^{T, r a w}, τ_{max}, τ_{min}) - τ_{min}}{τ_{max} - τ_{min}})))

(22)

where

f_{c l a m p} (\cdot)

is the clamping function,

f_{a d j} (\cdot)

is the contrast adjustment function,

f_{m a p} (\cdot)

is the colormap function, and

f_{s t d} (\cdot)

is the standardization. Figure 5 illustrates the visualization of the results of this preprocessing. The subsequent ablation study demonstrates the effectiveness of this processing in improving the performance of the networks.

4.3. Implementation Details

Training Configuration

For the depth prediction network of single-frame part (DepthNet) and PoseNet, we utilize the architectures from Monodepth2 [16] and SfMLearner [4] with the ResNet-18 [45] encoder, respectively. The confidence network is constructed with a three-layer 2D CNN and a Sigmoid layer. The ResNet blocks are pretrained with ImageNet, and other network layers are trained from scratch. Our method was implemented using the PyTorch (version 2.2.2) deep learning library [46]. All networks are trained on a single NVIDIA RTX 3080Ti GPU using the ADAM optimizer [47].

For data augmentation, we employed an offline strategy. Multiple augmented sequences were generated to expand the original dataset by applying varied parameters for horizontal flipping, scaling, and cropping. We shuffle and randomly sample from this entire original and augmented pool. During training, the single-frame depth prediction network and pose estimation network receive three consecutive thermal frames

\{I_{t - 1}^{T}, I_{t}^{T}, I_{t + 1}^{T}\}

, and the MVS depth network takes two frames

\{I_{t}^{T}, I_{t - 1}^{T}\}

. To balance the loss, we assign weights

\{λ_{T}, λ_{V}, λ_{S}, λ_{M}\}

as

\{1.0, 0.1, 0.001, 1.0\}

, respectively. The temperature parameters

\{τ_{min}, τ_{max}\}

in the thermal image preprocessing are set as

\{0.0, 30.0\}

(in degrees Celsius). The batch size is set to 4, and the learning rate is 0.0001. A total of 300 epochs are conducted for training.

4.4. Depth Prediction Results

In evaluating the performance of the proposed method, we compare it with state-of-the-art self-supervised depth estimation methods. The work of Shin et al. [11] is set as the baseline. Additionally, the methods employing monocular RGB supervision are included for comparison. The evaluation uses the metrics proposed by Eigen et al. [14], which are widely employed in modern depth estimation methods for RGB images, to measure the performance. The metrics include the error metrics (smaller is better) and the accuracy rate metrics (larger is better):

Abs Rel : \frac{1}{N} \sum_{i = 1}^{N} \frac{|D (p_{i}) - D_{gt} (p_{i})|}{D_{gt} (p_{i})}

(23)

Sq Rel : \frac{1}{N} \sum_{i = 1}^{N} \frac{{∥D (p_{i}) - D_{gt} (p_{i})∥}^{2}}{D_{gt} (p_{i})}

(24)

RMSE : \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {∥D (p_{i}) - D_{gt} (p_{i})∥}^{2}}

(25)

RMSE \log : \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {∥log D (p_{i}) - log D_{gt} (p_{i})∥}^{2}}

(26)

Accuracy : % of D (p_{i}) s . t . max (\frac{D (p_{i})}{D_{gt} (p_{i})}, \frac{D_{gt} (p_{i})}{D (p_{i})}) = δ < thr

(27)

where

D (p_{i})

and

D_{gt} (p_{i})

are the estimated and ground-truth depth values, respectively. thr is a threshold, which is set to

1.25

,

{1.25}^{2}

, and

{1.25}^{3}

.

Our framework’s performance in monocular depth estimation was rigorously evaluated on the VIVID dataset. The experiments, summarized in Table 1 and Figure 6 and Figure 7, conclusively demonstrate the superiority of our approach, which arises from the synergistic integration of multi-frame geometric constraints and multi-spectral supervision.

As shown in Table 1, our method consistently outperforms baseline self-supervised techniques across all scenarios and metrics. The key to this performance lies in our departure from traditional single-frame estimation. Unlike the baseline [11], which predicts depth from a single image and thus fails to exploit temporal data, our framework is built upon an MVS cost volume. By constructing this volume from temporally adjacent thermal frames, our model effectively leverages multi-view geometry to resolve ambiguities inherent in a single perspective, leading to more robust and accurate geometric structures. This architectural advantage is the primary driver behind the significant 19.8% reduction in absolute relative error (Abs. Rel.) compared to the baseline.

The results particularly highlight our method’s ability to maintain performance across extreme variations in illumination, especially when compared to RGB-based approaches. As predicted, RGB-based methods suffer a catastrophic performance degradation at night, underscoring their unreliability for continuous operation. While thermal imaging is inherently advantageous in these conditions, our method proves more stable than even the thermal-based baseline. From day to night, our model’s Abs. Rel. error increases by a mere 2.55%, whereas the baseline degrades by 8.72%. This enhanced stability can be attributed to the multi-spectral loss employed during training. By using a depth warping operation to incorporate a photometric reconstruction loss from registered RGB images, our thermal network learns richer, more generalized features that better capture object boundaries and fine-grained textures—details often blurred or subtle in thermal data alone. This robust feature representation, learned via multi-spectral supervision, translates directly to more consistent and accurate predictions when presented with only thermal images during inference.

To visually corroborate the quantitative results from Table 1, Figure 6 provides a qualitative comparison. The visualization highlights a key limitation of relying on standard cameras: in the challenging nighttime scene (bottom row), the RGB-based method (e) fails completely, producing incoherent geometry. In stark contrast, both our method (c) and the thermal baseline (d) robustly generate plausible depth maps that align with the ground-truth structure (f), underscoring the advantage of using thermal imagery for all-day operation.

A more granular analysis in Figure 7 reveals the improvements our method offers over the thermal baseline. This figure provides a detailed side-by-side comparison with the baseline, including error maps. The baseline frequently produces noticeable geometric distortions and artifacts. These are common failure modes for single-frame methods. In contrast, our results are geometrically coherent and artifact-free, a direct benefit of the MVS cost volume that aggregates spatial information over time to form a more complete and consistent 3D understanding of the scene.

Furthermore, the error maps (e, f) provide a granular view of our method’s precision. Our error map (e) is predominantly blue (low error), whereas the baseline’s (f) shows significant concentrations of red and orange (high error), especially around object edges. This visual evidence confirms that our framework, which combines the strengths of MVS with a final depth fusing step, achieves a superior level of per-pixel accuracy and produces more reliable depth maps for the downstream tasks.

4.5. Pose Estimation Results

In addition to depth, accurate ego-motion estimation is critical for autonomous navigation. The results presented in Table 2 demonstrate that our framework surpasses the baseline method in self-supervised pose estimation from thermal video.

As detailed in Table 2, our method achieves the lowest Absolute Trajectory Error (ATE) [48] across all the daytime and nighttime sequences. The most significant contributor to this success is the proposed motion enhancement module. This module provides a geometrically grounded supervisory signal by leveraging epipolar geometry constraints to compute a highly accurate pose offline. This extra poses supervision signal is then used to formulate a motion loss that guides the training of our PoseNet. This additional supervision stabilizes the learning process and forces the network to adhere to geometric principles, resulting in a 24.94% improvement in ATE over the baseline. This module also ensures the stability of the motion hint used for MVS depth sampling, creating a virtuous cycle where better poses lead to better depth and vice versa.

Moreover, similar to our depth results, our pose estimation network maintains exceptional performance from day to night, with its ATE degrading by only 1.51%. Compared to RGB-supervised methods, our method exhibits robustness to changes in illumination. In well-lit daytime scenes, RGB-based self-supervised methods perform well, but, as the scenes transition to poorly lit nighttime conditions, the errors in these methods increase. For instance, the method proposed by Bian et al. [18] shows that performance decreases by 74.43%.

4.6. Ablation Study

In the evaluation experiments, our method reveals superior performance on ego-motion and depth estimation compared to the other approaches. However, the individual contributions and improvements made by each component of the proposed framework remain unclear. Therefore, we conduct ablation experiments on the VIVID dataset to analyze and evaluate the contributions of each component and loss term within the framework. The results, presented in Table 3 and Table 4, systematically deconstruct our framework and quantify the importance of each element.

4.6.1. Effect of Each Component

This study, summarized in Table 3, assesses the impact of each core architectural module by selectively toggling them on or off within the proposed framework. By toggling various components on or off within the proposed framework, we show their respective importance. Table 3 presents the performance test results for depth prediction.

“Ours(full)” represents the results using all the components, and the other rows correspond to the results with the removal of each respective component. The results indicate that “Ours(full)” achieves the best performance. Removing different components leads to varying degrees of performance decline, with increased prediction errors and decreased accuracy. For “Ours (w/o depth warping),” we deactivate the depth transfer module, thus not receiving any supervisory signals from visible images. In the case of “Ours (w/o motion enhancement),” the motion enhancement module is turned off, omitting the utilization of the epipolar geometry constraint for pose generation. In the “w/o motion hint” experiment, the depth candidate range is half of the single-frame depth value. As for “Ours (w/o remapping),” training is conducted using the original 14-bit thermal images. For “Ours (w/o depth fusing),” the model solely outputs MVS depth without performing depth map fusion.

The underlying rationale for the motion enhancement module is to provide a supervisory signal with a different failure mode than the primary reconstruction-based supervision. The dense reconstruction signal can be ambiguous in photometrically poor regions, where an incorrect pose might still yield a low error. Our module provides an additional signal that relies on sparse structural features rather than dense texture content. To validate the quality of this mechanism, Figure 8 visually demonstrates its core feature-matching process. The module successfully extracts a dense and geometrically consistent set of feature correspondences, confirming that it provides a reliable geometric constraint. The benefit of adding this geometrically rigid constraint is then shown in Table 3. The “Ours (w/o motion enhancement)” experiment, which relies solely on the potentially ambiguous dense signal, shows a corresponding drop in pose estimation accuracy.

4.6.2. Effect of Each Loss Term

The ablation study is conducted on different loss terms by setting them to zero. The depth prediction evaluation results for the experiments are presented in Table 4. “Ours(full)” employs all the loss terms and achieves the best results. The thermal reconstruction loss is found to be crucial and foundational to our framework as removing it results in non-convergent training. Although removing the smoothness loss enhances the visibility of the structural details in the depth map, it results in an overall increase in errors. Similarly, removing the photometric reconstruction loss or the motion loss leads to increased errors and decreased accuracy in depth prediction.

5. Discussion

While our framework demonstrates robust performance, we acknowledge several avenues for future work. Our method does not explicitly model dynamic objects, a common challenge for self-supervised methods relying on view synthesis. Thus, incorporating dynamic-aware modules remains a valuable direction. We also acknowledge that the multi-spectral loss requires RGB data during the training phase. Although this is a training-only optimization (our model is fully independent at inference and our ablation study shows that it remains effective without this signal), this reliance does reduce data collection flexibility. A promising future direction is to enhance this flexibility by applying domain adaptation techniques to reduce this RGB dependence. In terms of deployment, while the MVS cost volume adds computational load, our full model achieves 43 FPS (23 ms per frame), comfortably meeting real-time requirements. We also note that the motion enhancement module is a training-only component and does not impact runtime. For more resource-constrained hardware, model compression and knowledge distillation remain promising avenues for optimization. Finally, while validated on a comprehensive benchmark, extending the framework’s applicability presents a clear direction for future work, such as evaluating its generalization to broader domains, including diverse scenarios and fusion with sensors like LiDAR and IMU.

6. Conclusions

In this paper, we introduce a self-supervised learning framework for depth and pose estimation from multi-frame thermal images. Firstly, the proposed framework constructs an MVS cost volume from multiple thermal frames. The poses generated by the network adaptively guide and adjust the construction process, directing MVS depth towards single-frame depth in scenarios with slow or static camera movement. Secondly, to bolster the stability of this mechanism, a motion enhancement module based on epipolar geometry constraints provides self-generated supervision signals to optimize the training process. Finally, a depth warping operation links visible and thermal cameras, establishing a loss term based on the multi-spectral property to aid in training the thermal network.

The experimental evaluations on the public dataset showcase that the proposed method demonstrates improved accuracy in depth and pose estimation compared to state-of-the-art methods. It exhibits effectiveness and accuracy even under challenging light conditions. We expect further improvements in the results by incorporating additional geometry features and employing multi-sensor fusion methods to boost the accuracy and robustness of depth and pose predictions.

Author Contributions

Conceptualization, R.Y., G.M., and J.G.; methodology, R.Y. and L.X.; investigation, R.Y. and L.X.; resources, G.M. and J.G.; supervision, G.M. and J.G.; writing—original draft, R.Y.; writing—review and editing, G.M. and J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Agricultural Science and Technology Independent Innovation Fund of Jiangsu Province (Grant No. CX(24)1023) and the Key Research and Development Program of Jiangsu Province (Industrial Foresight and Key Core Technologies) (Grant No. BE2021016).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://visibilitydataset.github.io/, accessed on 8 October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, C.; Lee, D. EASD: Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles. Appl. Sci. 2025, 15, 9130. [Google Scholar] [CrossRef]
Shi, P.; Dong, X.; Ge, R.; Liu, Z.; Yang, A. Dp-M3D: Monocular 3D object detection algorithm with depth perception capability. Knowl.-Based Syst. 2025, 318, 113539. [Google Scholar] [CrossRef]
Yang, W.J.; Tsai, H.; Chan, D.Y. High-Precision Depth Estimation Networks Using Low-Resolution Depth and RGB Image Sensors for Low-Cost Mixed Reality Glasses. Appl. Sci. 2025, 15, 6169. [Google Scholar] [CrossRef]
Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar] [CrossRef]
Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar] [CrossRef]
Lin, H.; Peng, S.; Chen, J.; Peng, S.; Sun, J.; Liu, M.; Bao, H.; Feng, J.; Zhou, X.; Kang, B. Prompting depth anything for 4k resolution accurate metric depth estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 17070–17080. [Google Scholar]
Aditya, N.; Dhruval, P. Thermal voyager: A comparative study of rgb and thermal cameras for night-time autonomous navigation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 14116–14122. [Google Scholar] [CrossRef]
Nawaz, M.; Khan, S.; Daud, M.; Asim, M.; Anwar, G.A.; Shahid, A.R.; Ho, H.P.A.; Chan, T.; Pak Kong, D.; Yuan, W. Improving Autonomous Vehicle Cognitive Robustness in Extreme Weather With Deep Learning and Thermal Camera Fusion. IEEE Open J. Vehic. Tech. 2025, 6, 426–441. [Google Scholar] [CrossRef]
Lu, Y.; Lu, G. An alternative of lidar in nighttime: Unsupervised depth estimation based on single thermal image. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021; pp. 3833–3843. [Google Scholar] [CrossRef]
Kim, N.; Choi, Y.; Hwang, S.; Kweon, I.S. Multispectral transfer network: Unsupervised depth estimation for all-day vision. In Proceedings of the Conference AAAI, New Orleans, LA, USA, 2–7 February 2018; pp. 6983–6991. [Google Scholar] [CrossRef]
Shin, U.; Lee, K.; Lee, S.; Kweon, I.S. Self-supervised depth and ego-motion estimation for monocular thermal video using multi-spectral consistency loss. IEEE Robot. Autom. Lett. 2022, 7, 1103–1110. [Google Scholar] [CrossRef]
Schönberger, J.L.; Zheng, E.; Frahm, J.M.; Pollefeys, M. Pixelwise view selection for unstructured multi-view stereo. In Proceedings of the European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 501–518. [Google Scholar] [CrossRef]
Chen, W.; Fu, Z.; Yang, D.; Deng, J. Single-image depth perception in the wild. Adv. Neural Inf. Process. Syst. 2016, 29, 730–738. [Google Scholar]
Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 2014, 27, 2366–2374. [Google Scholar]
Dong, Q.; Zhou, Z.; Qiu, X.; Zhang, L. A Survey on Self-Supervised Monocular Depth Estimation Based on Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 15622–15642. [Google Scholar] [CrossRef]
Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3D packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2485–2494. [Google Scholar] [CrossRef]
Bian, J.; Li, Z.; Wang, N.; Zhan, H.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Adv. Neural Inf. Process. Syst. 2019, 33, 35–45. [Google Scholar]
Bian, J.; Zhan, H.; Wang, N.; Li, Z.; Zhang, L.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis. 2021, 129, 2548–2564. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Yu, S.; Wu, M.; Lam, S.K.; Wang, C.; Wang, R. EDS-Depth: Enhancing Self-Supervised Monocular Depth Estimation in Dynamic Scenes. IEEE Trans. Intell. Transp. Syst. 2025, 26, 5585–5597. [Google Scholar] [CrossRef]
Feng, C.; Zhang, C.; Chen, Z.; Hu, W.; Lu, K.; Ge, L. Self-Supervised Monocular Depth Estimation With Dual-Path Encoders and Offset Field Interpolation. IEEE Trans. Image Process. 2025, 34, 939–954. [Google Scholar] [CrossRef] [PubMed]
Cheng, J.; Yin, W.; Wang, K.; Chen, X.; Wang, S.; Yang, X. Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 10138–10147. [Google Scholar] [CrossRef]
Wang, X.; Zhu, Z.; Huang, G.; Chi, X.; Ye, Y.; Chen, Z.; Wang, X. Crafting monocular cues and velocity guidance for self-supervised multi-frame depth learning. In Proceedings of the Conference AAAI, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2689–2697. [Google Scholar] [CrossRef]
Watson, J.; Mac Aodha, O.; Prisacariu, V.; Brostow, G.; Firman, M. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 1164–1174. [Google Scholar] [CrossRef]
Bae, G.; Budvytis, I.; Cipolla, R. Multi-view depth estimation by fusing single-view depth probability with multi-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 2842–2851. [Google Scholar] [CrossRef]
Zhao, G.; Wei, H.; He, H. IAFMVS: Iterative Depth Estimation with Adaptive Features for Multi-View Stereo. Neurocomputing 2025, 629, 129682. [Google Scholar] [CrossRef]
Wang, L.; Liang, Q.; Che, Y.; Wang, L.; Wang, G. IFDepth: Iterative fusion network for multi-frame self-supervised monocular depth estimation. Knowl.-Based Syst. 2025, 318, 113467. [Google Scholar] [CrossRef]
Wang, Y.; Chen, H.; Liu, Y.; Zhang, S. Edge-Based Monocular Thermal-Inertial Odometry in Visually Degraded Environments. IEEE Robot. Autom. Lett. 2023, 8, 2078–2085. [Google Scholar] [CrossRef]
Doer, C.; Trommer, G.F. Radar visual inertial odometry and radar thermal inertial odometry: Robust navigation even in challenging visual conditions. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 331–338. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Volume 9351, pp. 234–241. [Google Scholar]
Wang, F.; Galliani, S.; Vogel, C.; Pollefeys, M. IterMVS: Iterative probability estimation for efficient multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 8606–8615. [Google Scholar] [CrossRef]
Zhang, J.; Li, S.; Luo, Z.; Fang, T.; Yao, Y. Vis-mvsnet: Visibility-aware multi-view stereo network. Int. J. Comput. Vision 2023, 131, 199–214. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
He, M.; Zhu, C.; Huang, Q.; Ren, B.; Liu, J. A review of monocular visual odometry. Vis. Comput. 2020, 36, 1053–1065. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Zhang, Z. Eight-Point Algorithm. In Computer Vision: A Reference Guide; Ikeuchi, K., Ed.; Springer: Cham, Switzerland, 2021; pp. 370–371. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar] [CrossRef]
Faugeras, O.D.; Lustman, F. Motion and structure from motion in a piecewise planar environment. Int. J. Pattern Recognit. Artif. Intell. 1988, 2, 485–508. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Xu, X.; Siyao, L.; Sun, W.; Yin, Q.; Yang, M.H. Quadratic video interpolation. Adv. Neural Inf. Process. Syst. 2019, 32, 1647–1656. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 29, 2017–2025. [Google Scholar]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar] [CrossRef]
Lee, A.J.; Cho, Y.; Shin, Y.S.; Kim, A.; Myung, H. ViViD++: Vision for visibility dataset. IEEE Robot. Autom. Lett. 2022, 7, 6282–6289. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Nice, France, 2019; Volume 32. [Google Scholar]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Zhang, Z.; Scaramuzza, D. A tutorial on quantitative trajectory evaluation for visual (-inertial) odometry. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 7244–7251. [Google Scholar] [CrossRef]

Figure 1. (a) The overview of our framework. (b) The RGB–Thermal image pair at the test time. (c) The depth and error map from the baseline model [11], with smaller errors represented in blue and larger errors in red. (d) The depth and error map from our method, showing superior accuracy.

Figure 2. The main architecture of the proposed framework. (a) Utilizing PoseNet to estimate ego-motion between thermal frames, and integrating the pose generated from the motion enhancement module to compute the motion loss. (b) Employing DepthNet to predict the depth of the single thermal frame and after depth warping. (c) The encoded thermal feature maps are used to construct a cost volume with the assistance of motion hints. The cost volume is then decoded into the MVS depth and confidence map. Each generated depth map can be used to calculate multi-spectral losses. The confidence map is instrumental in fusing the single-frame depth and MVS depth, resulting in a more refined fused depth.

Figure 3. Detailed architecture of the multi-frame depth estimation part. It consists of three main components: (1) a shared MVS encoder for feature extraction; (2) a module that constructs a cost volume using motion hint-based sampling and feature warping; and (3) a 3D Convolutional Neural Network (CNN)-based MVS decoder that infers depth and a confidence map for the final fusion step.

Figure 4. In the upper branch, the motion enhancement module constructs epipolar geometry constraints between matched keypoints in thermal frames and computes the essential matrix (

E

) and homography matrix (

H

) using the two algorithms. The suitable matrix is selected according to the scene structure, and the camera pose is then computed. The lower branch utilizes PoseNet to generate poses between the thermal frames. Ultimately, the two poses are used to calculate motion loss.

Figure 4. In the upper branch, the motion enhancement module constructs epipolar geometry constraints between matched keypoints in thermal frames and computes the essential matrix (

E

) and homography matrix (

H

) using the two algorithms. The suitable matrix is selected according to the scene structure, and the camera pose is then computed. The lower branch utilizes PoseNet to generate poses between the thermal frames. Ultimately, the two poses are used to calculate motion loss.

Figure 5. Preprocessing of thermal images. The details are compressed in the raw thermal image. Clamping, adjustment, and remapping operations are performed based on the temperature range to make efficient use of the thermal image information.

Figure 6. Qualitative comparison of depth prediction under varying illumination. The top row shows a daytime scene, while the bottom row depicts a challenging nighttime scene. Note the severe degradation of the RGB-based depth (e) at night, whereas our thermal-based approach (c) remains robust.

Figure 7. Qualitative comparison results of depth prediction on the VIVID dataset. We present (a) RGB scene images, (b) thermal images, (c) depth maps generated by our method, (d) depth maps generated by the baseline, (e) error maps from our method, and (f) error maps from the baseline. The visualized error maps are derived from the absolute relative errors between the predicted depth and the ground-truth provided by LiDAR, where cool colors (blue) indicate smaller errors and warm colors (red) represent larger errors [11].

Figure 8. Qualitative visualization of the feature-matching process in our motion enhancement module. Robust feature correspondences (inliers shown as connecting lines) are successfully extracted from two consecutive thermal frames. These matches provide the geometric constraints for the self-generated supervisory signal.

Table 1. Quantitative comparison between our method and other self-supervised methods for depth prediction on the VIVID dataset. The evaluation encompasses scenes from daytime and nighttime, testing each method’s performance under varying light conditions. The “Input” column specifies the types of images fed into the network. “V” and “T” represent visible RGB images and thermal images, respectively. The table highlights results with the minimum errors and the highest accuracy in bold.

Scene	Methods	Input	Error (Lower Is Better)				Accuracy (Higher Is Better)
Scene	Methods	Input	Abs. Rel.	Sq. Rel.	RMSE	RMSElog	$δ < 1.25$	$δ < {1.25}^{2}$	$δ < {1.25}^{3}$
Outdoor day1	SfMLearner [4]	V	0.200	1.549	6.394	0.248	0.684	0.922	0.975
	Monodepth2 [5]	V	0.436	6.318	9.967	0.454	0.491	0.719	0.840
	Bian et al. [17]	V	0.136	1.464	5.638	0.179	0.844	0.977	0.991
	Watson et al. [24]	V	0.166	1.472	6.358	0.220	0.754	0.945	0.986
	Shin et al. [11] (T)	T	0.143	0.766	4.547	0.180	0.790	0.979	0.996
	Shin et al. [11] (MS)	T	0.142	0.792	4.730	0.181	0.788	0.980	0.996
	Ours	T	0.115	0.683	4.310	0.159	0.869	0.980	0.996
Outdoor day2	SfMLearner [4]	V	0.192	1.652	6.511	0.235	0.702	0.941	0.978
	Monodepth2 [5]	V	0.435	6.915	9.576	0.438	0.591	0.752	0.843
	Bian et al. [17]	V	0.131	1.376	5.899	0.185	0.822	0.976	0.991
	Watson et al. [24]	V	0.197	2.281	7.707	0.241	0.695	0.922	0.982
	Shin et al. [11] (T)	T	0.148	0.933	4.736	0.188	0.802	0.970	0.991
	Shin et al. [11] (MS)	T	0.145	0.917	4.784	0.187	0.804	0.971	0.991
	Ours	T	0.122	0.957	4.787	0.170	0.855	0.968	0.991
Outdoor night1	SfMLearner [4]	V	0.429	4.584	8.624	0.445	0.468	0.698	0.834
	Monodepth2 [5]	V	0.704	11.75	12.53	0.636	0.362	0.559	0.701
	Bian et al. [17]	V	0.520	6.413	10.40	0.516	0.381	0.596	0.755
	Watson et al. [24]	V	0.469	5.690	10.96	0.466	0.332	0.612	0.839
	Shin et al. [11] (T)	T	0.158	0.844	4.634	0.192	0.754	0.977	0.996
	Shin et al. [11] (MS)	T	0.156	0.856	4.813	0.192	0.752	0.976	0.996
	Ours	T	0.119	0.747	4.189	0.162	0.850	0.979	0.996
Outdoor night2	SfMLearner [4]	V	0.373	4.215	8.294	0.396	0.548	0.773	0.879
	Monodepth2 [5]	V	0.602	10.84	11.72	0.562	0.477	0.650	0.759
	Bian et al. [17]	V	0.464	6.376	9.887	0.472	0.511	0.685	0.807
	Watson et al. [24]	V	0.416	4.860	10.16	0.428	0.383	0.692	0.873
	Shin et al. [11] (T)	T	0.159	1.084	5.115	0.204	0.772	0.957	0.989
	Shin et al. [11] (MS)	T	0.156	1.049	5.166	0.202	0.775	0.957	0.989
	Ours	T	0.124	0.899	4.694	0.179	0.829	0.963	0.990

Table 2. Quantitative comparison of our method to other self-supervised methods for pose estimation. The comparison methods include the trained RGB-based self-supervised methods, and the scenes encompass both daytime and nighttime situations. Smaller errors in the table indicate better performance. Results with the minimum errors are highlighted in bold.

Methods	Input	Outdoor Day1		Outdoor Day2		Outdoor Night1		Outdoor Night2
Methods	Input	Mean	Std.	Mean	Std.	Mean	Std.	Mean	Std.
SfMLearner [17]	V	0.0774	0.0407	0.0867	0.0414	0.0657	0.0342	0.0579	0.0279
Monodepth2 [5]	V	0.0525	0.0305	0.0544	0.0266	0.0552	0.0302	0.0545	0.0277
Bian et al. [17]	V	0.0503	0.0255	0.0514	0.0285	0.0886	0.0439	0.0888	0.0390
Shin et al. [11] (T)	T	0.0751	0.0371	0.0784	0.0391	0.0744	0.0417	0.0793	0.0402
Shin et al. [11] (MS)	T	0.0541	0.0307	0.0643	0.0365	0.0590	0.0315	0.0604	0.0315
Ours	T	0.0429	0.0251	0.0450	0.0248	0.0464	0.0288	0.0442	0.0252

Table 3. Ablation study on various components. The best results are marked in bold.

Methods	Abs. Rel.	RMSE	$δ < 1.25$
Ours (full)	0.127	4.305	0.831
Ours (w/o depth warping)	0.144	4.729	0.815
Ours (w/o motion enhancement)	0.138	4.763	0.824
Ours (w/o motion hint)	0.140	4.798	0.816
Ours (w/o remapping)	0.140	4.832	0.813
Ours (w/o depth fusing)	0.136	4.478	0.816

Table 4. Ablation study on various loss terms. The experiment labeled “w/o RGB loss” indicates setting the reconstruction loss based on photometric consistency to zero. The best results are marked in bold.

Methods	Abs. Rel.	RMSE	$δ < 1.25$
Ours (full)	0.127	4.305	0.831
Ours (w/o thermal loss)	-	-	-
Ours (w/o motion loss)	0.138	4.763	0.824
Ours (w/o smoothness loss)	0.140	4.742	0.799
Ours (w/o RGB loss)	0.144	4.729	0.815

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, R.; Ma, G.; Guo, J.; Xu, L. Self-Supervised Depth and Ego-Motion Learning from Multi-Frame Thermal Images with Motion Enhancement. Appl. Sci. 2025, 15, 11890. https://doi.org/10.3390/app152211890

AMA Style

Yu R, Ma G, Guo J, Xu L. Self-Supervised Depth and Ego-Motion Learning from Multi-Frame Thermal Images with Motion Enhancement. Applied Sciences. 2025; 15(22):11890. https://doi.org/10.3390/app152211890

Chicago/Turabian Style

Yu, Rui, Guoliang Ma, Jian Guo, and Lisong Xu. 2025. "Self-Supervised Depth and Ego-Motion Learning from Multi-Frame Thermal Images with Motion Enhancement" Applied Sciences 15, no. 22: 11890. https://doi.org/10.3390/app152211890

APA Style

Yu, R., Ma, G., Guo, J., & Xu, L. (2025). Self-Supervised Depth and Ego-Motion Learning from Multi-Frame Thermal Images with Motion Enhancement. Applied Sciences, 15(22), 11890. https://doi.org/10.3390/app152211890

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Supervised Depth and Ego-Motion Learning from Multi-Frame Thermal Images with Motion Enhancement

Abstract

1. Introduction

2. Related Work

2.1. Self-Supervised Depth and Ego-Motion Learning

2.2. Depth and Ego-Motion Learning with Thermal Images

3. Method

3.1. Method Overview

3.2. Single-Frame Self-Supervised Depth Learning

3.3. Multi-Frame Depth Learning from Thermal Images with Motion Hint

3.3.1. Cost Volume Construction

3.3.2. Depth Sampling with Motion Hint

3.3.3. Depth Fusing

3.4. Motion Enhancement Using Epipolar Geometry Constraint

3.4.1. Epipolar Geometry Constraint

3.4.2. Matrix Scoring and Selection

3.4.3. Montion Recovery

3.4.4. Motion Loss

3.5. Multi-Spectral Loss

3.5.1. Thermal Reconstruction Loss

3.5.2. Depth Warping

3.5.3. Photometric Reconstruction Loss

3.5.4. Total Loss Function

4. Experiments

4.1. Datasets

4.2. Thermal Image Preprocessing

4.3. Implementation Details

Training Configuration

4.4. Depth Prediction Results

4.5. Pose Estimation Results

4.6. Ablation Study

4.6.1. Effect of Each Component

4.6.2. Effect of Each Loss Term

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI