Monocular Depth Estimation from a Single Infrared Image

Han, Daechan; Choi, Yukyung

doi:10.3390/electronics11111729

Open AccessArticle

Monocular Depth Estimation from a Single Infrared Image

by

Daechan Han

and

Yukyung Choi

^*

Department of Intelligent Mechatronics Engineering, Sejong University, Seoul 05006, Korea

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(11), 1729; https://doi.org/10.3390/electronics11111729

Submission received: 29 April 2022 / Revised: 25 May 2022 / Accepted: 27 May 2022 / Published: 30 May 2022

(This article belongs to the Special Issue Deep Learning Techniques for Manned and Unmanned Ground, Aerial and Marine Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Thermal infrared imaging is attracting much attention due to its strength against illuminance variation. However, because of the spectral difference between thermal infrared images and RGB images, the existing research on self-supervised monocular depth estimation has performance limitations. Therefore, in this study, we propose a novel Self-Guided Framework using a Pseudolabel predicted from RGB images. Our proposed framework, which solves the problem of appearance matching loss in the existing framework, transfers the high accuracy of Pseudolabel to the thermal depth estimation network by comparing low- and high-level pixels. Furthermore, we propose Patch-NetVLAD Loss, which strengthens local detail and global context information in the depth map from thermal infrared imaging by comparing locally global patch-level descriptors. Finally, we introduce an Image Matching Loss to estimate a more accurate depth map in a thermal depth network by enhancing the performance of the Pseudolabel. We demonstrate that the proposed framework shows significant performance improvement even when applied to various depth networks in the KAIST Multispectral Dataset.

Keywords:

monocular depth estimation; self-supervised learning; infrared image; night vision; pseudo-label; local descriptor

1. Introduction

In recent years, monocular depth estimation has become a crucial component in the computer vision and robotics fields, with applications such as autonomous driving [1] and augmented reality [2,3] due to its potential benefits as a substitute for LiDAR. Many studies have shown high-performance improvement in monocular depth estimation via supervised learning [4,5] using LiDAR as ground truth. However, data collection in supervised learning is difficult because of the high cost of LiDAR and precise, dense depth information cannot be estimated due to LiDAR’s sparse depth information. To solve this problem, researchers have proposed a self-supervised method that does not require ground-truth depth information throughout the learning process. Many researchers have proposed various self-supervised learning-based monocular depth estimation methods [6,7,8,9,10], and the performance gap between supervised and self-supervised learning has been narrowing. However, these methodologies do not guarantee performance in low-illumination situations such as a night, due to using RGB images as inputs. Additionally, because of the intrinsic limits of RGB sensors, external environmental changes such as rain or cloudiness are problematic.

As a realistic alternative to these problems, a long-wave infrared (thermal) imaging camera resistant to various environmental changes can be used. A thermal camera records images from an object’s radiation, which is unaffected by changes in the external environment compared with RGB images. Therefore, thermal cameras have increasingly been used in various tasks such as object detection [11,12], semantic segmentation [13], place recognition [14], and self-supervised depth estimation [15]. Kim et al. [15] proves that thermal images can be incorporated into RGB-based depth estimation methods due to sharing global context information between RGB and thermal cameras. However, substituting inputs of RGB-based depth estimation with thermal imaging remains a challenge due to spectral differences. In the existing depth estimation methodology [15] that uses a thermal image as an input, an appearance matching loss is calculated using an RGB image and a predicted depth map from the thermal image, as shown in Figure 1a. This learning framework is weakened by intrinsic limitations such as low contrast and blurry edges of thermal images.

In this paper, we propose a novel Self-Guided Framework for self-supervised monocular depth estimation based on thermal images to solve the problem of the existing learning framework. While the existing methodology exhibits decreased performance due to the difference between the thermal image and the RGB image, our Self-Guided Framework, which is our main contribution, improves the thermal image-based depth estimation performance by using a Pseudolabel. The Pseudolabel is the estimated depth as the input of the high-contrast RGB image. We observe that the image quality evaluation metric is better than the SIlog loss [16] used in supervised learning [4,5] to transmit the accuracy of the Pseudolabel, which is a dense depth map, to a thermal image-based depth estimation model. Therefore, we construct a Self-Guided Loss as an image quality evaluation metric to compare the Pseudolabel with a predicted depth map from the thermal image. The image quality evaluation metric in Self-Guided Loss consists of the structural similarity index measure [17] and L1 distance and perceptual loss [18] to compare all the low- and high-level information of the depth map. In addition, we add our proposed Patch-NetVLAD Loss, which is our second contribution, to enhance the local detail and global context information of the predicted depth from the thermal image. This loss uses the Patch-NetVLAD [19] to estimate local and global descriptors in both depth maps and compare all descriptors between the Pseudolabel and predicted depth map from the thermal image to strengthen local detail and global context information. Our third contribution is the proposal of Image Matching Loss that enhances the performance of the Pseudolabel used as ground truth in the Self-Guided Framework and enables more accurate thermal depth estimation. We show the qualitative and quantitative performance of our proposed Self-Guided Framework and its generality and validity using the KAIST Multispectral Dataset [11].

2. Related Work

2.1. Self-Supervised Depth Estimation

Beyond various hand-crafted approaches, Eigen et al. [16] proposed one of the earliest works in CNN-based depth estimation from a single image. Subsequent works [5,20,21] extended this baseline model to perform high-quality depth imaging. While these supervised learning-based methods show tremendous performance in depth prediction, they are critically limited in scaling due to the difficulty of collecting large-scale dataset images and depth pairs. Therefore, self-supervised learning, which is based on relatively low-cost data such as a synchronous stereo pairs [6,7] or monocular video [8,22,23,24,25], has become mainstream for monocular depth estimation.

Zhou et al. [23] proposed a framework for self-supervised training in the purely monocular setting, where a depth and pose network are simultaneously learned from unlabeled monocular videos. Inspired by Zhou et al. [23], Godard et al. [8] proposed a minimum reprojection loss and automasking loss, which focused on ignoring occlusions and violated camera motion assumptions. Various methods have expounded on Godard et al. [8] by changing depth networks [22,25,26,27].

Han et al. [25] proposed a novel architecture that refines the sequentially predicted depth map and gradually generates a high-quality depth map via multistack CNN structures. Zhou et al. [26] developed a novel depth estimation network which can use semantic information in down- and up-sampling procedures. Yan et al. [27] introduced a novel channelwise attention-based depth network employing two channelwise attention modules.

The above methodologies are learned by calculating the appearance matching loss using only SSIM [17] and L1 Distance. Unlike the two formulas that compare low-level pixels [18], methodologies that can compare synthetic images and real images at high level have been proposed in the image translation field [18,28]. Here, we propose Image Matching Loss, which enhances semantic information in an estimated depth map by comparing high-level pixels. Additionally, by adding our proposed Patch-NetVLAD Loss to the Image Matching Loss, local detail and global context information are reinforced by comparing the local and global descriptors of the two images.

2.2. Thermal Infrared Camera Vision

Due to the intrinsic limits of RGB cameras, such as a weakness in low-illumination situations, thermal imaging cameras are utilized for robust recognition in all-day conditions. Thermal images using deep learning are attracting attention in various applications such as object detection [11,12], segmentation [13,29], image enhancement [30], person re-identification [31], and visual localization [14]. However, there are few learning-based approaches [15] to estimate the depth map from thermal images. Kim et al. [15] proposed a novel multitask framework that generates a pixelwise depth image in an unsupervised manner and exploits geometric priors and chromaticity clues.

The training process of the existing self-supervised monocular depth estimation framework [15] with thermal infrared images as input, as shown in Figure 1a, can be summarized as follows: (1) A monocular depth model estimates the disparity map

D_{T}

from the left thermal infrared image

T_{L}

. (2) The reconstructed left RGB image

{\hat{R}}_{L}

is generated by warping from the right RGB image

R_{R}

to the left RGB image

R_{L}

via the above predictions and geometry constraints. The above process can be formulated as follows:

{\hat{R}}_{L} = P (R_{R}, K_{R}, K_{L}, E, D_{T})

(1)

where P indicates the projection matrix,

K_{R}

,

K_{L}

and E refer to an intrinsic parameter of the left and right image and extrinsic parameters between stereo images, respectively (3). Then, the appearance matching loss

L_{a p}

between the reconstructed left RGB image

{\hat{R}}_{L}

and the left RGB image

R_{L}

is the main objective function to encourage the model to learn depth information. This learning method makes it possible to learn a self-supervised monocular depth estimation network using a thermal image that is not taken with a stereo. However, calculating the appearance matching loss

L_{a p}

between RGB images using the

D_{T}

is limited in learning the depth estimation model due to the spectral difference between RGB and thermal images.

3. Materials and Methods

3.1. Self-Guided Framework

To overcome the aforementioned problems, we propose a Self-Guided Framework for significantly increasing predicted depth quality from the thermal image. Our Self-Guided Framework, as illustrated in Figure 1b, learns the depth network

N_{R}

using only RGB stereo images and, through this, creates a highly accurate depth map

D_{R}

. Then, our proposed Self-Guided Loss

L_{S G}

is calculated using the depth map

D_{R}

from the RGB image as a Pseudolabel to learn the depth network

N_{T}

that utilizes the left thermal image

T_{L}

as input. The existing supervised depth estimation methods [4,5,16] employ SIlog loss to compare depth maps. These methods cannot consider the relationship between object information and pixels in the dense depth maps.

To overcome this problem, our Self-Guided Loss

L_{S G}

is constructed with the generally used methodologies (a combination of SSIM [17] and L1 distance)

L_{S L}

in image translation [32,33] and our proposed Patch-NetVLAD Loss. The equation is as follows.

\begin{matrix} L_{S G} (D_{R}, D_{T}) = M * L_{S L} (D_{R}, D_{T}) + L_{V G G} (D_{R}, D_{T}) + β L_{P V} (D_{R}, D_{T}) \end{matrix}

(2)

where

β

is the static value to rescale loss, and is set it to 10. M is the automask calculated in appearance matching loss

L_{a p}

.

A combination of SSIM and L1 distance

The structural simulation index measure (SSIM) and L1 distance loss

L_{S L}

, which are the most traditional losses in image translation tasks, are used to compare the Pseudolabel

D_{R}

with the depth map

D_{T}

from the thermal image. The equation is as follows:

\begin{matrix} S S I M (D_{R}, D_{T}) = l (D_{R}, D_{T}) c s (D_{R}, D_{T}) \\ l (D_{R}, D_{T}) = \frac{2 μ_{R} μ_{T} + c_{1}}{{μ_{R}}^{2} + {μ_{T}}^{2} + c_{1}} \\ c s (D_{R}, D_{T}) = \frac{2 δ_{R} δ_{T} + c_{2}}{{δ_{R}}^{2} + {δ_{T}}^{2} + c_{2}} \\ L_{S L} (D_{R}, D_{T}) = α \frac{1 - S S I M (D_{R}, D_{T})}{2} + (1 - α) | D_{R} - D_{T} | \end{matrix}

(3)

where

μ_{x}

is the value of

D_{x}

,

{δ_{x}}^{2}

is the variance of

D_{x}

, and set (

c_{1} = 0 . 01^{2}

,

c_{2} = 0 . 03^{2}

,

α = 0.85

). The SSIM [17] is constructed by two comparison measurements: luminance (l) and contrast and structure (

c s

) to compare the similarity between the two images, and the L1 distance directly compares the two image values. This loss causes the predicted depth map from thermal images

D_{T}

to emulate the three elements of the Pseudolabel.

Perceptual Loss We use perceptual loss [18] to measure the high-level perceptual and semantic differences between the Pseudolabel

D_{R}

and predicted depth maps

D_{T}

from the thermal images. This loss uses the features obtained by inputting two depth maps to the trained VGG network [34] for image classification, and the equation is as follows:

\begin{matrix} F_{T} = N_{V G G} (D_{T}), F_{R} = N_{V G G} (D_{R}) \\ L_{V G G} (D_{R}, D_{T}) = | F_{R} - F_{T} | \end{matrix}

(4)

3.2. Patch-NetVLAD Loss

By combining the advantages of the local and global descriptor approaches and extracting patch-level features from NetVLAD residuals, Patch-NetVLAD [19] demonstrates significant performance improvement in the image matching task. Inspired by Patch-NetVLAD, we propose a novel Patch-NetVLAD Loss, in which the

D_{T}

learns both local and global descriptors in Pseudolabel as shown in Figure 2. Global

V^{G}

and local

V^{L}

descriptors are extracted from each depth map (Pseudolabel,

D_{T}

) using the NetVLAD [35] network

N_{N V}

pretrained with the Pittsburgh dataset [36]. Please refer to [19] for a detailed description about extracting Patch-NetVLAD vectors. The Patch-NetVLAD Loss is calculated from predicted descriptors by each depth map, and the equation is as follows:

\begin{matrix} V_{T}^{G}, V_{T}^{L} = N_{N V} (D_{T}), V_{R}^{G}, V_{R}^{L} = N_{N V} (D_{R}) \\ L_{P V} (D_{R}, D_{T}) = \frac{1}{n_{s} n_{p}} \sum_{s = 1}^{n_{s}} \sum_{p = 1}^{n_{p}} (| V_{R, s p}^{L} - V_{T, s p}^{L} |) + | V_{R}^{G} - V_{T}^{G} | \end{matrix}

(5)

where

n_{s}

is the size of multiple patches, and

n_{p}

indicates the total number of patches. Similar to perceptual loss, comparisons of global descriptors improve recognition and semantic information of the

D_{T}

images. In addition, local semantic information and details in

D_{T}

are improved due to the comparison of local descriptors, which are obtained using patches.

3.3. Image Matching Loss

For our proposed Self-Guided Framework using a Pseudolabel, the performance of the Pseudolabel determines the performance of

D_{T}

. Thus, we propose Image Matching Loss

L_{I M}

to improve the performance of the Pseudolabel. Since the SSIM relies only on low-level differences between pixels [18], the Image Matching Loss consists of a perceptual loss and a Patch-NetVLAD Loss that computes high-level similarity [18]. Through perceptual loss and comparison using the global descriptor of the Patch-NetVLAD Loss, the semantic information of the Pseudolabel is further strengthened. Additionally, the local descriptor term of the Patch-NetVLAD Loss improves the accuracy of the local detail area. Perceptual loss

L_{V G G}^{I M}

and Patch-NetVLAD Loss

L_{P V}^{I M}

in the Image Matching Loss proceed similarly to those used in Section 3.1 and Section 3.2, respectively, and input is only changed from the depth maps to RGB images. The above equation can be formulated as follows:

\begin{matrix} F_{R} = N_{V G G} (R_{L}), F_{\hat{R}} = N_{V G G} ({\hat{R}}_{L}) \\ L_{V G G}^{I M} (R_{L}, {\hat{R}}_{L}) = | F_{R} - F_{\hat{R}} | \end{matrix}

(6)

\begin{matrix} V_{R}^{G}, V_{R}^{L} = N_{N V} (R_{L}), V_{\hat{R}}^{G}, V_{\hat{R}}^{L} = N_{N V} ({\hat{R}}_{L}) \\ L_{P V}^{I M} (R_{L}, {\hat{R}}_{L}) = \frac{1}{n_{s} n_{p}} \sum_{s = 1}^{n_{s}} \sum_{p = 1}^{n_{p}} (| V_{\hat{R}, s p}^{L} - V_{R, s p}^{L} |) + | V_{\hat{R}}^{G} - V_{R}^{G} | \end{matrix}

(7)

\begin{matrix} L_{I M} (R_{L}, {\hat{R}}_{L}) = L_{V G G}^{I M} (R_{L}, {\hat{R}}_{L}) + L_{P V}^{I M} (R_{L}, {\hat{R}}_{L}) \end{matrix}

(8)

where

n_{s}

is the size of multiple patches, and

n_{p}

indicates the total number of patches.

3.4. Training Loss

Appearance Matching Loss We use a combination of an L1 distance and the Structural Similarity (SSIM) [17] term to increase the pixel-level similarity between the left RGB image

R_{L}

and the synthesized image

{\hat{R}}_{R}

, by Equation (9).

\begin{matrix} L_{a p}^{p} (R_{L}, {\hat{R}}_{L}) = α \frac{1 - S S I M (R_{L}, {\hat{R}}_{L})}{2} + (1 - α) | R_{L} - {\hat{R}}_{L} | \end{matrix}

(9)

Equation (9) is a typical robust learning method for self-supervision, however, the error of parallax in the scene leads to out-of-view and occluded pixels. This causes an undesirable effect on the learning. The per-pixel minimum re-projection loss is used to solve this problem with out-of-view and occluded pixels, as shown in monodepth2 [8]. This alleviates the undesirable problems by calculating the minimum loss per pixel for the left RGB image

R_{L}

, which means that the same pixel is not out-of-view and occluded in the synthesized target images.

\begin{matrix} L_{a p}^{p} (R_{L}, R_{R}) = min L_{a p}^{p} (R_{L}, {\hat{R}}_{L}) \end{matrix}

(10)

We also apply the automasking static pixels methods suggested in monodepth2 [8]. Due to static pixels having a minor appearance matching loss and their ability to make an infinite depth hole when assuming no ego-motion between frames, we use automasking to ignore the static pixels. We find the pixels that satisfy having

L_{a p}^{p} (R_{L}, R_{R})

higher than

L_{a p}^{p} (R_{L}, {\hat{R}}_{L}

) to produce a mask.

\begin{matrix} M = min L_{a p}^{p} (R_{L}, R_{R}) > min L_{a p}^{p} (R_{L}, {\hat{R}}_{L}) \end{matrix}

(11)

The final appearance matching loss for self-supervised loss is as follows:

\begin{matrix} L_{a p} = L_{a p}^{p} (R_{L}, R_{R}) ⊙ M \end{matrix}

(12)

Disparity Smoothness Loss As suggested in monodepth [6], the depth smoothness loss penalizes depth discontinuity in textureless low-image gradient regions. We apply depth smoothness to our constraints.

\begin{matrix} L_{d s} = | δ_{x} D_{R} | e^{- | δ_{x} R_{L} |} + | δ_{y} D_{R} | e^{- | δ_{y} R_{L} |} \end{matrix}

(13)

The total loss to learn our proposed Self-Guided Framework is as follows:

\begin{matrix} L_{S G F} = L_{S G} (D_{R}, D_{T}) + L_{I M} (R_{L}, {\hat{R}}_{L}) + L_{a p} (R_{L}, R_{R}) + 0.001 * L_{d s} \end{matrix}

(14)

4. Experiments and Results

4.1. Dataset

We conducted experiments with our proposed method on the KAIST Multispectral Stereo Dataset [15]. This dataset provides a calibrated RGB stereo pair, co-aligned thermal images with left-view RGB stereo images and 3D measurements. Therefore, it is widely used in visual perception tasks such as depth estimation, color estimation and visual localization. The KAIST Multispectral Dataset also focuses on real-world driving conditions, such as a university campus, residential areas, urban areas and suburbs during the day and night. This dataset consists of 7383 images per camera, where 4534 images are used for the training set and 1583 images for the test set.

4.2. Implementation Details

We implemented our proposed framework in PyTorch with all models trained across 1 Titan 3090 GPU. We use the AdamW optimizer [37] with

β_{1}

= 0.9 and

β_{2}

= 0.999, and we initialized our depth network pretrained for KITTI [38] depth estimation. In addition, our proposed model was trained for 20 epochs, with a batch size of 4 and the learning rates of the depth networks set to 1 ×

10^{- 4}

.

4.3. Evaluation Metrics

The ground-truth depth image was used for evaluation. Following standard evaluation protocol [8], our model was evaluated in 7 metrics, “Abs Rel”, ”Sqr Rel”, “RMSE”, “RMSE log”, “

δ_{1.25}

”, “

δ_{1 . 25^{1}}

” and “

δ_{1 . 25^{2}}

” for quantitative comparisons. The maximum range of the evaluation was 50 m.

4.4. Ablation Study

Effect of Self-Guided Loss We show the effect of constraints on the proposed Self-Guided Loss in Table 1 to demonstrate the validity of each constraint. To achieve this, we set the baseline as calculating the appearance loss using the depth map obtained from the thermal image. For all experimental results, DIFFNet [26] was used as the depth network. As shown in Table 1, combining SSIM and L1 distances

L_{S L}

to train a thermal depth network with Pseudolabels results in a considerable performance gain over the baseline framework and is significantly better than SIlog loss, which compares the depth values for each pixel. This demonstrates that the method of utilizing Pseudolabels improves the accuracy of the thermal depth network and that the image evaluation metric rather than depth value comparisons should be used to compare dense depth maps. The framework with perceptual loss

L_{V G G}

also shows performance improvement. This result demonstrates that the perceptual loss enhances performance by reinforcing high-level perceptual and semantic information via Pseudolabel. Additionally, when our novel suggested the Patch-NetVLAD Loss be used, the performance was the highest among all comparisons utilizing thermal infrared images as input, and the performance difference with the results using RGB images was the lowest. With these results, we argue that our contribution encourages the model to estimate depth outputs using thermal infrared images with abundant detail and globally semantic information.

Effect of Image Matching Loss We demonstrate our proposed Image Matching Loss that improves the performance of the Pseudolabel and provides a more accurate ground truth for thermal depth estimation, as shown in Table 2. All experiments in Table 2 are results of applying Self-Guided Loss

L_{S G}

and using the DIFFNet [26] as the depth network. In Table 2, the perceptual loss

L_{V G G}^{I M}

is added to strengthen the semantic and perceptual information of the Pseudolabel, and the performance result is improved. With the Patch-NetVLAD Loss

L_{P V}^{I M}

, the combination of proposed constraints is constantly boosting the accuracy in the monocular depth estimation using the thermal infrared image. This indicates that the Patch-NetVLAD Loss can encourage the model to estimate abundant local detail and semantic information.

4.5. Depth Estimation Performance

To evaluate the generality of the proposed model, we conducted an experiment by changing the depth networks, which show significant performance methods [25,26,27] in DDAD [10] and KITTI [38] and the most basic method [8] of self-supervised depth estimation; the experiment results can be seen in Table 3. When our approach is applied to monodepth2, the most fundamental self-supervised monocular depth estimation method, a significant improvement is seen in all evaluation metrics. Furthermore, both CADepth [27] using an attention mechanism and GBNet [25] with the progressive configuration of depth networks improves the performance of the thermal depth estimation. According to these results, our contributions have a generalization ability to operate on different types of depth networks. Additionally, qualitative results are shown in Figure 3. Unlike other existing methodologies in which our methodology is omitted, our proposed methodology includes abundant local detail and semantic information and estimates accurate depth information as a whole. As shown in Figure 4, when we apply our methodology at nighttime, more accurate and sharper results can be seen compared with other methodologies.

5. Conclusions

Here, we propose a novel Self-Guided Framework to estimate a depth map from thermal images. This framework transfers more accurate depth information via a Pseudolabel generator on the RGB images to the thermal depth estimation network. To transfer the accuracy of the Pseudolabel, which is a dense depth map, to the thermal depth estimation network, we use a combination of SSIM+L1 distance and perceptual loss to compare both low-level and high-level pixels, strengthening semantic and perceptual information. We also propose a novel Patch-NetVLAD Loss, in which the depth map from the thermal image learns both local and global descriptors via a Pseudolabel. In addition, we propose Image Matching Loss to increase the performance of the Pseudolabel, which improves its local detail and semantic information. Eventually, the thermal depth network also estimates the accurate depth map by using the more accurate ground-truth depth map. We demonstrate the generality of our proposed Self-Guided Framework which works in various depth networks and provides more accurate depth estimation, even at nighttime. Our suggested approaches open new avenues for thermal image-based depth estimation.

Author Contributions

Conceptualization, Y.C.; methodology, Y.C. and D.H.; software, D.H.; validation, D.H.; formal analysis, D.H.; investigation, D.H.; resources, Y.C.; data curation, D.H.; writing—original draft preparation, D.H.; writing—review and editing, D.H. and Y.C.; visualization, D.H.; supervision, Y.C.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) and ICT (NRF-2020M3F6A1109603, NRF-2020R1F1A1076987).

Data Availability Statement

The datasets generated for this study are accessible upon request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. Nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Ibáñez, M.B.; Delgado-Kloos, C. Augmented reality for STEM learning: A systematic review. Comput. Educ. 2018, 123, 109–123. [Google Scholar] [CrossRef]
Bastug, E.; Bennis, M.; Médard, M.; Debbah, M. Toward interconnected virtual reality: Opportunities, challenges, and enablers. IEEE Commun. Mag. 2017, 55, 110–117. [Google Scholar] [CrossRef]
Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Lee, J.H.; Han, M.K.; Ko, D.W.; Suh, I.H. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Gonzalez, J.L.; Kim, M. PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View Depth Estimation With Neural Positional Encoding and Distilled Matting Loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Song, X.; Li, W.; Zhou, D.; Dai, Y.; Fang, J.; Li, H.; Zhang, L. MLDA-Net: Multi-level dual attention-based network for self-supervised monocular depth estimation. IEEE Trans. Image Process. 2021, 30, 4691–4705. [Google Scholar] [CrossRef] [PubMed]
Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Kim, J.; Kim, H.; Kim, T.; Kim, N.; Choi, Y. MLPD: Multi-Label Pedestrian Detector in Multispectral Domain. IEEE Robot. Autom. Lett. 2021, 6, 7846–7853. [Google Scholar] [CrossRef]
Kim, Y.H.; Shin, U.; Park, J.; Kweon, I.S. MS-UDA: Multi-spectral unsupervised domain adaptation for thermal image semantic segmentation. IEEE Robot. Autom. Lett. 2021, 6, 6497–6504. [Google Scholar] [CrossRef]
Han, D.; Hwang, Y.; Kim, N.; Choi, Y. Multispectral Domain Invariant Image for Retrieval-based Place Recognition. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Virtual, 31 May–31 August 2020. [Google Scholar]
Kim, N.; Choi, Y.; Hwang, S.; Kweon, I.S. Multispectral transfer network: Unsupervised depth estimation for all-day vision. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. arXiv 2014, arXiv:1406.2283. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 694–711. [Google Scholar]
Hausler, S.; Garg, S.; Xu, M.; Milford, M.; Fischer, T. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Johnston, A.; Carneiro, G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 31 May–31 August 2020. [Google Scholar]
Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Han, D.; Choi, Y. GBNet: Gradient Boosting Network for Monocular Depth Estimation. In Proceedings of the International Conference on Control, Automation and Systems (ICCAS), Jeju, Korea, 12–15 October 2021. [Google Scholar]
Zhou, H.; Greenwood, D.; Taylor, S. Self-Supervised Monocular Depth Estimation with Internal Feature Fusion. In Proceedings of the British Machine Vision Conference (BMVC), Online, 22–25 November 2021. [Google Scholar]
Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation. In Proceedings of the British Machine Vision Conference (BMVC), Online, 22–25 November 2021. [Google Scholar]
Dosovitskiy, A.; Brox, T. Generating images with perceptual similarity metrics based on deep networks. arXiv 2016, arXiv:1602.02644. [Google Scholar]
Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017. [Google Scholar]
Choi, Y.; Kim, N.; Hwang, S.; Kweon, I.S. Thermal image enhancement using convolutional neural network. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 9–14 October 2016. [Google Scholar]
Ye, M.; Lan, X.; Li, J.; Yuen, P. Hierarchical discriminative learning for visible thermal person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Torii, A.; Sivic, J.; Pajdla, T.; Okutomi, M. Visual place recognition with repetitive structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The overview of the proposed framework. (a) Baseline Framework [15]. (b) Our proposed Self-Guided Framework.

Figure 2. The configuration of our proposed Patch-NetVLAD Loss.

Figure 3. Qualitative results of the KAIST Multispectral Datasets [15]. GT indicates the ground truth of the depth map.

Figure 4. Qualitative Results for Nighttime in the KAIST Multispectral Dataset [15]. (R) and (T) indicate that the RGB image and thermal image are used as inputs, respectively.

Table 1. Ablation Study of Self-Guided Loss. T and R denote the thermal image and RGB image, respectively. Metrics indicated in red: lower is better.

SIlog	$L_{S L}$	$L_{V G G}$	$L_{P V}$	Input	Abs. Rel	Sqr. Rel	RMSE	RMSE log
-	-	-	-	T	0.098	0.525	3.439	0.135
✓	-	-	-	T	0.099	0.512	3.454	0.136
-	✓	-	-	T	0.096	0.482	3.336	0.133
-	✓	✓	-	T	0.092	0.465	3.361	0.132
-	✓	✓	✓	T	0.089	0.452	3.299	0.130
-	-	-	-	R	0.081	0.373	2.966	0.120

Table 2. Ablation Study of Image Matching Loss. T and R denote the thermal image and RGB image, respectively. Metrics indicated in red: lower is better.

$L_{A P}$	$L_{V G G}^{I M}$	$L_{P V}^{I M}$	Input	Abs. Rel	Sqr. Rel	RMSE	RMSE log
✓	-	-	T	0.089	0.452	3.299	0.130
✓	✓	-	T	0.089	0.438	3.262	0.128
✓	✓	✓	T	0.086	0.431	3.261	0.127
✓	-	-	R	0.081	0.373	2.966	0.120
✓	✓	-	R	0.081	0.365	2.889	0.116
✓	✓	✓	R	0.079	0.350	2.877	0.116

Table 3. Quantitative results according to the depth network. SGF is our proposed Self-Guided Framework. Metrics indicated in red: lower is better. Metrics indicated in blue: higher is better.

Method	SGF	Abs. Rel	Sqr. Rel	RMSE	RMSE log	$δ_{1.25}$	$δ_{1 . 25^{2}}$	$δ_{1 . 25^{3}}$
monodepth2 [8]	-	0.107	0.577	3.619	0.148	0.885	0.975	0.994
monodepth2 [8]	✓	0.098	0.515	3.529	0.145	0.888	0.977	0.994
CADepth [27]	-	0.109	0.720	4.277	0.201	0.882	0.970	0.989
CADepth [27]	✓	0.105	0.581	3.782	0.153	0.883	0.973	0.993
GBNet [25]	-	0.095	0.483	3.430	0.135	0.911	0.980	0.995
GBNet [25]	✓	0.091	0.442	3.256	0.128	0.916	0.984	0.996
DIFFNet [26]	-	0.098	0.525	3.439	0.135	0.909	0.980	0.994
DIFFNet [26]	✓	0.086	0.435	3.284	0.129	0.912	0.983	0.995

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, D.; Choi, Y. Monocular Depth Estimation from a Single Infrared Image. Electronics 2022, 11, 1729. https://doi.org/10.3390/electronics11111729

AMA Style

Han D, Choi Y. Monocular Depth Estimation from a Single Infrared Image. Electronics. 2022; 11(11):1729. https://doi.org/10.3390/electronics11111729

Chicago/Turabian Style

Han, Daechan, and Yukyung Choi. 2022. "Monocular Depth Estimation from a Single Infrared Image" Electronics 11, no. 11: 1729. https://doi.org/10.3390/electronics11111729

APA Style

Han, D., & Choi, Y. (2022). Monocular Depth Estimation from a Single Infrared Image. Electronics, 11(11), 1729. https://doi.org/10.3390/electronics11111729

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Monocular Depth Estimation from a Single Infrared Image

Abstract

1. Introduction

2. Related Work

2.1. Self-Supervised Depth Estimation

2.2. Thermal Infrared Camera Vision

3. Materials and Methods

3.1. Self-Guided Framework

3.2. Patch-NetVLAD Loss

3.3. Image Matching Loss

3.4. Training Loss

4. Experiments and Results

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Ablation Study

4.5. Depth Estimation Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI