Enhancing Infrared Optical Flow Network Computation through RGB-IR Cross-Modal Image Generation

Huang, Feng; Huang, Wei; Wu, Xianyu

doi:10.3390/s24051615

Open AccessArticle

Enhancing Infrared Optical Flow Network Computation through RGB-IR Cross-Modal Image Generation

by

Feng Huang

,

Wei Huang

and

Xianyu Wu

^*

School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou 350108, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(5), 1615; https://doi.org/10.3390/s24051615

Submission received: 13 November 2023 / Revised: 22 January 2024 / Accepted: 22 February 2024 / Published: 1 March 2024

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

Due to the complexity of real optical flow capture, the existing research still has not performed real optical flow capture of infrared (IR) images with the production of an optical flow based on IR images, which makes the research and application of deep learning-based optical flow computation limited to the field of RGB images only. Therefore, in this paper, we propose a method to produce an optical flow dataset of IR images. We utilize the RGB-IR cross-modal image transformation network to rationally transform existing RGB image optical flow datasets. The RGB-IR cross-modal image transformation is based on the improved Pix2Pix implementation, and in the experiments, the network is validated and evaluated using the RGB-IR aligned bimodal dataset M³FD. Then, RGB-IR cross-modal transformation is performed on the existing RGB optical flow dataset KITTI, and the optical flow computation network is trained using the IR images generated by the transformation. Finally, the computational results of the optical flow computation network before and after training are analyzed based on the RGB-IR aligned bimodal data.

Keywords:

optical flow; infrared image; deep neural network

1. Introduction

The optical flow algorithm leverages pixel changes across a sequence of images in the time domain to analyze and understand the motion of objects moving within the observed imaging plane. This technique holds significant research implications for various fields, including visual navigation and motion inference [1]. The theoretical foundation of the optical flow algorithm relies on the assumption of luminance invariance. This assumption assumes that the luminance values of all pixels in two consecutive images involved in the optical flow computation remain constant over a short period of time. However, this assumption imposes limitations on the general applicability of the optical flow method, as it assumes a specific type of scene and may not hold true in all situations. In recent years, some researchers have also paid more attention to the use of optical flow algorithms in complex scenes, such as low-light dark environments. The essence is mostly to perform richer preprocessing on RGB images to obtain more semantic information [2]. The influence of environmental factors on the imaging of ordinary RGB cameras has also prevented the practical application of most optical research, but the emergence of infrared (IR) technology has been the key to breaking the deadlock to some extent. IR imaging technology itself is not affected by external light, and the use of IR thermal imaging cameras can directly capture the radiation signal emitted from objects to generate human-eye recognizable images with a strong penetrating power [3]. Consequently, exploring the computation of optical flow specifically for IR images can significantly expand the practical applications of optical flow algorithms.

Image optical flow computations are mainly categorized into traditional linear (Lucas–Kanade [4], Horn–Schunck [5], etc.) and deep learning-based (PWC-Net [6], VCN [7], RAFT [8], etc.) algorithms. Among them, deep learning-based optical flow computation algorithms have been shown to be able to obtain a more accurate computational accuracy for RGB images. Deep learning algorithms can extract richer image features by using a stacked convolutional computation module [9,10], which is beneficial for optical flow solving. In contrast, current experiments [11,12,13] involving optical flow computations of IR images still mainly use traditional algorithms or cannot independently use deep learning optical flow algorithms and validation [14]. The reason for this is that the fitting of network parameters for deep learning algorithms requires an optical flow dataset with the true value of the optical flow, which is often costly to produce. Most of the optical flow datasets (FlyingChairs [15], etc.) are generated using synthetic techniques such as PS (Photoshop) and are not derived from real scene acquisition. Only one dataset, KITTI [16], was created by combining RGB cameras with multi-sensors to capture real scene information. In the RGB image domain, existing optical flow datasets are extremely limited, and creating an optical flow dataset in the IR image domain is undoubtedly an even more voluminous project. This directly hampers the application of deep learning algorithms based on the computation of optical flow in IR images.

The first consideration for training optical flow-computing networks that lack scene-specific datasets is often an unsupervised training approach [17]. This approach does not require a dataset with real optical flow data, and it has better flexibility in dealing with more realistic and complex scenarios. Although it is more promising, the overall accuracy is still inferior to supervised networks in optical flow computations [18].

Considering the previously mentioned challenges and practical constraints, we propose a new approach to solve the parameter fitting problem of optical flow computational networks in real IR scenes. As shown in Figure 1, we utilize a deep learning-based image transformation algorithm. We generate IR images from existing RGB optical flow datasets across modalities. Each pair of images in an existing optical flow dataset has two RGB images and one optical flow map. As shown in Figure 2, we utilize the IR images obtained by converting two RGB images using the optical flow map to train the optical flow computation network. This can improve the optical flow computation accuracy of the optical flow computation network in IR scenes.

The main contributions of this work can be summarized as follows:

(1): Achieving cross-modal image transformation between RGB and IR. Drawing upon research on style transfer and image-to-image networks, this paper proposes a redesigned RGB-IR cross-modal image transformation network.
(2): Realization of optical flow computational networks for fine-tuning in IR scenes. In this paper, all the RGB images in the KITTI optical flow dataset are converted to IR images. Subsequently, the converted IR images are employed to train the optical flow computation network, thereby enhancing its performance, specifically for IR scenes.

2. Related Work

Currently, there are still relatively few optical flow studies based on the IR domain, especially in the field of deep learning algorithms. One of the most important reasons is the lack of optical flow datasets and the difficulty in producing them. However, the lack of IR datasets is not only a problem in the field of optical flow, but also in the field of IR identification, where only some research tools are worth learning from. As mentioned in AlignGAN [19], RGB-IR cross-modal image transformation operations can be performed based on GAN. Simply speaking, this method utilizes algorithms to convert real RGB images into fake IR images and utilizes high-quality IR image generation to extend the dataset.

Therefore, our rationale is the same. However, the slight difference is that the experiments in AlignGAN are centered around RGB-IR cross-modal pedestrian identification, and the experimental dataset is focused on the SYSU-MM01 dataset [20]. In contrast, our experiments are focused more on optical flow computation. There is one currently available real RGB optical flow dataset, KITTI, a dataset dominated by real street scenes. In the optical flow computation experiments, KITTI is the labeled dataset. But in the RGB-IR cross-modal transformation experiments, KITTI is the unlabeled dataset. There are often two kinds of operations performed on unlabeled datasets. One is to use an unsupervised network such as CycleGAN [21] to train and process the images. However, this method tends to be too random, and the generated images are most of the time unsatisfactory or even excessively distorted. The other is to use similar or similarly labeled datasets to supervise the training of the network and then use the strong generalization ability of the deep learning algorithms to achieve transformation of the unlabeled datasets. This method is easier to train and produces better images because of the exact convergence object, but the performance on unlabeled datasets is more of a test of the network’s own generalization ability.

After significant observation, we found that the M³FD [22] dataset provided by the Dalian University of Technology has many similarities with the KITTI dataset in terms of scenes and is based on a larger range of live street scenes. M³FD is an aligned RGB-IR bimodal dataset. Therefore, we propose to use M³FD for supervised training of image transformation networks to obtain more accurate and higher quality modal transformations.

Supervised image transformation can be traced back to the classical model of Pix2Pix [23]. The article states that the U-Net [24] structure of the model is more favorable for image generation. In addition, the use of pixel consistency loss is able to approximate the generation of low-frequency information in the image, while the high-frequency information is optimized using the generative adversarial loss (Patch Gan as the discriminator).

3. Proposed Method

In this paper, the research includes both RGB-IR cross-modal image transformation and optical flow computation networks that are fine-tuned for IR scenarios.

The RGB and IR images are shown in Figure 3, with the RGB and IR images aligned at moments

t

and

t + 1

. It is well known that the definition of optical flow is a description of the motion of a flat imaging object. The aligned RGB and IR images have the same image object source without considering special cases such as masking. As shown in Figure 3, the displacement of the car in the IR image is consistent with that in the RGB image. The motion of the flat imaging object is consistent to the above theory and can be regarded as aligned RGB and IR images corresponding to the real optical flow, which is also consistent.

Therefore, it is feasible for us to convert RGB images from existing optical flow datasets into IR images and then train the optical flow computation network using the generated IR images and real optical flow maps of RGB.

3.1. RGB-IR Cross-Modal Transformation

As shown in Figure 4, the image transformation network in this paper adopts a Pix2Pix-style supervised training architecture, where we use the labeled dataset M³FD for supervised training of the network. Supervised training enables the network to learn the relationship between the RGB and IR domains of the input images. This learning process helps the network to better understand the data and realize the transformation of RGB to IR images. The network first performs deep feature encoding of the RGB image using residual blocks with the SE (squeeze-and-excitation) [25] module and then recovers and reconstructs the image using transposed convolution. There is some cross information between the RGB and IR domain, and some information barriers exist at the same time. The use of the SE module can help the network to pay more attention to the cross information with the IR domain when feature encoding the RGB image. At the same time, when decoding, the use of the SE module can make the image generation more adaptive.

In the design of the loss function, this paper uses pixel consistency, gradient consistency, and generative adversarial loss to jointly constrain. The pixel consistency loss

L_{p i x e l}

uses the L1 loss, which is insensitive to outliers and can promote the network to generate smoother generation results.

L_{p i x e l} = ‖I_{r e a l} - I_{f a k e}‖

(1)

The

L_{p i x e l}

facilitates the generation of most of the low-frequency information of the image but is weak in the high-frequency information part. Therefore, gradient consistency loss

L_{g r a d}

and generative adversarial loss

L_{g a n}

are further used to enhance the recovery of high-frequency information of the image. The gradient consistency loss uses the Sobel operator to compute the gradient of the generated IR image and the real IR image separately and then compares the difference between the two.

L_{g r a d} = ‖S o b e l (I_{r e a l}) - S o b e l (I_{f a k e})‖

(2)

For

L_{g a n}

, we used Patch GAN to characterize the generated IR image or the real IR image, constraining the network to generate an IR image that is closer to the real one. The image transformation network and Patch GAN are performed in stages. In the Patch GAN training phase, the image transformation network parameters are not updated. Patch GAN inputs the generated IR image and the real IR image, and the constrained Patch GAN determines that the generated IR image is “False” and that the real IR image is “True”. The Patch GAN parameters are not updated during the image transformation network training phase. Patch GAN inputs the generated IR image and determines it to be “True” to adjust the parameters of the image transformation network. The “True” or “False” status of an image is abstracted into a mathematical model, corresponding to “1” or “0”. The more real the input image is, the closer the Patch GAN result is to 1, and vice versa. The structure of Patch GAN is shown in Figure 5, where an

H \times W

image is compressed into a

32 \times 32

feature map using multi-level convolution with residual blocks. Each point on the feature map corresponds to a mapping of a certain region of the original image.

L_{d i s c r i m i n a t o r} = ‖1 - D i s c r i m i n a t o r (I_{r e a l})‖ + ‖D i s c r i m i n a t o r (I_{f a k e})‖

(3)

L_{g a n} = ‖1 - D i s c r i m i n a t o r (I_{f a k e})‖

(4)

On both the image transformation network and Patch GAN, we use the residual block, whose specific structure is shown in Figure 6. We used EqualConv2d and FusedLeakyReLU to construct the residual block. EqualConv2d keeps the learning rate relatively consistent between different layers of the network, which helps it to better maintain gradient stability during parameter tuning and mitigate problems such as gradient explosion or gradient vanishing. FusedLeakyReLU combines the activation function and the normalization step to reduce the memory footprint and improve the training efficiency of the network.

In addition, on the task of transforming unlabeled RGB images, we fine-tune the parameters of the network using the edge consistency loss

L_{e d g e}

and perceptual consistency loss

L_{p e r c e p t}

. First, we use a labeled RGB-IR aligned bimodal dataset to obtain an RGB-IR image transformation network via supervised training. Theoretically, the heavily trained network should generalize on unlabeled RGB images to generate more realistic IR images. In practice, the neural network suffers from a certain degree of underfitting to “things” that have never been seen before. Although the network can map most of the pixels in the image from the RGB to IR domain, the remaining pixels that “fail to map” tend to generate random noise in the generated image, which makes the image distorted. We observe aligned RGB-IR images in an attempt to finetune the network by capturing information about the intersection of the RGB and IR domains. In Chipgan [26], the authors used edge consistency loss to maintain structural similarity before and after image transformation. This method inspired our work. Figure 7 shows the multi-level edge features of the aligned RGB and IR images, respectively. After an extensive analysis, we found that there exists some structural features between the edge images of the RGB and IR images that are more similar than the original images. Therefore, we extracted the multilevel edges of the images using the pre-trained network HED and then constructed the edge similarity loss

L_{e d g e}

.

L_{e d g e} = ‖H E D (I_{r g b}) - H E D (I_{i r (f a k e)})‖

(5)

As shown in Figure 8, since the edge features of the RGB and IR images are again not exactly aligned, we use the pre-trained Swin-T [27] model to construct a perceptual consistency loss

L_{p e r c e p t}

for the image generation. We use the aligned RGB-IR dataset to fine-tune the pre-trained Swin-T, as shown in Figure 9, and we use the triple loss

L_{p e r c e p t}

to constrain the Swin-T to extract the aligned feature vectors as much as possible.

L_{p e r c e p t} = r e l u (‖S w i n (I_{r g b 1}) - S w i n (I_{i r})‖ - ‖S w i n (I_{r g b 2}) - S w i n (I_{i r})‖ + m a r g i n e)

(6)

3.2. Optical Flow Computation

For the RGB image optical flow computation, RAFT (recurrent all-pairs field transforms) has been proven to be the most accurate algorithm, while FastFlowNet [28] is the leading lightweight optical flow computation network. The parameter size of FastFlowNet is only 1.37 M, and its low parameter count and computational complexity make it possible to further apply the optical flow network in mobile devices. Therefore, this paper discusses the computation of IR images using both RAFT and FastFlowNet.

3.2.1. RAFT

As shown in Figure 10, RAFT consists of three main modules: contextual feature extraction, visual similarity computation, and an iterative update module.

Among them, the contextual feature extraction module and the visual similarity computation module are jointly involved in the computation of the visual similarity matrix. The shared weights contextual feature extraction module simultaneously extracts feature information from two frames, forming a pair of feature maps, with each channel corresponding to a specific feature. The inner product of these feature map pairs quantifies the visual similarity between the two frames, with a value closer to 1 indicating a greater similarity of the features. Suppose the size of the feature map involved in the visual similarity computation is

H \times W

, then the size of the inner product calculated is

H \times W \times H \times W

(four-dimensional tensor). Because the multi-scale image similarity features are more sensitive to the abrupt change capture of motion, RAFT pools the last two dimensions of this tensor using pooling kernels of size 1, 2, 4, and 8, respectively, to construct a multi-scale visual similarity pyramid. After that, using the latest optical flow results obtained from the iterative update module (with an initial value of 0), the domain feature capture radius r is positioned in the corresponding position of this pyramid to construct a visual similarity matrix of size

H \times W \times (2 \times r + 1) \times (2 \times r + 1)

.

The iterative update module in RAFT employs a convolutional gated recurrent unit (Conv GRU), which is distinct from conventional GRU in that all fully connected modules are replaced by convolutional modules. Additionally, the parameters in the module are shared across each iteration of the Conv GRU, which significantly reduces the number of parameters required. This design enables RAFT to efficiently compute the optical flow results from coarse to fine iterations.

The gating logic of the Conv GRU is divided into four parts:

z = s i g m o d (C o n v (\begin{matrix} x^{t} \\ h^{t - 1} \end{matrix}))

(7)

r = s i g m o d (C o n v (\begin{matrix} x^{t} \\ h^{t - 1} \end{matrix}))

(8)

h^{'} = t a n h (C o n v (\begin{matrix} x^{t} \\ h^{t - 1} ⊙ r \end{matrix}))

(9)

h^{t} = z ⊙ h^{t - 1} + (1 - z) ⊙ h^{'}

(10)

Equation (5) gates

z

to control the update.

x^{t}

is the current input containing the latest optical flow iteration results, visual similarity matrix, and contextual feature information (obtained using a separate contextual feature extraction module).

h^{t - 1}

is the state volume passed down from the previous iteration. The activation function sigmoid transforms the data to a value in the range of (0, 1), which determines the retention weight of the state volume. Similarly, gates_r are used to control the reset specific gravity (Equation (6)). In Equation (7), after the reset

h^{t - 1}

is spliced with

x^{t}

for one convolution, the activation function tanh is used to deflate the data in the range of (−1, 1) to obtain

h^{'}

. In Equation (8), according to the assignment of

z

, the current stage

h^{t}

selectively remembers

h^{'}

, and the selected forgotten part updates the memory

h^{'}

.

3.2.2. FastFlowNet

As shown in Figure 11, FastFlowNet follows the classical optical flow network PWC-Net pyramid multi-scale computational architecture, with extensive model optimization and parameter pruning in feature extraction, feature correlation computations, and optical flow decoding computations.

In the feature extraction process, head enhanced pooling pyramid (HEPP) is introduced. It retains the convolutional extraction approach of PWC-Net for high-level feature extraction. However, for low-level feature extraction, parameterless pooling extraction is employed instead of convolutional extraction. This design choice reduces the number of convolutional parameters and computational complexity by two layers while still maintaining a reasonable computational accuracy. The optical flow5 is computed from the smallest layer5, and the features of layer4 are mapped according to the computed flow5 to align the two frames. Then, the remaining optical flow of that layer (flow4) is computed for layer4, and so on, iteratively.

The optical flow computation of each layer is obtained by decoding the optical flow from each point of the previous image of that layer. This is achieved by performing feature vector dot product computations on the points of the next image within a specific search radius to construct the image feature correlation cost volume, then decoding the cost volume by overlaying the convolutional or pooling layers. A large number of experiments have shown that expanding the search radius can improve the accuracy of subsequent optical flow computations. However, blindly increasing the radius leads to a significant increase in the computational complexity. To address this issue, FastFlowNet introduces a densely expanded search and acquisition approach with a center dense dilated correlation. This approach resembles dilated convolution, where only the two outermost layers of the search domain undergo feature correlation using cross-interval sampling. The remaining layers are kept fully sampled. This approach constructs a much smaller custom and computational volume than that of a search radius of 4 while achieving a larger feature vector correlation perception without adding excessive parameters or computational complexity.

The pyramid extraction method of HEPP is a layer-by-layer down-sampling extraction method. In contrast, the optical flow needs to be continuously up-sampled in the iterative optical flow back computation from the bottom to the top layer. The design of the up-sampler directly impacts the accuracy of the final optical flow [29]. While complex up-samplers can yield highly accurate up-sampling results, they often overlook considerations related to model parameters and computational complexity. To address this challenge, FastFlowNet introduces a shuffle block decoder based on the ShuffleNet [30,31] lightweight convolutional network. This design ensures high-accuracy optical flow computation while significantly optimizing computational costs. By leveraging the efficiency of ShuffleNet, FastFlowNet strikes a balance between accuracy and computational complexity in the up-sampling process.

4. Experimental Results

4.1. Experimental Results of RGB-IR Cross-Modal Transition

In this paper, the RGB-IR cross-modal transformation experiments are realized based on the RGB-IR bimodal dataset M³FD. First, M³FD is divided into a training set and test set in a ratio of 2:1. In our experiments, the generation results of our RGB-IR cross-modal image transformation network are compared with those of real IR images, Pix2Pix, FDIT [32], and AlignGAN2 [33] (unnamed in the original article, recent results from the original AlignGAN team).

The results of the RGB-IR cross-modal image transformation experiments are shown in Figure 12 for each algorithm used on the training set. Visually, the algorithms based on supervised training (ours and Pix2Pix) achieve better convergence results on the training set. The unsupervised training-based FDIT and AlignGAN2, on the other hand, clearly do not converge to better results in the same training period, especially FDIT. FDIT is based on the idea of separating the “content” and “style” of an image and then interacting with the “style”. This idea is theoretically feasible, but in the actual test, it seems difficult to realize the perfect segmentation between the “content” and “style” of the image, resulting in the actual residual “style” in the “content”. However, in practical tests, it seems difficult to realize perfect segmentation between the “content” and “style” of an image, resulting in the residual of “style” in “content”. AlignGAN2, which is essentially an unsupervised algorithm based on CycleGAN, generates images with strong randomness and is conservative in image transformation. Its generation results still have more elements of the original image under close examination.

The generated results of each algorithm (completed training) on the test set are shown in Figure 13. Visually, the results of each algorithm on the test set have slipped compared to the training results. The two supervised networks with better training results still have more room for improvement when facing the generalization problem. In particular, Pix2Pix, despite perceptually being able to find most of the pixels directly transformed, appears rich in noise, making the image look less smooth and contoured. In contrast, ours performs perceptual and edge fine-tuning on the unlabeled dataset, and the generated results are visually optimized and closer to the real IR images.

Since the unlabeled dataset KITTI is a daytime scene dataset, we mainly used the daytime scene part of M³FD when training the RGB-IR cross-modal image transformation network, trying to push the labeled training dataset to be highly similar to the unlabeled dataset to be generalized. Nevertheless, we still focus on the black night scene part of M³FD. The training results for the dark night scene are shown in Figure 14, and the supervised network can still form a good convergence for image generation in the dark night scene.

However, the results on the test set are not satisfactory. As shown in Figure 15, the visual difference between the RGB and IR images for the nighttime scenes is greater than for the daytime scenes due to bright light or shadow occlusion. IR images tend to have a lot of “things” that are not present in RGB, which may require more algorithmic combinations, such as “things” prediction followed by pixel transformation, when faced with more prediction problems.

However, the RGB-IR cross-modal transformation capability demonstrated by the algorithm in daytime scenes is sufficient for us to process the KITTI dataset. The generation results of the algorithm in this paper on the KITTI dataset are shown in Figure 16. The KITTI dataset contains a total of 194 pairs of RGB images and 194 corresponding optical flow maps. We reviewed the converted IR images one-by-one and found that they visually approximated the real IR images.

On the basis of visual comparison, this paper also further compares the image generation results of the different networks using three evaluation metrics, namely peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and root mean square (RMSE), with the real IR images used as the benchmark. Since the optical flow dataset KITTI is not an RGB-IR bimodal image dataset, in which the RGB images do not have corresponding real IR images, the evaluation experiments are conducted only on the training and test sets based on the M³FD division. Among the three metrics (Table 1), larger values of PSNR and SSIM indicate that the quality and similarity of the generated IR images are closer to the real IR images, while smaller values of RMSE indicate that the end-to-end pixel error between the generated IR images and the real IR images is minimized. The results reflected from the metrics data are largely consistent with the visual perception, with FDIT and AlignGAN2 based on unsupervised training performing poorly in all three metrics. On the other hand, Pix2Pix and ours, based on supervised training, maintain a high level on the training set while performing significantly better than the results of the unsupervised training on the test set.

4.2. Experimental Results of Optical Flow Computation

The training of RAFT and FastFlowNet is based on the RGB optical flow dataset KITTI (RGB KITTI) and IR KITTI generated by converting KITTI using the image transformation network. To verify the positive contribution of IR KITTI to the parameter fitting of the optical flow computation network in the IR scene, experiments were conducted using RGB KITTI. RGB KITTI + IR KITTI were trained on RAFT (or FastFlowNet) to obtain two optical flow computation networks labeled as RGB Network and IR Network to perform optical flow computations on real IR images and compare the differences between the computation results of the two networks.

The conventional practice for verifying the computational accuracy of a model is to conduct validation experiments with a dataset containing ground truth. And according to what has been mentioned above, none of the current optical flow dataset studies have been conducted by capturing IR image optical flows. Because of this realistic background, we propose a novel evaluation method. Briefly, we utilized the alignment of the RGB and IR images in the M³FD dataset in verifying the accuracy of the IR image optical flow computation. We evaluate the computational results by comparing the approximation of the optical flow computed for the IR image with the results on the RGB image. The optical flow of the RGB image, which is used as a benchmark for comparison, is obtained via supervised training on several public optical flow datasets. It is fine-tuned and verified on the real optical flow dataset, which has a strong confidence level and can be used as a strong evaluation index.

In addition, we use a reference color wheel (Figure 17) to visualize the optical flow results. The direction of the optical flow is expressed in different colors, while the size is equal to the pixel brightness size (pixel value) of the corresponding point in the visualized image.

The results of the IR image optical flow computation experiments are shown in Figure 18 and Figure 19. Based on RGB and IR images (M³FD) alignment, the images shown in (b) in Figure 11 and Figure 12 are determined as the benchmarks for comparison of the IR image optical flow computation results, respectively. The images shown in (d) are the results of optical flow computation of the IR images by a network trained only using RGB images, and the images shown in (e) are the results of optical flow computations of IR images by a network further trained with generated IR images. The images in (d) and (e) are compared with benchmark (b), respectively. The visualized results of the optical flow from RAFT and FastFlowNet shown in (d) are not as accurate as (e) in terms of optical flow direction. The results shown in (e) are also more accurate in terms of optical flow size, and the intensity and distribution of the colors in the visualized image are closer to that of the benchmark (b). Applying the optical flow computation network trained using only RGB images directly to the optical flow computation of IR images does not maintain consistency and accuracy with the optical flow computation of RGB images. Additionally, the parameters of the optical flow computation network are more obviously overfitted to the RGB domain. Adding the IR network trained using IR KITTI overfitted the parameters of the network to the IR domain so that the results of the optical flow computation network trained for IR images were more consistent with the benchmark in terms of visualization.

Furthermore, following the judging criteria that have been used in the field of optical flow algorithms, we calculated the EPE (endpoint error) of the optical flow computation results (Table 2) of the network for the IR images before and after training using the generated IR images with respect to the benchmark. The corresponding visualization results shown in (d) and (e) correspond to the EPE of (b), respectively. The EPE results of RAFT are 64.8766 (without IR KITTI) and 49.3069 (with IR KITTI), and the EPE results of FastFlowNet are 8.6948 (without IR KITTI) and 7.5395 (with IR KITTI). Between RAFT and FastFlowNet, the difference in the results across the different networks is clearly due to the fact that the different networks have different complexities and different processing capabilities for the images. However, the more obvious and important argument is that in the same network, the IR images generated by the RGB-IR cross-modal network substantially improve the accuracy of the network’s computation of the optical flow of the IR images, achieving smaller errors in the comparison with the benchmark.

5. Discussion

Due to the great achievements of deep learning theory in the field of optical flow computation for RGB images in 2015, it is natural to wonder whether it can be applied to images of other modalities, such as IR images. According to the empirical formula, due to the naturally existing feature differences between RGB images and IR images, we will inevitably need to fine-tune the model for IR scenes when migrating to compute IR image optical flow.

Considering the production cost, in order to solve the problem of the IR optical flow dataset required for fine-tuning, this paper proposes the use of an RGB-IR cross-modal transformation network to augment the existing RGB optical flow dataset for IR scenes. We use the broadened IR data to perform parameter fine-tuning experiments on several optical flow computational networks, achieving more obvious results, with the fine-tuned networks achieving smaller errors in computing IR image optical flows. It is worth mentioning that there is still an essential gap between the simulation-generated IR images and real IR images. The production of real IR optical flow datasets can provide better training and testing benchmarks in the field of deep learning IR optical flow computation, which is a valuable contribution to scientific research.

6. Conclusions

In this paper, we propose the use of an RGB-IR cross-modal image transformation network to augment the existing RGB optical flow dataset with IR scenes. We then use the generated IR images to fine-tune the parameters of the optical flow computation network to improve the network’s optical flow computation accuracy for IR images. In addition, due to the lack of real infrared optical flow datasets, this paper utilizes the alignment of an RGB-IR bimodal dataset, M³FD, for the evaluation of the optical flow computation results of optical flow computation networks for IR images. We tested the optical flow computation results of the optical flow computation networks before and after adding the generated IR images for training on real IR images, and the EPE of the latter optical flow computation results was reduced by 24% (RAFT) and 13.29% (FastFlowNet).

Although the theory and experiments in this paper have yielded impressive results, some shortcomings still exist. The foremost problem that remains is the lack of real IR optical flow datasets. As far as the conditions and capabilities allow, the collection and production of IR optical flow datasets can better improve the training and validation experiments of optical flow computational networks in IR scenes. Furthermore, the RGB-IR cross-modal transformation network in this paper requires more research and refinement to generate more realistic IR images, which can also better contribute to the optical flow computation capability of the network.

Author Contributions

Concept design, F.H. and W.H.; experiment, W.H.; writing—original draft preparation, F.H. and W.H.; writing—review and editing, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fuzhou University (2019T009, GXRC-18066); Department of Education, Fujian Province (JAT190005).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The RGB-IR cross-modal transition algorithm codes are available online at: https://github.com/ReggieBird/RGB2IR (accessed on 20 February 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fortun, D.; Bouthemy, P.; Kervrann, C. Optical flow modeling and computation: A survey. Comput. Vis. Image Underst. 2015, 134, 1–21. [Google Scholar] [CrossRef]
Zheng, Y.; Zhang, M.; Lu, F. Optical flow in the dark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6749–6757. [Google Scholar]
Kastberger, G.; Stachl, R. Infrared imaging technology and biological applications. Behav. Res. Methods Instrum. Comput. 2003, 35, 429–439. [Google Scholar] [CrossRef] [PubMed]
Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the IJCAI’81: 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; pp. 674–679. [Google Scholar]
Horn, B.K.; Schunck, B. Determining Optical Flow (Artificial Intelligence Laboratory); Massachusetts Institute of Technology: Cambridge, MA, USA, 1981; Volume 17, pp. 185–203. [Google Scholar]
Sun, D.; Yang, X.; Liu, M.-Y.; Kautz, J. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8934–8943. [Google Scholar]
Yang, G.; Ramanan, D. Volumetric correspondence networks for optical flow. In Proceedings of the NIPS’19: 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 794–805. [Google Scholar]
Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Cham, Switzerland, 2020; pp. 402–419. [Google Scholar]
Zhai, M.; Xiang, X.; Lv, N.; Kong, X. Optical flow and scene flow estimation: A survey. Pattern Recognit. 2021, 114, 107861. [Google Scholar] [CrossRef]
Shah, S.T.H.; Xiang, X. Traditional and modern strategies for optical flow: An investigation. SN Appl. Sci. 2021, 3, 289. [Google Scholar] [CrossRef]
Xin, J.; Cao, X.; Xiao, H.; Liu, T.; Liu, R.; Xin, Y. Infrared Small Target Detection Based on Multiscale Kurtosis Map Fusion and Optical Flow Method. Sensors 2023, 23, 1660. [Google Scholar] [CrossRef] [PubMed]
Jiménez-Pinto, J.; Torres-Torriti, M. Optical Flow and Driver’s Kinematics Analysis for State of Alert Sensing. Sensors 2013, 13, 4225–4257. [Google Scholar] [CrossRef] [PubMed]
Shao, Y.; Li, W.; Chu, H.; Chang, Z.; Zhang, X.; Zhan, H. A multitask cascading cnn with multiscale infrared optical flow feature fusion-based abnormal crowd behavior monitoring uav. Sensors 2020, 20, 5550. [Google Scholar] [CrossRef]
Guerrero-Rodriguez, J.-M.; Cifredo-Chacon, M.-A.; Cobos Sánchez, C.; Perez-Peña, F. Exploiting the PIR Sensor Analog Behavior as Thermoreceptor: Movement Direction Classification Based on Spiking Neurons. Sensors 2023, 23, 5816. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; Van Der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Santiago, Chile, 7–13 December 2015; pp. 3061–3070. [Google Scholar]
Yin, X.-L.; Liang, D.-X.; Wang, L.; Xu, J.; Han, D.; Li, K.; Yang, Z.-Y.; Xing, J.-H.; Dong, J.-Z.; Ma, Z.-Y. Optical flow estimation of coronary angiography sequences based on semi-supervised learning. Comput. Biol. Med. 2022, 146, 105663. [Google Scholar] [CrossRef] [PubMed]
Jonschkowski, R.; Stone, A.; Barron, J.T.; Gordon, A.; Konolige, K.; Angelova, A. What matters in unsupervised optical flow. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Cham, Switzerland, 2020; pp. 557–572. [Google Scholar]
Wang, G.; Zhang, T.; Cheng, J.; Liu, S.; Yang, Y.; Hou, Z. RGB-infrared cross-modality person re-identification via joint pixel and feature alignmen. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3623–3632. [Google Scholar]
Wu, A.; Zheng, W.-S.; Yu, H.-X.; Gong, S.; Lai, J. RGB-infrared cross-modality person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5380–5389. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 1125–1134. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
He, B.; Gao, F.; Ma, D.; Shi, B.; Duan, L.-Y. Chipgan: A generative adversarial network for chinese ink wash painting style transfer. In Proceedings of the 26th ACM International Conference on Multimedia, Torino, Italy, 22–26 October 2018; pp. 1172–1180. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Kong, L.; Shen, C.; Yang, J. Fastflownet: A lightweight network for fast optical flow estimation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 10310–10316. [Google Scholar]
Eldesokey, A.; Felsberg, M. Normalized convolution upsampling for refined optical flow estimation. arXiv 2021, arXiv:2102.06979. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Cai, M.; Zhang, H.; Huang, H.; Geng, Q.; Li, Y.; Huang, G. Frequency domain image translation: More photo-realistic, better identity-preserving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 13930–13940. [Google Scholar]
Wang, G.-A.; Zhang, T.; Yang, Y.; Cheng, J.; Chang, J.; Liang, X.; Hou, Z.-G. Cross-modality paired-images generation for RGB-infrared person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12144–12151. [Google Scholar]

Figure 1. Image (from KITTI) transformed from RGB domain to IR domain.

Figure 2. Generated IR images to train the optical flow computation network.

Figure 3. Aligned RGB-IR image optical flow is also aligned.

Figure 4. RGB-IR cross-modal transformation of an image.

Figure 5. Patch Gan architecture.

Figure 6. Residual block architecture.

Figure 7. Comparison of RGB-IR bimodal image results for edge detection. (a) RGB and IR images to be detected; (b–f) detection results at different scales, respectively; (g) detection results of fusing different scales.

Figure 8. Fine-tuning of unlabeled images.

Figure 9. Feature alignment fine-tuning using Swin-T.

Figure 10. Pipeline of RAFT [8].

Figure 11. Pipeline of FastFlowNet [28].

Figure 12. Comparison of RGB-converted IR image results of different networks on the training set: (a) RGB images; (b) real IR images; (c) Pix2Pix; (d) FDIT; (e) AlignGAN2; (f) ours.

Figure 13. Comparison of RGB-converted IR image results of different networks on the test set: (a) RGB images; (b) real IR images; (c) Pix2Pix; (d) FDIT; (e) AlignGAN2; (f) ours.

Figure 14. Comparison of RGB-converted IR image results of different networks on the training set: (a) RGB images; (b) real IR images; (c) Pix2Pix; (d) FDIT; (e) AlignGAN2; (f) ours.

Figure 15. Comparison of RGB-converted IR image results of different networks on the test set: (a) RGB images; (b) real IR images; (c) Pix2Pix; (d) FDIT; (e) AlignGAN2; (f) ours.

Figure 16. RGB-IR cross-modal transformation (KITTI, unlabeled): (a1–a3) RGB images; (b1–b3) generated IR images.

Figure 17. Benchmark of optical flow visualization. Different colors indicate different directions, and the intensity of each color represents the magnitude of the optical flow.

Figure 18. Computational results of optical flow computation networks (RAFT) trained on aligned RGB-IR with different training sets. (a) RGB image; (b) RGB Network calculates RGB image optical flow result; (c) IR image; (d) RGB Network calculates IR image optical flow result; (e) IR Network calculates IR image optical flow result.

Figure 19. Computational results of optical flow computation networks (FastFlowNet) trained on aligned RGB-IR using different training sets. (a) RGB image; (b) RGB Network calculates RGB image optical flow result; (c) IR image; (d) RGB Network calculates IR image optical flow result; (e) IR Network calculates IR image optical flow result.

Table 1. Comparison of RGB-converted IR image results of different networks on the training set, test set, and KITTI dataset.

	Training Set			Test Set
	PSNR	SSIM	RMSE	PSNR	SSIM	RMSE
Pix2Pix	38.6729	0.9772	0.0118	21.6372	0.7048	0.1003
FDIT	9.0525	0.3992	0.3658	9.5175	0.4217	0.3495
AlignGAN2	16.9659	0.6568	0.1457	14.5287	0.6030	0.1918
Ours	40.0976	0.9797	0.0102	20.9572	0.6915	0.1057

Table 2. Comparison of the optical flow computation results of the network for real IR images before and after training on hybrid generated IR images (based on the RGB-IR image alignment, assuming that the RGB-IR optical flow results are also aligned and using the optical flow computation results of the RGB image as a benchmark to calculate EPE, respectively).

	RAFT	FastFlowNet
RGB Network (without IR KITTI)	64.8766	8.6948
IR Network (with IR KITTI)	49.3069	7.5395
Percentage increase	24%	13.29%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, F.; Huang, W.; Wu, X. Enhancing Infrared Optical Flow Network Computation through RGB-IR Cross-Modal Image Generation. Sensors 2024, 24, 1615. https://doi.org/10.3390/s24051615

AMA Style

Huang F, Huang W, Wu X. Enhancing Infrared Optical Flow Network Computation through RGB-IR Cross-Modal Image Generation. Sensors. 2024; 24(5):1615. https://doi.org/10.3390/s24051615

Chicago/Turabian Style

Huang, Feng, Wei Huang, and Xianyu Wu. 2024. "Enhancing Infrared Optical Flow Network Computation through RGB-IR Cross-Modal Image Generation" Sensors 24, no. 5: 1615. https://doi.org/10.3390/s24051615

APA Style

Huang, F., Huang, W., & Wu, X. (2024). Enhancing Infrared Optical Flow Network Computation through RGB-IR Cross-Modal Image Generation. Sensors, 24(5), 1615. https://doi.org/10.3390/s24051615

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Infrared Optical Flow Network Computation through RGB-IR Cross-Modal Image Generation

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. RGB-IR Cross-Modal Transformation

3.2. Optical Flow Computation

3.2.1. RAFT

3.2.2. FastFlowNet

4. Experimental Results

4.1. Experimental Results of RGB-IR Cross-Modal Transition

4.2. Experimental Results of Optical Flow Computation

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI