Converting Optical Videos to Infrared Videos Using Attention GAN Converting Optical Videos to Infrared Videos Using Attention GAN and Its Impact on Target Detection and Classification and Its Impact on Target Detection and Classification Performance Performance

: To apply powerful deep-learning-based algorithms for object detection and classiﬁcation in infrared videos, it is necessary to have more training data in order to build high-performance models. However, in many surveillance applications, one can have a lot more optical videos than infrared videos. This lack of IR video datasets can be mitigated if optical-to-infrared video conversion is possible. In this paper, we present a new approach for converting optical videos to infrared videos using deep learning. The basic idea is to focus on target areas using attention generative adversarial network (attention GAN), which will preserve the ﬁdelity of target areas. The approach does not require paired images. The performance of the proposed attention GAN has been demonstrated using objective and subjective evaluations. Most importantly, the impact of attention GAN has been demonstrated in improved target detection and classiﬁcation performance using real-infrared videos.


Introduction
There are two groups of target detection algorithms for infrared videos. One group contains conventional algorithms that utilize supervised machine-learning algorithms. For instance, there are some conventional target tracking methods [1,2]. The second group of target detection and classification schemes uses deep-learning algorithms such as You Only Look Once (YOLO) for larger objects in short-range optical and infrared videos [3][4][5][6][7][8][9][10][11][12][13][14][15]. Training videos are required in these algorithms. Among those deep-learning algorithms, it is worth mentioning that some of them [3,4] are using compressive measurements directly for target detection and classification. This means that no reconstruction of compressive measurements is needed, and hence, fast target detection and classification can be achieved. The algorithms in [5][6][7][8][9][10][11][12][13][14] require target locations to be known. All of the aforementioned applications require a lot of videos for training.
In practical applications, we may have a lot of optical videos but only a handful of infrared videos. Consequently, the performance of machine-learning algorithms for surveillance and reconnaissance operations is seriously affected. Since optical videos are abundant in the public domain, the objective of this research is to determine if one can convert optical videos to infrared videos so that the performance of the machine-learning algorithms using IR videos for surveillance and reconnaissance can be improved. In particular, we focus on applying recent developments in a generative adversarial network (GAN) for converting optical videos to mid-wave infrared (MWIR) videos. We developed a customized attention GAN, which performed better than state-of-the-art methods [16][17][18]. Moreover, we compared the three GAN-based models using actual Defense Systems Information Analysis Center (DSIAC) videos [19]. We observed that when combining the converted videos with the actual MWIR videos for training, we were able to improve the intersection of the union (IoU) score of the target detection from 40.86% (without augmentation) to 61.2% (with augmentation) for the 2000 m-range videos. In contrast, the classification performance using ResNet was not as good as expected. We believe the root cause is due to the small target size in those videos. To mitigate this target size issue, we investigated the use of super-resolution videos to enhance the resolution of the target areas. We then observed quite significant improvements in ResNet classification performance.
Our contributions are as follows. First, we propose a new attention-based GAN to synthesize infrared videos from optical videos. Our approach does not require paired images. We were able to improve on cycle GAN [16], dual GAN [17], and CUTGAN [18]. Second, using many DSIAC videos, we demonstrated that target detection performance using YOLO can be significantly improved with data augmentation. Third, we demonstrated that the combination of data augmentation and video super-resolution can achieve good target classification performance using ResNet. Figure 1 shows the overall framework of our work, and our paper is organized as follows. Section 2 summarizes the related work. Section 3 describes our proposed model for optical-to-infrared video conversion. Section 4 summarizes the experimental results of converting optical images to infrared images. Both objective and subjective results are presented. Section 5 includes results where we incorporated the synthetic-infrared videos into the training of target detection and classification deep-learning models. In Section 6, we summarize the target classification results using a combination of video super-resolution and attention GAN. Section 7 includes some discussions on a future research direction. Finally, some remarks are included in Section 8.
Remote Sens. 2021, 13, x FOR PEER REVIEW 2 of 22 a customized attention GAN, which performed better than state-of-the-art methods [16][17][18]. Moreover, we compared the three GAN-based models using actual Defense Systems Information Analysis Center (DSIAC) videos [19]. We observed that when combining the converted videos with the actual MWIR videos for training, we were able to improve the intersection of the union (IoU) score of the target detection from 40.86% (without augmentation) to 61.2% (with augmentation) for the 2000 m-range videos. In contrast, the classification performance using ResNet was not as good as expected. We believe the root cause is due to the small target size in those videos. To mitigate this target size issue, we investigated the use of super-resolution videos to enhance the resolution of the target areas. We then observed quite significant improvements in ResNet classification performance. Our contributions are as follows. First, we propose a new attention-based GAN to synthesize infrared videos from optical videos. Our approach does not require paired images. We were able to improve on cycle GAN [16], dual GAN [17], and CUTGAN [18]. Second, using many DSIAC videos, we demonstrated that target detection performance using YOLO can be significantly improved with data augmentation. Third, we demonstrated that the combination of data augmentation and video super-resolution can achieve good target classification performance using ResNet. Figure 1 shows the overall framework of our work, and our paper is organized as follows. Section 2 summarizes the related work. Section 3 describes our proposed model for optical-to-infrared video conversion. Section 4 summarizes the experimental results of converting optical images to infrared images. Both objective and subjective results are presented. Section 5 includes results where we incorporated the synthetic-infrared videos into the training of target detection and classification deep-learning models. In Section 6, we summarize the target classification results using a combination of video super-resolution and attention GAN. Section 7 includes some discussions on a future research direction. Finally, some remarks are included in Section 8. Framework highlighting the main parts of our paper. (a) Framework for converting optical videos to infrared videos using our proposed attention GAN; (b) baseline (left) and proposed framework for target detection and classification (training data were augmented using converted infrared videos in our system); (c) baseline classification and proposed classification system (training data augmented using converted IR videos) with the incorporation of video superresolution (VSR). Framework highlighting the main parts of our paper. (a) Framework for converting optical videos to infrared videos using our proposed attention GAN; (b) baseline (left) and proposed framework for target detection and classification (training data were augmented using converted infrared videos in our system); (c) baseline classification and proposed classification system (training data augmented using converted IR videos) with the incorporation of video super-resolution (VSR).  [16,[20][21][22][23]. For example, Isola et al. proposed Pix2Pix GAN for image-to-image translation between two domains, but it needs paired datasets for training [20]. After that, several GAN-based models were proposed to mitigate this limitation including disco GAN [23], dual GAN [17], and cycle GAN [16]. Later, the attention mechanism was introduced to GAN for image conversion. In [24], authors used Resnet-18 as a teacher network to train the discriminator of the GAN where the teacher taught the discriminator where to focus on the generated image. In [25], researchers proposed a model with attention GAN for image-to-image translation. SAGAN was introduced in [26], which used the self-attention mechanism for generating fake images.

Image Conversion between Visible and IR Domains
In the past few years, few researchers have done image translation between the visible and IR domain, including near-infrared (NIR) to visible [27][28][29][30], MWIR to grey-scale [31], LWIR to RGB [32,33], and visible to IR [34]. Some general GAN network such as Pix2Pix GAN [19,33,35] was also customized for RGB to IR image generation and for generating infrared textures from visible images [36]. Moreover, in [37], authors used conditional GAN to generate NIR spectral band from an RGB image where they used paired dataset for this conversion. In addition, cycle GAN [16,[38][39][40] was also used for visible-to-IR image translation.

Video Super-Resolution
Video super-resolution (VSR) aims to enhance video resolution and improve subsequent processing performance. VSR is inherently more challenging than single image super-resolution (SSIR) due to the consideration of harnessing relevant information in temporal domain. Frame concatenation is the vanilla approach to retain temporal information for VSR [41,42]. Kappeler et al. [43] proposed a CNN-based VSR method where they used the handcrafted optical flow method [44] for super-resolution. Later, Liu et al. [45] introduced a temporal aggregation method to address the dynamic motion problem. However, this method still requires concatenation of input frames, which negatively affects global optimization. Recurrent neural networks (RNNs) have already become promising for video captioning [46] and video summarization [47]. Huang et al. [48] utilized bidirectional recurrent CNN for VSR, and further improvement was done by adding a motion compensation module and a convLSTM layer [49]. Sajjadi et al. [50] developed an improved VSR model by using many-to-many RNN, which used the previous high-resolution estimates to improve the estimation for the next frame.

Architecture of the Proposed Model
Our proposed model is based on the architecture of cycle GAN, and Figure 2 shows the architecture of our model for visible-to-IR image conversion. There are two generators (G and F) and two discriminators (D X and D Y ) in the model. Figure 3 shows the architecture of the generator, which used nine residual blocks along with convolution layers. Figure 4 shows the structure of the discriminator, which is a patch-based discriminator introduced in [51], and we modified it by following [25]. In [24], authors used ResNet-18 [52] as a teacher network to generate attention maps to teach the discriminators where to focus. Inspired by [24], we use ResNet-18 as a teacher network in our model to train the generators where to focus.
There are two types of attention GAN models in the literature for image-to-image translation: self-attention-based GAN model [25,26] and teacher-attention-based GAN model [24]. Self-attention mechanism uses the interactions among inputs to identify where the model should focus to produce output. Teacher-attention methods utilize a well-trained Remote Sens. 2021, 13, 3257 4 of 23 model to generate an attention map to focus. The authors of [24] used the ResNet-18 model trained with the ImageNet dataset to generate an attention map to facilitate medial image augmentation. In our dataset, the objects of interest (different types of military vehicles) are typically very small in images, since the images were taken from a distance. An attention map generated by a self-attention mechanism will be distracted to other unrelated parts in the images. We utilized the well-trained ResNet-18 model and finetuned it with our dataset to classify the different types of military vehicles to force ResNet-18 to focus on the vehicles in the images. Our proposed model then used the finetuned ResNet-18 model as a teacher to generate an attention map for image-to-image translation. the model should focus to produce output. Teacher-attention methods utilize a welltrained model to generate an attention map to focus. The authors of [24] used the ResNet-18 model trained with the ImageNet dataset to generate an attention map to facilitate medial image augmentation. In our dataset, the objects of interest (different types of military vehicles) are typically very small in images, since the images were taken from a distance. An attention map generated by a self-attention mechanism will be distracted to other unrelated parts in the images. We utilized the well-trained ResNet-18 model and finetuned it with our dataset to classify the different types of military vehicles to force ResNet-18 to focus on the vehicles in the images. Our proposed model then used the finetuned ResNet-18 model as a teacher to generate an attention map for image-to-image translation.   the model should focus to produce output. Teacher-attention methods utilize a welltrained model to generate an attention map to focus. The authors of [24] used the ResNet-18 model trained with the ImageNet dataset to generate an attention map to facilitate medial image augmentation. In our dataset, the objects of interest (different types of military vehicles) are typically very small in images, since the images were taken from a distance. An attention map generated by a self-attention mechanism will be distracted to other unrelated parts in the images. We utilized the well-trained ResNet-18 model and finetuned it with our dataset to classify the different types of military vehicles to force ResNet-18 to focus on the vehicles in the images. Our proposed model then used the finetuned ResNet-18 model as a teacher to generate an attention map for image-to-image translation.

Objective Function
The proposed model has three loss functions: GAN loss, cycle-consistency loss, and attention loss. In our model, there are two generators, G and F. Given the two domains, visible and IR, let G to map from visible to IR and F to map from IR to visible. x is an image from the visible domain and y is an image from the IR domain. G(x) denotes a generated IR image from visible image, and F(y) represents a generated visible image from IR image. We have two discriminators DX and DY where DX discriminates x from F(y) and DY discriminates y from G(x).
A GAN loss is defined as [21]: A cycle-consistency loss is defined over F(G(x)) and G(F(y)) as, In our model, an attention loss is defined between the attention map (generated by ResNet-18) of the input image and the output image of the generator as, The total loss of our model with hyperparameters , , and is defined:

DSIAC Data
We selected five vehicles in the DSIAC videos for detection and classification. There are optical and mid-wave infrared (MWIR) videos collected at distances ranging from 1000 m to 5000 m with 500 m increments. The five types of vehicles are shown in Figure  5. These videos are challenging for several reasons. First, the target sizes are small due to long distances. This is quite different from some benchmark datasets such as MOT Challenge [53] where the range is short and the targets are big. Second, the target orientations also change drastically. Third, the illuminations in different videos are also different. Fourth, the cameras also move in some videos.
In this research, we focus mostly on MWIR nighttime videos because MWIR is more effective for surveillance during the night.

Objective Function
The proposed model has three loss functions: GAN loss, cycle-consistency loss, and attention loss. In our model, there are two generators, G and F. Given the two domains, visible and IR, let G to map from visible to IR and F to map from IR to visible. x is an image from the visible domain and y is an image from the IR domain. G(x) denotes a generated IR image from visible image, and F(y) represents a generated visible image from IR image. We have two discriminators D X and D Y where D X discriminates x from F(y) and D Y discriminates y from G(x).
A GAN loss is defined as [21]: A cycle-consistency loss is defined over F(G(x)) and G(F(y)) as, In our model, an attention loss is defined between the attention map (generated by ResNet-18) of the input image and the output image of the generator as, The total loss of our model with hyperparameters α, β, and γ is defined:

DSIAC Data
We selected five vehicles in the DSIAC videos for detection and classification. There are optical and mid-wave infrared (MWIR) videos collected at distances ranging from 1000 m to 5000 m with 500 m increments. The five types of vehicles are shown in Figure 5. These videos are challenging for several reasons. First, the target sizes are small due to long distances. This is quite different from some benchmark datasets such as MOT Challenge [53] where the range is short and the targets are big. Second, the target orientations also change drastically. Third, the illuminations in different videos are also different. Fourth, the cameras also move in some videos. Here, we briefly highlight the background for optical and MWIR videos. The optical and MWIR videos have very different characteristics. Optical imagers have a wavelength between 0.4 and 0.8 microns, and MWIR imagers have a wavelength range between 3 and 5 microns. Optical cameras require external illuminations whereas MWIR counterparts do not need external illumination sources because MWIR cameras are sensitive to heat radiation from objects. Consequently, target shadows, illumination, and hot air turbulence can affect the target detection performance in optical videos. MWIR imagery is dominated by the thermal component at night, and hence, it is a much better surveillance tool than visible imagers at night. Moreover, atmospheric obscurants cause much less scattering in the MWIR bands than in the optical band. As a result, MWIR cameras are tolerant of heat turbulence, smoke, dust, and fog.
We have considered DSIAC videos for our research to do optical image to MWIR nighttime conversion, detection, and classification. DSIAC dataset has five different types of vehicles including BMP2, BTR70, BRDM2, ZSU23-4, and T72. Optical and MWIR videos were taken at 1000 m, 1500 m, and 2000 m distances. The video frame rate is 7 frames/second. The frame sizes of optical videos and MWIR videos are 640 × 480 and 640 × 512, respectively. The total number of frames is 1875 per optical video. On the other hand, each MWIR video has 1800 frames. Each pixel is represented by 8 bits. Figures 6 and 7 show the frames of the videos in our dataset. Some MWIR videos in Figure 7 are very dark, and it is difficult to visualize the video contents. Later on, we will apply contrast enhancement techniques to enhance the video quality.

Training
We trained our proposed model with the videos taken from 1500 m distance and applied the trained model to generate MWIR videos from optical videos taken from 1000 m and 2000 m distances. The training was performed with unpaired frames of optical and MWIR videos of BTR70 and ZSU234 at 1500 m. Figure 8 shows some unpaired frames used for training. In total, we have used 3600 unpaired frames for each domain in the training dataset. During training, we randomly cropped 256 × 256 patches, but full images were used during testing. We used a batch size of 1 during training by following [16] and selected 50 as the number of image buffer. The Adam optimizer [54] was used during In this research, we focus mostly on MWIR nighttime videos because MWIR is more effective for surveillance during the night.
Here, we briefly highlight the background for optical and MWIR videos. The optical and MWIR videos have very different characteristics. Optical imagers have a wavelength between 0.4 and 0.8 microns, and MWIR imagers have a wavelength range between 3 and 5 microns. Optical cameras require external illuminations whereas MWIR counterparts do not need external illumination sources because MWIR cameras are sensitive to heat radiation from objects. Consequently, target shadows, illumination, and hot air turbulence can affect the target detection performance in optical videos. MWIR imagery is dominated by the thermal component at night, and hence, it is a much better surveillance tool than visible imagers at night. Moreover, atmospheric obscurants cause much less scattering in the MWIR bands than in the optical band. As a result, MWIR cameras are tolerant of heat turbulence, smoke, dust, and fog.
We have considered DSIAC videos for our research to do optical image to MWIR nighttime conversion, detection, and classification. DSIAC dataset has five different types of vehicles including BMP2, BTR70, BRDM2, ZSU23-4, and T72. Optical and MWIR videos were taken at 1000 m, 1500 m, and 2000 m distances. The video frame rate is 7 frames/second. The frame sizes of optical videos and MWIR videos are 640 × 480 and 640 × 512, respectively. The total number of frames is 1875 per optical video. On the other hand, each MWIR video has 1800 frames. Each pixel is represented by 8 bits. Figures 6 and 7 show the frames of the videos in our dataset. Some MWIR videos in Figure 7 are very dark, and it is difficult to visualize the video contents. Later on, we will apply contrast enhancement techniques to enhance the video quality.

Training
We trained our proposed model with the videos taken from 1500 m distance and applied the trained model to generate MWIR videos from optical videos taken from 1000 m and 2000 m distances. The training was performed with unpaired frames of optical and MWIR videos of BTR70 and ZSU234 at 1500 m. Figure 8 shows some unpaired frames used for training. In total, we have used 3600 unpaired frames for each domain in the training dataset. During training, we randomly cropped 256 × 256 patches, but full images were used during testing. We used a batch size of 1 during training by following [16] and selected 50 as the number of image buffer. The Adam optimizer [54] was used during training. We used Pytorch framework for implementation, and all experiments were conducted on a NVIDIA GPU.

Evaluation Metrics for Assessing the Conversion Performance
An inception score (IS) [55] is one of the widely used metrics for evaluating the quality and diversity of images generated by GANs. IS considers the entropy of the probability distribution that is generated by the pre-trained inception v3 model [56] on the generated images. A higher inception score indicates the better quality of the generated images.

Frechet Inception Distance (FID)
Frechet inception distance (FID) [57] was specially developed for evaluating the performance of a GAN. The FID score indicates the similarity between two collections of images. Consistency between FID score and human judgement has made the FID score a good indicator of the generated image's quality. Statistics of the real and fake images are considered for obtaining the FID score. When calculating FID, the Wasserstein-2 distance between the features of real and synthetic images is calculated. The inception model [56] generates the feature representations of the images for calculating FID. FID performs well in terms of robustness and discriminability. A lower FID score denotes more similarity between the two data distributions.  An inception score (IS) [55] is one of the widely used metrics for evaluating the quality and diversity of images generated by GANs. IS considers the entropy of the probability distribution that is generated by the pre-trained inception v3 model [56] on the generated images. A higher inception score indicates the better quality of the generated images.

Frechet Inception Distance (FID)
Frechet inception distance (FID) [57] was specially developed for evaluating the performance of a GAN. The FID score indicates the similarity between two collections of images. Consistency between FID score and human judgement has made the FID score a good indicator of the generated image's quality. Statistics of the real and fake images are considered for obtaining the FID score. When calculating FID, the Wasserstein-2 distance between the features of real and synthetic images is calculated. The inception model [56] generates the feature representations of the images for calculating FID. FID performs well in terms of robustness and discriminability. A lower FID score denotes more similarity between the two data distributions.

Kernel Inception Distance (KID)
Similar to FID, Kernel inception distance (KID) [58] also indicates the quality of the generated images of a GAN relative to the real images. The KID score is the maximum mean discrepancy (MMD) between the inception representations of the real and fake images. The inception model is used to obtain those feature representations of the images. KID scores are consistent with human judgements when evaluating the quality of the synthetic images. A lower KID score denotes the high quality of the synthetic images generated by GAN. Figure 9 shows two representative attention maps generated by the teacher network (ResNet-18) during training of our attention GAN. It can be seen that the corresponding vehicle areas in the attention maps are brighter than other areas. This means that more emphasis will be placed in the vehicle areas during the training process. Consequently, the attention GAN will generate more accurate results near the vehicle areas than cycle GAN.

Attention Maps
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 22 Figure 9 shows two representative attention maps generated by the teacher network (ResNet-18) during training of our attention GAN. It can be seen that the corresponding vehicle areas in the attention maps are brighter than other areas. This means that more emphasis will be placed in the vehicle areas during the training process. Consequently, the attention GAN will generate more accurate results near the vehicle areas than cycle GAN.

Input image (x)
Attention map (AM(x)) Output image (G(x)) Attention map of output image (AM(G(x))

Qualitative Comparison
We compared our method with Cycle GAN [16], Dual GAN [17], and CUTGAN [18], which are state-of-the-art methods for unsupervised image-to-image translation. Both Cycle GAN and Dual GAN have two generators and two discriminators. On the other hand, CUTGAN uses one generator and one discriminator. They use unpaired datasets for training. All models were trained with the same dataset. Figure 10 shows results for visible-to-MWIR translation by different models. It is observed that results by Cycle GAN, Dual GAN, and CUTGAN contain visible artifacts, and the fine details of objects are not preserved. On the other hand, results by our model have much better visual quality, and the vehicles have been correctly translated to the IR domain. It should be noted that although the target areas are consistent, there are some artifacts in the background. We applied two post-processing steps (contrast enhancement and Gaussian filter) to the results, and Figure 11 shows the processed results.

Qualitative Comparison
We compared our method with Cycle GAN [16], Dual GAN [17], and CUTGAN [18], which are state-of-the-art methods for unsupervised image-to-image translation. Both Cycle GAN and Dual GAN have two generators and two discriminators. On the other hand, CUTGAN uses one generator and one discriminator. They use unpaired datasets for training. All models were trained with the same dataset. Figure 10 shows results for visible-to-MWIR translation by different models. It is observed that results by Cycle GAN, Dual GAN, and CUTGAN contain visible artifacts, and the fine details of objects are not preserved. On the other hand, results by our model have much better visual quality, and the vehicles have been correctly translated to the IR domain. It should be noted that although the target areas are consistent, there are some artifacts in the background. We applied two post-processing steps (contrast enhancement and Gaussian filter) to the results, and Figure 11 shows the processed results. Table 1 shows the IS, FID, and KID for different models. We can see that the proposed model outperformed Cycle GAN, Dual GAN, and CUTGAN in terms of IS. For FID and KID, the proposed model also won over the competing methods in most of the cases. Tables 2 and 3   GAN, and CUTGAN contain visible artifacts, and the fine details of objects are not preserved. On the other hand, results by our model have much better visual quality, and the vehicles have been correctly translated to the IR domain. It should be noted that although the target areas are consistent, there are some artifacts in the background. We applied two post-processing steps (contrast enhancement and Gaussian filter) to the results, and  . Generated MWIR images after post-processing. Figure 11. Generated MWIR images after post-processing.

Impact of Converted Videos on Target Detection and Classification Performance
For a given surveillance mission, we can divide it into two phases. The first phase is the training of the algorithms. We will first need to build target detection and classification models. In our approach, we propose to use YOLO for target detection and ResNet for classification. To train both YOLO and ResNet, we will create a training database by utilizing both real-infrared and synthetic-infrared videos. The synthetic-infrared videos are converted using our attention GAN. The second phase is the operational phase. We will feed the testing IR videos from various ranges into YOLO for target detection. The target locations will be fed into ResNet for classification. To enhance the classification performance, we propose to apply a VSR algorithm to increase the resolution of the input testing videos before feeding them into ResNet. It turns out that VSR does improve the overall performance of the target classification.

YOLO for Target Detection
In some conventional object trackers such as those conventional methods mentioned in [1], initial bounding boxes are needed to be manually placed on the objects in the first frame. This is a serious limitation involving human intervention. In contrast, YOLO and faster R-CNN do not require bounding boxes to be placed on some objects in the first frame. Once trained, YOLO and faster R-CNN can detect objects in any frames. The YOLO tracker [59] is fast and demonstrates similar performance to the faster R-CNN [60]. The input image is resized to 448 × 448. There are 24 convolutional layers and two fully connected layers. The output is 7 × 7 × 30. We have used YOLOv2 because it is more accurate than YOLO version 1. The training of YOLO is quite simple. Images with ground-truth target locations are needed. The bounding box for each vehicle was manually determined using tools in MATLAB. For YOLO, the last layer of the deep-learning model was retrained. We did not change any of the activation functions. YOLO took approximately 2000 epochs to train.
YOLO also comes with a built-in classification module. However, based on our earlier evaluations, the classification accuracy using YOLO's built-in module is not good compared to ResNet [52].

ResNet for Target Classification
As mentioned earlier, YOLO's built-in classifier did not perform well, which is probably because we have limited training data. Moreover, we think that although YOLO is good for object detection, its built-in classifier is probably more suitable for inter-class (humans, vehicles, bikes, etc.) discrimination and not good for intra-class (e.g., BTR70 vs. BMP2) discrimination. The ResNet-18 model is an 18-layer convolutional neural network (CNN) that has the advantage of avoiding performance saturation and/or degradation when training deeper layers, which is a common problem among other CNN architectures. The ResNet-18 model avoids the performance saturation by implementing an identity shortcut connection, which skips one or more layers and learns the residual mapping of the layer rather than the original mapping.
It is necessary to explain the relationship between YOLO and ResNet. YOLO was used to determine where, in each frame, the vehicles were located. YOLO generated bounding boxes for those vehicles, and the data were used to crop the vehicles from the image. The cropped vehicles would be fed into the ResNet-18 for classification, and classification results were generated. To be more specific, ResNet-18 is used directly after the bounding box information is obtained from YOLO.
Training of ResNet requires target patches. The targets are cropped from training videos. Mirror images are then created. We then perform data augmentation using scaling (larger and smaller), rotation (every 45 degrees), and illumination (brighter and dimmer) to create more training data. For each cropped target, we are able to create a dataset with 64 more images. For ResNet, the last layer of the deep-learning model was retrained. The ResNet model was trained until the validation score reached a steady-state value.

Performance Metrics for Assessing Target Detection and Classification Performance
The six different performance metrics used to quantify the detection performance are: center location error (CLE) [1], distance precision at 10 pixels (DP@10) [1], estimates in ground truth (EinGT) [15], intersection over union (IoU) [15], and percentage of frames with detection (% det.) [15]. These metrics have been widely used by researchers in the past. We briefly summarize them below: Center location error (CLE): It is the error between the center of the bounding box and the ground-truth bounding box. Smaller means better. CLE is calculated by measuring the distance between the ground-truth center location (C x,gt , C y,gt ) and the detected center location (C x,est , C y,est ). Mathematically, CLE is given by Distance precision (DP): It is the percentage of frames where the centroids of detected bounding boxes are within 10 pixels of the centroid of ground-truth bounding boxes. Close to 1 or 100% indicates good results.
Estimates in ground truth (EinGT): It is the percentage of the frames where the centroids of the detected bounding boxes are inside the ground-truth bounding boxes. It depends on the size of the bounding box and is simply a less strict version of the DP metric. Close to 1 or 100% indicates good results.
Intersection over union (IoU): It is the ratio of the intersected area over the union of the estimated and ground-truth bounding boxes.
Percentage of frames with detection (% det.): This is the percentage of the number of frames that have detection. It is between 0 and 100%.
We used confusion matrices for evaluating vehicle classification performance using ResNet. From the confusion matrix, we can also evaluate overall accuracy (OA), average accuracy (AA), and the kappa coefficient.

Training and Testing Procedures
In the training, we used 1500 m original nighttime MWIR videos and attention GAN (aGAN)-converted videos from 1000 m, 1500 m, and 2000 m optical videos. Altogether, there are 20 videos to train the YOLO and ResNet models. In testing, we used 1000 m, 1500 m, and 2000 m videos.

Baseline Results Using Only 1500 m Infrared Videos for Training
Tables 4 and 5 summarize the baseline YOLO detection and ResNet classification results, respectively. Here, baseline means that the YOLO and ResNet models were trained using only the 1500 m infrared videos without any data augmentation using our attention GAN. The baseline performance metrics will be used as a baseline to compare against the results of using converted videos with attention GAN. There are three different distances that have test results: 1000 m, 1500 m, and 2000 m. Please note that 1500 m testing results are only used as reference, as training data also used 1500 m videos. There is an obvious deterioration in accuracy as the vehicle distance moves from 1500 m, the distance the model was trained on.
From Tables 4 and 5, each metric trends worse as it moves further away from the trained 1500 m distance. This is a trend that is seen across both detection and classification statistics. The overall degradation in accuracy as distances move from the trained distances is quite extreme. For example, with detection, the AP value, measuring the amount of overlap between ground truth and detected bounding box, halves with each increase of 500 m. When looking at overall trends, although there is only one distance closer than 1500, it seems that the model performs better when moving closer than trained rather than moving further away. Table 4. Baseline YOLO detection results using only 1500 m infrared videos for training. The metrics are named as follows: center location error (CLE), distance precision (DP), estimates in ground truth (EinGT), intersection over union (IoU), average precision (AP), and detection percentage (% det.

Results with Attention GAN Augmented Data
The focus here is on performance evaluation of target detection and classification models using the augmented data converted by attention GAN. The training data include 1500 m infrared videos, converted infrared videos by attention GAN from 1000 m, 1500 m, and 2000 m optical videos. In the baseline models, we only used the 1500 m MWIR videos. In this case, we focused on testing infrared videos at 1000, 1500, and 2000 m distances because the target size is too small for longer ranges. Table 6 shows the YOLO  detection metrics for each distance, while Table 7 shows the ResNet classification metrics and confusion matrices of each distance. Partially, due to an anomaly for the BRDM2 CLE metric, there is a decrease in accuracy for 2000 m. However, most other metrics are at least slightly improved. The largest overall improvement comes from the 2000 m distance, and the largest metric improvement is the detection percentage. Classification results do not show many differences. Possible reason is that the converted video contains shape information about vehicles that helped the detection performances. However, the conversion was imperfect and did not preserve detailed textures of vehicles. Therefore, the performances of vehicle classification remain similar.
Here, we would like to compare the baseline results in Section 5.4.1 and the attention GAN results in this section. We focus only on the 2000 m case.
We first compare the YOLO detection results. From Tables 4 and 6, we can clearly see that data augmentation using attention GAN clearly improved the baseline YOLO performance in almost every metric.
In contrast, the ResNet with attention GAN results in Table 7 do not improve over that of the baseline ResNet results in Table 5. This is mainly because the attention-GANconverted videos lack some detailed textures for the targets, and those additional synthetic videos in the training data actually interfered with the original videos. As a result, the trained ResNet model with attention GAN augmented data did not perform as well as the baseline ResNet.

Enhancement of Target Classification Using Super-Resolution Videos
From the end of Section 5, we noticed that converted videos using attention GAN did not improve the ResNet classification performance in long-range videos. We think the reason is due to the small target size in the long-range videos. Since ResNet needs to normalize input images to certain standard sizes of 448 × 448, the target area becomes even smaller because the DSIAC videos are 640 × 480. The study in this section focuses on the use of video super-resolution (VSR) algorithms to enlarge the target area. Consequently, the target size will be bigger. Because of the above reasoning, we only focus on the target area inside the bounding boxes. It is assumed that YOLO has already detected the target. Now, we would like to see if we can improve the classification performance using super-resolution videos.

Vehicle Classification Architecture with Video Super-Resolution
For this investigation, at first, we cropped only the vehicle portion from each of the video frames. Then, we used the pre-trained video super-resolution (VSR) model to enhance the resolution of these cropped vehicle sub-image frames up to 4×. This pretrained model takes seven frames as an input to predict the high-resolution center frame. We applied this pre-trained VSR model on our 2000 m, 1500 m, and 1000 m cropped vehicle dataset to obtain 4× higher-resolution vehicle video frames. Figure 12 shows the IR object classification block diagram. dataset to obtain 4× higher-resolution vehicle video frames. Figure 12 shows the IR object classification block diagram.

Video Super-Resolution Algorithm
For VSR, we used the recurrent back-projection network (RBPN) model developed by Haris et al. [61]. This model combines spatial and temporal information from continuous video frames using a recurrent encoder-decoder resulting in high-resolution frame generation compared to the other state-of-the-art VSR. Video frames were enhanced four times using the VSR model. These enhanced frames were then fed to the pre-trained Res-Net-18 as input for classification. For this project, we only used the 1500 m dataset to fine-  Enhanced video frames Classifier

Video Super-Resolution Algorithm
For VSR, we used the recurrent back-projection network (RBPN) model developed by Haris et al. [61]. This model combines spatial and temporal information from continuous video frames using a recurrent encoder-decoder resulting in high-resolution frame generation compared to the other state-of-the-art VSR. Video frames were enhanced four times using the VSR model. These enhanced frames were then fed to the pre-trained ResNet-18 as input for classification. For this project, we only used the 1500 m dataset to finetune the ResNet-18 model for classification. Then, we applied the trained ResNet model to classify the vehicles in the 1000 m and 2000 m cropped vehicles dataset with and without enhanced resolution, histogram matching, and image stretching. For finetuning, we set the learning rate to 0.001, training epochs to 300, and optimizer as stochastic gradient descent (SGD). Figure 13 shows the overview of VSR [61]. I is a low-resolution video frame. Model takes are the LR frames {I t−1 , I t−2 . . . , I t−n , I t } where I t is the target frame. The VSR model goal is to produce SR t , which is the high-resolution version of I t . The network has two approaches. In the horizontal blue-line flow, the model extracts features from the target frame, and in the vertical red-line flow, the model computes the residual features from a pair of the targets to neighbor frames and the precomputed dense motion flow maps (F t−1 , F t−2 ). On each projection step, the model observes the missing details on the target frame and extracts the residual features from each neighbor frame to recover the missing details. More details of this VSR model can be found in [61].

Video Super-Resolution Algorithm
For VSR, we used the recurrent back-projection network (RBPN) model developed by Haris et al. [61]. This model combines spatial and temporal information from continuous video frames using a recurrent encoder-decoder resulting in high-resolution frame generation compared to the other state-of-the-art VSR. Video frames were enhanced four times using the VSR model. These enhanced frames were then fed to the pre-trained Res-Net-18 as input for classification. For this project, we only used the 1500 m dataset to finetune the ResNet-18 model for classification. Then, we applied the trained ResNet model to classify the vehicles in the 1000 m and 2000 m cropped vehicles dataset with and without enhanced resolution, histogram matching, and image stretching. For finetuning, we set the learning rate to 0.001, training epochs to 300, and optimizer as stochastic gradient descent (SGD). Figure 13 shows the overview of VSR [61]. I is a low-resolution video frame. Model takes are the LR frames { , ..., , } where is the target frame. The VSR model goal is to produce , which is the high-resolution version of . The network has two approaches. In the horizontal blue-line flow, the model extracts features from the target frame, and in the vertical red-line flow, the model computes the residual features from a pair of the targets to neighbor frames and the precomputed dense motion flow maps ( , ). On each projection step, the model observes the missing details on the target frame and extracts the residual features from each neighbor frame to recover the missing details. More details of this VSR model can be found in [61].

Results
It should be noted that 1500 m optical videos were converted to MWIR videos by attention GAN. The converted infrared videos were then used to train the ResNet. In our experiments here, we did not convert the 1000 m and 2000 m optical videos to infrared because the ResNet classification results in Section 5 showed that the converted videos did not help ResNet. As explained earlier, the likely reason is that the converted 1000 m and 2000 m videos interfered with the actual IR videos during the training process and thereby degraded the ResNet classification. In short, the experiments in this section can be seen in Figure 14 below. attention GAN. The converted infrared videos were then used to train the ResNet. In our experiments here, we did not convert the 1000 m and 2000 m optical videos to infrared because the ResNet classification results in Section 5 showed that the converted videos did not help ResNet. As explained earlier, the likely reason is that the converted 1000 m and 2000 m videos interfered with the actual IR videos during the training process and thereby degraded the ResNet classification. In short, the experiments in this section can be seen in Figure 14 below. To test the above framework, we used two separate datasets in the DSIAC database. Table 8 summarizes those videos in our experiments. There are MWIR daytime and MWIR nighttime, with five videos in each case. Each video has 1800 frames.  Table 9. There are two separate studies.
There are four sub-cases in each range: (a) without both VSR and data augmentation (by attention GAN); (b) with VSR and without data augmentation; (c) without VSR and with data augmentation; (d) with both VSR and data augmentation. We can see that the classification results with VSR (MWIR Day) are improved quite a lot for both 1000 m and 2000 m videos regardless of data augmentation. In some cases, the improvements are over 30%. The 2000 m video results with both VSR and data augmentation are also improved by 11% as compared to the case in which no data augmentation is used.
Results on MWIR nighttime videos are mixed, as shown in Table 9. In two out of the four cases with VSR, we see slightly improved performance. While the 1000 m case with data augmentation and 2000 m case without data augmentation showed degraded performance. The converted data have low quality, and more research is needed along this direction. To test the above framework, we used two separate datasets in the DSIAC database. Table 8 summarizes those videos in our experiments. There are MWIR daytime and MWIR nighttime, with five videos in each case. Each video has 1800 frames. The classification results are summarized in Table 9. There are two separate studies. There are four sub-cases in each range: (a) without both VSR and data augmentation (by attention GAN); (b) with VSR and without data augmentation; (c) without VSR and with data augmentation; (d) with both VSR and data augmentation. We can see that the classification results with VSR (MWIR Day) are improved quite a lot for both 1000 m and 2000 m videos regardless of data augmentation. In some cases, the improvements are over 30%. The 2000 m video results with both VSR and data augmentation are also improved by 11% as compared to the case in which no data augmentation is used. • Testing on MWIR nighttime videos.
Results on MWIR nighttime videos are mixed, as shown in Table 9. In two out of the four cases with VSR, we see slightly improved performance. While the 1000 m case with data augmentation and 2000 m case without data augmentation showed degraded performance. The converted data have low quality, and more research is needed along this direction.

Discussion
We have explored a data augmentation method to mitigate data scarcity in the IR domain for deep-network training by converting largely available labelled visible videos to the IR domain. Our method outperformed state-of-the-art methods for generating IR images from visible images. In addition, we have demonstrated that the converted IR images increased the detection and classification accuracies in the IR domain. Furthermore, we have proved that video super-resolution can be an effective way to improve object detection in video. There are some possible alternatives to mitigate the data hungry issue in deep learning such as transfer learning, in which datasets from similar domain can be utilized to pre-train a deep model, and the domain-specific dataset is then used to finetune the pre-trained data to improve the performance. For object detection in an IR video, we will investigate which method is more effective in future work.

Conclusions
In this paper, we presented a new approach to convert optical videos to infrared videos. Our proposed attention GAN model can generate more stable IR images and better vehicles' shapes in the IR domain than the cycle GAN. We also observed that attention GAN helps the YOLO detection performance. In particular, the average precision of the target detection was improved from 41% (without augmentation) to 62% (with augmentation) for the 2000 m videos. However, the converted videos did not help ResNet classification performance. We then investigated the use of a video super-resolution technique to enhance the ResNet classification performance. Some positive impacts on the ResNet classification performance have been observed. However, more research is still needed in this area. One future direction is to develop an integrated framework for target detection and classification that combines VSR, attention GAN, and more recent target detectors and classifiers.