Image Region Prediction from Thermal Videos Based on Image Prediction Generative Adversarial Network

: Various studies have been conducted on object detection, tracking, and action recognition based on thermal images. However, errors occur during object detection, tracking, and action recognition when a moving object leaves the ﬁeld of view (FOV) of a camera and part of the object becomes invisible. However, no studies have examined this issue so far. Therefore, this article proposes a method for widening the FOV of the current image by predicting images outside the FOV of the camera using the current image and previous sequential images. In the proposed method, the original one-channel thermal image is converted into a three-channel thermal image to perform image prediction using an image prediction generative adversarial network. When image prediction and object detection experiments were conducted using the marathon sub-dataset of the Boston University-thermal infrared video (BU-TIV) benchmark open dataset, we conﬁrmed that the proposed method showed the higher accuracies of image prediction (structural similarity index measure (SSIM) of 0.9839) and object detection (F1 score (F1) of 0.882, accuracy (ACC) of 0.983, and intersection over union (IoU) of 0.791) than the state-of-the-art methods.


Introduction
Various studies have been conducted on object detection [1][2][3][4], tracking [5][6][7][8][9], action recognition [10][11][12] using a camera-based video surveillance system in addition to depth, ego-motion, and optical flow estimation [13].However, when a walking or running object leaves the field of view (FOV) of the camera, part of the object's body becomes invisible, which leads to a failure in human detection and tracking, thus inducing errors in action recognition.However, no studies have considered this issue so far.To solve this problem, this study conducted an experiment for the first time for predicting the region outside the FOV that is not included in the current image (t), as shown in the image (t'), which is illustrated in Figure 1, to restore the part of the object's body that is invisible.

Introduction
Various studies have been conducted on object detection [1][2][3][4], tracking [5][6][7][8][9], action recognition [10][11][12] using a camera-based video surveillance system in addition to depth, ego-motion, and optical flow estimation [13].However, when a walking or running object leaves the field of view (FOV) of the camera, part of the object's body becomes invisible, which leads to a failure in human detection and tracking, thus inducing errors in action recognition.However, no studies have considered this issue so far.To solve this problem, this study conducted an experiment for the first time for predicting the region outside the FOV that is not included in the current image (t), as shown in the image (t'), which is illustrated in Figure 1, to restore the part of the object's body that is invisible.The proposed method widens the FOV of the current image using the current image, previous sequential images, and an image prediction generative adversarial network (IPGAN)-based method.Furthermore, the original one-channel thermal image is converted into a three-channel thermal image to be used as an input in the IPGAN.In this study, various experiments were conducted using the marathon sub-dataset of the Boston University-thermal infrared video (BU-TIV) benchmark open dataset [14].Existing studies related to the proposed method are explained in Section 2.

Related Works
The following studies attempted to predict the next image based on previous sequential images.In studies [15][16][17][18][19], image prediction methods were proposed for creating a future frame using a current frame and previous sequential frames.In [15], image prediction was performed using an encoder and decoder model based on long short-term memory (LSTM) and a 3D convolution layer.In [16], image prediction was performed using PhyDNet based on LSTM and the newly suggested PhyCell.In [17], image prediction was performed using LSTM and a convolutional neural network (CNN).In [18], image prediction was performed using an encoder and decoder model.In [19], image prediction was performed using a stochastic variational video prediction (SV2P) method.
Instead of predicting the current image based on previous sequential images, image inpainting methods were proposed in [20][21][22][23][24] where the deleted information is restored from a current image.In [20], image inpainting was performed using a fine deep-generativemodel-based approach with a novel coherent semantic attention (CSA) layer.In [21], image inpainting was performed based on gated convolution and SN-PatchGAN.In [22], image inpainting was performed based on a parallel extended-decoder path for semantic inpainting network (PEPSI).In [23], image inpainting was performed using a context encoder method based on a channel-wise fully connected layer.In [24], image inpainting was performed using edge prediction and image completion based on the predicted edge map.
Furthermore, the following review and survey studies have been conducted.In a review paper [25], the datasets created between 2004 and 2019 that were used in image prediction were compared with the image prediction models created between 2014 and 2020.In a survey paper [26], papers and datasets based on image prediction were described.In another review paper [27], sequential-based, CNN-based, and generative adversarial network (GAN)-based image inpainting methods and the datasets used in image inpainting were described.
As explained, studies have been extensively conducted on image inpainting and the prediction of the next image based on previous sequential images.However, no study has examined an image prediction method for generating an image region outside the FOV, which is proposed in this article.In addition, no previous study on image prediction and image inpainting adopted thermal images.The remainder of the paper is organized as follows.A detailed explanation of the proposed method is provided in Section 3. The experiment results and analysis with discussions are provided in Section 4. Finally, the discussion and the conclusion are presented in Sections 5 and 6.

Future image prediction
Encoder-decoder model [15,18], PhyDNet [16], CNN + LSTM [17], SV2P [19], and review and survey [25,26] High performance of future image prediction based on a current frame and previous frames Do not consider the image prediction out of FOV Do not use thermal image of low resolution and low image quality Image inpainting CSA layer [20], gated convolution + SN-PatchGAN [21], PEPSI [22], context encoder [23], edge prediction and image completion [24], and review [27] High performance of image inpainting based on a current frame The predicted image out of FOV has a size limit

Overall Procedure of Proposed Method
In this section, the method proposed in this study is described in detail.The proposed method performs the image region prediction based on sequential three-channel thermal images using preprocessing, IPGAN, and postprocessing.In Sections 3.2-3.5.preprocessing, the IPGAN architecture, postprocessing, and the dataset for image prediction, respectively, are described in detail.Figure 2 shows the overall flowchart of the proposed method.The length of the sequential input images is 20 frames (t − 0, t −1, . . ., t −19), the size of each image is 85 × 170 pixels, and the size of the output image is 105 × 170 pixels.Specifically, the output image is created by combining a generated image region (an image outside the FOV) and the current image (an image inside the FOV).

Methods Advantages Disadvantages
Using visiblelight images

Overall Procedure of Proposed Method
In this section, the method proposed in this study is described in detail.The proposed method performs the image region prediction based on sequential three-channel thermal images using preprocessing, IPGAN, and postprocessing.In Sections 3.2-3.5,preprocessing, the IPGAN architecture, postprocessing, and the dataset for image prediction, respectively, are described in detail.Figure 2 shows the overall flowchart of the proposed method.The length of the sequential input images is 20 frames (t − 0, t −1,…, t −19), the size of each image is 85 × 170 pixels, and the size of the output image is 105 × 170 pixels.Specifically, the output image is created by combining a generated image region (an image outside the FOV) and the current image (an image inside the FOV).

Preprocessing
The preprocessing step is described in detail in this subsection.For thermal images captured with a thermal camera, a one-channel thermal image is converted into a threechannel thermal image using a colormap function.The jet colormap array [29] is used for performing color conversion.The jet colormap array is a mapping function that expresses heat in the most appropriate color compared with other colormaps.It maps a one-channel image into a three-channel image for 256 pixel values from 0 and 255.For example, the hottest part of a one-channel image has a pixel value of 255 (white), whereas the coldest part has a pixel value of 0 (black).Conversely, the pixel value of the hottest part of a three-

Preprocessing
The preprocessing step is described in detail in this subsection.For thermal images captured with a thermal camera, a one-channel thermal image is converted into a threechannel thermal image using a colormap function.The jet colormap array [29] is used for performing color conversion.The jet colormap array is a mapping function that expresses heat in the most appropriate color compared with other colormaps.It maps a one-channel image into a three-channel image for 256 pixel values from 0 and 255.For example, the hottest part of a one-channel image has a pixel value of 255 (white), whereas the coldest part has a pixel value of 0 (black).Conversely, the pixel value of the hottest part of a three-channel (red, green, blue) image is [255,0,0] (red color), whereas that of the coldest part is [0,0,255] (blue color).A color conversion example is shown in Figure 3.A onechannel thermal image is converted into a three-channel thermal image because several studies have shown that performing object detection, recognition, and classification using color visible light images results in a better performance than using grayscale visible light images [30][31][32].Furthermore, for making the input and output sizes of the IPGAN structure identical, the region being predicted (the black area of 85 × 170 pixels) in the input image is created through the zero padding, thus changing the size of the input image from 85 × 170 pixels to 170 × 170 pixels.
Mathematics 2021, 9, x FOR PEER REVIEW 4 of 20 channel (red, green, blue) image is [255,0,0] (red color), whereas that of the coldest part is [0,0,255] (blue color).A color conversion example is shown in Figure 3.A one-channel thermal image is converted into a three-channel thermal image because several studies have shown that performing object detection, recognition, and classification using color visible light images results in a better performance than using grayscale visible light images [30][31][32].Furthermore, for making the input and output sizes of the IPGAN structure identical, the region being predicted (the black area of 85 × 170 pixels) in the input image is created through the zero padding, thus changing the size of the input image from 85 × 170 pixels to 170 × 170 pixels.

Proposed IPGAN Model
The three-channel image (170 × 170 pixels) obtained through preprocessing, as shown in Figure 3, is used as an input for the IPGAN proposed in this study.The structure of the IPGAN is illustrated in Figure 4.The generator shown in Figure 4 includes a concatenate layer (L1), convolution blocks (L2 and L7), residual blocks (L3-L5 and L8-L11), and convolution layers (L12 and L13) in that order.The discriminator includes convolution blocks (L1-L6) and a fully connected layer (L7) in that order.

Proposed IPGAN Model
The three-channel image (170 × 170 pixels) obtained through preprocessing, as shown in Figure 3, is used as an input for the IPGAN proposed in this study.The structure of the IPGAN is illustrated in Figure 4.The generator shown in Figure 4 includes a concatenate layer (L1), convolution blocks (L2 and L7), residual blocks (L3-L5 and L8-L11), and convolution layers (L12 and L13) in that order.The discriminator includes convolution blocks (L1-L6) and a fully connected layer (L7) in that order.

Postprocessing
During postprocessing, the final output is acquired from the RGB output image obtained using the IPGAN as shown in Figure 5.The region predicted in the output image obtained using the IPGAN is cropped as illustrated in Figure 5.The cropped region is combined with the original three-channel image (t − 0) to acquire the final output.The reasons for the smaller predicted region and the poor prediction of the remaining region are explained in Section 4.2 (ablation study) based on the experimental results.

Postprocessing
During postprocessing, the final output is acquired from the RGB output image obtained using the IPGAN as shown in Figure 5.The region predicted in the output image obtained using the IPGAN is cropped as illustrated in Figure 5.The cropped region is combined with the original three-channel image (t − 0) to acquire the final output.The reasons for the smaller predicted region and the poor prediction of the remaining region are explained in Section 4.2 (ablation study) based on the experimental results.

Dataset and Experimental Setup
The experiment in this study was conducted using the marathon sub-dataset [14] of the BU-TIV benchmark open thermal dataset.The task of the marathon dataset was for multi-object tracking.The dataset has included various objects, namely, pedestrians, cars,

Dataset and Experimental Setup
The experiment in this study was conducted using the marathon sub-dataset [14] of the BU-TIV benchmark open thermal dataset.The task of the marathon dataset was for multi-object tracking.The dataset has included various objects, namely, pedestrians, cars, motorcycles, bicycles, etc.The dataset consists of four videos (image sequences) with different sizes.The total number of images used in this experiment is 6552.Moreover, the size of an image in the marathon sub-dataset is 1024 × 512 × 1, and the pixel depth is 16 bits.The pixel value ranges between 3000 and 7000 units of uncalibrated temperature [14].Images in the dataset are provided in portable network graphics (PNG) format.The four sequences were provided with annotations for the object detection.The dataset was collected using FLIR SC800 cameras (FLIR Systems, Inc., Wilsonville, OR, USA) [14].We cropped all images into 170 × 170 × 1 and converted the image depth into 8 bits in this study.
The experiment was conducted in two-fold cross validation.In other words, half of the total data were used for training, the other half for testing, and the average value of the two testing accuracies (obtained by repeating the same process after swapping the training and testing data) was set as the final accuracy.In this study, the region was cropped with respect to the road on which people are running (the region of interest (ROI) of the red dashed box in Figure 6) in the original image.Ground-truth images (green dashed box) and input images (an image with zero paddings) were generated by cropping the ROI images into images of size 170 × 170.The process of creating the dataset used in this study is shown in Figure 6.
The training and testing of the algorithm proposed in this study were conducted using a desktop computer equipped with Intel Core i7-6700 CPU @ 3.40 GHz (Intel Corp., Santa Clara, CA, USA), Nvidia GeForce GTX TITAN X graphic processing unit (GPU) card size of an image in the marathon sub-dataset is 1024 × 512 × 1, and the pixel depth is 16 bits.The pixel value ranges between 3000 and 7000 units of uncalibrated temperature [14].Images in the dataset are provided in portable network graphics (PNG) format.The four sequences were provided with annotations for the object detection.The dataset was collected using FLIR SC800 cameras (FLIR Systems, Inc., Wilsonville, OR, USA) [14].We cropped all images into 170 × 170 × 1 and converted the image depth into 8 bits in this study.
The experiment was conducted in two-fold cross validation.In other words, half of the total data were used for training, the other half for testing, and the average value of the two testing accuracies (obtained by repeating the same process after swapping the training and testing data) was set as the final accuracy.In this study, the region was cropped with respect to the road on which people are running (the region of interest (ROI) of the red dashed box in Figure 6) in the original image.Ground-truth images (green dashed box) and input images (an image with zero paddings) were generated by cropping the ROI images into images of size 170 × 170.The process of creating the dataset used in this study is shown in Figure 6.
The training and testing of the algorithm proposed in this study were conducted using a desktop computer equipped with Intel Core i7-6700 CPU @ 3.40 GHz (Intel Corp., Santa Clara, CA, USA), Nvidia GeForce GTX TITAN X graphic processing unit (GPU) card

Training
The IPGAN structure proposed in this study was trained as follows.The batch size, training iterations, and learning rate of the IPGAN were set to 1, 800,000, and 0.0001, respectively.Furthermore, for both the generator and discriminator losses, we used the binary cross-entropy loss, and adaptive moment estimation (Adam) optimizer [36] was used as optimizer.Twenty sequential images of size 170 × 170 pixels were used in all the methods for both training and testing.Figure 7 shows the training loss curves of the IPGAN by iteration.In Table 7, detailed information of the hyperparameter tuning is presented.The remaining hyperparameters were determined according to the default values by Keras API [35].

Training
The IPGAN structure proposed in this study was trained as follows.The batch size, training iterations, and learning rate of the IPGAN were set to 1, 800,000, and 0.0001, respectively.Furthermore, for both the generator and discriminator losses, we used the binary cross-entropy loss, and adaptive moment estimation (Adam) optimizer [36] was used as optimizer.Twenty sequential images of size 170 × 170 pixels were used in all the methods for both training and testing.Figure 7 shows the training loss curves of the IPGAN by iteration.In Table 7, detailed information of the hyperparameter tuning is presented.The remaining hyperparameters were determined according to the default values by Keras API [35].

Testing (Ablation Study)
In this section, the results of ablation studies for the proposed method are presented.The experiments were conducted using the same dataset and two types of GAN structures.For measuring the image prediction accuracy, the image region cropped in the resulting image (Figure 5) was compared with respect to the ground-truth region based on similarity.The accuracy was measured using three types of metrics shown in Equations ( 1)- (3).

Testing (Ablation Study)
In this section, the results of ablation studies for the proposed method are presented.The experiments were conducted using the same dataset and two types of GAN structures.For measuring the image prediction accuracy, the image region cropped in the resulting image (Figure 5) was compared with respect to the ground-truth region based on similarity.The accuracy was measured using three types of metrics shown in Equations ( 1)- (3).
MSE represents the mean squared error [37] in Equation ( 1).W and H represent the image width and height, respectively, in Equation (1).Furthermore, in Equations ( 1) and ( 3), O and T represent the output image and target image (ground-truth image), respectively.PSNR represents the peak signal-to-noise ratio [38] in Equation ( 2).In Equation (3), the structural similarity index measure (SSIM) [39] is presented, in which µ T and σ T represent the mean and standard deviation of the pixel values of the ground-truth image, respectively, and µ O and σ O represent the mean and standard deviation of the pixel values of the output image, respectively.σ OT represents the covariance of the two images.R1 and R2 are positive constants so as not to make the denominator zero.
In this section, seven different experiments were conducted.In Figures 8-10 the I t image, the tth image of the 20 sequential images, is shown as the input image (far left).In Figure 8a, the target image (ground-truth (GT) image) (right image) is the subsequent image of the I t image (left image), where GT and I t do not include the same region.Specifically, the entire GT image that does not include any region of I t was predicted in Method 1.However, the results obtained using this method varied significantly from the ground-truth image as shown by the output image (middle image) in Figure 8a.Accordingly, only the region R (zero padded black area of 30 × 170 pixels) of the image was predicted as shown in Figure 8b (Method 2).This method aims to predict the spatial information R which is not included in I t when I t is included in GT.However, gray noise is generated within the R region being predicted in the output image obtained using this method.
Mathematics 2021, 9, x FOR PEER REVIEW 10 of 20 MSE represents the mean squared error [37] in Equation ( 1).W and H represent the image width and height, respectively, in Equation (1).Furthermore, in Equations ( 1) and (3), O and T represent the output image and target image (ground-truth image), respectively.PSNR represents the peak signal-to-noise ratio [38] in Equation ( 2).In Equation ( 3), the structural similarity index measure (SSIM) [39] is presented, in which μT and σT represent the mean and standard deviation of the pixel values of the ground-truth image, respectively, and μO and σO represent the mean and standard deviation of the pixel values of the output image, respectively.σOT represents the covariance of the two images.R1 and R2 are positive constants so as not to make the denominator zero.
In this section, seven different experiments were conducted.In Figures 8-10, the It image, the tth image of the 20 sequential images, is shown as the input image (far left).In Figure 8a, the target image (ground-truth (GT) image) (right image) is the subsequent image of the It image (left image), where GT and It do not include the same region.Specifically, the entire GT image that does not include any region of It was predicted in Method 1.However, the results obtained using this method varied significantly from the groundtruth image as shown by the output image (middle image) in Figure 8a.Accordingly, only the region R (zero padded black area of 30 × 170 pixels) of the image was predicted as shown in Figure 8b (Method 2).This method aims to predict the spatial information R which is not included in It when It is included in GT.However, gray noise is generated within the R region being predicted in the output image obtained using this method.For improving the accuracy, unlike Methods 1 and 2 in Figure 8, which used the size of the input, output, and ground-truth images as 80 × 170 pixels, Methods 3 and 4 in Figure 9 set the size of the input, output, and ground-truth images to 170 × 170 pixels in order to use the spatial information that is wider in the horizontal direction.In addition, the experiment was conducted by setting the region R being predicted to be larger for Method 4 in Figure 9b (in Methods 3 and 4, the sizes of R were 17 × 170 pixels).However, the gray noise generated in R' became larger in Figure 9b.As the width of R increased, the gray noise also became larger in this experiment.Therefore, the region can only be predicted between the red and yellow lines of R' in Figure 9b, and it was difficult to predict the region to the left of the yellow line in this experiment.Therefore, as shown in Figure 5, the predicted region was cropped to a fixed size (20 × 170 pixels).For improving the accuracy, unlike Methods 1 and 2 in Figure 8, which used the size of the input, output, and ground-truth images as 80 × 170 pixels, Methods 3 and 4 in Figure 9 set the size of the input, output, and ground-truth images to 170 × 170 pixels in order to use the spatial information that is wider in the horizontal direction.In addition, the experiment was conducted by setting the region R being predicted to be larger for Method 4 in Figure 9b (in Methods 3 and 4, the sizes of R were 17 × 170 pixels).However, the gray noise generated in R' became larger in Figure 9b.As the width of R increased, the gray noise also became larger in this experiment.Therefore, the region can only be predicted between the red and yellow lines of R' in Figure 9b, and it was difficult to predict the region to the left of the yellow line in this experiment.Therefore, as shown in Figure 5, the predicted region was cropped to a fixed size (20 × 170 pixels).Moreover, the experiment was conducted as shown in Figure 10a by paddings with the average value of It to examine the effects of zero padding.In Method 6, the padding was performed using an empty background as in an input image shown in Figure 10b.The empty background was selected manually from marathon thermal images in order to examine the effects of zero padding.Moreover, in Figure 10, the size of the input, output, and ground-truth images was set to 170 × 170 pixels in order to use wider spatial information in the horizontal direction.However, the result obtained through zero paddings (Method 4) as shown in Figure 9b demonstrated the best performance among the results thus far.Finally, as shown in Figure 10c, the experiment was conducted using the converted three-channel color image (Method 7), and the accuracy was compared.A comparison of all the experimental results is presented in Table 8.The results of using a onechannel color image (Method 4) and a three-channel color image (Method 7) were compared in the images in Figure 11.As shown in Figures 8-11 and Table 8, Method 7 exhibited the best image prediction performance.In Table 8, Method 4 exhibited a better performance than Method 7 in terms of PSNR; however, it has been reported that the PSNR is a poor measure for evaluating the difference and similarity in the human visual-image quality [40,41].SSIM can better evaluate the similarity in the image quality [39].Thus, Method 7 demonstrated the highest accuracy.Figure 12 shows the examples of the output images obtained using the proposed method.Moreover, the experiment was conducted as shown in Figure 10a by paddings with the average value of I t to examine the effects of zero padding.In Method 6, the padding was performed using an empty background as in an input image shown in Figure 10b.The empty background was selected manually from marathon thermal images in order to examine the effects of zero padding.Moreover, in Figure 10, the size of the input, output, and ground-truth images was set to 170 × 170 pixels in order to use wider spatial information in the horizontal direction.However, the result obtained through zero paddings (Method 4) as shown in Figure 9b demonstrated the best performance among the results thus far.Finally, as shown in Figure 10c, the experiment was conducted using the converted three-channel color image (Method 7), and the accuracy was compared.A comparison of all the experimental results is presented in Table 8.The results of using a one-channel color image (Method 4) and a three-channel color image (Method 7) were compared in the images in Figure 11.As shown in Figures 8-11 and Table 8, Method 7 exhibited the best image prediction performance.In Table 8, Method 4 exhibited a better performance than Method 7 in terms of PSNR; however, it has been reported that the PSNR is a poor measure for evaluating the difference and similarity in the human visual-image quality [40,41].SSIM can better evaluate the similarity in the image quality [39].Thus, Method 7 demonstrated the highest accuracy.Figure 12 shows the examples of the output images obtained using the proposed method.For inspecting the efficiency of the proposed method, the results of detecting humans in the original input and ground-truth images were compared with the result of detecting humans in the predicted image using the proposed method.Mask R-CNN [42] was used for conducting the experiment on human detection.Figure 13 shows the result of detecting humans using Mask R-CNN as mask images.
As shown in Figure 13, the result of human detection in the ground-truth image is similar to the result of human detection in the image predicted by the IPGAN for which a three-channel color image is input.Furthermore, the detection result from the predicted image is closer to the detection result from the ground-truth image than that from the original input image.For inspecting the efficiency of the proposed method, the results of detecting humans in the original input and ground-truth images were compared with the result of detecting humans in the predicted image using the proposed method.Mask R-CNN [42] was used for conducting the experiment on human detection.Figure 13 shows the result of detecting humans using Mask R-CNN as mask images.
As shown in Figure 13, the result of human detection in the ground-truth image is similar to the result of human detection in the image predicted by the IPGAN for which a three-channel color image is input.Furthermore, the detection result from the predicted image is closer to the detection result from the ground-truth image than that from the original input image.
Additionally, the detection (detection 1) accuracy was measured between the results obtained with the original input images and the results obtained with the ground-truth images.The detection (detection 2) accuracy was also measured between the results obtained with the images predicted using our method and the results obtained with the ground-truth images.These detection results (detection 1 and detection 2) were compared in Table 9.To this end, the true positive rate (TPR) (#TP/(#TP + #FN)) and positive predictive value (PPV) (#TP/(#TP + #FP)) [43], as well as the accuracy (ACC) [43], F1 score (F1) [44], and intersection over union (IoU) [43], which are expressed in Equations ( 4)-( 6), respectively, were used to measure the accuracy for a comparison.Here, TP, FP, FN, and TN denote true positive, false positive, false negative, and true negative, respectively.Positive and negative in this experiment indicate the pixels detected in the ground-truth image (white pixel in Figure 13) and those not detected (black pixel in Figure 13), respectively.More specifically, TP refers to the case when positive pixels are detected correctly, whereas TN refers to the case when negative pixels are not detected correctly.FP refers to the case when negative pixels are incorrectly detected as positive pixels, whereas FN refers to the case when positive pixels are incorrectly detected as negative pixels.Here, "#" denotes "the number of."Additionally, the detection (detection 1) accuracy was measured between the results obtained with the original input images and the results obtained with the ground-truth images.The detection (detection 2) accuracy was also measured between the results obtained with the images predicted using our method and the results obtained with the ground-truth images.These detection results (detection 1 and detection 2) were compared in Table 9.To this end, the true positive rate (TPR) (#TP/(#TP + #FN)) and positive predictive value (PPV) (#TP/(#TP + #FP)) [43], as well as the accuracy (ACC) [43], F1 score (F1) [44], and intersection over union (IoU) [43], which are expressed in Equations ( 4)-( 6), respectively, were used to measure the accuracy for a comparison.Here, TP, FP, FN, and TN denote true positive, false positive, false negative, and true negative, respectively.Pos-Figure 13.Examples of detection results before and after image prediction.In (a-d), from left to right, the original input images, results with original input images, ground-truth images, results with ground-truth images, images predicted using our method, and results with predicted images, respectively.As shown in Table 9, detection 2 was more accurate than detection 1, which indicates that using the image predicted with our method produced the detection results closer to the results of using the ground-truth image than using the original input image.

Comparisons of Proposed Method with the State-of-the-Art Methods
In this section, the results of comparing the proposed method and the state-of-the-art methods are presented.For measuring the image prediction accuracy, the entire image (including R' and I' t as in Figure 8b) obtained using the proposed method was compared with respect to the ground-truth based on similarity.In Table 10, the existing image prediction [15] method and inpainting [20,22,24] methods are compared with the IPGANbased image region prediction method proposed in this study.In Figure 14, the result images obtained using all the methods are compared.The methods that originally used a single image [20,22,24] are made to use sequential images as inputs, as in our method, for a fair performance evaluation; the input layer of these methods [20,22,24] was changed to layers 0 and 1 of Table 2, as in the proposed method.Moreover, a three-channel color image was used as the input and output of all the methods, as in our method, for a fair comparison and evaluation.As shown in Table 10 and Figure 14, the proposed method exhibited a better performance than the state-of-the-art methods.
For the next experiment, all the methods were compared using Mask-R-CNN-based human detection.Table 11 and Figure 15 show the accuracy of the detection results as well as the output images.The experiment showed that the proposed method demonstrated the best performance.For the next experiment, all the methods were compared using Mask-R-CNN-based human detection.Table 11 and Figure 15 show the accuracy of the detection results as well as the output images.The experiment showed that the proposed method demonstrated the best performance.

Processing Time
In Table 12, the processing time of each sub-part of the proposed method (Figure 2) is presented.The processing time was measured in the environments described in Section 3.5.As shown in Table 12, the processing time of the Mask R-CNN is higher than other sub-parts.The frame rate of the proposed prediction method is about 23.4 frames per second (1000/(9.97+ 32.8 + 0.01)), and the total frame rate including image prediction and detection method is about 10.6 frames per second (1000/94).Thus, the processing time of the proposed method to perform both image prediction and object detection is sufficiently short.

Processing Time
In Table 12, the processing time of each sub-part of the proposed method (Figure 2) is presented.The processing time was measured in the environments described in Section 3.5.As shown in Table 12, the processing time of the Mask R-CNN is higher than other sub-parts.The frame rate of the proposed prediction method is about 23.4 frames per second (1000/(9.97+ 32.8 + 0.01)), and the total frame rate including image prediction and detection method is about 10.6 frames per second (1000/94).Thus, the processing time of the proposed method to perform both image prediction and object detection is sufficiently short.

Discussion
In this study, a method was proposed for predicting the image outside the FOV of a camera.The proposed method was studied for accurately detecting humans who are leaving the FOV of a camera, by which the object detection error, due to a part of a human body being invisible in the input image, can be reduced.As shown in the result images of Figures 13-15, the invisible body parts of humans leaving the FOV of a camera became visible in the images by the proposed image prediction method.Therefore, it is confirmed that the proposed method is efficient to predict missing parts of a human body as well as is sufficient to increase the accuracy of a human detection.
However, it is confirmed that the size of region being predicted is limited when the images outside of the FOV of a camera are predicted as shown in Figure 9b.In addition, the gray noises are generated in the R region (Figure 8b).As the width of R increased, the gray noise also became larger in this experiment, and the consequent size of the predicted region became limited.Therefore, our method can be used for the applications where the region of limited size is predicted for human detection in thermal videos.

Conclusions
In this study, a method was proposed for predicting the image outside the FOV of a camera using a one-channel thermal image converted into a three-channel thermal image as an input of the IPGAN.Various ablation studies based on different image size and image channels were conducted and compared in this study.The method based on a threechannel thermal image showed a higher SSIM (0.9535) value compared to one-channel thermal image-based methods.Moreover, it was confirmed that the image prediction method increased the accuracy of object detection as shown in Table 9.For example, the TPR = 0.82, PPV = 0.81, F1 score = 0.815, ACC = 0.941, and IoU = 0.713 were increased to TPR = 0.901, PPV = 0.864, F1 score = 0.882, ACC = 0.983, and IoU = 0.791.In addition, the proposed method was compared with the state-of-the-art methods, and our method showed higher PSNR = 23.243 and SSIM = 0.9839 values than the state-of-the-art methods as shown in Table 10.Our method was also compared with the state-of-the-art methods in terms of human detection, and the proposed method showed the TPR = 0.901, PPV = 0.864, F1 score = 0.882, ACC = 0.983, and IoU = 0.791 which were higher than the state-of-the-art methods as shown in Table 11.
In future work, the methods for predicting a wider region will be studied.Furthermore, an image prediction method in which the front viewing angle of a vehicle's visible-light camera is expanded in horizontal directions will be investigated by expanding the scope of this study.

Figure 1 .
Figure 1.Example of thermal image prediction.

Figure 1 .
Figure 1.Example of thermal image prediction.

Figure 2 .
Figure 2. Overall flowchart of the proposed method.

Figure 2 .
Figure 2. Overall flowchart of the proposed method.

Figure 4 .
Figure 4. Example of the structure of the proposed IPGAN.

Figure 4 .
Figure 4. Example of the structure of the proposed IPGAN.

Figure 5 .
Figure 5. Example of the postprocessing.

Figure 5 .
Figure 5. Example of the postprocessing.

Figure 7 .
Figure 7. Training loss curves of GAN.

Figure 7 .
Figure 7. Training loss curves of GAN.

Figure 8 .
Figure 8. Examples of result images obtained using Methods 1 and 2. From left to right, the input, output, and ground-truth images, respectively, obtained using (a) Method 1 and (b) Method 2. The size of the input, output, and ground-truth images is 80 × 170 pixels.

Figure 8 .
Figure 8. Examples of result images obtained using Methods 1 and 2. From left to right, the input, output, and ground-truth images, respectively, obtained using (a) Method 1 and (b) Method 2. The size of the input, output, and ground-truth images is 80 × 170 pixels.

Figure 9 .
Figure 9. Examples of result images obtained using Methods 3 and 4. From left to right, the input, output, and ground-truth images, respectively, obtained using (a) Method 3 and (b) Method 4.

Figure 9 .Figure 10 .
Figure 9. Examples of result images obtained using Methods 3 and 4. From left to right, the input, output, and ground-truth images, respectively, obtained using (a) Method 3 and (b) Method 4. Mathematics 2021, 9, x FOR PEER REVIEW 12 of 20

Figure 10 .
Figure 10.Examples of result images obtained using Methods 5-7.From left to right, the input, output, and ground-truth images, respectively, obtained using (a) Method 5, (b) Method 6, and (c) Method 7.

Figure 11 .
Figure 11.Examples of result images obtained using Methods 4 and 7. From left to right, the input, output, and ground-truth images, respectively, obtained using (a) Method 4 and (b) Method 7.Figure 11.Examples of result images obtained using Methods 4 and 7. From left to right, the input, output, and ground-truth images, respectively, obtained using (a) Method 4 and (b) Method 7.

Figure 11 .Figure 11 .Figure 12 .
Figure 11.Examples of result images obtained using Methods 4 and 7. From left to right, the input, output, and ground-truth images, respectively, obtained using (a) Method 4 and (b) Method 7.Figure 11.Examples of result images obtained using Methods 4 and 7. From left to right, the input, output, and ground-truth images, respectively, obtained using (a) Method 4 and (b) Method 7.

Figure 12 .
Figure 12.Examples of result images obtained using the proposed method.In (a-d), from left to right, the original, ground-truth, and predicted (output) images, respectively.

Figure 13 .
Figure 13.Examples of detection results before and after image prediction.In (a-d), from left to right, the original input images, results with original input images, ground-truth images, results with ground-truth images, images predicted using our method, and results with predicted images, respectively.

Figure 14 .
Figure 14.Comparisons of the original images, ground-truth images, and prediction results obtained using the state-ofthe-art methods and our method: (a) original images; (b) ground-truth images.Images predicted using: (c) Haziq et al.'s method; (d) Liu et al.'s method; (e) Shin et al.'s method; (f) Nazeri et al.'s method; (g) the proposed method.

Figure 14 .Figure 15 .
Figure 14.Comparisons of the original images, ground-truth images, and prediction results obtained using the state-ofthe-art methods and our method: (a) original images; (b) ground-truth images.Images predicted using: (c) Haziq et al.'s method; (d) Liu et al.'s method; (e) Shin et al.'s method; (f) Nazeri et al.'s method; (g) the proposed method.

Figure 15 .
Figure 15.Comparisons of detection results using the original images, ground-truth images, and the predicted images obtained using the state-of-the-art methods and our method.(a) Original images.Detection results using the (b) original images, (c) ground-truth images, (d) images predicted using Haziq et al.'s method, (e) images predicted using Liu et al.'s method, (f) images predicted using Shin et al.'s method, (g) images predicted using Nazeri et al.'s method, and (h) images predicted using our method.
Table1presents a summary of the comparisons between the present and previous studies.This study is novel in the following four ways compared with the previous works:

Table 1 .
Comparison between the present and previous studies.

Table 1 .
Comparison between the present and previous studies.

Table 2 .
Description of the generator of the proposed IPGAN.

Table 3 .
Description of a convolution block of the generator.

Table 2 .
Description of the generator of the proposed IPGAN.

Table 3 .
Description of a convolution block of the generator.

Table 4 .
Description of a residual block of the generator.

Table 5 .
Description of the discriminator of the proposed IPGAN.

Table 6 .
Description of a convolution block of the discriminator.

Table 5 .
Description of the discriminator of the proposed IPGAN.

Table 6 .
Description of a convolution block of the discriminator.

Table 7 .
Detailed information of hyperparameter tuning.

Table 7 .
Detailed information of hyperparameter tuning.

Table 8 .
Comparison of various region prediction methods.

Table 8 .
Comparison of various region prediction methods.

Table 9 .
Comparisons of object detection accuracies by detections 1 and 2.

Table 10 .
Comparison of the image prediction methods.

Table 11 .
Comparisons of object detection accuracies obtained using our method with those of the state-of-the-art methods based on Mask R-CNN.

Table 11 .
Comparisons of object detection accuracies obtained using our method with those of the state-of-the-art methods based on Mask R-CNN.

Table 12 .
Processing time of the proposed method per image (unit: ms).

Table 12 .
Processing time of the proposed method per image (unit: ms).