Detection of Small Moving Objects in Long Range Infrared Videos from a Change Detection Perspective

: Detection of small moving objects in long range infrared (IR) videos is challenging due to background clutter, air turbulence, and small target size. In this paper, we present two unsupervised, modular, and ﬂexible frameworks to detect small moving targets. The key idea was inspired by change detection (CD) algorithms where frame differences can help detect motions. Our frameworks consist of change detection, small target detection, and some post-processing algorithms such as image denoising and dilation. Extensive experiments using actual long range mid-wave infrared (MWIR) videos with target distances beyond 3500 m from the camera demonstrated that one approach, using Local Intensity Gradient (LIG) only once in the workﬂow, performed better than the other, which used LIG in two places, in a 3500 m video, but slightly worse in 4000 m and 5000 m videos. Moreover, we also investigated the use of synthetic bands for target detection and observed promising results for 4000 m and 5000 m videos. Finally, a comparative study with two conventional methods demonstrated that our proposed scheme has comparable performance.


Introduction
In long range surveillance, targets may have around 10 or even fewer pixels and these are known as small targets. Small target detection is difficult in long range infrared videos due to small target size and environmental factors. Small target detection for infrared images has been a commonly explored problem in recent years [1][2][3][4][5][6]. Chen et al. [1] proposed to detect small IR targets by using local contrast measure (LCM), which is timeconsuming and sometimes enhances both targets and clutters. To improve the performance of LCM, Wei et al. [2] introduced a multiscale patch-based contrast measure (MPCM). Gao et al. [3] developed an infrared patch-image (IPI) model to convert small target detection to an optimization problem. Zhang et al. [4] improved the performance of the IPI via non-convex rank approximation minimization (NRAM). Zhang et al. [5] proposed to detect small IR targets based on local intensity and gradient (LIG) properties, which has good performance and relatively low computational complexity. Recently, Chen et al. [6] proposed a new and real-time approach for detecting small targets with sky background.
It should be noted that the aforementioned papers detect targets frame by frame. Parallel to the above small target detection activities, there are some conventional target tracking methods [7,8] for videos. In general, target detection performance in videos can yield better results because target motion can be exploited. For instance, the paper [7] combines single frame detection with a track fusion algorithm to yield improved target detection in infrared videos. Furthermore, various target detection and classification schemes for optical and infrared videos have been proposed in the literature . Some of them [9][10][11]13,33] used You Only Look Once (YOLO) for target detection. Although the YOLO performance is reasonable for short ranges up to 2000 m in some videos, the performance dropped quite a lot in long ranges where the target sizes are so small. This is because some deep learning algorithms, such as YOLO, use texture information to help the detection. The use of YOLO is not very effective for long range videos in which the targets are too small to have any discernible textures. Some of these new algorithms incorporated compressive measurements directly for detection and classification. Real-time issues have also been discussed [33]. In a recent paper [34], optical flow techniques were applied to small target detection in long range infrared videos. Detection results using actual videos in the range up to 5000 m yielded promising performance.
In this research, we focus on small moving target detection in long range infrared videos where the ranges are 3500 m and beyond. In the literature, we have not seen target detection studies for such long ranges before except papers written by us [7,[9][10][11]34]. We propose two approaches based on change detection (CD) techniques for target detection in videos containing moving targets. We call these two approaches the standard and alternate approaches. There are several steps in the standard approach. First, we propose to apply change detection techniques to generate a residual image between two frames separated by 15 frames. The number "15" is a design parameter that worked well in our experiments. For other datasets, a different number may be needed. Although direct subtraction between two frames can be used here, we compared three well-known change detection methods known as covariance equalization (CE) [35], chronochrome (CC) [36], and anomalous change detection (ACD) [37] and found that those change detection methods performed better than direct subtraction. Second, a denoising step using a diffusion filter is used to reduce some false positives Third, an image dilation step is performed afterwards to enlarge the detected object. Fourth, a Local Intensity Gradient (LIG) [5] is applied to the residual image to detect the targets in the residual image. It was discovered that this step plays a dominant role in small target detection. Finally, another dilation is performed to further enhance the target detection performance. In the alternate approach, the LIG and change detection modules are swapped. Extensive experiments using three long range infrared videos demonstrated that the performance of the standard approach is better than the alternate approach.
In addition to the above studies, we also investigated the use of Extended Morphological Attribute Profile (EMAP) [38][39][40][41][42] and local contrast enhancement (LCE) [43] to synthesize multiple bands out of the single infrared image. The motivation for this is that, in our recent change detection applications [44,45], we noticed remarkable improvement in change detection and target detection performance when EMAP was used. For LCE, it was observed by Xia et al. [43] that target detection was also improved. The additional synthetic bands from EMAP and LCE yielded comparable or better results than that of using the original images for 4000 m and 5000 m videos.
Our contributions are summarized as follows: • We present two new target detection frameworks from a change detection perspective for small moving targets.

•
The two new schemes are unsupervised approaches as compared to the deep learning approaches in the literature. This means the proposed approaches require no training data and hence are more practical.

•
We demonstrated the efficacy of the proposed approaches using actual long range and low quality MWIR videos from 3500 m to 5000 m.

•
We investigated the use of synthetic bands for target detection. The performance is promising as we have comparable or better detection results for 4000 m and 5000 m videos.

•
We compared with two conventional approaches (frame by frame and optical flow) and yielded comparable or better performance.
The remainder of the paper is as follows. Section 2 describes the motivation and the proposed approaches. Section 3 summarizes the experimental results, including comparative studies. Finally, some concluding remarks are presented in Section 4.  Figure 1 contains three frame differences with different separations from a 3500 m distance video, which is one of the daytime videos in the DSIAC dataset [46]. It can be seen that, when two frames are separated by 15 or more frames, it is possible to see some motion differences. This motivates us to pursue object detection using frame difference. However, as one can see in later sections, a direct subtraction without the help of other processing modules can have a lot of false positives.  Figure 1 contains three frame differences with different separations from a 3500 m distance video, which is one of the daytime videos in the DSIAC dataset [46]. It can be seen that, when two frames are separated by 15 or more frames, it is possible to see some motion differences. This motivates us to pursue object detection using frame difference. However, as one can see in later sections, a direct subtraction without the help of other processing modules can have a lot of false positives.

Frame Difference
Image a Image b Difference image 1 frame 15 frames 30 frames Figure 1. Direct subtraction results. The frame separation needs to be large enough in order to detect moving objects.

Proposed Unsupervised Target Detection Approaches Using Change Detection
From Figure 1, the difference maps usually contain a lot of noise for a number of different reasons and the accurately detected change is still very dim. So, we tried using more sophisticated change detection algorithms. The three algorithms we tried are Covariance Equalization (CE) [35], Chronochrome (CC) [36], and Anomalous Change Detection (ACD) [37]. It should be noted that using change detection between two frames will also result in two detections. If these frames are far enough apart, there would be two vehicles present on the change detection map. In our experiments, a 15-frame gap between image pairs seems to create a reasonable balance where the vehicles overlap on the change detection map so there will only be one detection and the vehicles also create a reasonably sized detection due to the amount of separation that can occur in 15 frames. A smaller gap in frames has more overlap guaranteeing only one detection but creating a smaller detection as well. A larger gap in frames has a larger detection but increases the odds that there will be two changes detected rather than one.
The most effective workflow for tracking a moving target using change detection is to only use one frame every 15 frames and perform the following steps for each pair. The workflow of the standard approach is also illustrated in Figure 2 and the key steps are summarized below: Figure 1. Direct subtraction results. The frame separation needs to be large enough in order to detect moving objects.

Proposed Unsupervised Target Detection Approaches Using Change Detection
From Figure 1, the difference maps usually contain a lot of noise for a number of different reasons and the accurately detected change is still very dim. So, we tried using more sophisticated change detection algorithms. The three algorithms we tried are Covariance Equalization (CE) [35], Chronochrome (CC) [36], and Anomalous Change Detection (ACD) [37]. It should be noted that using change detection between two frames will also result in two detections. If these frames are far enough apart, there would be two vehicles present on the change detection map. In our experiments, a 15-frame gap between image pairs seems to create a reasonable balance where the vehicles overlap on the change detection map so there will only be one detection and the vehicles also create a reasonably sized detection due to the amount of separation that can occur in 15 frames. A smaller gap in frames has more overlap guaranteeing only one detection but creating a smaller detection as well. A larger gap in frames has a larger detection but increases the odds that there will be two changes detected rather than one.
The most effective workflow for tracking a moving target using change detection is to only use one frame every 15 frames and perform the following steps for each pair. The workflow of the standard approach is also illustrated in Figure 2 and the key steps are summarized below:

1.
Perform change detection using a CD algorithm between two frames.

2.
Apply denoising to reduce the amount of noise in the change map.
1. Perform change detection using a CD algorithm between two frames. 2. Apply denoising to reduce the amount of noise in the change map. 3. Perform dilation to increase size and intensity of detected changes 4. Use LIG to detect anomalies in each change detection map. 5. Perform dilation again to make detected change more visible In Figure 3, we also show an alternative approach to target detection. The difference between the two approaches is the location of the LIG module. In the alternative approach, the LIG is applied to the two individual frames first. We will compare these two approaches in the experiments. In the following paragraphs, we will briefly summarize the details of each module.

Change Detection
We have applied three change detection algorithms in our experiments.

Covariance Equalization (CE)
Suppose I(T1) is the reference (R) image and I(T2) is the test image (T). The algorithm is as follows [35]: 1. Compute mean and covariance of R and T as 3. Do transformation. In Figure 3, we also show an alternative approach to target detection. The difference between the two approaches is the location of the LIG module. In the alternative approach, the LIG is applied to the two individual frames first. We will compare these two approaches in the experiments.  In Figure 3, we also show an alternative approach to target detection. The difference between the two approaches is the location of the LIG module. In the alternative approach, the LIG is applied to the two individual frames first. We will compare these two approaches in the experiments. In the following paragraphs, we will briefly summarize the details of each module.

Change Detection
We have applied three change detection algorithms in our experiments.

Covariance Equalization (CE)
Suppose I(T1) is the reference (R) image and I(T2) is the test image (T). The algorithm is as follows [35]: 1. Compute mean and covariance of R and T as 3. Do transformation. In the following paragraphs, we will briefly summarize the details of each module.

Change Detection
We have applied three change detection algorithms in our experiments.

Covariance Equalization (CE)
Suppose I(T 1 ) is the reference (R) image and I(T 2 ) is the test image (T). The algorithm is as follows [35]:

1.
Compute mean and covariance of R and T as m R , C R , m T , C T 2.
where Q is the covariance of [PR − PT]. The changes will be reflected in the residuals.

Chronochrome (CC)
Suppose I(T 1 ) is the reference (R) image and a later image I(T 2 ) the test image (T). The algorithm is as follows [36]:

1.
Compute mean and covariance of R and T as m R , C R , m T , C T 2.
Compute cross-covariance between R and T as C TR 3.
Do transformation.
where Q is the covariance of [PR − PT]. The change detection results between PR and PT can be seen in ε.

Anomalous Change Detection (ACD)
ACD is a method of Anomalous Change detection created by Los Alamos National Laboratory [37]. ACD is based on an anomalous change detection framework that is applied to the Gaussian model. Suppose x and y are mean subtracted pixel vectors in two images (R and T) for the same pixel location. We denote the covariance of R and T as C R and C T , and the cross-covariance between R and T as C TR . The change value at pixel location (where x and y are) is then computed using The change map is computed by applying Equation (6) for all pixels in R and T. In Equation (7), subscript R corresponds to the reference image, subscript T corresponds to the test image and Q is computed as Different from Chronochrome (CC) and Covariance Equalization (CE) techniques, in ACD, the lines that separate normal from abnormal ones are hyperbolic.

Denoising
The denoising step is important in reducing speckle noise that could be detected as a change between two frames. During this step, Matlab's imdiffusefilt function is used [47]. This function applies anisotropic diffusion filtering to denoise the change map.

Dilation
When using dilation, we used Matlab's imdilate function using a disk with a size of 2 pixels as the parameter. This made the results from change detection much more clear. Figure 4 shows an example of its improvement.

Dilation
When using dilation, we used Matlab's imdilate function using a disk with a size 2 pixels as the parameter. This made the results from change detection much more cle Figure 4 shows an example of its improvement.

LIG for Target Detection
Since the detection results of YOLO at the longer ranges (3500 m and above) were n as high as we would have liked, we also investigated a traditional unsupervised sm target detection method to see how it would perform on the long range videos. The al rithm of choice for this study was a local intensity gradient (LIG) based target detector specifically designed for infrared images. The LIG is relatively faster than other al rithms and is very robust to background clutter. Figure 5 highlights the architecture of LIG [5]. The algorithm scans through the input image using a sliding window, whose s depends on the input image resolution. For each window, the local intensity and gradi values are computed separately. Then, those values are multiplied to form an intensi gradient (IG) map. An adaptive threshold is then used to segment the IG map and th the binarized image will reveal the target. A major advantage of these traditional/unsupervised algorithms is that they requ no training, so there is no need to worry about customizing training data, which is case with YOLO. A disadvantage of the LIG algorithm is that it is quite slow, tak roughly 70 s per frame.
There are two adjustments we made to the LIG algorithm to make it more suita for the DSIAC infrared dataset. First of all, we adjusted the way in which the adapta threshold T is calculated. One method to calculate T is to use the mean value of all no zero pixels [5]. For our dataset, this calculation produced a very small value due to overwhelming amount of very low non-zero pixels. The left image in Figure 6 highlig the significant role that the threshold plays for this algorithm. Second, we have imp mented ways of speeding up the algorithm, such as incorporating multithreading with

LIG for Target Detection
Since the detection results of YOLO at the longer ranges (3500 m and above) were not as high as we would have liked, we also investigated a traditional unsupervised small target detection method to see how it would perform on the long range videos. The algorithm of choice for this study was a local intensity gradient (LIG) based target detector [5], specifically designed for infrared images. The LIG is relatively faster than other algorithms and is very robust to background clutter. Figure 5 highlights the architecture of the LIG [5]. The algorithm scans through the input image using a sliding window, whose size depends on the input image resolution. For each window, the local intensity and gradient values are computed separately. Then, those values are multiplied to form an intensity-gradient (IG) map. An adaptive threshold is then used to segment the IG map and then the binarized image will reveal the target.

Dilation
When using dilation, we used Matlab's imdilate function using a disk with a size of 2 pixels as the parameter. This made the results from change detection much more clear. Figure 4 shows an example of its improvement.

LIG for Target Detection
Since the detection results of YOLO at the longer ranges (3500 m and above) were not as high as we would have liked, we also investigated a traditional unsupervised small target detection method to see how it would perform on the long range videos. The algorithm of choice for this study was a local intensity gradient (LIG) based target detector [5], specifically designed for infrared images. The LIG is relatively faster than other algorithms and is very robust to background clutter. Figure 5 highlights the architecture of the LIG [5]. The algorithm scans through the input image using a sliding window, whose size depends on the input image resolution. For each window, the local intensity and gradient values are computed separately. Then, those values are multiplied to form an intensitygradient (IG) map. An adaptive threshold is then used to segment the IG map and then the binarized image will reveal the target. A major advantage of these traditional/unsupervised algorithms is that they require no training, so there is no need to worry about customizing training data, which is the case with YOLO. A disadvantage of the LIG algorithm is that it is quite slow, taking roughly 70 s per frame.
There are two adjustments we made to the LIG algorithm to make it more suitable for the DSIAC infrared dataset. First of all, we adjusted the way in which the adaptable threshold T is calculated. One method to calculate T is to use the mean value of all nonzero pixels [5]. For our dataset, this calculation produced a very small value due to the overwhelming amount of very low non-zero pixels. The left image in Figure 6 highlights the significant role that the threshold plays for this algorithm. Second, we have implemented ways of speeding up the algorithm, such as incorporating multithreading within A major advantage of these traditional/unsupervised algorithms is that they require no training, so there is no need to worry about customizing training data, which is the case with YOLO. A disadvantage of the LIG algorithm is that it is quite slow, taking roughly 70 s per frame.
There are two adjustments we made to the LIG algorithm to make it more suitable for the DSIAC infrared dataset. First of all, we adjusted the way in which the adaptable threshold T is calculated. One method to calculate T is to use the mean value of all nonzero pixels [5]. For our dataset, this calculation produced a very small value due to the overwhelming amount of very low non-zero pixels. The left image in Figure 6 highlights the significant role that the threshold plays for this algorithm. Second, we have implemented ways of speeding up the algorithm, such as incorporating multithreading within the script and also converting it to a faster interpreted language than MATLAB. We were able to speed up the computational time by close to three times. the script and also converting it to a faster interpreted language than MATLAB. We able to speed up the computational time by close to three times. For the example in Figure 6, the mean value was 0.008. Using this threshold val binarization, we observe that roughly half the non-zero pixels would be consider detections, as seen on the left hand image of Figure 6. This originally resulted in hun of false positives in the frames. So instead of using the mean of non-zero pixels in th processed frame, we use the mean of the top 0.01% of pixels. A higher threshold is e tial for eliminating false positives, as can be seen in the image on the right of Figure After running change detection using the CC method, the visual change map peared to be correct in most cases but there were a couple of pairs with a lot of noise LIG detection was able to clean up the noise in the pairs that performed poorly.

Dilation Again After LIG
After LIG, dilation is again performed using a 10 × 10 pixel square as the param Figure 8 shows an example of how this improves the visual result. For the example in Figure 6, the mean value was 0.008. Using this threshold value for binarization, we observe that roughly half the non-zero pixels would be considered as detections, as seen on the left hand image of Figure 6. This originally resulted in hundreds of false positives in the frames. So instead of using the mean of non-zero pixels in the LIG processed frame, we use the mean of the top 0.01% of pixels. A higher threshold is essential for eliminating false positives, as can be seen in the image on the right of Figure 6.
After running change detection using the CC method, the visual change maps appeared to be correct in most cases but there were a couple of pairs with a lot of noise. The LIG detection was able to clean up the noise in the pairs that performed poorly. Figure 7 below is an example of what these noisy frames looked like.
Photonics 2021, 8, x FOR PEER REVIEW 7 the script and also converting it to a faster interpreted language than MATLAB. We w able to speed up the computational time by close to three times. For the example in Figure 6, the mean value was 0.008. Using this threshold value binarization, we observe that roughly half the non-zero pixels would be considere detections, as seen on the left hand image of Figure 6. This originally resulted in hund of false positives in the frames. So instead of using the mean of non-zero pixels in the processed frame, we use the mean of the top 0.01% of pixels. A higher threshold is es tial for eliminating false positives, as can be seen in the image on the right of Figure 6 After running change detection using the CC method, the visual change maps peared to be correct in most cases but there were a couple of pairs with a lot of noise. LIG detection was able to clean up the noise in the pairs that performed poorly. Figu below is an example of what these noisy frames looked like.

Dilation Again After LIG
After LIG, dilation is again performed using a 10 × 10 pixel square as the param Figure 8 shows an example of how this improves the visual result.

Dilation Again after LIG
After LIG, dilation is again performed using a 10 × 10 pixel square as the parameter. Figure 8 shows an example of how this improves the visual result.

Generation of Synthetic Bands
In the past, researchers have used EMAP to enhance change detection performance [38][39][40][41][42]. It was observed that EMAP can generate synthetic bands and improve the overall performance. As such, we considered using EMAP to improve the small target detection performance in infrared videos. EMAP allows us to convert a single band into a multispectral image made up of synthetic bands. In another study, researchers also found that some synthetic bands using the LCE can help the target detection performance [43]. We implemented the LCE algorithm.
This section describes our attempts to expand this investigation.

Generation of Synthetic Bands
In the past, researchers have used EMAP to enhance change detection perfor [38][39][40][41][42]. It was observed that EMAP can generate synthetic bands and improve the o performance. As such, we considered using EMAP to improve the small target det performance in infrared videos. EMAP allows us to convert a single band into a spectral image made up of synthetic bands.
In another study, researchers also found that some synthetic bands using th can help the target detection performance [43]. We implemented the LCE algorithm This section describes our attempts to expand this investigation.

EMAP
Mathematically, given an input grayscale image and a sequence of thresho els ℎ , ℎ , … ℎ , the attribute profile (AP) of is obtained by applying a seq of thinning and thickening attribute transformations to every pixel in .
The EMAP of is then acquired by stacking two or more APs while using an ture reduction technique on multispectral/hyperspectral data, such as purely geo attributes (e.g., area, length of the perimeter, image moments, shape factors), or te attributes (e.g., range, standard deviation, entropy) [38][39][40][41].
In this paper, the "area (a)" and "length of the diagonal of the bounding bo attributes of EMAP [42] were used. For the area attribute of EMAP, two threshold by the morphological attribute filters were set to 10 and 15. For the Length attrib EMAP, the thresholds were set to 50, 100, and 500. The above thresholds were c based on experience, because we observed them to yield consistent results in our e ments. With this parameter setting, EMAP creates 11 synthetic bands for a given band image. One of the bands comes from the original image. LCE LCE stands for Local Contrast Element [43]. This method of creating synthetic by creating a window around each pixel and finding the most similar pixels to the This method was used to create a varying number of bands. Figure 9 is a diagram ex ing the logic in creating these synthetic bands. The EMAP of f is then acquired by stacking two or more APs while using any feature reduction technique on multispectral/hyperspectral data, such as purely geometric attributes (e.g., area, length of the perimeter, image moments, shape factors), or textural attributes (e.g., range, standard deviation, entropy) [38][39][40][41].
In this paper, the "area (a)" and "length of the diagonal of the bounding box (d)" attributes of EMAP [42] were used. For the area attribute of EMAP, two thresholds used by the morphological attribute filters were set to 10 and 15. For the Length attribute of EMAP, the thresholds were set to 50, 100, and 500. The above thresholds were chosen based on experience, because we observed them to yield consistent results in our experiments. With this parameter setting, EMAP creates 11 synthetic bands for a given single band image. One of the bands comes from the original image. LCE LCE stands for Local Contrast Element [43]. This method of creating synthetic bands by creating a window around each pixel and finding the most similar pixels to the center. This method was used to create a varying number of bands. Figure 9 is a diagram explaining the logic in creating these synthetic bands.

Videos
Our research objective is to perform target detection in long range and low quality

Videos
Our research objective is to perform target detection in long range and low quality MWIR videos. There are no such datasets in the public domain except the DSIAC videos [47]. There are optical and MWIR videos in the DSIAC datasets. The optical and MWIR videos have very different characteristics. Optical imagers have a wavelength between 0.4 and 0.8 microns and MWIR imagers have a wavelength range between 3 and 5 microns. Optical cameras require external illuminations whereas MWIR counterparts do not need external illumination sources because MWIR cameras are sensitive to heat radiation from objects. Consequently, target shadows, illumination, and hot air turbulence can affect the target detection performance in optical videos. MWIR imagery is dominated by the thermal component at night and hence it is a much better surveillance tool than visible imagers at night. Moreover, atmospheric obscurants cause much less scattering in the MWIR bands than in the optical band. As a result, MWIR cameras are tolerant of heat turbulence, smoke, dust and fog. In this paper, we focused on the mid-wave infrared (MWIR) videos collected at distances ranging from 1000 m to 5000 m with 500 m increments. Each video has 1800 frames. The video frame rate is 7 frames/second and the frame size is 640 × 512. Each pixel is represented by 8 bits. These videos are challenging for several reasons. First, the target sizes are small due to long distances between the target and camera. This is quite different from some benchmark datasets such as the MOT Challenge [48] where the range from target to camera is short and the targets are big. Second, the target orientations also change drastically because the vehicles travel in a circle. Third, the illuminations in different videos are also different because of changes in cloud cover and time of day. Fourth, the cameras also move in some videos.

Performance Metrics
A correct detection or true positive (TP) occurs if the binarized detection is within a certain threshold of the centroid of the ground truth bounding box. Otherwise, the detected object is regarded as a false positive (FP). If a frame does not have a TP, then a missed detection (MD) occurs. Based on the correct detection and false positive counts, we can further generate precision, recall, and F1 metrics. The precision (P), recall (R), and F1 are defined as P = TP TP + FP (8)

Experiments to Demonstrate the Proposed Frameworks
In this section, we will include some experimental results to illustrate the importance of some critical modules. Figure 10 shows a few frames in the 3500 m video (daytime) even though the frames look very dark.

Baseline Performance Using Direct Subtraction
Although the results shown in Figure 1 appear to show that we are able to detect the moving target using direct subtraction, there are actually many false positives in different places. In order to quantify the performance of direct subtraction, we performed several experiments by using 300 frame pairs in the 3500 m videos.
Here, we briefly mention how we generated the 300 frame pairs. For every five frames, we would select a pre-image and the corresponding post-image would be the 15th frame after the pre-image.

Experiments to Demonstrate the Proposed Frameworks
In this section, we will include some experimental results to illustrate the importance of some critical modules. Figure 10 shows a few frames in the 3500 m video (daytime) even though the frames look very dark.

Baseline Performance Using Direct Subtraction
Although the results shown in Figure 1 appear to show that we are able to detect the moving target using direct subtraction, there are actually many false positives in different places. In order to quantify the performance of direct subtraction, we performed several experiments by using 300 frame pairs in the 3500 m videos.
Here, we briefly mention how we generated the 300 frame pairs. For every five frames, we would select a pre-image and the corresponding post-image would be the 15th frame after the pre-image.
The first experiment was to perform direct subtraction without any other processing steps in the workflows. Table 1 summarizes the results. One can see that the false positives are vast and greatly outnumber the true positives. In the second experiment, we performed change detection by using direct subtraction in the standard workflow as shown in Figure 2. It should be noted that the workflow remains the same except for the change detection module in which a direct subtraction was performed. The detection results are shown in Table 2. It can be seen that direct subtraction worked quite well. In the third experiment, we excluded the LIG module in the standard workflow. The results are shown in Table 3. We can see that the performance dropped quite significantly. This means that the LIG module plays an important role in the workflow.  The first experiment was to perform direct subtraction without any other processing steps in the workflows. Table 1 summarizes the results. One can see that the false positives are vast and greatly outnumber the true positives. In the second experiment, we performed change detection by using direct subtraction in the standard workflow as shown in Figure 2. It should be noted that the workflow remains the same except for the change detection module in which a direct subtraction was performed. The detection results are shown in Table 2. It can be seen that direct subtraction worked quite well. In the third experiment, we excluded the LIG module in the standard workflow. The results are shown in Table 3. We can see that the performance dropped quite significantly. This means that the LIG module plays an important role in the workflow.

Importance of LIG in the Full Standard and Alternative Workflows
From Section 3.3.1, we observed that LIG played a very important role in object detection when simple direct subtraction was used for change detection. It will be important to demonstrate the importance of LIG in the full workflows containing more sophisticated change detection algorithms. When there is no LIG, the two workflows are actually the same. In the change detection module, we have compared three change detection algorithms. We performed an experiment using 300 frame pairs in the 3500 m video. Table 4 shows the detection results without using LIG. It can be seen that there are more than one detection per frame and a lot of false positives. This is similar to Table 3 where a direct subtraction was performed. Comparing Tables 3 and 4, we can observe the following. First, ACD has the fewest false positives in this case. Second, all three change detection methods performed better (fewer FP) than direct subtraction. Table 5 summarizes the results with LIG in the standard workflow and the alternative workflow, respectively. We can see that the standard approach performed much better than the alternate approach. In the alternate approach, there are simply more false positives. We think that one possible reason for better results in the standard flow is because of the location of the LIG in the workflow. It should be noted that the LIG contains an adaptive thresholding step. In the standard workflow, this thresholding is done in the later stage whereas the LIG is applied in the early stage in the alternative workflow. We believe that, since the thresholding is a hard decision step, a wrong decision in the thresholding may cause some additional wrong decisions in the subsequent steps. Hence, it is better to delay the thresholding in the later stage of the workflow.

Detection Results for 4000 m and 5000 m Videos
Here, we will summarize additional experiments using videos from 4000 m and 5000 m ranges using the full standard and alternate workflow. Figure 11 shows a few frames from the raw 4000 m video. The target sizes are quite small and it is hard to visually see any potential targets in the scene. Table 6 summarizes the detection results using the standard and alternate workflows. It can be seen that the standard approach has much fewer false positives in two out of three cases. Moreover, in the 4000 m video, the CC and CE performed slightly better than ACD. Here, we will summarize additional experiments using videos from 4000 m and 5000 m ranges using the full standard and alternate workflow. Figure 11 shows a few frames from the raw 4000 m video. The target sizes are quite small and it is hard to visually see any potential targets in the scene. Table 6 summarizes the detection results using the standard and alternate workflows. It can be seen that the standard approach has much fewer false positives in two out of three cases. Moreover, in the 4000 m video, the CC and CE performed slightly better than ACD. Figure 12 shows a few frames from the 5000 m video. The target size is even small than other ranges. Table 7 summarizes the detection results using the standard and alternate workflows. One can observe that there are more false positives and missed detection. This is understandable as the target size is so small. Moreover, we can see that the standard workflow performed better than the alternate workflow in two out of three cases. However, the alternative workflow has better results in the CC case.     Figure 12 shows a few frames from the 5000 m video. The target size is even small than other ranges. Table 7 summarizes the detection results using the standard and alternate workflows. One can observe that there are more false positives and missed detection. This is understandable as the target size is so small. Moreover, we can see that the standard workflow performed better than the alternate workflow in two out of three cases. However, the alternative workflow has better results in the CC case.  Here, we summarize and compare the detection results using synthetic and original bands. The videos range from 3500 m to 5000 m.
Results for the 3500 m Video As shown in Table 8, the results when using the single band original image are sig-

Additional Investigations Using EMAP and LCE
Here, we summarize and compare the detection results using synthetic and original bands. The videos range from 3500 m to 5000 m.
Results for the 3500 m Video As shown in Table 8, the results when using the single band original image are significantly stronger than the synthetic bands especially when compared against EMAP. Table 8. Detection results of the standard approach with 15 frame separation comparing the single band approach to the multi band synthetic approach. The target is at a distance of 3500 m. LCE5 has 5 bands. EMAP has 11 bands. Bold numbers indicate the best performing method in each column.  Table 8 shows the results from those two experiments. In LCE5, there are 5 bands. In every case, the single band approach is stronger at 3500 m.

Results for the 4000 m Video
The results below are created using the standard approach. As shown in Table 9, the detection results at 4000 m were very good for all cases. We also observe that the EMAP results with ACD and CE are comparable to those using the original video. Table 9. Detection results of the standard approach with 15 frame separation using the original single band frames. The target is at a distance of 4000 m. LCE5 has 5 bands. EMAP has 11 bands. Bold numbers indicate the best performing method in each column. Results for the 5000 m Video

Image
As shown in Table 10, the single band approach had 10 false detections for all three CD methods. The EMAP performed slightly worse, especially with the CC change detection method. The LCE5 approach with ACD and CE shows improvement over the single band case. In order to find improvements with EMAP, we tried modifying the vector value and the attribute value but no changes to those values show any significant improvement.  15 show some detection results for the three long ranges. Since the detection involves two frames, we denote the current frame as the reference frame. The other frame is 15 frames before the reference frame. The ground truth target location, correct detection location, and false position location are overlaid to the reference frame. A green box highlights a true detection. A blue box highlights the ground truth bounding box and a red box highlights a false positive.

Computational Times
As can be seen in Table 11, the most time-consuming module in the standard approach is the LIG module, which takes about 70 s per frame. For the alternate approach, there are two LIG modules and hence it takes 140 s per frame pair. Since the bottleneck is LIG, there are several potential methods to speed up the processing of LIG that can be done in the future. First, since LIG is a local approach that performs object detection window by window, one feasible approach is to apply a graphical processor unit (GPU) to speed up the process. In a typical GPU, there are several thousand processors. If done properly, each processor can handle a small window. Consequently, significant speed up can be achieved. Second, one can also analyze the LIG algorithm closely and see if one can optimize the implementation. Third, if one needs to implement LIG in hardware, then field programmable gate array (FPGA) can be utilized. Based on our understanding, FPGA can also execute parallel processing tasks.

Computational Times
As can be seen in Table 11, the most time-consuming module in the standard approach is the LIG module, which takes about 70 s per frame. For the alternate approach, there are two LIG modules and hence it takes 140 s per frame pair. Since the bottleneck is LIG, there are several potential methods to speed up the processing of LIG that can be done in the future. First, since LIG is a local approach that performs object detection window by window, one feasible approach is to apply a graphical processor unit (GPU) to speed up the process. In a typical GPU, there are several thousand processors. If done properly, each processor can handle a small window. Consequently, significant speed up can be achieved. Second, one can also analyze the LIG algorithm closely and see if one can optimize the implementation. Third, if one needs to implement LIG in hardware, then

Computational Times
As can be seen in Table 11, the most time-consuming module in the standard approach is the LIG module, which takes about 70 s per frame. For the alternate approach, there are two LIG modules and hence it takes 140 s per frame pair. Since the bottleneck is LIG, there are several potential methods to speed up the processing of LIG that can be done in the future. First, since LIG is a local approach that performs object detection window by window, one feasible approach is to apply a graphical processor unit (GPU) to speed up the process. In a typical GPU, there are several thousand processors. If done properly, each processor can handle a small window. Consequently, significant speed up can be achieved. Second, one can also analyze the LIG algorithm closely and see if one can optimize the implementation. Third, if one needs to implement LIG in hardware, then

Computational Times
As can be seen in Table 11, the most time-consuming module in the standard approach is the LIG module, which takes about 70 s per frame. For the alternate approach, there are two LIG modules and hence it takes 140 s per frame pair. Since the bottleneck is LIG, there are several potential methods to speed up the processing of LIG that can be done in the future. First, since LIG is a local approach that performs object detection window by window, one feasible approach is to apply a graphical processor unit (GPU) to speed up the process. In a typical GPU, there are several thousand processors. If done properly, each processor can handle a small window. Consequently, significant speed up can be achieved. Second, one can also analyze the LIG algorithm closely and see if one can optimize the implementation. Third, if one needs to implement LIG in hardware, then  Here, we compare the performance of the proposed algorithm (standard workflow containing the CC change detection method) with two other conventional algorithms. One conventional algorithm is based on frame by frame detection [7] and the other one is based on optical flow [34]. Details can be found in [7,34]. Table 12 summarizes the detection metrics for 3500 m, 4000 m, and 5000 m videos. One can see that the proposed method has comparable or better detection results than the two other methods.

Conclusions
In this paper, we presented two approaches for small moving target detection in long range infrared videos. Both approaches are unsupervised, modular, and flexible frameworks. The frameworks were motivated by change detection algorithms in remote sensing. It was observed that change detection algorithms performed better than direct subtraction. Another observation is that the standard approach performed better than the alternate approach in most cases. The most influential module is the LIG detection module, which can detect small targets quite effectively. We also experimented with synthetic band generation algorithms. We have seen some positive impacts in longer ranges such as 4000 m and 5000 m videos.
One limitation of the current approaches is that the computational time is too long. Fast implementation using GPU and FPGA will be explored in the near future. It is also noted that there are some recent advances in target detection in remote sensing images [49][50][51] that may have great potential in infrared images/videos.