Global Motion-Aware Robust Visual Object Tracking for Electro Optical Targeting Systems

Although recently developed trackers have shown excellent performance even when tracking fast moving and shape changing objects with variable scale and orientation, the trackers for the electro-optical targeting systems (EOTS) still suffer from abrupt scene changes due to frequent and fast camera motions by pan-tilt motor control or dynamic distortions in field environments. Conventional context aware (CA) and deep learning based trackers have been studied to tackle these problems, but they have the drawbacks of not fully overcoming the problems and dealing with their computational burden. In this paper, a global motion aware method is proposed to address the fast camera motion issue. The proposed method consists of two modules: (i) a motion detection module, which is based on the change in image entropy value, and (ii) a background tracking module, used to track a set of features in consecutive images to find correspondences between them and estimate global camera movement. A series of experiments is conducted on thermal infrared images, and the results show that the proposed method can significantly improve the robustness of all trackers with a minimal computational overhead. We show that the proposed method can be easily integrated into any visual tracking framework and can be applied to improve the performance of EOTS applications.


Introduction
Visual tracking is one of the core problems in computer vision. The main task of short term visual tracking is to localize the target in consecutive frames in a video. Recently, visual tracking received much attention from researchers, resulting in significant improvements of the tracking algorithms. These improvements are reflected in the large number of tracking benchmarks [1][2][3][4][5][6][7]. The subfield of visual tracking focuses on thermal infrared (TIR) tracking, which is less developed than the RGB based short term tracking. In this paper, we focus on hybrid RGB + TIR images, which is common in image sensors for electro-optical targeting systems (EOTS). As a matter of fact, many defense and security applications such as helicopters and armored vehicles are integrated with EOTS sub-systems. They have stabilizing inner gimbals to achieve high object detection and tracking performance at night and in dynamic motion environments.
Thermal infrared images have outstanding advantages over the standard RGB based imaging systems, i.e., the lack of light and reflections is not very problematic in TIR, while on the other hand, there is less color information, which can be useful for robust tracking. These properties result in TIR images having great application potential, especially in surveillance and object tracking missions. A typical tracking scenario in TIR is the tracking of an object that is far; therefore, it is small in the image. Another property is that TIR cameras are static most of the time, but when they are used to search and track to observe a target continuously at a long range at high magnification, the scene moves fast and suddenly, which causes significant camera movement and blurred images. The TIR sensor has a relatively small amount of photon energy compared to the RGB sensor; therefore, its integration time to accumulate photon energy and to make clear images is longer than the integration time of the RGB sensor in the readout integrated circuit (ROIC). TIR images are therefore easily blurred in the event of severe camera movements due to the longer integration time compared to general imaging sensors. As a result of that, trackers on TIR images often fail due to the image blur caused by fast camera motion. Most of the existing trackers do not explicitly address fast camera motion, and they use a fixed sized search region. As shown in Figure 1a, if the target goes out of the search region due to significant camera motion, the tracker fails to keep tracking it continuously. This is more problematic for TIR images, especially in real EOTS applications, since motion blur occurs more often than in RGB tracking due to the sensitivity of the TIR detector and its severe operating environment. The main contribution of the paper is a new method for global camera motion estimation that can be incorporated into most of the existing tracking algorithms. Figure 1b shows the conceptual idea of the proposed method. We extend four conventional fast correlation filter based trackers with the proposed method, and we show the performance boost in terms of robustness for all compared methods, with a negligible complexity overhead. All methods are tested on the recent Visual Object Tracking Challenge in 2019 (VOT2019) using RGB + TIR datasets, which are not only the most challenging infrared tracking datasets, but also the most similar published infrared datasets with real EOTS applications [8]. Additionally, we analyze extensive experimental datasets on the PTB-TIR and RGB-T234 benchmark [9,10]. Especially, the RGB-T234 object tracking benchmark datasets have in total 234 thermal infrared sequences with 89 fast camera motions in contrast to PTB-TIR, having 17 moderate camera motions. We also apply the proposed tracking method to real EOTS in the conditions of fast camera motions for validation. The results show that the proposed method can improve the robustness of an EOTS and show improvement of the intersection over union (IOU) and center error compared to the conventional tracking methods [11,12].

Related Works
Correlation filter trackers achieved state-of-the-art results in the visual tracking tasks in recent years. In the following, we review some popular tracking methods based on correlation filters and describe how the field developed in the last few years. The discriminative correlation filter (DCF) methodology was introduced by Hester and Casasent [13]. It was successfully applied to the localization task in visual tracking by Bolme et al. [14], who introduced the minimum output sum of squared error (MOSSE) correlation filter and achieved state-of-the-art results at the remarkable tracking speed of few hundred frames per second. The method was improved by introducing kernels [15] and the multi-channel formulation of DCFs [16]. Danelljan et al. [17] proposed the scale estimation method formulated as a set of one-dimensional correlation filters on each pixel. The advancements and performance improvements of these methods were not only methodological; they also incorporated more complex features such as color names [18], HoG [19] and even CNN based features [3]. Lui proposed a correlation filter based ensemble tracker with multi-layer convolutional features for TIR tracking and found that the features from the convolution layer were more effective than a fully connected layer for thermal infrared tracking [20]. Xin li et al. proposed a TIR tracker via a hierarchical spatially aware Siamese CNN to obtain both spatial and semantic features of the TIR object; the tracker was designed as a Siamese CNN that combined the multiple hierarchical convolutional layers [21]. Li et al. proposed a target aware deep tracking framework integrated with the Siamese CNN and target aware features [22]. Liu proposed a tracker that performed a prediction using a given threshold by providing a template update method as a score function between a candidate group and a template [23,24]. This tracker had an effective observation module, which could deal with occasional large appearance variation or severe occlusions. These state-of-the-art trackers have been successfully researched with various advanced methods to overcome the limitations of conventional trackers using TIR and RGB images. However, in order to consider the application of the embedded tracking module in EOTS, these methods required great computing power with parallel processors, and the processing time was not fast enough, being under 3∼10 fps.
According to the development of deep learning methods, in order to improve tracking accuracy and robustness under the conditions of dynamic objects or camera motions, Danelljan et al. [25] proposed a method of applying a deep RGB feature and a deep motion feature. However, the operational condition with abrupt camera motion caused serious blurred images such that the object and the background could not be distinguished, and as a result, the motion feature map could be expanded and saturated up to the global search area; finally, the coarse localization of the target could not be estimated. Risse et al. [26] proposed a compensation method to find the localization of the target in sequence frames using the RANSAC approach. This method had a limitation with respect o correcting the position of the target in motion blur situations under fast camera motion cases where the target and background are not distinguished. Zhu et al. [27] proposed a framework that worked by using the distractor aware approach to reduce the response of the background rather than the object so that the response score of the object became apparent. However, the semantic negative pairs had a limitation in that they could not find the distractor when fast motion blur occurred. In addition, these methods used pre-trained deep learning models; although their accuracy and robustness were outstanding compared to general DCF based trackers, the accuracy and robustness could be drastically reduced according to the quality of pre-trained datasets, as well as variation of new test environments. Furthermore, it was difficult to apply to real-time EOTS applications since the frame rate was too slow for conventional compact embedded systems. The most useful and high performance object tracking methods for real-time EOTS applications are advanced DCF baseline trackers such as discriminative scale space tracking (DSST) [28].
The following authors made advancements in the size of the search region in correlation filters. Danelljan et al. [29] proposed a method that introduced a penalty function that penalized large filter values far away from the target region. Such a filter can be larger while preventing the background from having a large effect. A similar issue was addressed by Galoogahi et al. [30], who formulated constrained learning within the target region only, and by Lukezic et al. [31], who used a binary mask obtained as color segmentation to constrain the filter. Mueller et al. [5] presented a method, the context aware (CA) correlation filter tracking, to train the correlation filter robustly, by taking into account negative samples. On the other hand, Bertinetto et al. [32] proposed a method (STAPLE) to improve the localization of the correlation filter by combining correlation response and color segmentation in the localization step. All described existing correlation filters localized the target within a search region of a limited size, which was defined by the size of the filter. The search region was centered on the target position from the previous frame. This limitation could cause tracking failure, which could not be recovered in the event of significant camera motion. In a review paper, Li et al. [33] showed that most of the trackers are not guaranteed to function properly with respect to fast camera motions, and in particular, representative DCF trackers, including DSST, showed the most degradation of performance in motion blur (MB) conditions. In this work, we address the fast camera movement issue with a separate background tracking method, which reduces the impact of large displacements between the consecutive frames. We use a motion blur detection method when the camera moves fast, and the global motion is found from the time duration of motion blur frames. To detect motion blur and feature tracking, the motion blur detection method should be reliable and accurate [34]. Moshe Ben-Ezra et al. proposed a motion blur detection method using the hybrid camera system and point spread function (PSF) calculation to detect motion blur and improve the accuracy of the object tracking algorithm [35]. Cho et al. proposed a fast deblurring method that produced a deblurring result from a single image of various sizes in 5∼0.2 fps [36]. The method is general and can be incorporated into any tracking method, as well as DCF trackers. This can especially contribute to improving the robustness of fast DCF based trackers without much additional computational complexity.

Methods
The EOTS applications with thermal imaging sensors mounted on aircraft, drones, or battle tanks are used under the conditions of static background images because of the long surveillance distance. However, they must perform observation missions under conditions of the complex movement of targets, as well as fast camera motions. Although their background is simpler than general near-field cameras, many of the conventional trackers fail to track the target robustly because the images are significantly blurred. We are motivated to detect this blurring phenomenon using the entropy value and to find the camera fast motion vector effectively to improve the performance of trackers. In this work, we propose a global camera motion estimation framework consisting of the gradient of the entropy sensor (GES) and background tracking (BT), as shown in Figure 2. The GES component (Section 3.1) detects camera motion based on the change of the gradient of the image entropy, and it triggers the BT (Section 3.2) component, which estimates the global fast camera movement.

Gradient of the Entropy Sensor
In the tracking method using TIR images, tracking failure often happens when the images are blurred and the shape of the objects becomes unclear. The image signal processor (ISP) of the thermal image sensor includes a process of setting the integration time for accumulating photon energy due to the low energy level of TIR detectors. Because of the integration time, image blur occurs when fast motion occurs in the camera. In general, the image blur of a thermal image can be detected by the low values of the entropy [37]. We used the phenomenon of motion blur as a sensor signal for fast camera motion detection because the thermal imaging sensors are required to be set with a long integration time and motion blur occurs easily. In general, the motion blur of images is measured quantitatively using the point spread function (PSF) and the discrete entropy (DE) [35]. The PSF derives the level and direction of motion blur as a vector, but requires complex two-dimensional convolution operations. On the other hand, the entropy derives a quantitative value only by the level of motion blur of the image, and the calculation speed is fast due to the simple one-dimensional summation. However, depending on the complexity of the background scenes, the absolute value of the entropy shows large variation. The variation affects the accuracy of the blurring detection and is problematic. Therefore, we propose the gradient of the entropy (GES) method for fast camera motion detection without the variation problems. In summary, the intuition behind GES is a value that can quantitatively detect the amount of change in image blur caused by fast camera motion. Furthermore, GES is described as a sensor signal that detects the timing of fast camera motions for image sequences in real time. The entropy of information theory was proposed by Claude Elwood Shannon in 1948, which was used as a metric for information chaos [38]. It has been primarily used in data analysis and communication systems to calculate the minimum number of bits required for lossless compression of information. Denoting A as a finite set of M A possible states, i.e., A = {a j }, j = {1, ..., M A }, the Shannon entropy H(A) is defined as: where p(a j ) denotes the probability of the state a j . In the field of computer vision, the entropy is also used to analyze how well image quality is maintained or improved [39,40]. To obtain the entropy E I of an image I, the probability p(a i ) is changed to pd f (I i ), calculated as: The I i represents a specific intensity level in the image, and |I i | is the number of pixels with this intensity. All possible intensity values in the image are denoted as L, e.g., in the case of an eight bit grayscale image L = 256. n is the total number of pixels in the image I. The image entropy E I is finally calculated as: and it can be used to detect the image blurring caused by camera motion. However, since the sequences of various images have different scenes of different images and have different E I level values, it is difficult to detect the camera motion robustly only by E I . Therefore, we calculate the temporal derivative ∆E I of the image entropy E I as: The current and previous time steps are denoted as t and t − 1, respectively, while image entropy at the current and previous time steps is E t I and E t−1 I , respectively. In general, as an example of using the variation of entropy for motion blur detection, Jiadong et al. proposed a noise robust motion compensation method using parametric minimum entropy optimization [34]. Shuigen et al. proposed a gradient magnitude distribution based no-reference image blur assessment method [41]. Considering visual tracking using two modalities, e.g., thermal infrared (IR) and color (RGB) images, the temporal derivatives of the image entropy are denoted as ∆E IR and ∆E RGB , respectively. The two modalities are captured using different sensors, which have different optical properties and sensitivities; therefore, they are considered separately to detect camera motion: The main influential sensitivities of the motion blur are the integration time and the frame rate. The integration time improves the quality of the thermal image and generates blurring [42,43]. GES(t) represents the trigger for camera motion in frame t, and α IR and α RGB are sensor specific thresholds, which are adjusted and optimized according to the sensitivity of sensors. In more detail, two parameters are initialized in relation to the integration time and the frame rate. The equations for the initializing parameters are shown in Equation (6). The term of the integration time is also used as the exposed time or the read out time of charged photons. However, the motion blur is also changed depending on the optics, ROIC, image transmissions, and compression processes. Therefore, the final α IR and α RGB values should be fine tuned and verified from the initial values through the ablation test process based on the actual output images.
where T IR is initialized to the integration time of the IR sensor and T RGB is initialized to the frame rate. For an uncooled type of bolometer as a longer wavelength infrared (LWIR) detector sensor, its integration time is approximated by 600 µs, so the threshold is set to α IR = −0.06. For an RGB sensor, its frame rate is 50 Hz (20 ms), so the threshold is set to α RGB = 0.01. We observed that the correlation response can also be a good indicator to know when the image is significantly blurred. The correlation response of the tracker is denoted as R(t), and the correlation response threshold is τ R = 0.2. Figure 3 shows how the GES is calculated on a few consecutive frames where the camera motion occurs.
GES is a motion blur detector and is derived from three sub-sensor signals, ∆E IR , ∆E RGB , and Rt, as shown in Equation (5). An ablation study was performed to verify the accuracy of parameter tuning and motion blur detection for the GES optimization, as shown in Figure 4. The GES ALL signals are the final detected GES values, which are compared to the ground truth of fast camera motion GT FM , and they show the results of the corresponding timing when the fast camera motions is occurring. GES ALL−∆E IR means a result except the ∆E IR signal; GES ALL−∆E RGB means a result, except the ∆E RGB signal; and GES ALL−RES shows that it supports removing outlier GES signals. As a result, this ablation study shows that the most accurate performance is derived when the three sub-signals are collaborating altogether. Figure 3. Example of the camera movement detection (camera movement happens between the frames #64 an #66, bordered with green). During that period, ∆E RGB and ∆E IR are increased, and R is significantly lower, which triggers the camera motion event.

Background Tracking
During camera motion, the target position in the image can change significantly, which can cause the target to disappear from the fixed size search range, which makes it impossible for the tracker to localize it. To address this issue, we propose a method to estimate global camera movement and to use it to move the position of the search range of the tracker. The background tracking (BT) framework estimates the translation of the global camera motion from two consecutive frames f t−1 and f t when the GES trigger (Section 3.1) detects camera motion. Figure 5 shows the comparison of the proposed tracking method and the conventional tracking method, which does not estimate global camera motion. The BT framework consists of two modules: feature detection and feature tracking.
Feature detection: In the first step of the BT process, the features on the two consecutive frames are detected. The cross-correlation method can be used as a conventional approach to detect the feature of two consecutive image frames and tracking the background. However, when the cross-correlation method is used to detect the background motion of the blurred screen, the convolution calculation of the image has to be performed several times over the entire range, which is very slow. The proposed tracker is required to be fast to be applied to the DCF based trackers of the EOTS in real time. Hence, we employed and modified a feature detector with fast speed and accurate feature point detection performance on the blurry images. Dan et al. performed experiments of four feature detection methods, Shi-Tomasi, Harris-Stephens-Plessey, SUSAN, and FAST [44]. As a result, the Shi-Tomasi method outperformed other methods to detect feature points and showed the fastest speed. Fenghui et al. showed that the Shi-Tomasi method was a more reliable and faster feature detector than other methods for moving IR camera applications [45]. Therefore, for efficient feature detection of the blurred image, we used the Shi and Tomasi corner detection method [46,47], which is an extension of the Harris corner detection method [48]. First, the sum of squared differences (SSD) is calculated using the sliding window approach. The SSD at the position (x, y) in the frame f is defined as: where ∆x and ∆y are the size of the window within which the SSD is calculated. Equation (7) can be linearly approximated as: where f x (x, y) and f y (x, y) are the x and y image coordinates. The derivation can be further expanded into the following form: and finally written into the matrix form: ∆x, ∆y H ∆x ∆y .  The score for corner structures is defined as: where λ 1 and λ 2 represent the eigenvalues of the matrix H.
In the conventional Shi-Tomasi detection algorithm, q is a pre-defined constant as a quality threshold of detecting feature points. We observed that it was difficult to determine a single q for IR images since they were often poorly textured. Therefore, we proposed to use an adaptive method to select q in each frame separately, which we define as: Note that q A is an adaptive constant of the q in (10), and it is denoted differently just for clarity. E IR denotes the image entropy value (3) of the IR image. The scale factor value (SF) can be adjusted according to the image quality performance parameters including the minimum resolvable temperature difference (MRTD) and modulation transfer function (MTF) performance.
We tested several SF values and selected SF = 2 × 10 4 . See Figure 6 for the results. The SF value was set such that q A = 0.01 if E IR = 6, while q A = 0.02 if E IR = 7. q A = 0.01 represents extracting about 1% of the feature points, and q A = 0.02 reduces the number of feature points by about 0.5%. In Figure 7, there are two scenarios when fast camera motions occurred. The stable number of detected feature points assured the high accuracy of the background tracking. The variation of detected feature points in Figure 7a,d is larger than the proposed detector, Figure 7c,f. As shown in Figure 6b, when different feature point detectors are applied to the baseline tracker, the results showed a quantitative comparison of the precision and the success rate of the trackers. The conventional Shi-Tomasi feature detector had a low quality threshold (constant q = 0.01) and a high quality threshold (constant q = 0.05), and the proposed tracker had an adaptive q A with the SF = 2 × 10 4 based feature detector. The proposed tracker showed the best performance on both precision and success plots compared to the original feature detecting method. The optimization of the SF value was inversely proportional to the MRTD and MTF performance of the IR camera, which was tuned through experiments.  Kanade-Lucas-Tomasi (KLT) feature matching [49]: Global camera motion between two consecutive frames f t−1 and f t is denoted as d = [d x , d y ] T . Feature points in frame t, denoted as u t = [u x t , u y t ] T , can be expressed using feature points from frame t − 1 as: The KLT method was used to estimate the final camera motion value d by repeatedly calculating a point at which a relation between the detected feature points in frame t and frame t − 1 is minimal.

Experimental Results
In this section, we present the experimental results of the proposed method for the estimation of global camera motion. We extend four existing DCF based trackers, described in Section 4.1, with the proposed global camera motion estimation module (GESBT) framework. All tested trackers were modified so that GESBT was used before their target localization process. The evaluation methodology of the experimental validation is described in Section 4.2. The quantitative results using VOT2019 RGB-TIR datasets are described in Section 4.3. The qualitative results using VOT2019 RGB-TIR datasets show the performance of the proposed tracker compare with others by visualization in Section 4.4. The quantitative and qualitative results using real EOTS datasets are shown in Section 4.5.

Baseline Trackers
In the experimental comparison, we modified the four existing trackers, described in the following, with the proposed GESBT method. (i) MOSSE_CA [5] is a visual tracking method, which extends the well known MOSSE [14]. It uses simple (grayscale) features only and achieves remarkable robustness.
(ii) STAPLE [32] is a combination of the HoG based DSST [28] tracker and a color segmentation method [50]. (iii) STAPLE is further extended with the context aware framework [5] and denoted as STAPLE_CA. (iv) DFPReco [3] is an extension of the ECOtracker [51] by adding part based formulation to the holistic tracker. The main features and release dates of these baseline trackers are shown in Table 1.

Evaluation Methodology
The VOT2019 RGB-TIR dataset (RGB and thermal infrared) was used in all of the experiments consisting of 60 tracking sequences. Since they had several camera motion events, the dataset was used to demonstrate the performance boost of the proposed method. The standard VOT short term reset based methodology [4] was used to evaluate the tested trackers. A tracker was initialized at the beginning of the sequence, and the overlap of the predicted region with the ground truth was calculated in each frame. When the overlap dropped to zero, the tracker was considered as having failed, and it was re-initialized in the following frames. The VOT methodology measures tracking performance using two basic measures: accuracy, which represents average overlap and robustness, which is measured as the average number of failures. The expected average overlap (EAO) is a combination of accuracy and robustness calculated on an average short term tracking sequence. Furthermore, we used the one pass evaluation (OPE) method for extensive experimental validation [6] to demonstrate the effectiveness of the proposed method. All trackers were run on the same workstation (a single Intel CPU i7-7700 3.6GHz 32 GB RAM) using MATLAB.

Quantitative Results
The accuracy and robustness of the baseline and modified trackers are shown on the ARgraph in Figure 8a and EAO plot in Figure 8c. The proposed method (GESBT) improved the robustness of all trackers. Table 2 shows the improvements of all methods in terms of robustness. STAPLE_GESBT improved the failure rate of the baseline version by 3.92% by reducing approximately four failures on the whole dataset. The failure rates of MOSSE_CA and DFPReco were improved by GESBT by 2.92% and 2.87%, respectively.
The VOT2019 RGB-TIR dataset had per-frame annotations of the visual attributes, and one of them was camera motion. We compared the trackers under this attribute only and show the results on the AR plot in Figure 8b. The proposed GESBT improved the tracking robustness even further under this attribute, which was an expected result.
Additionally, the PTB-TIR and RGB-T234 object tracking benchmarks had various thermal infrared sequences [9,10]. Specifically, RGB-T234 had 89 sequences containing fast camera motion; in contrast, PTB-TIR had 17 sequences with slow camera motions. In order to verify the effectiveness of the proposed method, we carried out extensive experimental validation using the 234 RGB-T sequences and 89 sequences with fast camera motions as shown in Figure 9.   Speed analysis: The recently published CA method claimed that the speed was two to six times faster than the compared target-adaptive counterpart (AT) method, and it was shown that it could be applied as half speed (50%) compared to its baseline [5]. Measurements are presented in Figure 10. The proposed method did not significantly reduce the speed compared to the CA method. The average speed for all methods is presented in Table 2. It shows that when CA was applied to the baseline STAPLE, it reduced the speed by 29%, while when GESBT was used, it reduced the speed by 8% only. For more detailed speed analysis, we calculated the speed in frames-per-second (FPS) for three randomly selected sequences from VOT2019 RGB-TIR for all four trackers and their modifications.  Figure 11 shows qualitative results for the baseline trackers and GESBT based extensions of these methods under significant camera motion. In the first two lines, all baseline trackers lost the target in the frame after the camera motion occurred. The trackers with the proposed GESBT were able to track the target successfully even after camera motion. The third line shows that even the baseline trackers were still tracking the target, and the GESBT based methods were tracking the target more accurately.

Results of the EOTS Applications
In this section, we compare the performance of the proposed method with conventional trackers when applied to an EOTS product, which was mounted on an aircraft where actual fast and complex camera motion occurred. There were two main reasons for fast camera motions in aircraft EOTS. First, when the flights or vehicles' dynamic disturbance exceeded the camera stabilization performance, fast camera motion occurred. In fact, in the case of drastic flying of the aircraft or disturbance of vehicles when they were traveling on uneven roads, the limitation of the tracking performance occurred, and it was recorded in the user operating instructions. Second, the fast camera motion occurred when the camera motor was controlled by the joystick command. The operator tended to keep the target approximately in the center of the image; therefore, significant camera motion could happen when the camera position was adjusted. Such events often resulted in a tracking failure.
We used the experimental environment as shown in Figure 12 to acquire the video sequences, which included fast motion profiles of the EOTS product. The acquired video sequences included camera motor movement in the azimuth and elevation directions and disturbances in the roll, pitch, and yaw directions applied by the six degrees of freedom (6-DOF) motion simulator. Figure 13 shows examples from the videos obtained by a hard mounted EOTS product in the experimental environment. Figure 13a-d includes the fast motions generated by the movement of the target and the camera motor control. The red solid line in the graph means the Euclidean distance (L2) from the center of the image to the center of the target. Figure 13e-h includes the target motion and the movement generated by the motion simulator. This movement occurred following the pitch, yaw, and roll directions, and the vertical (heave), horizontal (sway), and straight (surge) linear movements were not observed because they were canceled by the parallax at a long distance of more than 200 m. Figure 14 shows the experimental results for the EOTS_parking and EOTS_6DOF_up_down datasets. The sequential images qualitatively showed continuous object tracking results of the two different trackers. The white bounding boxes represented the ground truth, the yellow boxes the results of the proposed tracker, and others the conventional trackers. In Figure 14a,b, the solid red lines on the graphs show camera motions, and the solid green lines show the intersection over union (IOU) measurement index compared with the conventional tracker by the solid magenta lines [52]. The IOU measurement index function Φ(·) measures the overlap between the region predicted by a tracker and a ground truth region.
As shown in Figure 14, the IOU result of φ is calculated between the region of the ground truth R G t and the region of the tracker R T t . φ t is measured until N frames, and it is expressed by Φ(Λ G , Λ T ). In Figure 14a,b, the dotted lines mean the center error calculated by the L2 distance between the center positions of the ground truth and the center of predicted target region by the trackers. The dotted green line is the proposed tracker, and the dotted magenta line is a conventional tracker (MOSSE_CA). All experimental results using other trackers showed improved robustness to track objects continuously. The proposed method especially showed the highest performance improvement ratio when it was applied with MOSSE_CA compared to the others.

Conclusions
In this study, we proposed a global motion aware method that can be applied to improve the performance of all visual object tracking algorithms for real-time applications. The method consisted of the camera motion detection module based on the gradient of the entropy sensor and the background detection module based on the feature tracking method. Global camera motion was estimated and used in the target localization step. Compared with the existing CA method, the robustness of the proposed method was increased especially when camera motion occurred. The additional computational complexity was very low. We expect that this method will motivate researchers to study the limitations of the trackers in thermal IR based electro-optical systems that are operated in real field environments. Future work includes incorporation of deep CNN features used for estimation of the global motion and formulating the problem as an end-to-end training task.

Conflicts of Interest:
The authors declare no conflict of interest.