Moving People Tracking and False Track Removing with Infrared Thermal Imaging by a Multirotor

: Infrared (IR) thermal imaging can detect the warm temperature of the human body re ‐ gardless of the light conditions, thus small drones equipped with the IR thermal camera can be utilized to recognize human activity for smart surveillance, road safety, and search and rescue mis ‐ sions. However, the unpredictable motion of the drone poses more challenges than a fixed camera. This paper addresses the detection and tracking of people through IR thermal video captured by a multirotor. For object detection, each frame is first registered with a reference frame to compensate for its coordinates. Then, the objects in each frame are segmented through k ‐ means clustering and morphological operations. Falsely detected objects are removed considering the actual size and the shape of the object. The centroid of the segmented area is considered the measured position for target tracking. The track is initialized with two ‐ point differencing initialization, and the target states are continuously estimated by the interacting multiple model (IMM) filter. The nearest neigh ‐ bor association rule assigns the measurement to the track. Tracks that move slower than the mini ‐ mum speed are terminated at the proposed criteria. In the experiments, three videos were captured with a long ‐ wave IR band thermal imaging camera mounted on a multirotor. In the first and second videos, eight pedestrians on a pavement and three hikers on a mountain on winter nights were captured, respectively. In the third video, two walking people with complex backgrounds were cap ‐ tured on a windy summer day. The image characteristics vary between videos depending on the climate and surrounding objects, but the proposed scheme shows the robust performance in all cases; the average root mean squared errors in position and velocity are obtained as 0.08 m and 0.53 m/s, respectively for the first video, 0.06 m and 0.58 m/s, respectively for the second video, and 0.18 m and 1.84 m/s, respectively for the third video. The proposed method reduces false tracks from 10 to 1 in the third video.


Introduction
Multirotor drones are widely used in many applications [1]. The multirotor can hover from a fixed position or fly as programmed while capturing video from a distance. This capture is cost-effective and does not require highly trained personnel.
A thermal imaging camera produces an image by detecting infrared (IR) radiation emitted by objects [2,3]. Because the thermal imaging camera uses the objects' temperature instead of their visible properties, no illumination is required, and consistent imaging day and night is possible. Thermal images also pass through many visible obscurants, such as smoke, dust, haze, and light foliage [4]. The long-wavelength (LW) IR band (8)(9)(10)(11)(12)(13)(14) μm) has less atmospheric attenuation and most of the radiation emitted by the human body is included in this band [5]. This allows a LWIR thermal imaging camera to detect human activity day and night. The multirotor equipped with a thermal imaging camera can be used for search and rescue missions in hazardous areas as well as security and surveillance [6,7]. It is also useful for many applications such as wildlife monitoring, agricultural and industrial inspection [8,9]. However, compared to visible light images, the image resolution is smaller and thermal images do not provide texture and color information. The image quality can be varied by the climate and surrounding objects. Moreover, the unpredictable motion of the platform poses more challenges than a fixed camera [10]. Thus, suitable intelligent image processing is required in order to overcome the shortcomings of aerial thermal imaging and retain its advantages.
There is an increasing number of studies for people detection using thermal images captured by a drone [11][12][13][14][15], although there have been no studies for people tracking with the same configuration. The temperature difference was estimated using the spatial gray level co-occurrence matrix [11], but humans were captured at a relatively short distance. In [12], a two-stage hot-spot detection approach was proposed to recognize a person with a moving thermal camera, but the dataset was not obtained from the drone. In [13], an autonomous unmanned helicopter platform was used to detect humans with thermal and color imagery. Human and fire detection was studied with optical and thermal sensors from high altitude unmanned aerial vehicle images [14]. However, in [13] and [14], thermal and visual images were used together for detection. The multi-level method was applied to the thermal images obtained by a multirotor for people segmentation [15].
Non-human object tracking from drones using a LWIR band thermal camera can be found in [16,17]. In [16], a boat is captured and tracked with the Kalman filter and a constant velocity motion model. A colored-noise measurement model was adopted to track a small vessel [17]. However, in [16] and [17], a fixed-wing drone capable of maintaining stable flight was used at sea, which is high detection and low false alarm environment.
Human detection and tracking using stationary thermal imaging cameras have been studied in [18][19][20][21][22]. A contour-based background-subtraction to extract foreground objects was presented [18]. A local adaptive thresholding method performs the pedestrian detection [19]. In [20], humans and animals were detected in difficult weather conditions using YOLO. People were detected and tracked from aerial thermal view based on the particle filter [21]. The Kalman filter with the multi-level segmentation was adopted to track people in thermal images [22]. Various targets in a thermal image database were tracked by the weighted correlation filter [23].
Multiple targets are simultaneously tracked by estimating their kinematic state such as position, velocity, and acceleration [24]. The interacting multiple model (IMM) estimator using multiple Kalman filters has been developed [25] and successfully applied to track multiple high maneuvering targets [26]. Recently, multiple moving vehicles were successfully tracked by a flying multirotor with a visual camera [27]. Data association, which assigns measurements to tracks, is also an important task for tracking multiple targets in a cluttered environment. The nearest neighbor (NN) measurement-track association is the most effective in computing and has been successfully applied to precision target tracking [27][28][29].
This paper addresses moving people detection and tracking with IR thermal video captured by a multirotor. First, considering the unstable motion of the drone, the global matching is performed to compensate the coordinate system of each frame [30], then each frame of the video is analyzed through k-means clustering [31,32] and morphological operations. Incorrectly segmented areas are eliminated using object size and shape information based on squareness and rectangularity [33,34]. In each frame, the centroid of the segmented area is considered the measured position and input to the next tracking stage.
The tracking is performed by the interacting multiple mode (IMM) filter to estimate the kinematic state of the target. The track is initialized by the two-point differential initialization following the maximum speed gating. For measurements to track associations, the speed and position gating process is applied sequentially to exclude measurements outside the validation region. Then, the NN association assigns the closest valid measurements to tracks. The track is terminated if either of the two criteria is satisfied. One criterion is the maximum number of updates without a valid measurement and the other criterion is the minimum speed of the target. Even a stationary object can establish tracks because of the drone's turbulence. The minimum target speed was set to eliminate false tracks caused by stationary objects in heavy false alarm environments. Finally, a validity test is performed to check the continuity of the track [27]. Figure 1 shows a block diagram of detecting and tracking people with thermal video captured by a multirotor. In the experiment, a drone hovering from a fixed position captures three IR thermal videos. The thermal camera operates in the LWIR band, which is suitable for detecting objects on surfaces. The first and second videos (Video 1 and 2) were captured on a winter night from an altitude of 30 m and 45 m, respectively. A total of eight people walk or run for 40 s in Video 1 and three hikers walk, stand, or sit in the mountains for 30 s in Video 2. They were covered with leaves in Video 2. The third video (Video 3) was captured on a windy summer day from an altitude of 100 m. Two people walk in a complex background for 50 s. The average detection rates are about 91.4%, 91.8%, 79.8% for Videos 1-3, respectively. The false alarm rates are 1.08, 0.28, and 6.36 per frame for Videos 1-3, respectively. The average position and velocity root mean square error (RMSE) are calculated as 0.077 m and 0.528 m/s, respectively for Video 1, 0.06 m, and 0.582 m/s, respectively for Video 2, and 0.177 m and 1.838 m/s, respectively for Video 3. Three segmented tracks are generated on one target in Video 3, but the number of false tracks is reduced from 10 to 1 by the proposed termination scheme. A two-stage scheme of detection and tracking has been proposed and successfully applied to various thermal videos. To the best of my knowledge, this is the first study to track people with a thermal imaging camera mounted on a small drone.
The rest of the paper is organized as follows. Object detection based on k-means clustering is presented in Section 2. Target tracking with the IMM filter and the track termination criteria is described in Section 3. Section 4 demonstrates the experimental results. The conclusion follows in Section 5.

People Detection in Thermal Images
This section describes object detection in IR thermal images. The human detection in IR images consists of coordinate compensation between the reference frame and other frames, k-means clustering, morphological operations, and false alarm removing based on the size and shape of the target. The coordinates are corrected to compensate for unstable motion of the platform. The global matching between two frames is performed by minimizing the sum of absolute difference (SAD) as follows, where I1 and Ik are the first and the k-th frame, Sx and Sy are the image sizes in the x and y directions, respectively, and NK is the total number of frames. The coordinates of the frame . Then, the k-means clustering is performed to group the pixels in each frame into multiple clusters to minimize the following cost function: where Nc is the number of clusters, Cj is the pixel set of the j-th cluster, and μj is the mean of pixel intensities in the j-th cluster. The pixels in the cluster with the largest mean are labeled as areas of alternate (white) objects that produce a binary image. The morphological operations, closing (dilation and erosion) is applied to the binary image. The dilation filter connects fragmented areas of one object. The erosion filter removes very small clutters and recovers the dilated boundary. Finally, the object area is tested for size and two shape properties. One is the ratio between the minor and major axes of the basic rectangle, which measures squareness, and the other is the ratio between the object size and the basic rectangle size, which measures rectangularity [33,34]. The basic rectangle is defined as the smallest rectangle that includes the object. Therefore, four parameters are utilized to eliminate false alarms: maximum (θmax) and minimum (θmin) size of the basic rectangle in the imaging plane, the minimum squareness, and the minimum rectangularity. Figure 2 illustrates object imaging and a basic rectangle with its major and minor axes. The selection process for the number of clusters and the four parameters will be described in Section 4.

Multiple Target Tracking
A block diagram of multiple target-tracking is shown in Figure 3. Each step of the block diagram is described in the following subsections.

System Modeling
The kinematic state of a target is assumed to follow a nearly constant velocity (NCV) motion. The uncertainty of the process noise, which follows the Gaussian distribution, controls the kinematic state of the target. The discrete state equation for multiple targets is as follows is the state vector of target t at frame k, xt(k) and yt(k) are positions in the x and y directions, respectively; vtx(k) and vty(k) are velocities in the x and y directions, respectively; T denote the matrix transpose, NT(k) is the number of targets at frame k, Δ is the sampling time, and v(k) is a process noise vector, which is Gaussian white noise with the diagonal covariance matrix. The covariance of the process noise is set differently for each IMM filter mode as where M is the number of modes of the IMM filter. The measurement vector for target t consists of the positions in the x and y directions. The measurement equation is as follows , where w(k) is a measurement noise vector, which is Gaussian white noise with the covariance matrix

Two-Point Differencing Initialization with Maximum Speed Gating
Two-point differencing initialization has been applied to target tracking by a drone [27][28][29]. The initial state and covariance of target t are obtained, respectively, as where kt is the frame number when the target t is initialized, and kt can be any number between 2 and NK. The state is confirmed as the initial state of a track if the following speed gating is satisfied:  , where Vmax is the maximum speed of targets. The initialization is only performed for measurements that are not associated with an existing track.

Multi-Mode Interaction
The states and covariances of all modes at the previous frame are mixed to generate the mode-initial state and covariance of target t for mode j at the current frame k: are, respectively, the state and covariance of target t for mode i at frame k-1, is the mode probability of target t for mode i at frame k-1, and pij is the mode transition probability from mode i to mode j. When the track is initialized at frame k-1, (8) and (9), and is set at 1/M.

Mode Matched Kalman Filtering
The Kalman filter is performed for each mode. The state and covariance predictions of target t for mode j at frame k are computed as The residual covariance ) (k S t j and the filter gain ) (k W t j of target t for mode j are, respectively, obtained as .

Measurement-Track Association
Measurement to track association is the process of assigning measurements to established tracks. The measurement gating is performed by the chi-square hypothesis test assuming Gaussian measurement residuals [24]. All measurements in the validation region are considered candidates for mode j and target t at frame k as t mj x z ν (19) where zm(k) is the m-th measurement vector at frame k,  is the gating size for measurement association, and NM(k) is the number of measurements at frame k. The NN association rule assigns track t to the ) ( k m j t -th measurement, which is obtained as is the number of valid measurements for mode j and target t at frame k. Any remaining measurements that fail to associate with the target go to the initialization stage in the Section 3.2.

State Estimate and Covariance Update
The state and the covariance of target t for mode j are updated as If no measurement exists in the validation region, they merely become the predictions of the state and the covariance as The mode probability is updated as where N denotes Gaussian probability density function. If no measurement exists in the validation region, the mode probability becomes Finally, the state vector and covariance matrix of each target are updated as The procedures from Equation (10) to Equation (29) repeat until the track is terminated. The track termination criteria and the track validity testing are described in the next subsection.

Track Termination and Validity Testing
Two criteria for track termination are proposed in the paper. One is the number of consecutive updates with no measurements. If a continuous search for a measurement for a certain number of frames fails, the track is terminated. The other is the target's minimum speed. If the track is slower than the minimum speed, the track is considered to have been generated by false detections. The false detections of the stationary object can set the wrong trajectory due to unstable camera position. Finally, all tracks are tested for validity in terms of the length of the track life. The length of the track life is defined as the number of frames in between including the last frame updated by the measurement and the initial frame [26,27].

Results
Experimental results for all videos were detailed through video description, parameter setting, and people detection and tracking along with the proposed strategy.

Video Description
Three thermal videos (Videos 1-3) were captured by an IR thermal camera, FILR Vue Pro R640 (f = 19 mm, FOV = 32° × 26°) mounted on a DJI Inspire 2 drone. The spectrum band of the thermal camera is 7.5-13.5 μm [35]. The image resolution is 620 × 540 pixels, the pixel pitch is 17 μm, and the frame rate is 30 fps. The drone hovered from a fixed position with the camera facing directly to the ground or slightly titled in the mountain. The altitude of the drone is 30, 45, 100 m for Videos 1-3, respectively. Videos 1 and 2 were captured on winter nights on a flat pavement and in the mountains, respectively, and Video 3 was captured with a complex background of a parking lot during the summer daytime. In Video 1, a total of eight walking or running people appeared and disappeared for 40 s. In Videos 2 and 3, three hikers and two walkers were captured for 30 and 50 s, respectively. Every fifth frame was processed in Video 1, and every third frame was processed in Videos 2 and 3 for efficient image processing. The details of the videos are described in Table 1.  Figure 4 shows the 50th, 90th, and 150th frame of Video 1, and Figure 5 shows the 1st, 151st, and 301st frame of Video 2, and Figure 6 shows the 6th, 280th, and 406th frame of Video 3. The image characteristics vary greatly between the videos depending on climatic conditions and surrounding objects.

People Detection
The k-means clustering, morphological filtering, and false alarm removing were performed sequentially. The coordinate compensation was applied only to Video 3 to compensate for the unstable motion of the drone.

Parameter Set-Up
Since k-means clustering depends on the number of clusters, it is important to choose an appropriate number of clusters. The data set can be evaluated to find the elbow point where the cost function begins to be flat [32]. However, only the largest mean cluster is important for detecting people in thermal images since the region of interest is the object with the highest temperature. Therefore, the minimum of the largest mean cluster according to the number of clusters was obtained for sample frames. This minimum value is equal to the thresholding intensity. The minimum number of clusters was chosen for which the threshold remains constant. Figure 7 shows the thresholds with varying cluster numbers; the cluster number was set to six for Videos 1 and 3, and 10 for Video 2. The parameters θmax and θmin reflect the size of the true object. They are the maximum and minimum size of the basic rectangle in Figure 2 projected onto the imaging plane. The maximum and minimum size of the basic rectangle were set to 1 m 2 and 0.25 m 2 , respectively for Videos 1 and 2, and 1.5 m 2 and 0.5 m 2 for Video 3; θmax and θmin are calculated accordingly as in Table 2. The minimum squareness and minimum rectangularity are chosen heuristically when better results were produced. The parameters for object detection are presented in Table 2.   Figure 8b is the object area after the morphological operation are applied to Figure 8a. Figure 8c shows the object areas after false alarm removing in Figure 8b. Figure 8d shows the area of the detected object with the centroid marked by a red circle. The detection results of the three videos are summarized in Table 3. The detection results for Videos 1, 2, and 3 are 91.4%, 91.8% and 79.8%, respectively. The number of false alarms per frame for Videos 1, 2, and 3 are 1.08, 0.28, and 6.36, respectively. The detection performances were degraded in Video 3. The false alarms were generated from the warm non-human objects at still. Certain objects (streetlight in Video 1 and manhole cover in Video 3) continuously generated false alarms in most of the frames. All the centroids of the object areas including false alarms are shown in Figures 11-13. Figure 13a,b shows the centroids of the objects of Video 3 before and after the coordinate compensation, respectively; the coordinates of the objects are translated by   T y x (1). The coordinate translation reduced the fluctuation of the false alarms as shown in Figure 13b. It is noted that all centroids are input to the next target tracking stage as measurements.

Parameter Set-Up
The parameters for target tracking are designed as in Table 4. The sampling time is 0.167 s for Video 1 and 0.1 s for Videos 2 and 3 since the actual frame rate is 6 and 10 fps, respectively. A single-mode IMM filter is adopted for Videos 1 and 2, which is the same as the Kalman filter, thus only one process noise variance is set to 10 m/s 2 . Two different process noise variances are set for the two-mode IMM filter which is adopted for Video 3. One variance is low as 5 m/s 2 , and that of the other mode is set large as 10 m/s 2 to handle maneuvering targets and the unstable camera motion. The maximum target speed for track initialization and measurement association is set to 10 m/s and the minimum target speed for termination is set to 0.5 m/s, that is, the system aims to track a target moving between 0.5-10 m/s except for Video 2. For Video 2, the minimum target speed was set at 0 m/s considering that missing hikers can move and stop intermittently. The track termination criterion is set to a maximum of 10 and 15 searches with no valid measurements for Videos 1 and 2, and Video 3, respectively. The minimum track life length for a valid track is set to 10, thus all tracks less than 10 frames in length (approximately 1.7 sec for Video 1 and 1 s for Videos 2 and 3) are removed as false tracks. A total of eight valid tracks are generated for eight targets as shown in Figure 14a. Figure 14b shows the ground truth of the target position. The ground truth is manually obtained from each frame of the video. The ground truth of the velocity is obtained as the difference in position between consecutive frames divided by the sampling time. The tracking performance is evaluated in terms of the number of track segments (NTS), and the total track life (TTL) [26] as well as the position and velocity RMSEs. The target ID of a track is defined as the target with the largest number of measurements in the track. The TTL is defined as the ratio of the sum of lengths of track segments which have the same target ID and the target life length. Table 5 shows the NTS and TTL for each target and the position and velocity RMSEs of Video 1. The NTS is 1 for each target because there are no redundant, missing, or segmented tracks among valid tracks. No valid false track was generated as well. The average TTL is 99% showing the robustness of the tracking performance. The average position RMSE is about 0.077 m, which is equivalent to 2.85 pixels, and the average velocity RMSE is 0.528 m/s. This result shows that the manually obtained centroid of the object is very close to the position estimate calculated by the target tracker. Two supplementary multimedia files (MP4 format) that track people in Video 1 are available online. One is the target tracking result, which displays the position estimate immediately, and the valid track number is displayed together (Supplementary Material Video S1); The other shows the trajectories of valid tracks, and the invalid (false) trajectories are removed when the track termination criteria are satisfied in the movie (Supplementary Material Video S2). The blue circles in the MP4 file are the position estimates of the valid tracks, and the black circles represent the false tracks. The numbers represent the valid track numbers in the order they were created.

Tracking Results of Video 2
A total of four valid tracks are generated for three targets as shown in Figure 15a. Figure 15b shows the ground truth of the target position. One false track is validated in the upper right corner as shown in Figure 15a.  Table 6 shows the tracking performance of Video 2. The NTS for a false target is equivalent to the number of the false track. The average TTL is 100%, and the average position RMSE is about 0.06 m, which is equivalent to 1.5 pixels, and the average velocity RMSE is 0.582 m/s. A total of 14 valid tracks are generated for two targets without the coordinate compensation as shown in Figure 16a. The number of the false track is 10 out of 14 valid tracks. The number of false tracks is reduced to one by the coordinate compensation as shown in Figure 16b. Figure 17a,b show the ground truths without and with the coordinate compensation, respectively. The first target was well tracked with one valid track, but three track segments were generated for the second target due to missing detections caused by the backgrounds. Tables 7 and 8 show the tracking results with and without coordinate compensation, respectively. The NTS and TTS remain the same for the true targets, but the NTS for a false target were reduced from 10 to 1. The average RMSE of position decreased from 0.2 m to 0.177 m, and the average RMSE of velocity decreased from 1.877 m/s to 1.838 m/s after the coordinate compensation. They are displayed the same as in the previous section.

Discussion
The thermal image quality varies with atmospheric conditions and surrounding objects. In Video 1, the streetlight continuously generates a false alarm for almost every frame, but no false track was generated because it was a stationary object moving slower than the minimum target speed. In Video 2, which simulates missing hikers who can move and stop intermittently, the minimum target speed was not set to remove false tracks, but only one false track was generated. In Video 3, the ambient temperature was similar to that of a human, resulting in lower detection and high false alarm rates. These false alarms generate many false tracks due to strong winds and high altitude of the drone, but two proposed strategies have significantly reduced the number of false tracks; one is the coordinate compensation, and the other is the track termination criterion based on a minimum target speed. It has been also shown that the coordinate compensation improves the tracking accuracy as well.

Conclusions
In this paper, an IR thermal camera mounted on a drone captured multiple moving people in winter and summer. The coordinates of the frame were compensated, and the moving objects were detected based on the k-means clustering, morphological operations, and false alarm removing. The targets were tracked with two-point differencing initialization, Kalman or IMM filter, and NN association. The target's minimum speed is tested, and the very slow tracks are removed as false-positive tracks. The robust performance was obtained from all video clips captured in a very different environment and climate conditions. The proposed detection and tracking method is useful for smart security and surveillance and for search and rescue missions (SAR) in hazardous areas. It can be used to track animals in wildlife, that remains for future study.