Moving Vehicle Tracking with a Moving Drone Based on Track Association

: The drone has played an important role in security and surveillance. However, due to the limited computing power and energy resources, more efﬁcient systems are required for surveillance tasks. In this paper, we address detection and tracking of moving vehicles with a small drone. A moving object detection scheme has been developed based on frame registration and subtraction followed by morphological ﬁltering and false alarm removing. The center position of the detected object area is the input to the tracking target as a measurement. The Kalman ﬁlter estimates the position and velocity of the target based on the measurement nearest to the state prediction. We propose a new data association scheme for multiple measurements on a single target. This track association method consists of the hypothesis testing between two tracks and track fusion through track selection and termination. We reduce redundant tracks on the same target and maintain the track with the least estimation error. In the experiment, drones ﬂying at an altitude of 150 m captured two videos in an urban environment. There are a total of 9 and 23 moving vehicles in each video; the detection rates are 92% and 89%, respectively. The number of valid tracks is signiﬁcantly reduced from 13 to 10 and 56 to 26 in the ﬁrst and the second video, respectively. In the ﬁrst video, the average position RMSE of two merged tracks are improved by 83.6% when only the fused states are considered. In the second video, the average position and velocity RMSE are 1.21 m and 1.97 m/s, showing the robustness of the proposed system.


Introduction
In recent years, the use of a small unmanned aerial vehicle (UAV) or drone is increasing in various applications [1]. Aerial video surveillance is of particular interest among the applications [2]. Multirotor drones can hover or fly as programmed while capturing video from a distance [3]. This capture is cost effective and does not require highly trained personnel. However, the limited computational power of a small drone is an important factor that must be considered.
When the drone moves, the coordinates change as the surveillance coverage is shifted. Therefore, the frame registration is required to generate fixed coordinates. The registration process matches multiple images from varying scenes to the same coordinates [4]. Coarseto-fine level image registration was proposed in [5]. Frames from different sensors was registered via modified dynamic time warping in [6]. Telemetry-assisted registrations of frames from a drone were studied in [7]. In [8], the drone velocities were estimated using frame registration based on sum of absolute difference (SAD).
Studies on target tracking with a small drone have been conducted with various methods. They can be categorized as visual, non-visual, or combined trackers. The visual trackers by a small drone mainly utilizes video streams. One is tracking with deep learning. In [9], various deep learning-based trackers were compared with a camera motion model. Testing results with UAV-based real video showed that small objects, large numbers of targets, and camera motion degraded tracking performance even using high-end CPUs [10].
to the ground. In another case, moving object detection is performed on the drone instantly after video capture, and then the center positions of the extracted object areas are transmitted to the ground control station to perform multiple target tracking. In the latter, it is possible to associate the measurements of multiple drones with the proposed association method. Since the proposed scheme does not use intensity information with heavy computation, there is no need to store or transmit a high-resolution video stream. During moving object detection, frame registration is performed between two consecutive frames to compensate for the drone's movement [8]. The current frame compensated for and subtracted from the previous frame, and thresholding is performed to generate a binary image of the moving object. Two morphological filters (erosion and dilation) are applied sequentially to the binary image. The erosion filter removes very small clutters, and the dilation filter restores the boundaries of the object areas and connects fragmented areas of one object. Finally, target blobs smaller than the assumed target size are removed. The center location of the target area becomes the measurement of the next tracking stage.
The tracking is performed by the Kalman filter to estimate the kinematic state of the target following the moving object detection. A nearly constant velocity (NCV) motion model is used as the discrete time kinematic model for the target. The NCV model has been successfully applied to the aerial early warning system to track 120 aerial targets with high maneuvers up to 6 g [24] and other applications [34][35][36].
The track is initialized by the two-point differential initialization following the maximum speed gating process. The initial state of the track can affect tracking performance severely. In the proposed scheme, no additional process for the initialization is required.
For measurement of track association, a position gating process excludes measurements outside the validation region. Then, the NN association assigns valid measurements to tracks. However, multiple tracks can be generated if multiple measurements on a single target are detected [27]. In the paper, we propose a new association scheme, tracktrack association, in order to maintain a single track for each target. In the proposed method, first, it is an association test of two tracks that performs hypothesis testing using chi-square statistics, which is the statistical distance between estimates of the current state of the two tracks. Second, the track with the smallest determinant of the covariance matrix of the state is maintained and the other is terminated. It is noted that covariance measures the expected value of the squared error between the true state and the unbiased estimate. Finally, the last update of the selected track is replaced by the fused estimate.
We also set the criteria for a valid track and the termination of the track. A track is confirmed as a valid track if it continues longer than the minimum track life, and a track is terminated after searching for the measurements for a certain number of allowed frames. During moving object detection, frame registration is performed between two consecutive frames to compensate for the drone's movement [8]. The current frame compensated for and subtracted from the previous frame, and thresholding is performed to generate a binary image of the moving object. Two morphological filters (erosion and dilation) are applied sequentially to the binary image. The erosion filter removes very small clutters, and the dilation filter restores the boundaries of the object areas and connects fragmented areas of one object. Finally, target blobs smaller than the assumed target size are removed. The center location of the target area becomes the measurement of the next tracking stage.
The tracking is performed by the Kalman filter to estimate the kinematic state of the target following the moving object detection. A nearly constant velocity (NCV) motion model is used as the discrete time kinematic model for the target. The NCV model has been successfully applied to the aerial early warning system to track 120 aerial targets with high maneuvers up to 6 g [24] and other applications [34][35][36].
The track is initialized by the two-point differential initialization following the maximum speed gating process. The initial state of the track can affect tracking performance severely. In the proposed scheme, no additional process for the initialization is required.
For measurement of track association, a position gating process excludes measurements outside the validation region. Then, the NN association assigns valid measurements to tracks. However, multiple tracks can be generated if multiple measurements on a single target are detected [27]. In the paper, we propose a new association scheme, track-track association, in order to maintain a single track for each target. In the proposed method, first, it is an association test of two tracks that performs hypothesis testing using chi-square statistics, which is the statistical distance between estimates of the current state of the two tracks. Second, the track with the smallest determinant of the covariance matrix of the state is maintained and the other is terminated. It is noted that covariance measures the expected value of the squared error between the true state and the unbiased estimate. Finally, the last update of the selected track is replaced by the fused estimate.
We also set the criteria for a valid track and the termination of the track. A track is confirmed as a valid track if it continues longer than the minimum track life, and a track is terminated after searching for the measurements for a certain number of allowed frames.
In the experiments, the drones captured two videos while flying at a height of 150 m in an urban environment. The drone camera pointed directly downwards. In each video, there are a total of 9 or 23 moving vehicles (cars, buss, motorcycles, and bicycles), respectively. With the proposed association, the number of valid tracks was reduced from 13 to 10 in the first video and 56 to 26 in the second video. In the first video (Video 1), considering only the fusion state of the two targets, the sum of the position RMSE decreased from 5.2 m to 0.85 m, showing a reduction rate of about 83.6%. In the second video (Video 2), the average position and velocity RMSEs are 1.21 m 1.97 and m/s, respectively, showing the accuracy of the proposed system. Figure 2 shows a block diagram of the moving vehicle detection and multiple target tracking. degrades image resolution and quality. Through the experiments, the proposed scheme was applied to small size objects at slow-rate frame videos (10 fps). (3) We propose a new configuration of drone surveillance as shown in Figure 1. Small drones have limited computational resources in CPU, memory, battery, and bandwidth. We can build a more efficient surveillance system where storage or transmission of high-resolution video streams is not essential.
The remainder of the paper is organized as follows: object detection is discussed in Section 2. Section 3 demonstrates multiple target tracking. Section 4 presents experimental results, and the conclusion follows in Section 5.

Moving Object Detection
This section briefly describes the moving object detection with a moving drone studied in [8]. Moving object detection consists of frame registration and subtraction followed by thresholding, morphological operations, and removing false alarms. Two consecutive gray-scaled images Ik and Ik−1 are registered using the SAD between them. The SAD at kth frame is obtained as where M and N are the image sizes in the x and y directions, respectively; K is the total number of frames. The displacement vectors px and py are obtained in the x and y directions, respectively, by minimizing the SAD as The contributions of this paper are listed as follows: (1) we propose a new data association scheme. The proposed track-track association is utilized with the NN measurementtrack association. The track association is based on the least covariance matrix while the NN association is based on the least statistical distance. Therefore, the proposed data association is easy to implement and provides fast computing. The experimental results show that the proposed method is very effective when multiple measurements are detected on a single target in consecutive frames. (2) The moving object detection-multiple target tracking scheme is studied with a moving drone. In the previous studies, the video is captured with a drone hovering at a fixed position. However, as the drone flies, the coverage of surveillance expands by utilizing the drone as a dynamic sensor. A moving drone has a wider range of surveillance at higher altitudes and at higher speeds, which degrades image resolution and quality. Through the experiments, the proposed scheme was applied to small size objects at slow-rate frame videos (10 fps). (3) We propose a new configuration of drone surveillance as shown in Figure 1. Small drones have limited computational resources in CPU, memory, battery, and bandwidth. We can build a more efficient surveillance system where storage or transmission of high-resolution video streams is not essential.
The remainder of the paper is organized as follows: object detection is discussed in Section 2. Section 3 demonstrates multiple target tracking. Section 4 presents experimental results, and the conclusion follows in Section 5.

Moving Object Detection
This section briefly describes the moving object detection with a moving drone studied in [8]. Moving object detection consists of frame registration and subtraction followed by thresholding, morphological operations, and removing false alarms. Two consecutive gray-scaled images I k and I k−1 are registered using the SAD between them. The SAD at k-th frame is obtained as where M and N are the image sizes in the x and y directions, respectively; K is the total number of frames. The displacement vectors p x and p y are obtained in the x and y directions, respectively, by minimizing the SAD as p x (k)p y (k) = min (p x p y ) SAD k p x , p y .
Appl. Sci. 2021, 11, 4046 5 of 20 The subtraction and thresholding generate a binary image after the coordinate compensation as where θ T is a threshold value set to 85 and 30 for Video 1 and 2, respectively, in the experiments. Two morphological operations, erosion and dilation, are sequentially applied to the binary image. The structure elements for erosion and dilation are set at [1] 2×2 and [1] 20×20 , respectively. Finally, assuming the true size of the object is known, the false alarm is removed as where O i is the i-th object area, and θ s is the minimum object size, set to 400 or 100 for Video 1 and 2, respectively, in the experiments.

Multiple Target Tracking
A block diagram of multiple target-tracking with the new association scheme is shown in Figure 3. Each step of the block diagram is described in the following subsections.
The subtraction and thresholding generate a binary image after the coordinate compensation as where is a threshold value set to 85 and 30 for Video 1 and 2, respectively, in the experiments. Two morphological operations, erosion and dilation, are sequentially applied to the binary image. The structure elements for erosion and dilation are set at [1] × and [1] × , respectively. Finally, assuming the true size of the object is known, the false alarm is removed as where Oi is the i-th object area, and θs is the minimum object size, set to 400 or 100 for Video 1 and 2, respectively, in the experiments.

Multiple Target Tracking
A block diagram of multiple target-tracking with the new association scheme is shown in Figure 3. Each step of the block diagram is described in the following subsections.

System Modeling
The kinematic state of a target is assumed to follow a nearly constant velocity (NCV) motion. The uncertainty of the process noise, which follows the Gaussian distribution, controls the kinematic state of the target. The discrete state equation for multiple targets is as follows

System Modeling
The kinematic state of a target is assumed to follow a nearly constant velocity (NCV) motion. The uncertainty of the process noise, which follows the Gaussian distribution, controls the kinematic state of the target. The discrete state equation for multiple targets is as follows T is the state vector of target t at frame k, x t (k) and y t (k) are positions in the x and y directions, respectively; v tx (k) and v ty (k) are velocities in the x and y directions, respectively; N T is the number of targets, ∆ is the sampling time, and v(k) is a process noise vector, which is Gaussian white noise with the covariance matrix The measurement vector for target t consists of the positions in the x and y directions. The measurement equation is as follows where w(k) is a measurement noise vector, which is Gaussian white noise with the covariance matrix R = diag r 2 x r 2 y .

Two Point Intialization
Two-point initialization has been applied to target tracking by a drone [25][26][27]. The initial state of each target is calculated by the two-point differencing following a maximum speed gating. The initial state vector and covariance matrix for target t are, respectively: The state is confirmed as the initial state of the track if the following speed gating is satisfied: [

Prediction and Filter Gain
The state and covariance predictions are iteratively computed aŝ wherex t (k|k − 1) and P t (k|k − 1) , respectively, are the state and the covariance prediction of target t at frame k; T denotes the matrix transpose. The residual covariance S t (k) and the filter gain W t (k), respectively, are obtained as

Measurement-Track Association
Measurement to track association is the process of assigning measurements to established tracks. The measurement gating is performed by the chi-square hypothesis test assuming Gaussian measurement residuals [21]. The measurement in the validation region is considered candidates for the target t at frame k as where z m (k) is the m-th measurement vector at frame k, γ g is the gating size for measurement association, and M(k) is the number of measurements at frame k. The NN association rule assigns track t to them tk -th measurement, which is obtained aŝ where m t (k) is the number of valid measurements for target t at frame k. Any remaining measurements that fail to associate with the target go to the initialization stage in Section 3.2.

State Estimate and Covariance Update
The state estimate and the covariance matrix of targets are updated as follows: If no measurement can be associated with target t at frame k, they merely become the predictions of the state and the covariance aŝ

Track-Track Association
If multiple measurements are continuously detected on a single object, more than one track can be generated. We develop the track-track association to eliminate redundant tracks. Multiple tracks on the same target have the error dependencies on each other, thus the following track-association hypothesis testing [21] is preceded as wherex s (k|k) andx t (k|k) are the state vector of track s and k, respectively, at frame k; P s (k|k) and P t (k|k) are the covariance matrix of track s and k, respectively, at frame k; γ f is a thresholding value for track association; N(k) is the number of tracks at frame k; b s (k) and b t (k) are binary numbers that become one when track s or t is associated with a measurement. If there is no measurement associated, it will be zero. In this case, the state vector and the covariance matrix are replaced by predictions as in Equations (20) and (21). The fused covariance in Equation (24) is a linear recursion and its initial condition is set at P st (0 0) = [0] 4×4 . When the track association hypothesis is accepted, the most accurate track is selected, the current state is replaced with a fused estimate, and the remaining track is immediately terminated. The selection process is based on the determinant of the covariance matrix because the more accurate track has less error (covariance). A track is selected and fused asĉ = argmin Figure 4 illustrates the track-track association process. Assuming that there are two tracks on the same target at frame k as shown in Figure 4a, the statistical distance between the tracks is tested as shown in Figure 4b. In Figure 4c, the track with the least determinant of the covariance matrix, which is track s, is selected. The state and covariance of track s are replaced by fused ones and the other track t is terminated at the same time. of the covariance matrix, which is track s, is selected. The state and covariance of track s are replaced by fused ones and the other track t is terminated at the same time. In the paper, we establish criteria for a valid track and track termination, respectively. One is the minimum track life to become a valid track. Track life is the number of frames between the last frame updated by a measurement and the initial frame [24]. Another criterion is for track termination. The track is terminated if the search for a measurement for a specific number of frames fails. In Figure 5, the track life is six frames, and the track is terminated after six frames without updating measurements.

Video Description
Videos 1 and 2 were captured by a DJI Phantom 4 and Inspire 2 in urban environments, respectively. The drones flew at the height of 150 m in a straight line in Video 1 [8] and changed direction once in Video 2. The videos were captured in different weather and lighting conditions. The drone's speed was set to be constant on the controller by the operator. The camera was pointed directly at the ground and captured video clips at 30 fps for 15 and 24 s in Videos 1 and 2, respectively. It is processed every third frame for efficient image processing, so a total of 151 and 242 frames are considered in each video and the actual frame rate is 10 fps. The original frame size of Video 1 is 4096 × 2160 pixels, but the frame is gray-scaled and resized by 50% to reduce the computational time. In Video 2, the original frame of 3840 × 2160 pixels is scaled the same way. One pixel corresponds to 0.11 m after resizing. There are 9 moving vehicles (6 cars, 2 buses, 1 bike) and several pedestrians in Video 1 while there are 23 moving vehicles (18 cars, 2 buses, 2 motorcycles, 1 bicycle) in Video 2. The details of the videos are described in Table 1.  15 24 Flying direction West→East West→East→North In the paper, we establish criteria for a valid track and track termination, respectively. One is the minimum track life to become a valid track. Track life is the number of frames between the last frame updated by a measurement and the initial frame [24]. Another criterion is for track termination. The track is terminated if the search for a measurement for a specific number of frames fails. In Figure 5, the track life is six frames, and the track is terminated after six frames without updating measurements. of the covariance matrix, which is track s, is selected. The state and covariance of track s are replaced by fused ones and the other track t is terminated at the same time. In the paper, we establish criteria for a valid track and track termination, respectively. One is the minimum track life to become a valid track. Track life is the number of frames between the last frame updated by a measurement and the initial frame [24]. Another criterion is for track termination. The track is terminated if the search for a measurement for a specific number of frames fails. In Figure 5, the track life is six frames, and the track is terminated after six frames without updating measurements.

Video Description
Videos 1 and 2 were captured by a DJI Phantom 4 and Inspire 2 in urban environments, respectively. The drones flew at the height of 150 m in a straight line in Video 1 [8] and changed direction once in Video 2. The videos were captured in different weather and lighting conditions. The drone's speed was set to be constant on the controller by the operator. The camera was pointed directly at the ground and captured video clips at 30 fps for 15 and 24 s in Videos 1 and 2, respectively. It is processed every third frame for efficient image processing, so a total of 151 and 242 frames are considered in each video and the actual frame rate is 10 fps. The original frame size of Video 1 is 4096 × 2160 pixels, but the frame is gray-scaled and resized by 50% to reduce the computational time. In Video 2, the original frame of 3840 × 2160 pixels is scaled the same way. One pixel corresponds to 0.11 m after resizing. There are 9 moving vehicles (6 cars, 2 buses, 1 bike) and several pedestrians in Video 1 while there are 23 moving vehicles (18 cars, 2 buses, 2 motorcycles, 1 bicycle) in Video 2. The details of the videos are described in Table 1.  15 24 Flying direction West→East West→East→North

Video Description
Videos 1 and 2 were captured by a DJI Phantom 4 and Inspire 2 in urban environments, respectively. The drones flew at the height of 150 m in a straight line in Video 1 [8] and changed direction once in Video 2. The videos were captured in different weather and lighting conditions. The drone's speed was set to be constant on the controller by the operator. The camera was pointed directly at the ground and captured video clips at 30 fps for 15 and 24 s in Videos 1 and 2, respectively. It is processed every third frame for efficient image processing, so a total of 151 and 242 frames are considered in each video and the actual frame rate is 10 fps. The original frame size of Video 1 is 4096 × 2160 pixels, but the frame is gray-scaled and resized by 50% to reduce the computational time. In Video 2, the original frame of 3840 × 2160 pixels is scaled the same way. One pixel corresponds to 0.11 m after resizing. There are 9 moving vehicles (6 cars, 2 buses, 1 bike) and several pedestrians in Video 1 while there are 23 moving vehicles (18 cars, 2 buses, 2 motorcycles, 1 bicycle) in Video 2. The details of the videos are described in Table 1.  Figure 6a shows Targets 1-6 at the 54th frame of Video 1, and Figure 6b shows Targets 7-9 at the 131st frame. The drone flew slightly upwards from west to east, and the coverage of the frame continued to shift in the same direction. Figure 7a shows Targets 1-10 at the 7th frame of Video 2, Figure 7b shows Targets 11-17 at the 117th frame, and Figure 7c shows Targets 18-23 at the 218th frame. The drone flew west to east for 130 frames and north for the remaining 111 frames in Video 2.
Actual fame number 151 241 Actual frame size (pixels) 2048 × 1080 1920 × 1080 Actual frame rate (fps) 10 Number of moving vehicles 9 23 Figure 6a shows Targets 1-6 at the 54th frame of Video 1, and Figure 6b shows Targets 7-9 at the 131st frame. The drone flew slightly upwards from west to east, and the coverage of the frame continued to shift in the same direction. Figure 7a shows Targets 1-10 at the 7th frame of Video 2, Figure 7b shows Targets 11-17 at the 117th frame, and Figure 7c shows Targets 18-23 at the 218th frame. The drone flew west to east for 130 frames and north for the remaining 111 frames in Video 2.

Moving Object Detection
The coordinates of the frame were compensated for and then, frame subtraction was performed between the current and the preceding frames as in Equation (3). Figure 8 shows the intermediate results of the detection process of Figure 6b. Figure 8a is the binary image generated by frame subtraction and thresholding. Figure 8b is the object area after morphological operations (erosion and expansion) are applied to Figure 8a. Figure 8c shows the rectangular windows including the detected area after false alarm removal in Figure 8b. The red bounding boxes in Figure 8d are the boundaries of the object windows. Figure 9 shows the detection results of Figure 7b.   Figure 6a shows Targets 1-6 at the 54th frame of Video 1, and Figure 6b shows Targets 7-9 at the 131st frame. The drone flew slightly upwards from west to east, and the coverage of the frame continued to shift in the same direction. Figure 7a shows Targets 1-10 at the 7th frame of Video 2, Figure 7b shows Targets 11-17 at the 117th frame, and Figure 7c shows Targets 18-23 at the 218th frame. The drone flew west to east for 130 frames and north for the remaining 111 frames in Video 2.

Moving Object Detection
The coordinates of the frame were compensated for and then, frame subtraction was performed between the current and the preceding frames as in Equation (3). Figure 8 shows the intermediate results of the detection process of Figure 6b. Figure 8a is the binary image generated by frame subtraction and thresholding. Figure 8b is the object area after morphological operations (erosion and expansion) are applied to Figure 8a. Figure 8c shows the rectangular windows including the detected area after false alarm removal in Figure 8b. The red bounding boxes in Figure 8d are the boundaries of the object windows. Figure 9 shows the detection results of Figure 7b.

Moving Object Detection
The coordinates of the frame were compensated for and then, frame subtraction was performed between the current and the preceding frames as in Equation (3). Figure 8 shows the intermediate results of the detection process of Figure 6b. Figure 8a is the binary image generated by frame subtraction and thresholding. Figure 8b is the object area after morphological operations (erosion and expansion) are applied to Figure 8a. Figure 8c shows the rectangular windows including the detected area after false alarm removal in Figure 8b. The red bounding boxes in Figure 8d are the boundaries of the object windows. Figure 9 shows the detection results of Figure 7b The average detection rate of Video 1 is 92%. A total of 23 false alarms were detected including 16 moving pedestrians, but the pedestrians were not considered targets of interest in this study. The average detection rate of Video 2 is 89%. There were 47 false alarms, of which 4 moving pedestrians were detected. All the centroids of the object windows including false alarms of Video 1 and 2 are shown in Figures 10 and 11, respectively, in the expanded frame. The number of detections is 709 and 1016 for each video, respectively. The expanded frame of Video 1 is 2779 × 1096 pixels, equivalent to 305.7 × 120.6 m; Video 2 is 2641 × 1426 pixels, equivalent to 290.5 × 156.7 m. The position coordinates of the expanded frame are compensated by ̂ ( ) ̂ ( ) obtained in Equation (2). It is noted that the center locations are input to the next target tracking stage as measurements. The average detection rate of Video 1 is 92%. A total of 23 false alarms were detected including 16 moving pedestrians, but the pedestrians were not considered targets of interest in this study. The average detection rate of Video 2 is 89%. There were 47 false alarms, of which 4 moving pedestrians were detected. All the centroids of the object windows including false alarms of Video 1 and 2 are shown in Figures 10 and 11, respectively, in the expanded frame. The number of detections is 709 and 1016 for each video, respectively. The expanded frame of Video 1 is 2779 × 1096 pixels, equivalent to 305.7 × 120.6 m; Video 2 is 2641 × 1426 pixels, equivalent to 290.5 × 156.7 m. The position coordinates of the expanded frame are compensated by ̂ ( ) ̂ ( ) obtained in Equation (2). It is noted that the center locations are input to the next target tracking stage as measurements. The average detection rate of Video 1 is 92%. A total of 23 false alarms were detected including 16 moving pedestrians, but the pedestrians were not considered targets of interest in this study. The average detection rate of Video 2 is 89%. There were 47 false alarms, of which 4 moving pedestrians were detected. All the centroids of the object windows including false alarms of Video 1 and 2 are shown in Figures 10 and 11, respectively, in the expanded frame. The number of detections is 709 and 1016 for each video, respectively. The expanded frame of Video 1 is 2779 × 1096 pixels, equivalent to 305.7 × 120.6 m; Video 2 is 2641 × 1426 pixels, equivalent to 290.5 × 156.7 m. The position coordinates of the expanded frame are compensated by p x (k)p y (k) obtained in Equation (2). It is noted that the center locations are input to the next target tracking stage as measurements.

Multiple Target Trackig
In this subsection, we show the target tracking results of Videos 1 and 2. The sampling time in Equation (6) is 0.1 s because every third frame is processed. The parameters are designed as in Table 2. In Video 1, a total of 13 valid tracks are generated without track association; four redundant tracks are generated for Targets 1, 3, 4, and 7. With track association, 3 of the 4 tracks are successfully merged, resulting in 10 valid tracks. Therefore, tracking efficiency was improved from 69% to 90%. Figure 12a,b show the tracking results without and with track association, respectively, in the expanded frame. Two supplementary multimedia files (MP4 format) for tracking vehicles are available online. One is target tracking without track association (Supplementary Material Video S1) and the other is using the proposed method (Supplementary Material Video S2). The red bounding boxes of the MP4 file are the boundaries of the object windows, the blue dots are position estimates, and the numbers represent the track numbers in the order they were created.

Multiple Target Trackig
In this subsection, we show the target tracking results of Videos 1 and 2. The sampling time in Equation (6) is 0.1 s because every third frame is processed. The parameters are designed as in Table 2. In Video 1, a total of 13 valid tracks are generated without track association; four redundant tracks are generated for Targets 1, 3, 4, and 7. With track association, 3 of the 4 tracks are successfully merged, resulting in 10 valid tracks. Therefore, tracking efficiency was improved from 69% to 90%. Figure 12a,b show the tracking results without and with track association, respectively, in the expanded frame. Two supplementary multimedia files (MP4 format) for tracking vehicles are available online. One is target tracking without track association (Supplementary Material Video S1) and the other is using the proposed method (Supplementary Material Video S2). The red bounding boxes of the MP4 file are the boundaries of the object windows, the blue dots are position estimates, and the numbers represent the track numbers in the order they were created.

Multiple Target Trackig
In this subsection, we show the target tracking results of Videos 1 and 2. The sampling time in Equation (6) is 0.1 s because every third frame is processed. The parameters are designed as in Table 2. In Video 1, a total of 13 valid tracks are generated without track association; four redundant tracks are generated for Targets 1, 3, 4, and 7. With track association, 3 of the 4 tracks are successfully merged, resulting in 10 valid tracks. Therefore, tracking efficiency was improved from 69% to 90%. Figure 12a,b show the tracking results without and with track association, respectively, in the expanded frame. Two supplementary multimedia files (MP4 format) for tracking vehicles are available online. One is target tracking without track association (Supplementary Material Video S1) and the other is using the proposed method (Supplementary Material Video S2). The red bounding boxes of the MP4 file are the boundaries of the object windows, the blue dots are position estimates, and the numbers represent the track numbers in the order they were created.       To evaluate the accuracy of the position and velocity estimates, the ground truth of the target position for Video 1 is manually obtained as shown in Figure 17. The ground truth of the velocity is obtained as the difference in position between consecutive frames divided by the sampling time.    To evaluate the accuracy of the position and velocity estimates, the ground truth of the target position for Video 1 is manually obtained as shown in Figure 17. The ground truth of the velocity is obtained as the difference in position between consecutive frames divided by the sampling time.   To evaluate the accuracy of the position and velocity estimates, the ground truth of the target position for Video 1 is manually obtained as shown in Figure 17. The ground truth of the velocity is obtained as the difference in position between consecutive frames divided by the sampling time.  To evaluate the accuracy of the position and velocity estimates, the ground truth of the target position for Video 1 is manually obtained as shown in Figure 17. The ground truth of the velocity is obtained as the difference in position between consecutive frames divided by the sampling time.       Table 3 show the average position and velocity RMSE with and without track association, respectively. If there is more than one track for the target (Target 1, 2, 4), the longer track is considered to obtain the RMSEs. The average RMSE of position decreased from 2.56 m to 2.42 m, but the average RMSE of velocity increased from 2.20 m/s to 3.19 m/s.  Table 4 only considers the fused states of Targets 2 and 4. The second and fourth rows of the RMSE is counted for errors when track association occurs. Although the average RMSE of velocity increases from 1.11 m/s to 1.77 m/s, the position RMSE shows the accuracy improvement from 2.61 m to 0.67 m. In Video 2, 23 targets appear for 214 frames. Many redundant tracks are caused by the multiple measurements on the same target without track association. A total of 56 valid tracks are reduced to 26 with the proposed method. The tracking efficiency increased from 41% to 92% because two segmented tracks were generated from one target. Figure  20a Table 3 show the average position and velocity RMSE with and without track association, respectively. If there is more than one track for the target (Target 1, 2, 4), the longer track is considered to obtain the RMSEs. The average RMSE of position decreased from 2.56 m to 2.42 m, but the average RMSE of velocity increased from 2.20 m/s to 3.19 m/s.  Table 4 only considers the fused states of Targets 2 and 4. The second and fourth rows of the RMSE is counted for errors when track association occurs. Although the average RMSE of velocity increases from 1.11 m/s to 1.77 m/s, the position RMSE shows the accuracy improvement from 2.61 m to 0.67 m. In Video 2, 23 targets appear for 214 frames. Many redundant tracks are caused by the multiple measurements on the same target without track association. A total of 56 valid tracks are reduced to 26 with the proposed method. The tracking efficiency increased from 41% to 92% because two segmented tracks were generated from one target. Figure 20a The ground truth of Video 2 is shown in Figure 21. Tables 5 and 6 show the average position and velocity RMSEs without and with track association, respectively. The average position RMSE decreased from 1.64 to 1.21 m while the average velocity RMSE increased from 1.65 to 1.97 m/s.  The ground truth of Video 2 is shown in Figure 21. Tables 5 and 6 show the average position and velocity RMSEs without and with track association, respectively. The average position RMSE decreased from 1.64 to 1.21 m while the average velocity RMSE increased from 1.65 to 1.97 m/s. The ground truth of Video 2 is shown in Figure 21. Tables 5 and 6 show t position and velocity RMSEs without and with track association, respectively age position RMSE decreased from 1.64 to 1.21 m while the average velocity creased from 1.65 to 1.97 m/s.

Discussion
The detection rates of Video 1 and 2 are 92% and 89%, respectively. The detection rates were affected by simulating harsh conditions that the frames were gray-scaled and resized by 50%, and the frame rate was modified to 1/3 of the original one. Missing target detection degrades tracking performance; a long interval of missing targets may cause the track to be terminated or segmented. Using full color images at the original resolution and frame rate can increase the detection performance. A few false alarms are detected except for targets of no interest (pedestrians).
Typically, data association suffers from high clutters, low detection, state and measurement errors, and closely located targets. The NN is the most effective in computing and requires no initial assumptions on the number and states of targets. Indeed, the measurement characteristics in this study makes it possible to use the NN association as follows: (1) the target is transformed from the visual shape in the image frame into a point object in a Cartesian coordinate system during object detection, thus the measurement is not a continuous value, but a discrete 2D signal. The measurement resolution is the same with the spatial resolution of the image. Therefore, no false alarm or other targets can exist around the point of the true target depending on the physical shape of the target and the spatial resolution. In other words, as long as the validation region falls inside the target body, there is no possibility of the association with false alarms or other targets. In consequence, the accuracy of position estimates more matters than data association strategies. It is noted that the proposed track association scheme increases the accuracy of the target state in position as well as eliminating redundant tracks, (2) the false alarm rate is very low, and its existence is limited in space. Most false measurements were caused by buildings and some were caused by parked cars except for the non-target objects (pedestrians) in the experiments, so it is effective to use the NN association.
The NCV model is adopted in the experiments. The NCV, nearly constant acceleration (NCA), and coordinated turn (CT) models are most commonly used among various statespace dynamic models [35,36]. The choice of the model depends on the target's maneuver, the measurement quality, and the measurement rate [36]. With high quality measurements, the NCA model can produce the accurate estimate of the acceleration. The CT model is often used for targets in horizontal planes and extended to 3D space [34]. We leave the investigation of the motion models in the various road environments such as highly curved roads in highway interchanges as well as straight roads in the future study.
In Video 1, Target 2 is moving horizontally, thus most of the errors are found in the x direction as shown in Figures 18b and 19b. In Figures 18d and 19d, the error of Target 4 is mostly found in the y direction because the target moves vertically. When the target states are fused, a more accurate position estimate is obtained although the velocity estimate becomes less accurate. Considering only the association states of the two targets, the position accuracy improved by 83.6%. In Video 2, a total of 23 various targets are considered and the drone changes a direction once, which expands the surveillance coverage in two dimensions. A total of 26 valid tracks are generated but two of them are segmented tracks for one target (bicycle), thus the tracking efficiency is 92%. Only one target (bus) was responsible for two additional tracks. Except for them, no redundant track was generated. The position accuracy increased by 26%. The experimental results show the proposed method is very effective in reducing redundant tracks and increasing the position accuracy.

Conclusions
In this paper, a moving drone has captured multiple moving vehicles. The coordinates of the frame were compensated for, and the moving objects were detected based on the frame subtraction. The targets were tracked with two-point initialization, a Kalman filter, and NN association. The track association scheme was proposed to merge redundant tracks and reduce the number of valid trajectories. The positioning accuracy was improved to show a higher accuracy of the fusion state.
The proposed tracking method is useful for stand-alone drone surveillance as well as multiple drones. A specific vehicle can be continuously locked and tracked in the large area by the drones' hand-overing the target position. Thus, this system is suitable for vehicle chase as well as traffic control or counting vehicles. Multiple drones sharing a ground unit can improve target tracking accuracy, which remains for future study.