A Video-Based Real-Time Tracking Method for Multiple UAVs in Foggy Weather

: Aiming at the real-time tracking problem of multiple unmanned aerial vehicles (UAVs) based on video under fog conditions, we propose a multitarget real-time tracking method that combines the Deepsort algorithm with detection based on improved dark channel defogging and improved You Only Look Once version 5 (YOLOv5) algorithm. The contributions of this paper are as follows: 1. For the multitarget tracking problem under fog interference, a multialgorithm combination method is proposed. 2. By optimizing dark channel defogging, the complexity of the original algorithm is reduced from O (cid:0) n 2 (cid:1) to O ( n ) , which simpliﬁes the processing time of the defogging algorithm. 3. The YOLOv5 network structure is optimized so that the network can synchronously reduce the detection time while maintaining high-precision detection. 4. The amount of algorithm processing through image size compression is reduced, and the real-time performance under high-precision tracking is improved. In the experiments conducted, the proposed method improved tracking precision by 36.1% and tracking speed by 39%. The average time of tracking per image frame was 0.036s, satisfying the real-time tracking of multiple UAVs in foggy weather.


Introduction
UAVs have been widely used in diverse missions.Because UAVs have the characteristics of high flexibility and good concealment, they play an important role in the military field.The use of UAVs at sea to detect and attack naval vessels poses a threat to the safety of navigation.Therefore, effective detection and tracking of unmanned aerial vehicle targets are of great significance.In foggy weather, the characteristics of UAVs in the video are weakened, which reducing the tracking accuracy.
Presently, video-based multiobject tracking mainly uses the following methods: labeled multi-Bernoulli multiobject tracking algorithm based on detection optimization [1], multiobject tracking algorithm for position prediction by combining deep neural network SiamCNN with contextual information [2], and pedestrian tracking algorithm with a combination of YOLO detection algorithm and Kalman filter [3].These methods focus on problems such as tracking loss caused by occlusion, target ID matching error in the tracking process, and missing tracking targets.In the process of detection and tracking, there is a negative correlation between tracking accuracy and algorithm processing time.The realization of high-precision tracking often leads to an increase in algorithm time, which makes it difficult to meet real-time performance.
Recently, scholars have carried out research on image defogging.Salazar-Colores et al., (2020) proposed a novel methodology based on depth approximations through DCP, local Shannon entropy, and fast guided filter for reducing artifacts and improving image recovery in sky regions with low computation time [4].Liu et al., (2019) combined dark channel defogging with a KCF tracking algorithm for multitarget tracking under fog conditions [5].However, this method needs to manually label multiple target frames in the first image.
Once the tracking target is lost, it will not be able to complete the follow-up tracking task.Defogging is mainly achieved by image enhancement [6][7][8][9] and using a physical model [10][11][12][13][14].Using image enhancement for image defogging does not consider the influence of fog on the image but achieves the defogging effect by adjusting the image contrast.Meanwhile, the physical model for image defogging takes into account the foggy image generation process and other factors, with dark channel defogging used as a typical algorithm.
In order to achieve real-time tracking for multiple UAV targets, we selected the YOLOv5 algorithm, which has excellent speed performance in the current target detection field, to carry out the UAV target detection task.In the process of matching and tracking, considering the UAV motion has the characteristics of direction and speed mutation, conventional tracking methods such as Kalman filtering and SORT algorithm are prone to cause tracking and matching errors.Therefore, we chose the Deepsort algorithm to perform track correlation for the detected UAV target.The Deepsort algorithm can combine the motion state and appearance characteristics of the target to perform matching and correlation in the tracking process.It has good tracking ability for moving direction and speed mutation targets.
We implemented and tested a "detection-based tracking" algorithm for multiple UAVs in foggy weather by combining an improved dark channel algorithm with improved YOLOv5 and Deepsort [15][16][17].Through these improvements, we reduced the complexity of defogging algorithm from O n 2 to O(n) and simplified the complexity of YOLOv5, thus reducing the time of the defogging and detection algorithm.Compared to target detection and tracking without fog interference, the introduction of defogging processing makes the algorithm spend more time on single-frame image processing.For this reason, images are compressed without distortion to further ensure real-time tracking.The specific process is given in Figure 1.

Image Defogging Algorithm Based on Improved Dark Channel
The dark channel defogging algorithm uses the following model for image defogging [18]: where x stands for the pixel spatial coordinate of the image, I(x) represent the captured foggy images, J(x) represents the restored fogless images; A is the global atmospheric light value, and t(x) is the transmissivity that can be estimated by Equation (1) when J dark tends to zero according to the dark channel prior theory [19].Moreover, the value of A is the maximum grayscale of the pixels in the original images with the top 0.1% luminance in the dark channel [20].The process flow of the dark channel defogging algorithm is presented in Figure 2.

Determination of Transmissivity by Mean Filtering
In the dark channel defogging algorithm, the minimum filter is often used.However, after minimum filter processing, the restored defogging image has obvious white edges at the edge of the UAV target.This phenomenon affects the edge features of the UAV target itself and is not computer friendly for UAV target recognition.To solve this problem, we used mean filtering to process the dark channel image to estimate the transmittance.Additionally, the defogging coefficient is correlated with foggy images in dark channel defogging to achieve adaptive adjustment of defogging.The detailed calculation process is as follows.In Equation (1), I(x), J(x), t(x), and A are all above 0 and J dark tends to zero: We assumed a constant atmospheric light value, A, calculated from the minimum intensity in the R, G, and B channels: It is transformed to Here, mean filtering is carried out on the right side of the equation, and the transmittance t(x), is calculated.The result after mean filtering can reflect the general trend of t(x), but there is a certain absolute value difference between it and the real value.Therefore, an offset value ϕ(0 < ϕ < 1) is made up for the filtering result.Moreover, to simplify the representation, average sa ( min c∈{r,g,b} (I c (x))) is substituted with M ave (x), where the average represents the mean filter processing, and sa represents the window size of the filter.The approximate evaluation of transmissivity is obtained as follows: Let δ = 1 − ϕ, the above equation is expressed as follows: δ can regulate the darkness of images restored after defogging.The larger the value of δ, the lower the transmissivity t(x) and the darker the restored image.To enable δ to dynamically adjust the brightness after defogging according to the fog image, δ can be set to be associated with the pixels of the original image, and the formula is as follows: δ = ρm av (7) where m av is the mean value of all elements in M(x), that is, the mean value of the minimum pixel at each pixel coordinate x of RGB channels in an original foggy image.Moreover, 0 ≤ ρ ≤ 1/m av .If the value of δ is too low, it may lead to lower transmissivity and result in dark images after defogging.For this reason, the maximum threshold of δ is set to 0.9.Thus, we have δ = min(ρm av , 0.9) Equations ( 3), ( 6) and ( 8) are combined to obtain the following:

Estimation of Global Atmospheric Light Value
During dark channel defogging, the positions of the top 0.1% of pixels in the dark channel are determined.This operation requires comparing the pixel information of all pixels with other pixels to arrange them in order.We assumed that the number of pixels in the image is n.Then, the algorithm complexity of this operation reaches O n 2 and its operation amount will increase significantly with the increase in the image size.
In this study, we directly used the combination of the maximum pixels of the filtered image dark channel and the maximum pixels of the RGB channel of the original image to estimate the atmospheric light value.In the algorithm process, only the relationship between the current pixel value and the maximum pixel value needs to be compared, thus greatly reducing the computational complexity of the algorithm.The luminance of the restored image is slightly lower on the whole, but the complexity of the defogging algorithm can be reduced from O n 2 to O(n), thus shortening the processing time of the algorithm.
Filtering is performed for inequality (4) to obtain From Equation (10), we obtain and hence, A 0 may be expressed as where 0 ≤ ε ≤ 1.
The implementation steps of the algorithm are given in Table 1.
M ave (x) = average sa min c∈{r,g,b} 4: Calculate the mean value of all elements in M ave (x) to obtain m av .5: Calculate the global atmospheric light value.

Tracking of Multiple UAVs with Improved YOLOv5 and Deepsort
After defogging the video frame of UAVs under fog conditions, an algorithm combining improved YOLOv5 and Deepsort was employed to track multiple UAVs.
The network structure of YOLOv5 is divided into input, backbone, neck, and prediction [21] as given in Figure 3.

Optimization and Improvement of YOLOv5 Network
With highly precise detection of interesting objects, the YOLOv5 network structure can shorten the time of detection.Before target detection and tracking, the defogging algorithm is introduced to defog the fog image, which increases the processing time.Although the defogging time is reduced by improving dark channel defogging, the overall algorithm still cannot meet the real-time performance requirements.The YOLOv5 network structure was optimized and improved to further shorten detection.

Removal of Focus Layer
The focus module is used in YOLOv5 to slice an image and expand the three RGB channels of the original image into 12 channels.Further convolution generates an initial down-sampled feature image that retains the valid information of the original image while improving the processing speed because of less calculation and lower parameters.However, frequent slice operation in the focus module increases the amount of calculation and parameters.In the process of sampling 640 × 640 × 3 images to obtain 320 × 320 × 3 feature maps, the amount of calculation and parameters of the focus operation becomes four times that of the ordinary convolution operation.As revealed in experiments, convolution can replace the focus module and perform satisfactorily without side effects.Hence, the focus layer was removed to further improve the speed.

Backbone Optimization Based on ShuffleNet V2
The backbone of YOLOv5 adopts the C3 module to extract object features, which is easier and faster than BottleneckCSP [22] in the previous versions.However, the C3 module utilizes multiple separable convolutions, so it occupies a large portion of memory if there are many channels and it is frequently implemented.In this case, the speed of the equipment is reduced to some extent.As a lightweight network model, ShuffleNet V2 [23] contains two block structures.As shown in Figure 4, Structure a has a channel split, so the input feature image with c input channels is split into two branches: c1 and c2.The branch c2 is concatenated with c1 after three-layer convolution.Through the control module, input and output channels are kept consistent to speed up the reasoning speed of the model.Subsequently, channel shuffle is performed to reduce the branches of the network structure, improve the parallelism of the network, and shorten the duration of processing.Therefore, Structure a is mainly used to deepen the layers of the network.Meanwhile, Structure b has the same right branch as Structure a.Its left branch is concatenated, and the channel is shuffled with the right branch after convolution of input features.However, this structure cancels the channel split so as to allow the expansion of module output channels.Therefore, it was mainly employed to downsample and compress the feature layer.In the neck layer, the improved network maintained the same structure of feature pyramid network (FPN) and pyramid attention network as in YOLOv5.However, the PAN structure contained as many output channels as input channels.Moreover, the "cat" operation was adjusted to "add", which further optimized memory access and utilization, as shown in Figure 5.The channels of the original YOLOv5 were pruned to redesign the network structure.The optimization and improvement of YOLOv5 were achieved by deleting the focus module, replacing the backbone with the ShuttleNet module, and optimizing the network in the neck layer.The improved detection network became less complex and processed faster.Its structure is presented in Figure 6.

Tracking of Multiple UAVs Based on Deepsort
The Deepsort algorithm takes the output of the YOLOv5 detector as its input to select the boxes for object detection and calculate the object association for matching.The tracking process flow [24] is presented in Figure 7. h) and to record the motion state and information of objects.On this basis, the Kalman filter is employed to predict the motion state of objects.The motion information of objects predicted with the current frame is matched and associated with the object detection result output by the next frame.In the second operation, Deepsort associates the motion features of objects by virtue of Mahalanobis distance.In order to overcome the matching error caused by the abrupt change of object speed and jitter of shooting equipment because of single Mahalanobis distance, the appearance feature matching of objects is introduced for compensation.These two measures are combined through linear weighting into the final measure.Therefore, it can offer reliable association and matching with short-term prediction and compensate for the ID switch during object occlusion and camera shake.

Image Compression
While tracking multiple UAVs in a foggy video, the defogging algorithm is introduced, but it extends the time of processing frames in the video and lowers the frame rate.Hence, it is difficult to achieve real-time processing.In order to improve the processing speed of a single frame without reducing the accuracy [25,26], we used the method based on bilinear interpolation [27] to compress the image before defogging.The details are as follows.
Assuming that the size of the source image is a × b and the size of the compressed image is m × n, the horizontal and vertical compression ratios of the image are a/m and b/n.The pixel value of the compressed image can then be calculated based on the pixel ratio and the pixel points in the source image.Assuming that the coordinates of the pixel points in the compressed image are (I, J), the pixel points corresponding to the original image should be (i × a/m, j × b/m).Because i × a/m and j × b/m are not integers, the four adjacent pixels of the point are interpolated to obtain the pixel value of the point.The schematic diagram of such an operation is presented in Figure 8.In Figure 8, P represents the pixel point of the source image corresponding to the pixel point of the compressed image; Q 11 , Q 12 , Q 21 , and Q 22 are four pixel points adjacent to point P in the original image.Q 11 and Q 21 points can be calculated in the horizontal direction to obtain the pixel value of R 1 point, and Q 12 and Q 22 points can be calculated in the horizontal direction to obtain the pixel value of R 2 point.The specific calculation is as follows: R 1 and R 2 points can be interpolated in the vertical direction to obtain the pixel value of the P point:

Preparation for Experiments
Experiments were performed on a personal computer with Windows 10 operating system, Intel(R) Core(TM) i7-10750 processor, 16 G memory, NVIDIA Quadro (Lenovo, Beijing, China)T1000 GPU, PyTorch deep learning framework, and version 10.2 CUDA.
Because there is no public dataset for UAV target in the current target detection dataset, we built a UAV dataset for network model training by downloading network images and taking UAV flight photos.A total of 1727 UAV photos were collected.Each image contained one to six UAVs covering rotor UAVs of different models.The background included sky, land, underwater, city, vegetation, and other background environments.Some UAV images are shown in Figure 9.The tool Labeling was used to label the UAV targets in the self-built dataset.The dataset was in the format of common objects in context (COCO).The dataset annotation process is shown in Figure 10.
After the dataset was labeled, extensible markup language (XML) documents were produced for the image.These XML documents contained object type and border information, particularly the type and position of labeled objects in the image.From the dataset, 1727 pieces of the images were labeled and classified at the ratio of 7:3 into the training set and testing set for 300 iterations.
In the experiment, the indexes for evaluating detection effect were accuracy, recall, and average precision.Multiple object tracking accuracy (MOTA) and multiple object tracking precision (MOTP) were taken as the indexes for the detection precision of multiple UAVs.The ID switches (IDs) of object tracking while matching indicated the stability of object tracking [28].The processing time per frame of the image was taken as an index for speed.

Comparison of YOLOv5 Network Improvement Effect
In order to verify the effect of our improvement to the YOLOv5 network structure, we carried out target detection experiments on the same dataset with the original YOLOv5s model and the improved YOLOv5 model; evaluated the complexity of the model with floating point operations (FLOPs), parameters, inference time, and model size; and evaluated the detection effect with average precision (AP).The experimental results are shown in Table 2. Based on Table 2, the following conclusions were drawn.
Compared to the original YOLOv5 network model, the complexity of the improved YOLOv5 network model was reduced.The FLOPs were only 1/6 of the original, and the number of parameters was reduced to 1/10 of the original.In addition, the size of the model was reduced from 13.7 to 1.58 M, which greatly improved the inference time in the detection process, and the accuracy of the detection was also maintained at a high level.

Multi-UAV Target Tracking Experiments on UAVs under Normal Illumination
We used mobile devices to record the flight video images of three UAVs and verified the tracking ability of the improved algorithm for UAV targets based on the recorded video.During the video capture process, the UAVs were controlled to fly in a stable state, cross the trajectory, and fly to a far distance.
Experiments were conducted to test the tracking ability of the algorithm for UAV flight in stable state, track cross occlusion, and small target and scale change.The specific tracking results are shown below.
By analyzing Figures 11-13 and Table 3, the following conclusions were drawn: (1) For multiple UAV targets in stable flight state, YOLOv5 + Deepsort algorithm and improved YOLOv5 + Deepsort algorithm had stable tracking abilities.(2) For multiple UAV targets with mutual occlusion, when the targets were separated, the algorithm could still track the target and keep the original target ID unchanged without the problem of target ID matching error.(3) For small scale targets and scale changing targets, the algorithm could also accurately identify and track them.In the process of tracking multiple UAV targets, as long as the flying targets did not exceed the video capture range, no IDs were generated for the algorithm.(4) Moreover, the improved YOLOv5 algorithm could significantly improve the tracking speed.The average tracking frame rate of the original tracking algorithm was 15 FPS, and the tracking effect could not meet the real-time tracking requirements.By reducing the complexity of the original network, YOLOv5 greatly reduced the detection time in the tracking process.The average tracking frame rate reached 43 FPS, far exceeding the real-time tracking performance requirements.Real-time and high-precision tracking of multiple UAV targets was realized.A self-produced dataset of UAVs was used to train the YOLOv5 model.Defogging was performed with different algorithms for 345 frames of image from the foggy video.The trained YOLOv5 was input for the object detection experiment.The average detection results with different defogging algorithms are given in Table 4.In the process of detection accuracy evaluation, we used the IOU value between the detection frame and the real frame as the judgment standard to determine whether the target detection position was accurate.In the calculation process, we set the IOU judgment threshold to 0.5.When the IOU value was higher than the set threshold, the predicted position for the current target was considered to be accurate.As revealed in Table 4, the improved dark channel defogging algorithm achieved the best results in terms of detection precision indexes.Hence, we conducted defogging on a foggy video image based on dark channel defogging and improved dark channel defogging algorithms to track multiple UAV targets.

Comparison of Target Tracking Results of Multiple UAVs under Fog Interference
In order to explore the tracking ability of the algorithm for multiple UAV targets under fog interference, we recorded flight videos of three UAVs in foggy weather and verified the tracking effect of the algorithm based on recorded videos.As the UAV target tracking ability under stable flight, mutual occlusion, and small scale and scale change had already been verified, as noted in Section 5.2.2, in order to ensure experimental safety, we only recorded relevant videos of UAV stable flight status under fog conditions.The test video contained 345 images in total, and the image size was 1920 × 1080.
An experiment was carried out to compare the tracking results of multiple UAVs with four methods.These methods were as follows:  As revealed in Figure 15, Method 1 could detect and track a UAV only at the confidence level of 0.78.Methods 2 and 3 could detect and track two UAVs at the confidence level of 0.67 and 0.67 and 0.71 and 0.83, respectively.However, Method 4 could detect all UAVs at the confidence level of 0.77, 0.82, and 0.47.Therefore, Method 4 had the best tracking effect.In order to quantitatively evaluate the tracking effect, the tracking indicators under the four tracking methods were calculated.The specific tracking indicators are shown in Table 5 and Figure 15.Based on Table 5 and Figure 15, the following conclusions were drawn: (1) Defogging could effectively improve the effects of multiobject tracking.In the experimental results of Method

Real-Time Tracking with Compressed Images
It can be seen from Section 5.3.2 that by improving the dark channel defogging algorithm and YOOv5 network structure, the time of defogging and tracking algorithm was considerably shortened, but its average time of processing each frame of image was 0.188s, which cannot meet the requirement for real-time processing.
Considering the processing speed of the defogging algorithm is affected by the size of the input image, an image compression method was introduced to compress the size of the input image before the defogging algorithm, and bilinear interpolation was selected as the compression method.Target tracking experiments were conducted after compressing the same fog video outlined in Section 5.2.2.The tracking method used improved dark channel defogging + improved YOLOv5 + Deepsort.The tracking effect of different compression sizes are shown in Table 6 and Figure 16.
The analysis based on Table 6 and Figure 12 revealed the following: (1) The smaller the compression size of images without distortion, the shorter the average defogging time and the shorter the tracking time per frame.However, when the size was too small, it led to a dramatic decrease in the tracking effect.When the size was further compressed, the size of the UAV target in the image was too small to highlight its feature information, so the tracking accuracy decreased and the discontinuity of tracking led to ID matching errors.(2) Within a certain range of compression size without distortion, the accuracy of object tracking could exceed 85% in all cases.When the images were compressed to 576 × 324, the tracking time per frame was 0.036 s, and the accuracy of object tracking was 88.9%.Therefore, the requirements for accuracy and real-time processing were satisfied simultaneously.(3) After the compression size of 576 × 324, Method 4 achieved a 36.1% higher tracking precision and 39% higher tracking speed than Method 1 in the tracking of multiple UAVs.

Conclusions
This paper presented the study and improvement of the dark channel defogging algorithm and YOLOv5 network structure.With the Deepsort algorithm, a video-based method for tracking multiple UAVs in foggy weather was designed and further combined with the technology of image compression without distortion.In this way, the proposed method can meet the requirement for real-time processing while ensuring highly precise and constantly stable tracking.Therefore, it is greatly practical.

Figure 1 .
Figure 1.Process flow of tracking multiple UAVs in foggy weather.

Figure 2 .
Figure 2. Process flow of the dark channel defogging algorithm.

Figure 5 .
Figure 5.Comparison of neck with PAN structure.(a) Improved neck PAN structure.(b) PAN structure in YOLOv5.

Figure 9 .
Figure 9. Images of UAV dataset.(a) UAV images downloaded from the network.(b) UAV images taken by a mobile phone.

Figure 10 .
Figure 10.Image labeling of the dataset.

Figure 15 .
Figure 15.Average object tracking time per frame of the image with different algorithms.

Table 2 .
Comparison of YOLOv5 and improved YOLOv5 network structures.

Table 3 .
Target tracking effect index comparison of multiple UAVs.

Table 4 .
Comparison of detection results with different defogging algorithms for multiple UAVs.

Table 5 .
Multiobject tracking indexes with different methods.
1, the target detection for UAV failed due to fog interference, the tracking accuracy was only 52.8%, and the discontinuity of tracks led to the error of target ID matching and the generation of IDs.Generally, Method 4 achieved the best tracking indexes.It improved MOTA by 35.2% and MOTP by 31.7% compared to Method 1. (2) Defogging increased the time of processing a single frame of the image.With Method 2, the time extended by 603.4% after defogging.It decreased by 54.7% compared to Method 4 after improving the dark channel and optimizing the YOLOv5 network structure, but it was still 218.6%higher than Method 1.

Table 6 .
Indexes for evaluating the tracking effect with different compression sizes.
Figure 16.Comparison of tracking time with different compression sizes.