Small Target Recognition and Tracking Based on UAV Platform

Target recognition and tracking based on multi-rotor UAVs have the advantages of low cost and high flexibility. It can monitor low-altitude targets with high intensity. It has great application prospects in national defense, military, and civil fields. The existing algorithms for aerial small target recognition and tracking have the disadvantages of slow speed, low accuracy, poor robustness, and insufficient intelligence. Aiming at the problems of existing algorithms, this paper first makes a lightweight improvement for the YOLOv4 network recognition algorithm suitable for small target recognition and tests it on the VisDrone dataset. The accuracy of the improved algorithm is increased by 1.5% and the speed is increased by 3.3 times. Then, by analyzing the response value, the KCF tracking situation is judged, and the template update of the adaptive learning rate is realized. When the tracking fails, the target is re-searched and tracked based on the recognition results and the similarity judgment. Finally, experiments are carried out on the multi-rotor UAV, and the adaptive zoom tracking strategy is designed to track pedestrians, cars, and UAVs. The results show that the proposed algorithm can achieve stable tracking of long-distance small targets.


Introduction
The tracking and recognition system of UAVs based on multi-rotor design has the advantages of low cost, good concealment, high flexibility, and small volume. It can make up for the lack of satellite acquisition of near-ground and low-altitude information [1]. Therefore, multi-rotor UAVs have been widely used in civil and military fields. In the military field, multi-rotor UAV recognition and tracking can be used for reconnaissance and tracking of ground military targets, and can also carry weapons to attack targets independently [2]. After finding the target through the recognition and tracking algorithm, the multi-rotor UAV can maintain a safe distance for covert tracking and reconnaissance of the target. Compared with the traditional way of tracking vehicles or personnel, the UAV can achieve low-cost and zero casualty tracking. Because the UAV has a broad field of view from the top of the air, it can better track the target. In the civil field, after fusing the target recognition algorithm, UAVs can be used to monitor the growth of crops [3], rescue and disaster relief, search and rescue trapped people, and also can be used to track and hunt criminals when they escape [4], which has a very good application prospect.
At present, target recognition and tracking tasks are mainly oriented to natural scene images and have been relatively mature in related application fields, such as face recognition, pedestrian detection, and so on. However, due to the different imaging angles of UAVs, the effect of directly applying the existing algorithms to the field of UAVs is poor. Therefore, it is of great significance to study the target detection and tracking algorithm suitable for UAVs [5]. A major feature of traditional target detection algorithms is that the features used are artificially designed features, such as scale-invariant feature transformation (SIFT) [6], histogram of oriented gradients (HOG) [7], and so on.
However, the detection accuracy of traditional methods is poor. In 2012, the team led by Professor Hinton designed AlexNet using convolutional neural network (CNN) and won the championship on image net, making CNN gradually become one of the most map is divided into S × S grid cells, and each grid of the feature map is a predicted BBox, as shown in Figure 2. Each BBox prediction bounding box [26] contains location information (t x , t y , t w , t h ) and confidence (t 0 ), in which t x , t y are mapped between 0 and 1 through the sigmoid function, and the prediction center is fixed in the cell, making the model more stable. c x , c y represent the distance between the cell and the upper left coordinate of the characteristic diagram, p w , p h represent the width and height of the anchor, and (b x , b y , b w , b h ) represents the prediction result based on the anchor. The calculation formula is shown in Equation (1), where σ represents the sigmoid function.

YOLOv4 Algorithm
YOLOv4 was completed by Alexey Bochkovskiy in 2020. It is an end-to-end one-stage network. Its network structure is shown in Figure 1. Firstly, the input image is resized into a multiple of 32, such as 640 × 640. Multi-scale feature maps are extracted of 20 × 20, 40 × 40, and 80 × 80 through the backbone network CSPDarknet53. The obtained feature map is divided into S × S grid cells, and each grid of the feature map is a predicted BBox, as shown in Figure 2. Each BBox prediction bounding box [26] contains location information ( , , , ) and confidence ( ), in which , are mapped between 0 and 1 through the sigmoid function, and the prediction center is fixed in the cell, making the model more stable. , represent the distance between the cell and the upper left coordinate of the characteristic diagram, , represent the width and height of the anchor, and ( , , , ) represents the prediction result based on the anchor. The calculation formula is shown in Equation (1) σ(t y )

The Improved YOLOv4
YOLOv4 is mainly used for target recognition of conventional size from the head-up perspective. Using YOlOv4 for direct detection wastes computing resources for detecting large targets and is easy to lose the details of small targets. The improvement starts from the network depth and network width, and a backbone network is designed suitable for small target recognition to improve the accuracy of the network.

YOLOv4 Algorithm
YOLOv4 was completed by Alexey Bochkovskiy in 2020. It is an end-to-end one-stage network. Its network structure is shown in Figure 1. Firstly, the input image is resized into a multiple of 32, such as 640 × 640. Multi-scale feature maps are extracted of 20 × 20, 40 × 40, and 80 × 80 through the backbone network CSPDarknet53. The obtained feature map is divided into S × S grid cells, and each grid of the feature map is a predicted BBox, as shown in Figure 2. Each BBox prediction bounding box [26] contains location information ( , , , ) and confidence ( ), in which , are mapped between 0 and 1 through the sigmoid function, and the prediction center is fixed in the cell, making the model more stable. , represent the distance between the cell and the upper left coordinate of the characteristic diagram, , represent the width and height of the anchor, and ( , , , ) represents the prediction result based on the anchor. The calculation formula is shown in Equation (1), where represents the sigmoid function.

The Improved YOLOv4
YOLOv4 is mainly used for target recognition of conventional size from the head-up perspective. Using YOlOv4 for direct detection wastes computing resources for detecting large targets and is easy to lose the details of small targets. The improvement starts from the network depth and network width, and a backbone network is designed suitable for small target recognition to improve the accuracy of the network.

The Improved YOLOv4
YOLOv4 is mainly used for target recognition of conventional size from the head-up perspective. Using YOlOv4 for direct detection wastes computing resources for detecting large targets and is easy to lose the details of small targets. The improvement starts from the network depth and network width, and a backbone network is designed suitable for small target recognition to improve the accuracy of the network.
Although the over-deep network can extract more advanced semantic information, which has great advantages in dealing with large data sets, it will cause the loss of target spatial information, that is, it is difficult to complete the target positioning due to the lack of target details. Therefore, the original backbone network is improved for the problem that the loss of target detail information leads to low accuracy of the algorithm. Firstly, aiming at the problem of small target information loss, the residual block Resblock4, which is mainly used for medium target feature extraction in the original backbone network of Figure 1, is discarded. Secondly, in the Path Aggregation Network (PANnet) [27], the feature maps are downsampled and upsampled three times. Although this process finally restores the feature map to the original size, the restoration process is completed by linear interpolation, which causes the feature map to become blurred. Therefore, the detector is appropriately advanced, to enhance the ability of the network to detect small targets. Finally, to improve the recognition accuracy, an SPP module is added to the network, and 10 convolutional layers are used for feature extraction afterward.
On the other hand, for the problem that the network width greatly increases the net-work computation amount. Firstly, the number of channels of Resblock5 in Figure 1 is reduced from 512 to 256, which can reduce the calculation amount of the Resblock by a factor of four. Secondly, the SPP module quadruples the number of channels, resulting in a sharp increase in the amount of computation and parameters of a convolutional layer behind the SPP module. Therefore, the size of the feature map is changed from the original 80 × 80 to 40 × 40 through a downsample before the SPP module, so that the calculation amount between downsample and upsample is reduced by a factor of four. The improved YOLOv4 network structure is shown in Figure 3.
which has great advantages in dealing with large data sets, it will cause the loss of target spatial information, that is, it is difficult to complete the target positioning due to the lack of target details. Therefore, the original backbone network is improved for the problem that the loss of target detail information leads to low accuracy of the algorithm. Firstly, aiming at the problem of small target information loss, the residual block Resblock4, which is mainly used for medium target feature extraction in the original backbone network of Figure 1, is discarded. Secondly, in the Path Aggregation Network (PANnet) [27], the feature maps are downsampled and upsampled three times. Although this process finally restores the feature map to the original size, the restoration process is completed by linear interpolation, which causes the feature map to become blurred. Therefore, the detector is appropriately advanced, to enhance the ability of the network to detect small targets. Finally, to improve the recognition accuracy, an SPP module is added to the network, and 10 convolutional layers are used for feature extraction afterward.
On the other hand, for the problem that the network width greatly increases the network computation amount. Firstly, the number of channels of Resblock5 in Figure 1 is reduced from 512 to 256, which can reduce the calculation amount of the Resblock by a factor of four. Secondly, the SPP module quadruples the number of channels, resulting in a sharp increase in the amount of computation and parameters of a convolutional layer behind the SPP module. Therefore, the size of the feature map is changed from the original 80 × 80 to 40 × 40 through a downsample before the SPP module, so that the calculation amount between downsample and upsample is reduced by a factor of four. The improved YOLOv4 network structure is shown in Figure 3. The improved algorithm is tested based on the VisDrone dataset [28], and the test results are shown in Table 1. The test results show that the parameters of the improved backbone network are reduced to 10.4% of the original network, the accuracy is improved by 2.7%, and the reasoning speed is 1.8 times that of the original network in the task of small target recognition from the perspective of UAV. However, due to the network improvement, only the number of channels is designed for the entire Resblock, and there is still a redundancy problem in the network. The improved algorithm is tested based on the VisDrone dataset [28], and the test results are shown in Table 1. The test results show that the parameters of the improved backbone network are reduced to 10.4% of the original network, the accuracy is improved by 2.7%, and the reasoning speed is 1.8 times that of the original network in the task of small target recognition from the perspective of UAV. However, due to the network improvement, only the number of channels is designed for the entire Resblock, and there is still a redundancy problem in the network. To solve the problem of a large amount of calculation and slow reasoning speed caused by redundant channels in the network, the redundant channels are pruned based on BN layer coefficients. The BN layer standardizes the input of each layer of the network so that the data distribution shifts to the mean and variance of the overall data, which makes the network easier to initialize and accelerates network training. The BN layer preserves the sample characteristics through two learnable parameters to improve the flexibility of the network. The process of the BN algorithm is shown in Formula (2), where x i , y i are the input and output samples, µ B , σ 2 B are the mean and variance of batch data, and γ, β are the learnable scaling factor and shift factor.
It can be seen from Formula (2) that when γ is small, the channel has only a small input to the subsequent module, that is, the module is less important. Redundant channels can be removed by evaluating the importance of channels by sparse γ coefficients. According to this principle, the designed network is pruned, the parameters of the BN layer of the network model are thinned using L1 regularization, and the γ value of the sparse model is sorted. The channel pruning of the network is carried out by using the strategy of global pruning of 54% of the parameter quantity and at least 10% of the single layer, the network test results after pruning are shown in Table 2. By pruning redundant channels, the amount of network parameters is reduced to 32.1% of the original, and the amount of calculation is reduced to 4.3% of the original. When the network loses only 0.9% of the accuracy rate, the network inference speed reaches 1.9 times that before pruning. After the pruning is completed, to further improve the inference speed of the algorithm, TensorRT is used to quantify and accelerate the network. The pruning quantification test results are shown in Table 3. After accelerating the improved network with TensorRT, the network accuracy is only reduced by 0.3%, the inference speed is increased by 26%, and the performance of the algorithm on the embedded platform is greatly enhanced. Overall, compared with the original algorithm, the accuracy of the improved algorithm is increased by 1.5% and the speed is increased by 3.3 times.

Basic Principles of KCF
The basic idea of the tracking method based on correlation filtering is to design a filtering template according to the first frame sample, and the template will perform a correlation operation on the image of the search area. The maximum response position is the current target position, as shown in Formula (3), where g represents the response output, f represents the input image, h represents the filter template, and * represents the convolution operation.
To improve the operation speed, the convolution in the time domain can be transformed into dot multiplication in the frequency domain. As shown in Formula (4), F represents Fourier transformation, and represents point multiplication. The core task of correlation filtering is to find the optimal filtering template.
MOSSE, the first tracking algorithm based on correlation filtering, was proposed by Bolme D S et al. In 2010. Firstly, the filter template is designed through a set of training images F i and expected output G i . As shown in Formula (5).
For the whole image sequence, the algorithm hopes to minimize the square sum of the error between the true value and the predicted value of all sequences and solve the H * to obtain the template. Such as Formula (6) Due to the illumination change, rotation, and other transformations in the tracking process, the filter template needs to be updated online to adapt to the changes in the environment. The update strategy is shown in Formula (7), in which η is the learning rate, which makes the weight of the current frame larger, and the influence of the past frame attenuates with time.
KCF kernel correlation filtering algorithm has a very high speed and accuracy. Compared with MOSSE, KCF uses the HOG feature to replace the original pixels and proposes a multi-channel training strategy, which performs better in the case of illumination change and enhances the robustness of the algorithm. In the training stage, the nonlinear problem is introduced into the high-dimensional space through the kernel function to make it linearly separable. The filter is trained by ridge regression. The objective function is as shown in Equation (8), w is the filter template, x is the input image, y is the true value, and * represents the correlation operation.
Introduce Formula (8) into Fourier transform, as shown in Formula (9). The solution ofŵ can be obtained by derivingŵ and making its value 0. The result is shown in Formula (10). KCF uses a kernel function to map input features to high-dimensional space, such as Formula (11). The Formula (11) is introduced into the frequency domain and the derivative of α is 0 to obtain the optimal solution, as shown in Formula (12), wherek xz is the kernel correlation matrix.
In the process of tracking the target, the appearance of the target will change due to scale, attitude, and light and shadow, so the classifier needs to be updated in real-time. As shown in Formulas (13) and (14), whereâ t is the update coefficient of the classifier,x t is the appearance of the target, and η is the update rate.
3.2. Improvement of KCF 3.2.1. Adaptive Learning Rate KCF algorithm has the advantages of fast speed, low requirements for hardware platform, online template updating, and can adapt to the changes of objectives to a certain extent. However, in the application scenario of UAV tracking, there are some problems such as target occlusion and scale transformation. Aiming at the poor robustness of KCF in UAV tracking, an improved KCF algorithm with an adaptive learning rate is proposed.
Because KCF adopts a fixed learning rate, the tracking template will be polluted when the tracking target is blocked, similar targets interfere, and the tracking background is complex. This paper judges the tracking situation based on the tracking response parameters reduces the learning rate when the tracking is poor, avoids the template from being polluted, and realizes a highly robust update through an adaptive learning rate. Reference [29] mentioned a method to judge occlusion based on the response matrix. As shown in Formula (15), F max , F min is the maximum and minimum value in the response matrix, and F w,h is the response value at (w, h) in the response matrix.
However, due to different types of tracking targets, large differences in background complexity, different moving speeds of targets, and other factors, the APCE value will be different, so it is difficult to use a single threshold to judge the tracking situation of different sequences. Therefore, given the defects of APCE, based on the idea of data filtering, this paper proposes a method of adaptive APCE threshold for each sequence. The specific process is shown in Figure 4. First, the APCE mean of each sequence is calculated, and the outliers are eliminated. Then, whether to update the template is determined by the relevant threshold.
complexity, different moving speeds of targets, and other factors, the APCE value will be different, so it is difficult to use a single threshold to judge the tracking situation of different sequences. Therefore, given the defects of APCE, based on the idea of data filtering, this paper proposes a method of adaptive APCE threshold for each sequence. The specific process is shown in Figure 4. First, the APCE mean of each sequence is calculated, and the outliers are eliminated. Then, whether to update the template is determined by the relevant threshold.  Algorithms are compared on objects' occluded sequences of dataset UAV123. The one-pass evaluation method (OPE) [30] is used for experiments. This method initializes the first frame with the position of the target in the true value, then runs the tracking algorithm to obtain the algorithm result and calculates the average precision and success rate through the set threshold. In this section, the accuracy map is used to evaluate the results, and the experimental results are shown in Figure 5. The experimental results show that the speed of the improved algorithm is consistent with the original algorithm, and the accuracy of the improved algorithm is 1.4% higher than the original algorithm.

Scale Improvement
KCF has certain advantages in accuracy and speed, but the algorithm does not update the scale. When the size of the tracking target changes, the algorithm cannot adapt to the size of the target, resulting in tracking failure. In response to the appealing problem, and to ensure the real-time performance of the algorithm, this paper adopts three scale pools (1.05, 1, 0.95) to adapt to the change in target scale. In each frame of tracking, three scales are used for tracking. To adapt to the rapid changes of the target, the three basic scales are transformed using an adaptive scale transformation factor to achieve the effect of multiple scale pooling. Algorithms are compared on objects' occluded sequences of dataset UAV123. The one-pass evaluation method (OPE) [30] is used for experiments. This method initializes the first frame with the position of the target in the true value, then runs the tracking algorithm to obtain the algorithm result and calculates the average precision and success rate through the set threshold. In this section, the accuracy map is used to evaluate the results, and the experimental results are shown in Figure 5. The experimental results show that the speed of the improved algorithm is consistent with the original algorithm, and the accuracy of the improved algorithm is 1.4% higher than the original algorithm.
complexity, different moving speeds of targets, and other factors, the APCE value will be different, so it is difficult to use a single threshold to judge the tracking situation of different sequences. Therefore, given the defects of APCE, based on the idea of data filtering, this paper proposes a method of adaptive APCE threshold for each sequence. The specific process is shown in Figure 4. First, the APCE mean of each sequence is calculated, and the outliers are eliminated. Then, whether to update the template is determined by the relevant threshold.  Algorithms are compared on objects' occluded sequences of dataset UAV123. The one-pass evaluation method (OPE) [30] is used for experiments. This method initializes the first frame with the position of the target in the true value, then runs the tracking algorithm to obtain the algorithm result and calculates the average precision and success rate through the set threshold. In this section, the accuracy map is used to evaluate the results, and the experimental results are shown in Figure 5. The experimental results show that the speed of the improved algorithm is consistent with the original algorithm, and the accuracy of the improved algorithm is 1.4% higher than the original algorithm.

Scale Improvement
KCF has certain advantages in accuracy and speed, but the algorithm does not update the scale. When the size of the tracking target changes, the algorithm cannot adapt to the size of the target, resulting in tracking failure. In response to the appealing problem, and to ensure the real-time performance of the algorithm, this paper adopts three scale pools (1.05, 1, 0.95) to adapt to the change in target scale. In each frame of tracking, three scales are used for tracking. To adapt to the rapid changes of the target, the three basic scales are transformed using an adaptive scale transformation factor to achieve the effect of multiple scale pooling.

Scale Improvement
KCF has certain advantages in accuracy and speed, but the algorithm does not update the scale. When the size of the tracking target changes, the algorithm cannot adapt to the size of the target, resulting in tracking failure. In response to the appealing problem, and to ensure the real-time performance of the algorithm, this paper adopts three scale pools (1.05, 1, 0.95) to adapt to the change in target scale. In each frame of tracking, three scales are used for tracking. To adapt to the rapid changes of the target, the three basic scales are transformed using an adaptive scale transformation factor to achieve the effect of multiple scale pooling.
Record the target scale changes of five consecutive frames, when the frame of the target scale becomes larger, it is marked as 1, when the frame of the target scale becomes smaller, it is marked as −1, and the unchanged frame is marked as 0. Add the five numbers, convert the absolute value of their sum, and multiply it with the default scale transformation factor to obtain the scale adjustment factor of the current frame. When the target continuously becomes larger and smaller, the change rate can be automatically adjusted to achieve fast and accurate scale change. The calculation formula is shown in Formula (16), in which scales are the current scale pool of the scale, s 0 · · · s 4 is the scale transformation of the previous 5 frames of the record number. When the suppressed response peak is still larger than the response peak of the original scale, the scale is updated to obtain the scale with The improved algorithm is tested by the UAV123 dataset, and the experimental results are shown in Figure 6. The experimental results show that, through the adaptive scale, three basic scales can be used as six changed scales, and the improved algorithm improves the accuracy by 6.8% compared with the original KCF.
smaller, it is marked as −1, and the unchanged frame is marked as 0. Add the five numbers, convert the absolute value of their sum, and multiply it with the default scale transformation factor to obtain the scale adjustment factor of the current frame. When the target continuously becomes larger and smaller, the change rate can be automatically adjusted to achieve fast and accurate scale change. The calculation formula is shown in Formula (16), in which scales are the current scale pool of the scale, ⋯ is the scale transformation of the previous 5 frames of the record number. When the suppressed response peak is still larger than the response peak of the original scale, the scale is updated to obtain the scale with the largest response value as the new scale. After the tracking scale has not changed for eight consecutive frames, the scale is adjusted to the initial scale.
The improved algorithm is tested by the UAV123 dataset, and the experimental results are shown in Figure 6. The experimental results show that, through the adaptive scale, three basic scales can be used as six changed scales, and the improved algorithm improves the accuracy by 6.8% compared with the original KCF.

Target Re-Recognition
To improve the robustness and intelligence of the tracking algorithm, the autonomous initialization and intelligent re-recognition of the target are realized by combining the target recognition algorithm and the image similarity recognition algorithm. The flow chart is shown in Figure 7.  Figure 7. Autonomous initialization tracking combined with object recognition.
In Figure 7, i is the video frame sequence number and fc is the flag bit of whether the KCF tracking is successful. When the tracking fails according to the response, fc is set to

Target Re-Recognition
To improve the robustness and intelligence of the tracking algorithm, the autonomous initialization and intelligent re-recognition of the target are realized by combining the target recognition algorithm and the image similarity recognition algorithm. The flow chart is shown in Figure 7.
smaller, it is marked as −1, and the unchanged frame is marked as 0. Add the five numbers, convert the absolute value of their sum, and multiply it with the default scale transformation factor to obtain the scale adjustment factor of the current frame. When the target continuously becomes larger and smaller, the change rate can be automatically adjusted to achieve fast and accurate scale change. The calculation formula is shown in Formula (16), in which scales are the current scale pool of the scale, ⋯ is the scale transformation of the previous 5 frames of the record number. When the suppressed response peak is still larger than the response peak of the original scale, the scale is updated to obtain the scale with the largest response value as the new scale. After the tracking scale has not changed for eight consecutive frames, the scale is adjusted to the initial scale.
The improved algorithm is tested by the UAV123 dataset, and the experimental results are shown in Figure 6. The experimental results show that, through the adaptive scale, three basic scales can be used as six changed scales, and the improved algorithm improves the accuracy by 6.8% compared with the original KCF.

Target Re-Recognition
To improve the robustness and intelligence of the tracking algorithm, the autonomous initialization and intelligent re-recognition of the target are realized by combining the target recognition algorithm and the image similarity recognition algorithm. The flow chart is shown in Figure 7.  Figure 7. Autonomous initialization tracking combined with object recognition.
In Figure 7, i is the video frame sequence number and fc is the flag bit of whether the KCF tracking is successful. When the tracking fails according to the response, fc is set to In Figure 7, i is the video frame sequence number and fc is the flag bit of whether the KCF tracking is successful. When the tracking fails according to the response, fc is set to 0. At the beginning of the algorithm, i = 0, the target is recognized through the YOLOv4 target recognition algorithm. Calculate the similarity between the results of the same category as the target of interest and the pre-stored target image through the color histogram, use the recognition result with the largest similarity to initialize the KCF, and set the tracking flag to 1 to start tracking continuously. When the tracking fails, the initialization operation is carried out again.
In the tracking process, there are situations such as the target is blocked, the target moves rapidly, etc., which causes the tracking to fail. To achieve stable tracking of the target, it is necessary to combine the information before the target disappears to re-recognition and continue to track. By integrating the target recognition algorithm, the algorithm flow is shown in Figure 8. After the target is completely occluded, take the place where the target disappears as the center, and target recognition searches in an area four times the size of the target. When the target is detected and the similarity is greater than the threshold, the target recognition result is used to re-initialize the tracking target, and again to track.
use the recognition result with the largest similarity to initialize the KCF, and set the tracking flag to 1 to start tracking continuously. When the tracking fails, the initialization operation is carried out again.
In the tracking process, there are situations such as the target is blocked, the target moves rapidly, etc., which causes the tracking to fail. To achieve stable tracking of the target, it is necessary to combine the information before the target disappears to re-recognition and continue to track. By integrating the target recognition algorithm, the algorithm flow is shown in Figure 8. After the target is completely occluded, take the place where the target disappears as the center, and target recognition searches in an area four times the size of the target. When the target is detected and the similarity is greater than the threshold, the target recognition result is used to re-initialize the tracking target, and again to track.  To verify the effectiveness of the algorithm, experiments were carried out on sequences such as BlurBody, Human6, Jogging1, Jogging2, etc. in the OTB100 dataset. The experimental results are shown in Table 4. The experimental results show that by integrating the target recognition algorithm, the algorithm can re-recognize the tracking target when it is occluded or moves rapidly, and the accuracy rate is increased by 52.6% on average in the four data sequences, which has good usability.

Tracking System
To verify the effectiveness of the algorithm, the Onboard SDK (OSDK) is used to control the UAV based on DJI M210, DJI Z30 camera, airborne computer Manifold-2G, and other hardware. The completed hardware platform is shown in Figure 9. To verify the effectiveness of the algorithm, experiments were carried out on sequences such as BlurBody, Human6, Jogging1, Jogging2, etc. in the OTB100 dataset. The experimental results are shown in Table 4. The experimental results show that by integrating the target recognition algorithm, the algorithm can re-recognize the tracking target when it is occluded or moves rapidly, and the accuracy rate is increased by 52.6% on average in the four data sequences, which has good usability.

Tracking System
To verify the effectiveness of the algorithm, the Onboard SDK (OSDK) is used to control the UAV based on DJI M210, DJI Z30 camera, airborne computer Manifold-2G, and other hardware. The completed hardware platform is shown in Figure 9. To verify the effectiveness of the algorithm, the experimental flow as shown in Figure  10 is designed. The specific process is as follows: (1) The UAV will take off autonomously from the take-off point to an altitude of 10 m and 3 m away from the take-off point, and run the improved YOLOv4 algorithm; (2) Based on the target recognition results, in the process of tracking the target, the similarity between all targets detected by YOLOv4 and the reserved target template is calculated by using the color histogram. To ensure the stability of the algorithm, the target with the highest similarity and the distance from the center of the picture is To verify the effectiveness of the algorithm, the experimental flow as shown in Figure 10 is designed. The specific process is as follows: (1) The UAV will take off autonomously from the take-off point to an altitude of 10 m and 3 m away from the take-off point, and run the improved YOLOv4 algorithm; (2) Based on the target recognition results, in the process of tracking the target, the similarity between all targets detected by YOLOv4 and the reserved target template is calculated by using the color histogram. To ensure the stability of the algorithm, the target with the highest similarity and the distance from the center of the picture is less than a certain threshold is regarded as the interest target in the experiment; (3) The coordinate information of the target is used to initialize the improved KCF. After the KCF tracks the target, the pixel value of the target center deviating from the image center is used as the control amount, which is calculated by PID as the speed control amount, and the target is actually tracked. (4) In the process of tracking, the APCE value is used to judge the tracking situation.
When the tracking target is lost, the target re-recognition process will intervene.
To verify the effectiveness of the algorithm, the experimental flow as shown in Figure  10 is designed. The specific process is as follows: (1) The UAV will take off autonomously from the take-off point to an altitude of 10 m and 3 m away from the take-off point, and run the improved YOLOv4 algorithm; (2) Based on the target recognition results, in the process of tracking the target, the similarity between all targets detected by YOLOv4 and the reserved target template is calculated by using the color histogram. To ensure the stability of the algorithm, the target with the highest similarity and the distance from the center of the picture is less than a certain threshold is regarded as the interest target in the experiment; (3) The coordinate information of the target is used to initialize the improved KCF. After the KCF tracks the target, the pixel value of the target center deviating from the image center is used as the control amount, which is calculated by PID as the speed control amount, and the target is actually tracked. (4) In the process of tracking, the APCE value is used to judge the tracking situation.
When the tracking target is lost, the target re-recognition process will intervene.
The target is always in the center of the picture. When the target is not in the center of the picture, the pixel offset is converted into the UAV speed control through PID control, to complete the actual tracking task.

Experimental Result
Firstly, the intelligent tracking algorithm is tested, and a building corridor is selected as the experimental site. The scene contains pillars. When the tracked target passes through the area, it will be blocked by the pillar, and the tracked target will completely disappear from the picture.
The experimental effect is shown in Figure 11. When the target walks behind the pillar and the target is blocked, the algorithm judges the tracking failure through the response value and enters the re-initialization process. Then it starts target recognition, Figure 10. Experimental procedures.
The target is always in the center of the picture. When the target is not in the center of the picture, the pixel offset is converted into the UAV speed control through PID control, to complete the actual tracking task.

Experimental Result
Firstly, the intelligent tracking algorithm is tested, and a building corridor is selected as the experimental site. The scene contains pillars. When the tracked target passes through the area, it will be blocked by the pillar, and the tracked target will completely disappear from the picture.
The experimental effect is shown in Figure 11. When the target walks behind the pillar and the target is blocked, the algorithm judges the tracking failure through the response value and enters the re-initialization process. Then it starts target recognition, recognizes the target, and combines the position before the target disappears. When the similarity is greater than the threshold, it re-initializes the tracker and continues to track the target. The experimental results show that the designed algorithm can achieve accurate target recognition, automatic initialization, and robust tracking. recognizes the target, and combines the position before the target disappears. When the similarity is greater than the threshold, it re-initializes the tracker and continues to track the target. The experimental results show that the designed algorithm can achieve accurate target recognition, automatic initialization, and robust tracking. Secondly, experiments were carried out on the tracking of various types of small targets, and the tracking of UAVs and smart cars were carried out respectively. Figure 12 shows the platform tracks a DJI M210 multi-rotor UAV. In the 1920 × 1080 pixel image, the target only occupies 69 × 63 pixels. The target trajectory is shown by the yellow arrow. Secondly, experiments were carried out on the tracking of various types of small targets, and the tracking of UAVs and smart cars were carried out respectively. Figure 12 shows the platform tracks a DJI M210 multi-rotor UAV. In the 1920 × 1080 pixel image, the target only occupies 69 × 63 pixels. The target trajectory is shown by the yellow arrow. In Figure 13, the UAV has tracked a ground unmanned vehicle, which occupies only 39 × 54 pixels in the picture. The tracking results show that the platform and methods can stabilize and effectively track small targets. Secondly, experiments were carried out on the tracking of various types of small targets, and the tracking of UAVs and smart cars were carried out respectively. Figure 12 shows the platform tracks a DJI M210 multi-rotor UAV. In the 1920 × 1080 pixel image, the target only occupies 69 × 63 pixels. The target trajectory is shown by the yellow arrow. In Figure 13, the UAV has tracked a ground unmanned vehicle, which occupies only 39 × 54 pixels in the picture. The tracking results show that the platform and methods can stabilize and effectively track small targets.  Finally, the target tracking in a complex environment is tested, and a pedestrian in the path between tall buildings is tracked. As shown in Figure 14, during the whole tracking process, the UAV flew at a height of 16 m and a tracking distance of 493 m, which took  Secondly, experiments were carried out on the tracking of various types gets, and the tracking of UAVs and smart cars were carried out respectivel shows the platform tracks a DJI M210 multi-rotor UAV. In the 1920 × 1080 the target only occupies 69 × 63 pixels. The target trajectory is shown by the ye In Figure 13, the UAV has tracked a ground unmanned vehicle, which occup 54 pixels in the picture. The tracking results show that the platform and meth bilize and effectively track small targets.  Finally, the target tracking in a complex environment is tested, and a p the path between tall buildings is tracked. As shown in Figure 14, during the w ing process, the UAV flew at a height of 16 m and a tracking distance of 493 m Finally, the target tracking in a complex environment is tested, and a pedestrian in the path between tall buildings is tracked. As shown in Figure 14, during the whole tracking process, the UAV flew at a height of 16 m and a tracking distance of 493 m, which took 7 min and 13 s, and the average speed was 1.14 m/s. The tracking trajectory is shown in Figure 15. 7 min and 13 s, and the average speed was 1.14 m/s. The tracking trajectory is shown in Figure 15.   In the tracking of a complex environment, the tracked target is a pedestrian, and the trajectory of the UAV is consistent with that of the tracked target of interest. When multiple pedestrians appear in the picture at the same time, the algorithm can eliminate the interference of similar targets and track the specific pedestrian. In the complex background, the algorithm updates the template adaptively to avoid the pollution of the tracking template and realizes the target tracking in the complex environment. In addition, when the target is blocked, it can still be tracked again by the re-recognition algorithm, which shows the characteristics of strong robustness and high level of intelligence of the algorithm. In the tracking of a complex environment, the tracked target is a pedestrian, and the trajectory of the UAV is consistent with that of the tracked target of interest. When multiple pedestrians appear in the picture at the same time, the algorithm can eliminate the interference of similar targets and track the specific pedestrian. In the complex background, the algorithm updates the template adaptively to avoid the pollution of the tracking template and realizes the target tracking in the complex environment. In addition, when the target is blocked, it can still be tracked again by the re-recognition algorithm, which shows the characteristics of strong robustness and high level of intelligence of the algorithm.

Conclusions
An improved algorithm for a lightweight network for target recognition is proposed. Backbone network, redundant channels are pruned based on the BN layer coefficients, and TensorRT is used to accelerate the network, which improves the accuracy and speed of small target recognition. At the same time, aiming at the shortcomings of low robustness and intelligence of the existing KCF tracking algorithm, an improved KCF algorithm with an adaptive learning rate is proposed. The APCE is calculated based on the target response value, and a threshold is set to judge the tracking situation. The template adaptive learning rate update is realized, and the tracking template pollution problem caused by the fixed learning rate is solved. Besides, adaptive scale updating is used to greatly improve the accuracy of the algorithm. Finally, the improved tracking algorithm is combined with the improved target recognition algorithm, which improves the robustness of the algorithm while realizing automatic re-recognition when tracking fails. A physical experimental platform is built up based on DJI M210 UAV. The physical experiment results show that the improved algorithm proposed in this paper realizes real-time, accurate, and highly robust recognition and tracking of small targets.