2.1. Proposed Recognition Network
We optimized the recognition network as follows. The target recognition network based on deep learning with good generalization is used to complete the target recognition of airports, bridges, and ports under low resolution. The following introduces the identification of the backbone structure of the network, candidate frame generation in the network, calculation of the network loss function, and training strategies.
Step (1) Design of image target recognition backbone network
First, the basic structure of the remote sensing image target recognition network under low resolution is introduced. The basic structure of the remote sensing target recognition network used in this subject is shown in
Figure 1. The basic network structure of the VGG16 is continued on the network backbone structure. The first five layers still use the five convolutional layers of the VGG16 network, discarding the fully connected layers of the sixth and seventh layers of the VGG16 network, while using the dilated convolution [
38] method to construct two new convolution floors.
The conventional pooling layer in a deep neural network causes a decrease in resolution while increasing the receptive field, and the decrease in resolution causes a loss of some feature information. The advantage of this dilated convolution is to avoid the decrease in resolution caused by pooling [
38]. The comparison between dilated convolution and ordinary convolution is shown in
Figure 2. It can be seen from
Figure 2 that under the same calculation parameters, a larger receptive field can be obtained by using dilated convolution instead of ordinary convolution.
After the newly added sixth and seventh convolutional layers, three more convolutional layers (conv8, conv9, and conv10) are added, and a layer is added to the network at the end to convert the output feature map of the previous layer into a one-dimensional vector. For the remote sensing targets studied in this subject, there is a large intra-class gap for the same type of target, and there is still a problem of scale gap for the same type of target. Therefore, multi-scale recognition is particularly important. Considering the scale change of the target object, the network outputs feature maps of different scales at different layers and send them to the detector to predict the degree of confidence and position coordinate offset of each category. As shown in
Figure 3, the front-most feature map is output after the Conv4_3 layer. The feature maps of the first few layers in the network describe the shallower features in the input image, and their receptive fields are relatively small. In contrast, the deeper feature maps are responsible for describing the more advanced composite features. Their lower-level feature maps of receptive fields are larger, and also have stronger advanced semantic information. At the end of the network, in order to avoid the result that the same target is detected by the multilayer feature detector at the same time, a non-maximum suppression process is added, as shown in
Figure 3. From this, the final test result is obtained. The network backbone structure does not use a fully connected layer. On one hand, the output of each layer can only feel the characteristics of the area near the target, not the global information. On the other hand, it also reduces the number of computing parameters in the network.
Step (2) Candidate box generation in the network
The network adopts an idea similar to Anchor in Faster R-CNN [
33] to generate candidate regions, which is called the priority box here. For the aforementioned networks and for the six sets of feature maps generated by the Conv4_3, Conv7, Conv8_2, Conv9_2, Conv10_2, and global average pooling layers, the sizes are 38 × 38 × 512, 19 × 19 × 1024, 10 × 10× 512, 5 × 5 × 256, 3 × 3 × 256, and 1 × 1 × 256. For feature maps of different scale output by different layers, different aspect ratio candidate regions of the target object can be simulated by using different aspect ratios in each feature map.
Figure 4 shows the process of generating priority boxes during airport image training in the network. Specific to the generation of each priority box, take the feature map of different scales. Taking Conv9_2 as an example, the size of the generated feature map is 5 × 5 × 256. Set its default box parameter to 6 in the network, that is, to generate 6 priority boxes with different aspect ratios around the same point around each anchor point. Then for the feature map of this layer, a total of 150 candidate priority boxes of 5 × 5 × 6 can be obtained for the prediction of category confidence and 4 position coordinate scores. In this network, for the output feature maps of each layer, the network generates 8732 priority boxes for prediction. In the process of network training, the prediction of an input image is equivalent to the prediction of classification and position regression of the 8732 sub-images of the input image at different scales.
In the process of generating boxes with different aspect ratios, two parameters of scale and ratio are used to control the generated boxes of different sizes. The scale parameter varies with the number of layers.
During network prediction, the scale value of the lowest-level feature map is set to 0.2, that is, S
min = 0.2, and the scale value of the highest-level feature map is set to Smax = 0.95. The ratio value interval is set to
, and this parameter is used to control the aspect ratio of the candidate box around the anchor point. Use scale and ratio to calculate the size of the priority box in each layer feature map. Let the width of each priority box be
and the height be
. Then, the width and height of each priority box can be calculated by:
where
Sk is a parameter of each layer, and its calculation formula is shown in:
For a ratio of 1, that is, an aspect ratio of 1, two candidate boxes with an aspect ratio of 1 are generated around each anchor point, and use extra to generate a box with an aspect ratio of 1. In this way, for each anchor point, you can get 6 different boxes.
Step (3) Network loss function design
The network in this topic belongs to a supervised learning network. For supervised learning, the target position and target category in the manually labeled labels are very important. In training, it is important to correlate artificially labeled target position category information with the boxes generated prior by the network. The first is about the definition of positive and negative samples. The concept of IoU is introduced here. For the target recognition task in this topic, as shown in
Figure 5, the red dashed line on the left is the priority box generated during training, and the solid green line box is the target position manually labeled, where
is the overlapping area of the two boxes and
is the total area covered by the two boxes. Then, defining IoU is described as follows:
During the training process, for several priority boxes generated by the network, if there are artificially labeled targets near the priority boxes, that is, ground truth, and the IOU of the box and ground truth is greater than 50%, the box is regarded as a positive sample; otherwise, it is considered a negative sample. Each box will have a certain positive and negative value. With this strategy, each ground truth corresponds to multiple positive samples, which also alleviates the problem of imbalance of positive and negative samples caused by too many negative samples during training.
During training, because there are two training purposes (category confidence and score prediction of four position parameters), the corresponding objective function is also divided into two parts. The objective function refers to the idea of multiBox loss function [
39] and calculates the classification confidence of the category to which the target belongs and the regression accuracy of the target location. For the classification task for each box, the confidence calculation in the network is calculated using a softmax-type cross-entropy loss function. The specific calculation formulas are shown as:
The position loss regression function uses the calculation method of smooth L1-loss, and its loss function is shown as:
The total loss function in the network is the weighted sum of the above two loss functions as shown as:
where
N is the number of positive samples.
Step (4) Network training strategy
In response to the problem of insufficient data sets during the training process, this topic expands the following data sets, so that the number of labeled data was doubled, the expanded data set was trained, and the other training parameters were the same as the environment. In the case of the target dataset, after multiple experiments on the target data set, the data expansion improves the accuracy of target recognition by an average of 3 to 5 percentage points. Take airport training as an example: as shown in
Figure 6, the left is the test accuracy before expansion, and the right is the recognition accuracy after data expansion.
In the training process, because the priority boxes around each anchor point are mostly negative samples, if the original positive and negative samples are directly trained, the proportion of positive and negative samples is extremely imbalanced, and too many negative samples will affect the accuracy of training network to a certain extent. Therefore, the Hard Example Mining method is used in the training process to balance the positive and negative samples to a certain extent. The priority boxes with an IOU greater than 50% are regarded as positive samples, and during the training process, the Loss values of the class loss functions of all boxes will be sorted for each type of target, and the one with the largest Loss value will be selected. Some samples are used as negative samples, and the ratio of positive and negative samples is finally controlled to 1:3.
In the initialization stage of training, for the convolutional layers other than the newly added VGG16 convolutional layer, the initialization process of the weight in the convolution kernel is performed using the Xavier initialization [
40] method. During the training process, Adam (Adaptive Moment Estimation) [
41] was selected as the optimization method instead of the commonly used stochastic gradient optimization (SGD) to optimize the model to accelerate the speed of model convergence. The Adam optimization algorithm is a weight update method based on a dynamic learning rate. It adaptively selects a suitable learning rate for different parameter states during training, making the learning convergence process more stable and faster. Among them, the initial learning rate, impulse, weight attenuation, and other parameter values are slightly different according to different data sets in practice.
In addition, in order to improve the training results of the algorithm, this topic introduces transfer learning to improve the training recognition rate. Although the data set has been greatly expanded, the amount of data is still insufficient for deep recognition networks. For low-level feature extraction networks, the introduction of transfer learning can greatly improve the training results. Transfer learning focuses on training problems when there is insufficient data. The goal of transfer learning is to use the weight equivalents learned from a task to accelerate the learning and convergence process of a new task. With the help of transfer learning technology, a large number of existing data sets (such as the Pascal VOC data set) are directly used for pre-training, and then the parameters are loaded directly from the existing model during the training process. In the subject low-resolution remote sensing image target recognition algorithm, when a new target recognition training task is introduced, the existing model can be directly loaded to start training, thereby speeding up the convergence speed and improving the correct recognition rate to a certain extent. This method can also achieve the purpose of incremental learning of existing models required by technical indicators.
During the test process, since more than 8000 candidate regions were obtained to frame the same target for different priority boxes at different scales. For each output target area, non-maximum suppression is used to merge the target bounding boxes, sort by score, select the box with the highest score, and then calculate the other target boxes in the surrounding area and the highest score of IOU. Delete all boxes larger than a certain threshold, and then continue the previous process for all unbound bounding boxes until the final target box is obtained.
2.2. Proposed Parallel Computation Framework
The overall architecture of the UAV target detection system for ground stations in this paper is shown in
Figure 7. The system can be divided into three parts: data transmission to the ground station, deep learning local target detection, and global target stable display.
The three parts of the system operate independently but the information is related to each other, that is, the entire system is composed of four processes. Considering the overall real-time requirements of the system, the processes communicate with each other using shared memory. There are usually four ways of inter-process communication: pipes, semaphores, message queues, and shared memory. Shared memory is designed to solve the operational efficiency problem of inter-process communication, and is the fastest inter-process communication method. The basic communication principle is shown in
Figure 8.
One of the methods used to realize the rapid transmission and sharing of data, images, and other information between two independent processes is to use the same physical address to store information, and each process accesses this address to obtain information of the other process. The process and the physical address of the shared memory connect their own virtual address space and actual physical space through a page table. As the data is directly stored in the memory, the frequency of multiple data replication for ordinary data transmission is reduced, thereby speeding up the transmission speed, and the time it takes to store information is almost negligible. Considering the requirements of this system, the writing and reading of information should be sequential, and only one process can access shared memory at a time between processes. Therefore, a mutex variable lock mechanism is added to achieve mutual access between processes.
A total of four shared memory methods were used for information transfer between the four processes in this paper. First, the video data collected by the drone is shared with the ground station in real-time using a memory space used to store the original video stream data. Second, the initial position information and target slice information of the original video after deep learning local detection are stored in the second shared memory, which is different from the first shared memory, a shared memory for storing local target information divided according to the image. The number of local areas is decomposed into corresponding multiple sub-shared memory areas, as shown in
Figure 9. Then, considering the stability and long-term nature of the detection results, the information of each child shared memory is used for subsequent further supplementation, screening, and fusion. After the final processing, it is stored in the last complete shared memory area, which is used to display the global target detection results.
Through the design of the above framework, the entire process from data acquisition and target detection processing to stable and real-time display of the final detection result is realized. A complete system that can be applied to the target detection of actual UAV ground stations is set up.
The time complexity determines the training/prediction time of the model. If the complexity is too high, it will lead to a lot of time for model training and prediction, which can not quickly verify the idea and improve the model, nor can it achieve rapid prediction. The time complexity of this paper is defined as:
The spatial complexity determines the number of parameters of the model. Due to the limitation of the dimension curse, the more parameters of the model, the greater the amount of data required to train the model. In contrast, the data set in real life is usually not too large, which will make the model training easier to over fit. The spatial complexity of this paper is defined as: