UAV Detection with Transfer Learning from Simulated Data of Laser Active Imaging

: With the development of our society, unmanned aerial vehicles (UAVs) appear more frequently in people’s daily lives, which could become a threat to public security and privacy, especially at night. At the same time, laser active imaging is an important detection method for night vision. In this paper, we implement a UAV detection model for our laser active imaging system based on deep learning and a simulated dataset that we constructed. Firstly, the model is pre-trained on the largest available dataset. Then, it is transferred to a simulated dataset to learn about the UAV features. Finally, the trained model is tested on real laser active imaging data. The experimental results show that the performance of the proposed method is greatly improved compared to the model not trained on the simulated dataset, which veriﬁes the transferability of features learned from the simulated data, the effectiveness of the proposed simulation method, and the feasibility of our solution for UAV detection in the laser active imaging domain. Furthermore, a comparative experiment with the previous method is carried out. The results show that our model can achieve high-precision, real-time detection at 104.1 frames per second (FPS).


Introduction
With the increasing demand for high-precision data collection, commercial quadrotor UAVs have emerged. As ideal platforms of data acquisition, UAVs have high maneuverability even in complex environments. UAVs play a significant role in a variety of fields such as photography, disaster monitoring, and traffic guidance nowadays [1,2]. Moreover, UAVs can also be deployed in a lot of military applications [3]. Due to the lack of effective regulation of UAVs, their abuse poses a threat to the privacy of citizens and the flight safety in specific places, such as airports. What's more, the lack of regulation gives rise to smuggling, terrorist attacks, and other illegal activities. To build a system of regulation, it is crucial to carry out research on the detection of UAVs. However, this is not an easy task due to the small size of the UAVs and the limited field of view (FOV) of the detection system, especially since it would become more difficult when there is no proper illumination.
The laser is a promising source to compensate for the situation of low illumination, due to its high intensity and high collimation. Therefore, many laser active imaging systems have been proposed for long-range target identification at night [4]. Gating technology is usually adopted in these systems to mitigate the scattering effects of obscurants and reduce the influence of background clutter. Researchers have proposed a common implementation of laser active imaging systems using a long-wave infrared camera with a large FOV as the detection device to search for the target. The distance of the target is determined by the time of flight (ToF) of the laser for range gating. At last, a laser range-gated short wave infrared camera is used for target identification [5]. In 2015 [6], this author proposed a target recognition algorithm that used laser active imaging based on fast contour torque features. However, the algorithm only considers a single background and cannot be extended to complex scenes. All of the studies mentioned above recognize the target in a narrow FOV, but cannot extend to a complex scenario. Our work here focuses on using laser active imaging to perform target detection, including target location and identification, even in a complex scene.
There are many traditional hand-crafted, feature-based models, such as histograms of oriented gradients (HOG), scale-invariant feature transform (SIFT), Haar-like and deformable part-based model (DPM) [7], which have achieved good object detection performance in natural images. However, compared with the conventional passive imaging system, laser active imaging produces gray images with higher noise owing to the laser's coherence, which hinders the performance of traditional object detection algorithms. Since the great success of Alexnet [8] in 2012, convolutional neural networks (CNNs) have been widely used in a variety of fields of computer vision, including image classification [9], object detection [10], and image segmentation [11]. CNNs can learn more robust features from data automatically compared with traditional algorithms, in which sophisticated features need to be designed manually. Furthermore, object detection algorithms based on CNN rely on a large amount of annotated data for its training. However, collecting enough data to satisfy the requirements is time-consuming, laborious, and even impossible in many situations. Laser active imaging is a typical situation where few samples can be gathered. Transfer learning is an effective and promising method to alleviate the situation of insufficient data in many domains [12]. Deep transfer learning aims to transfer the prior features extracted from the source domain D src of a large dataset to the target domain D tar where data is scarce. Yang et al. [13] achieved a significant improvement in military object recognition by equipping the model with the prior knowledge learned from ImageNet [14]. Another study [15] applied transfer learning in the medical field, achieving glioma grading on conventional magnetic resonance images. These works mentioned above still need some labeled data in the target domain, while there is no dataset available for our application. Besides, it is also difficult to collect and construct such datasets under various scenarios. Therefore, we propose a method to simulate the imaging process of laser active illumination to generate synthesis images to construct the simulated dataset D sim for the training of the proposed deep neural network. Much research has been conducted about UAV detection. Zhu et al. [16] used deep transfer learning to recognize targets in a UAV-to-ground situation, which is the opposite of our ground-to-UAV situation. Sommer et al. [17] trained a CNN to detect and recognize UAVs in a natural light scene, while we mainly want to detect UAVs at night. Zhao et al. [18] used a Doppler radar signal to detect small UAVs, while we want to use laser active illumination to detect UAVs. To the best of our knowledge, our work is the first attempt to detect small UAVs through deep transfer learning from simulated data in a laser active imaging field. Figure 1 shows the schematic illustration of the proposed method. Firstly, the model learns the general features in a large dataset of common objects in context (COCO) [19]. This step gives proper initial values to the network parameters. Then, the model learns the UAV features in the constructed simulated dataset. Finally, the trained model is applied to the real laser active imaging images to realize the high-precision detection of the UAVs. Our main contributions can be summarized as follows: (1) A real-time UAV detection framework is established based on a CNN cooperating with transfer learning. To the best of our knowledge, this is the first study to analyze the problem of zero-shot object detection in the laser active imaging domain. (2) A dataset is constructed by simulating the process of laser active imaging. The knowledge learned from the simulated dataset is beneficial to UAV detection in real data. (3) We experimentally show that our algorithm can realize a high-precision UAV detection for our laser active imaging system, which proves the authenticity of the simulated data and the success of our solution. (3) We experimentally show that our algorithm can realize a high-precision UAV detection for our laser active imaging system, which proves the authenticity of the simulated data and the success of our solution. The remainder of this paper is as follows. In Section 2, we briefly introduce the related work. The simulation process is analyzed in Section 3. Then, the adopted algorithm and its improvement, as well as the experimental process, are described in Section 4. The analysis of the experimental results is in Section 5, and Section 6 summarizes and prospects our work.

Laser Active Imaging
Active imaging systems are widely used at night or in low light conditions. Many 2D laser-illuminated imaging systems are proposed due to their advantages over passive imaging systems. Renold et al. [20] quantitatively analyzed the effect of laser speckle on the target identification performance of a 2-D laser active imaging system. The modeling of the system is presented, which emphasizes mainly the effect of speckle and atmospheric scintillation [5]. In [21], the author designs and implements a range-gated underwater laser imaging system and realizes the underwater target detection at a distance of 40 m. In this paper, our system uses a continuous laser to illuminate the target and doesn't need the target distance information, so we can quickly obtain and process 2D images of targets.

Object Detection
Traditional object detection methods use handcrafted features. The performance of the algorithm depends on the robustness of the features to a large extent. With the appearance of the huge amount of annotated data and high-performance hardware, the deep The remainder of this paper is as follows. In Section 2, we briefly introduce the related work. The simulation process is analyzed in Section 3. Then, the adopted algorithm and its improvement, as well as the experimental process, are described in Section 4. The analysis of the experimental results is in Section 5, and Section 6 summarizes and prospects our work.

Laser Active Imaging
Active imaging systems are widely used at night or in low light conditions. Many 2D laser-illuminated imaging systems are proposed due to their advantages over passive imaging systems. Renold et al. [20] quantitatively analyzed the effect of laser speckle on the target identification performance of a 2-D laser active imaging system. The modeling of the system is presented, which emphasizes mainly the effect of speckle and atmospheric scintillation [5]. In [21], the author designs and implements a range-gated underwater laser imaging system and realizes the underwater target detection at a distance of 40 m. In this paper, our system uses a continuous laser to illuminate the target and doesn't need the target distance information, so we can quickly obtain and process 2D images of targets.

Object Detection
Traditional object detection methods use handcrafted features. The performance of the algorithm depends on the robustness of the features to a large extent. With the appearance of the huge amount of annotated data and high-performance hardware, the deep learning-based method has been widely adopted in object detection, which can learn semantic, high-level features with good robustness automatically. Deep learning-based object detection methods can be roughly divided into two types. The former is a two-step method, generating region proposals firstly, and then classifying and identifying each proposal by CNN. The typical frameworks of this type include region-based CNN (R-CNN) series, Mask R-CNN, etc. The latter regards object detection as a regression problem, which directly gives the object category and location information simultaneously, for example you only look once (Yolo) series or single shot multibox detectors (SSD) [7]. Compared with the two-step method, the latter can achieve faster detection speed while maintaining comparable detection accuracy. YOLOv5 [22] is the latest version of the YOLO series, which has a significant increment in performance compared to older versions. In this paper, we adopted the smallest YOLOv5s model as the backbone model, which can commendably meet the requirements of our application scenarios.

Transfer Learning
Transfer learning is a promising technology to ease the problem of limited labeled data in the majority of domains of interest. The definition of transfer learning can be summarized as follows: given a source domain D src with large labeled data and a target domain D tar with a few labeled data, transfer learning aims to learn knowledge in D tar based on the prior knowledge learned from D src . When the knowledge of data is learned by a deep convolutional neural network, transfer learning becomes deep transfer learning. As the simplest and the most effective measure of deep transfer learning, fine-tuning is widely used in the early stage. Jason et al. conducted a survey of the transferability of deep neural networks and declared that fine-tuning is a desirable mean to overcome the domain gap of different datasets [23]. There is no doubt that transfer learning can be used for UAV detection. In [24], the author achieved UAV-Bird image classification using deep transfer learning with a synthetic dataset; Sommer et al. [17] detected UAVs in visible imagery with a two-step approach: flying object detection and subsequent object classification. In the latter stage, fine-tuning pre-existing weights is important for stable training with the small amount of UAV data. In this paper, our goal is to explore the transferability of knowledge learned from the simulated dataset we constructed, so we choose the simple fine-tuning method to complete deep transfer learning.

Data Simulation
To simulate the imaging process, we need to have complete knowledge of the laser active imaging system. The active imaging system uses its light source to irradiate the target area and receives the reflected signal of the target. Hence, it is not limited by the illumination of the scene. At the same time, the laser has the characteristics of high brightness and high collimation, so it is an ideal light source. The commonly used laser active illumination imaging system is shown in Figure 2. The laser is emitted to irradiate the object, and then the camera images at the image plane by receiving the reflected signal of the object. In the process of laser illumination, the laser can be modeled as a Gaussian beam [25], so the intensity of light can be written as: In the process of laser illumination, the laser can be modeled as a Gaussian beam [25], so the intensity of light can be written as: is the spot size in the forward propagation direction z. r is the radial coordinate, taking the optical axis center as a reference. I 0 = I(0,0) is the irradiance at the center of beam waist w 0 . z R = πw 2 0 /λ is the Rayleigh length, with λ being the wavelength. Reflection occurs when the laser reaches the target surface. It is a non-trivial problem to determine the reflectivity of each point on the target plane. For simplicity, the reflectivity was approximated by the gray level of the panchromatic visible image. Ignoring the influence of atmospheric turbulence, the irradiance at the target plane can be approximated as: where I gray is the gray level of the image in natural light; I c denotes the irradiance of the center point (x c , y c ) of the Gaussian beam at the target plane. Then the process from the object surface to the imaging surface can be represented based on standard statistical optics: where (x i , y i ) denotes the position in the image plane; J 0 (x 0 , y 0 ; x 0 , y 0 ) is the mutual intensity of (x 0 , y 0 ) and (x 0 , y 0 ), two points in the target plane; h(x i , y i ) is the amplitude spread function of the imaging system. This representation of the imaging process is general, but it is difficult to determine the mutual intensity. For simplicity, the mutual intensity is usually modeled in two extreme cases: fully coherent and completely incoherent. The real laser works between these two extremes.

Coherent Imaging
When the light is coherent, the mutual intensity is given by where U 0 (x 0 , y 0 ) and U * 0 (x 0 , y 0 ) denote time-averaged field quantities at the target plane. Under this limited condition, the image intensity can be simplified as: According to [20], the amplitude of U 0 could be approximated by taking the square root of I 0 . The phase of U 0 was chosen from a uniform distributed random variable over the interval [0, 2π).

Incoherent Imaging
When the light is spatially incoherent, the imaging system is a linear transfer for the irradiance: The image irradiance is the convolution of the object irradiance with the squared magnitude of the amplitude spread function.
We use both coherent imaging and incoherent imaging to generate the simulated image of laser active imaging. Figure 3 shows a real laser active illumination image of the UAV and two simulated images of coherent imaging and incoherent imaging, respectively. In addition to the different imaging mechanisms, there are also the factors of background and target distance, which together cause the visual differences between the three images.
According to [20], the amplitude of U0 could be approximated by taking the square root of I0. The phase of U0 was chosen from a uniform distributed random variable over the interval [0, 2π).

Incoherent Imaging
When the light is spatially incoherent, the imaging system is a linear transfer for the irradiance: The image irradiance is the convolution of the object irradiance with the squared magnitude of the amplitude spread function.
We use both coherent imaging and incoherent imaging to generate the simulated image of laser active imaging. Figure 3 shows a real laser active illumination image of the UAV and two simulated images of coherent imaging and incoherent imaging, respectively. In addition to the different imaging mechanisms, there are also the factors of background and target distance, which together cause the visual differences between the three images.

Methodology
In this section, we will give a brief introduction of the principle of YOLO, the network architecture of YOLOv5s, and the bounding box regression loss used in our solution first, then describe the datasets used for training and evaluation, and finally introduce the training protocol of our method.

Methodology
In this section, we will give a brief introduction of the principle of YOLO, the network architecture of YOLOv5s, and the bounding box regression loss used in our solution first, then describe the datasets used for training and evaluation, and finally introduce the training protocol of our method.

Principle of YOLO
YOLO series algorithm is a typical representative of the deep convolution neural network (DCNN) in the field of image object detection. The reason why it is called DCNN is that it has a multilayer structure and can extract very deep features. The key structures to make a neural network deeper are the convolution layer and the pooling layer. The function of a convolution layer is to extract local features; the function of a pooling layer is to select features and prevent overfitting. With the development of DCNN, many mature design skills have been proposed and also adopted by the YOLO series algorithm. For example, small (3 × 3) convolution filters are adopted throughout the whole net since a stack of 3 × 3 convolution layers can cover the same size receptive field of 5 × 5 or 7 × 7 convolution filters while retaining a small number of parameters [9]; Batch Normalization [26] is usually used to accelerate the training process of DCNN; leaky rectified linear unit (LReLU) [27] is chosen as the activation function to accelerate optimization and improve performance. Furthermore, inspired by feature pyramid networks (FPN) [10], YOLOv3 [28] and later versions predict bounding boxes at 3 different scales to improve the detection accuracy.

Network Structure
In YOLOv5, the author adopts a variety of tricks to improve network performance. The network structure of YOLOv5s used in this paper is shown in Figure 4. Compared with the previous version, there are mainly two modules adopted. One is the BottleneckCSP module, which draws lessons from the cross stage partial network (CSPNet) [29]; the other one is the path augmentation structure, which is inspired by the path aggregation network (PANet) [30]. CSP module divides the shallow feature map into two parts, then merges them after going through different paths. The network can extract abundant features while reducing the amount of computation necessary by utilizing this strategy. In order to better integrate high-level and low-level features, the path aggregation structure is used to add a bottom-up path on the basis of FPN. Although FPN can mix the high-level features and low-level features, the shallow features have to pass through dozens or even hundreds of convolution layers in the bottom-up process of the original path, which will cause serious loss of shallow information. By adding the path of fewer than ten layers, the shallow features can be better preserved to integrate with deep features for improving detection ability. The specific structure of YOLOv5s is listed in Table 1. It is composed of two parts, a backbone which is used to extract features and a head which is designed to detect targets. The backbone part consists of one Focus unit, four convolutional layers, four BottleneckCSP units, and one SPP layer. The Focus unit slices the 640 × 640 × 3 inputs into 320 × 320 × 12, and then turns it into 320 × 320 × 32 feature maps via 32 separate 3 × 3 filters. The function of the Focus unit is to realize down-sampling and feature extraction without losing information. Each convolution layer is followed by a pooling layer and each BottleneckCSP unit contains two convolutional layers of 1 × 1 and 3 × 3 filters. The purpose of incorporating the SPP layer is to increase the robustness of the model against object deformations. The head part consists of four convolutional layers, two Upsample units, four Concatenation units, and one Detection unit. The role of Upsample and Concatenation is to fuse features from different levels. The Detection unit is carried out on three different scales: 20 × 20, 40 × 40, and 80 × 80 outputs (i.e., P5, P4, and P3 in Figure 4). YOLOv5 has released four models of different sizes ranging from YOLOv5s to YOLOv5x, and there is good integration between them. In model depth, the number of BottleneckCSP modules in YOLOv5m/l/x is 2/3/4 times that of YOLOv5s; in model width, the number of layer filters in YOLOv5m/l/x is 1.5/2/2.5 times that of YOLOv5s, respectively. l. Sci. 2021, 11, x FOR PEER REVIEW Figure 4. Illustrations of the YOLOv5s model [22]. There is a path augmentation struc red dotted box and a structural schematic diagram of the CSP module in the green do  Illustrations of the YOLOv5s model [22]. There is a path augmentation structure in the red dotted box and a structural schematic diagram of the CSP module in the green dotted box.

GIoU Loss
The conventional deep learning algorithms usually use MSE loss to directly regress the center coordinates as well as the length and width of the bounding box. Directly estimating the coordinates of these points is the way that treats these points as independent variables, and ignores the integrity of the object. To deal with this problem, the researchers adopted intersection over union (IoU) loss as an optimization objective, which is scale-invariant. However, IoU loss also has a serious problem: if there is no intersection area between the predicted box and the ground truth box, the gradient of the loss would be zero and cannot be optimized during the iteration. In order to solve this problem, we adopt the generalized IOU (GIoU) proposed in [31] as the bounding box regression loss. The formula of GIoU is as follows: where IoU denotes the ratio between the intersection and union of two boxes; U is the union of the two boxes and A C represents the smaller circumscribed rectangle of the two boxes. Similar to IoU, GIoU is also insensitive to scale. L GIoU = 1 − GIoU can be the loss. Compared to IoU loss, GIoU loss always has a non-zero gradient whether there is an overlapping area between two bounding boxes or not. Therefore, it is the proper policy to choose GIoU loss as the optimization objective in our work.

Dataset
There are two datasets used in the experiments, which are described below.
(1) Simulated dataset: The dataset consists of simulated laser active illumination images according to the method described in Section 3. Firstly, 744 natural images of UAVs in different scenes are collected by a camera. Then ten simulated images with different illumination centers (x c , y c ) and spot sizes w(z) are generated from each image under coherent imaging and incoherent imaging, respectively. The selection of illumination center and spot size is random following the constraint that the illumination area covers the UAV target and does not exceed the image edge. (2) Real dataset: We first construct our laser active imaging system according to Figure 2.
We choose a continuous laser as the illumination source. The laser beam is collimated and expanded by the transmitting lens and then illuminates the target. The callback signal is acquired in an intensified CCD camera after passing through the collection lens. The detail parameters of the camera and laser are listed in Table 2. The setup and experimental scene are shown in Figure 5 left and right respectively. The transmitting and receiving equipment are placed on a turntable to facilitate scene scanning and subsequent tracking and monitoring. We collect three laser active imaging videos of UAVs using this system in different scenes including city, forest, and sky. The distance between UAV and imaging system is 100-500 m. Then we extracted 861 images from the videos to make up the real dataset. pared to IoU loss, GIoU loss always has a non-zero gradient whether there is an o ping area between two bounding boxes or not. Therefore, it is the proper policy to GIoU loss as the optimization objective in our work.

Dataset
There are two datasets used in the experiments, which are described below.
(1) Simulated dataset: The dataset consists of simulated laser active illumination according to the method described in Section III. Firstly, 744 natural images o in different scenes are collected by a camera. Then ten simulated images with ent illumination centers (xc, yc) and spot sizes w(z) are generated from each under coherent imaging and incoherent imaging, respectively. The selection mination center and spot size is random following the constraint that the illum area covers the UAV target and does not exceed the image edge. (2) Real dataset: We first construct our laser active imaging system according to 2. We choose a continuous laser as the illumination source. The laser beam mated and expanded by the transmitting lens and then illuminates the targ callback signal is acquired in an intensified CCD camera after passing thro collection lens. The detail parameters of the camera and laser are listed in T The setup and experimental scene are shown in Figure 5 left and right respe The transmitting and receiving equipment are placed on a turntable to facilita scanning and subsequent tracking and monitoring. We collect three laser ac aging videos of UAVs using this system in different scenes including city, for sky. The distance between UAV and imaging system is 100-500 m. Then we ex 861 images from the videos to make up the real dataset.

Training Protocol
The hardware used for training was a single NVIDIA GTX1080 graphics processing unit (GPU) with 8 GB of memory. The operating system was Ubuntu 16.04, and the models were implemented on the PyTorch framework. The training process was organized in two stages: initial training and deep transfer learning. During the initial training stage, the model was pre-trained on the available dataset COCO, which consists of over 200,000 images with over 500,000 annotated object instances from 80 categories. For convenience, the trained model weights derived from YOLOv5 s model export were downloaded. Then we employed transfer learning by fine-tuning the entire model on the simulated dataset. The fine-tuning process used the following settings: SGD optimizer with initial learning rate = 0.01, weight decay = 0.0005, number of epochs = 200, batch size = 16, single-class training mode = True, and the real dataset for evaluation.

Experimental Results
In this section, we test the model and carry out experiments to explore the transferability of the knowledge learned from simulated data. The experiment results show that the GIoU loss has appealing properties compared with IoU loss. Furthermore, the model is compared with the previous algorithm. At last, the UAV detection results of our algorithm on real data are shown and discussed.

Model Initialization
We firstly design an experiment to verify the importance of the first step of the proposed method, that is, pre-trained on the COCO dataset. We do this experiment based on incoherent imaging simulated data using weights random initialized and pre-trained on the COCO dataset, respectively. In Figure 6, the loss on the training set and the precision on the test set are shown for each epoch, with epoch denoting the number of batch iterations corresponding to all images in the training set. It is clear that pre-training on the COCO dataset makes the network converge faster on the simulated dataset and achieve higher precision on the real data. This result is consistent with our understanding that pre-training on a large dataset can help the model grasp general features which are beneficial for a specific task, and shows the feasibility of the laser active imaging task.
ages with over 500,000 annotated object instances from 80 categories. For convenience, the trained model weights derived from YOLOv5′s model export were downloaded. Then we employed transfer learning by fine-tuning the entire model on the simulated dataset. The fine-tuning process used the following settings: SGD optimizer with initial learning rate = 0.01, weight decay = 0.0005, number of epochs = 200, batch size = 16, single-class training mode = True, and the real dataset for evaluation.

Experimental Results
In this section, we test the model and carry out experiments to explore the transferability of the knowledge learned from simulated data. The experiment results show that the GIoU loss has appealing properties compared with IoU loss. Furthermore, the model is compared with the previous algorithm. At last, the UAV detection results of our algorithm on real data are shown and discussed.

Model Initialization
We firstly design an experiment to verify the importance of the first step of the proposed method, that is, pre-trained on the COCO dataset. We do this experiment based on incoherent imaging simulated data using weights random initialized and pre-trained on the COCO dataset, respectively. In Figure 6, the loss on the training set and the precision on the test set are shown for each epoch, with epoch denoting the number of batch iterations corresponding to all images in the training set. It is clear that pre-training on the COCO dataset makes the network converge faster on the simulated dataset and achieve higher precision on the real data. This result is consistent with our understanding that pre-training on a large dataset can help the model grasp general features which are beneficial for a specific task, and shows the feasibility of the laser active imaging task.

Transferability of Simulated Data
We do another comparative study to illustrate the significance of knowledge learned from simulated data under four different training sets, including natural images, coherent imaging images, incoherent imaging images, and a mixture of coherent imaging images and incoherent imaging images. All the models are tested on the same real dataset. The evaluation metrics we used are precision, recall, F1_Score, AP, AP50, and AP75, as described in [19]. Table 3 shows the quantitative results of the adopted YOLOv5s backbone on four different training sets. From the table, the performance of the models training on any of the simulated data, no matter whether dealing with coherent imaging or incoherent imaging, is significantly better than the one trained on gray images. This is because the addition of the simulation process reduces the difference between the training set and the test set. These results show the effectiveness of knowledge learned from simulated data, which in turn proves the rationality of our simulation of a laser active imaging system. Furthermore, the performance of incoherent imaging is better than that of the other two situations. The reason for its promising performance is that the noise introduced by coherent imaging simulation destroys the contour and texture information of the UAVs, which is of critical importance for a deep learning object detection algorithm. In order to verify this, we use the acknowledged peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [32] metrics to measure the similarity between them and real data, as shown in Table 4. The SSIM assesses the visual impact of three characteristics of an image: luminance, contrast, and structure. The better the image similarity, the higher the PSNR and SSIM scores. We calculate the metrics between 10% of the training set and the whole real dataset, then take the mean value. It can be seen that the metrics scores agree with the model performance across different training sets, which confirms our explanation. Therefore, we will eventually use the incoherent imaging simulation method to generate the training set.

GIoU Loss vs. IoU Loss
In this subsection, we explore the advantages of GIoU loss over IoU loss based on an incoherent imaging training set. The results of the test set have been reported in Table 5. Moreover, Figure 7 shows the mAP values of the model trained with IoU and GIoU losses against different IoU thresholds, i.e., IoU = {0.5, 0.55, . . . , 0.95}. The result in Table 5 shows that we can improve the performance of the model significantly by using GIoU loss as the bounding box regression loss. As shown in Figure 7, consistent improvement can be obtained across different values of IoU thresholds. However, this improvement is not stable with the change of IoU threshold. Nevertheless, by incorporating GIoU loss into our algorithm, we can slightly improve UAV detection performance in the laser active imaging domain.
nificantly by using GIoU loss as the bounding box regression loss. As shown in Figu consistent improvement can be obtained across different values of IoU thresholds. H ever, this improvement is not stable with the change of IoU threshold. Nevertheles incorporating GIoU loss into our algorithm, we can slightly improve UAV detection formance in the laser active imaging domain.

Comparison with the Previous Method
In the experiments, we compared the adopted YOLOv5s model with several re sentative methods including HOG [33], DPM [34], and YOLOv3 [28]. The performan the methods is evaluated in terms of both detection accuracy and detection speed experimental results are shown in Table 6. The manually designed feature-based H and DPM are implemented using CPU; the CNN-based YOLOv3 and YOLOv5s are plemented using GPU. From the comparison of these methods, we can observe tha algorithm based on CNN is higher than the traditional algorithm by a large margin the aspect of detection speed, the HOG and DPM methods are inefficient during th dundancy of a sliding window strategy while the two YOLO methods can achieve time detection performance. Furthermore, our adopted YOLOv5s model is more three times faster than YOLOv3 while maintaining comparable detection accuracy fast detection benefits from small model size. The model size of YOLOv5s is 14.3 which is much smaller than the 243.7 MB size of YOLOv3. Therefore, the YOLOv5s selected as the basic framework, which is sufficient to meet the requirements of our active imaging system.

Comparison with the Previous Method
In the experiments, we compared the adopted YOLOv5s model with several representative methods including HOG [33], DPM [34], and YOLOv3 [28]. The performance of the methods is evaluated in terms of both detection accuracy and detection speed. The experimental results are shown in Table 6. The manually designed feature-based HOG and DPM are implemented using CPU; the CNN-based YOLOv3 and YOLOv5s are implemented using GPU. From the comparison of these methods, we can observe that the algorithm based on CNN is higher than the traditional algorithm by a large margin. For the aspect of detection speed, the HOG and DPM methods are inefficient during the redundancy of a sliding window strategy while the two YOLO methods can achieve real-time detection performance. Furthermore, our adopted YOLOv5s model is more than three times faster than YOLOv3 while maintaining comparable detection accuracy. The fast detection benefits from small model size. The model size of YOLOv5s is 14.3 MB, which is much smaller than the 243.7 MB size of YOLOv3. Therefore, the YOLOv5s was selected as the basic framework, which is sufficient to meet the requirements of our laser active imaging system.  Figure 8 shows the detection results of the model on the data accumulated by our laser active imaging system. The results show that our algorithm can accurately detect the UAV in different backgrounds and states. Compared with buildings and trees, there is no background reflection signal in the sky. The camera will only present the illumination image of the UAV without an illumination area, which is different from our simulated data. Nevertheless, our method can obtain satisfactory detection results in this situation. Furthermore, the model can also detect UAVs in flight, even when its score is lower than that of the UAV in hover. In the last one, our method is still effective, although the imagery suffers from motion blur due to the flight of the UAV. In summary, our algorithm is a useful UAV detection method for our laser active imaging system. Figure 8 shows the detection results of the model on the data accumulated by our laser active imaging system. The results show that our algorithm can accurately detect the UAV in different backgrounds and states. Compared with buildings and trees, there is no background reflection signal in the sky. The camera will only present the illumination image of the UAV without an illumination area, which is different from our simulated data. Nevertheless, our method can obtain satisfactory detection results in this situation. Furthermore, the model can also detect UAVs in flight, even when its score is lower than that of the UAV in hover. In the last one, our method is still effective, although the imagery suffers from motion blur due to the flight of the UAV. In summary, our algorithm is a useful UAV detection method for our laser active imaging system.

Conclusions
In this paper, we propose a high-precision, real-time UAV detection method to be used in the field of laser active imaging by combining a deep CNN-based object detection algorithm with transfer learning, which provides the basis for the following tracking and monitoring. For specific tasks like small-scaled UAV detection of laser active imaging, in which few samples can be used for training, the performance of the model degrades sharply. To solve this problem, we firstly embedded prior knowledge learned from a large available dataset into the model. Then, we provided sufficient data for the specific task by simulating the imaging process of laser active illumination. The experiment carried out on our real laser active imaging system demonstrates the transferability of simulated data and the effectiveness of our solution.
However, limited by the laser power, we can only collect targets within 500 m. In future work, bigger and more realistic datasets will be gathered, including longer distances, more complex scenes such as occlusions by trees, and confusing objects such as birds. Further work that can be carried out is to add the depth information of the target plane in the simulation process, which will contribute to producing more realistic images. This measure has great potential to improve the performance of small-scale UAV detection in the laser active imaging domain.