Mixed YOLOv3-LITE: A Lightweight Real-Time Object Detection Method.

Embedded and mobile smart devices face problems related to limited computing power and excessive power consumption. To address these problems, we propose Mixed YOLOv3-LITE, a lightweight real-time object detection network that can be used with non-graphics processing unit (GPU) and mobile devices. Based on YOLO-LITE as the backbone network, Mixed YOLOv3-LITE supplements residual block (ResBlocks) and parallel high-to-low resolution subnetworks, fully utilizes shallow network characteristics while increasing network depth, and uses a "shallow and narrow" convolution layer to build a detector, thereby achieving an optimal balance between detection precision and speed when used with non-GPU based computers and portable terminal devices. The experimental results obtained in this study reveal that the size of the proposed Mixed YOLOv3-LITE network model is 20.5 MB, which is 91.70%, 38.07%, and 74.25% smaller than YOLOv3, tiny-YOLOv3, and SlimYOLOv3-spp3-50, respectively. The mean average precision (mAP) achieved using the PASCAL VOC 2007 dataset is 48.25%, which is 14.48% higher than that of YOLO-LITE. When the VisDrone 2018-Det dataset is used, the mAP achieved with the Mixed YOLOv3-LITE network model is 28.50%, which is 18.50% and 2.70% higher than tiny-YOLOv3 and SlimYOLOv3-spp3-50, respectively. The results prove that Mixed YOLOv3-LITE can achieve higher efficiency and better performance on mobile terminals and other devices.


Introduction
Recently, object detection based on convolutional neural networks has been a popular research topic in the field of computer vision with a focus on object location and classification. Feature extraction and classification of original images can be conducted via multi-layer convolution operations, and the position of an object in an image can be predicted using boundary boxes, providing the capability of visual understanding. The results of these studies can be widely applied in facial recognition [1], attitude prediction [2], video surveillance, and a variety of other intelligent applications [3][4][5].
Currently, convolutional neural network structures are becoming deeper and more complex. Although such network structures can match or even exceed human vision in precision, they usually require huge amounts of computation power and involve ultra-high energy consumption. There has been significant development in fast object detection methods [6][7][8]; nevertheless, it is still inconvenient to implement convolutional neural network structures in non-graphics processing unit (GPU) or mobile devices. With the growth in the development of embedded and mobile intelligent devices with limited YOLOv3 [14] learns from a residual network structure to form a deeper network level. It uses multi-scale features for object detection and logistics instead of softmax object classification to improve the mean average precision (mAP) and detection of small objects. In the case of equal precision, the speed of YOLOv3 is three to four times greater than that of other models. Its network structure is illustrated in Figure 1.

High-Resolution Network (HRNet)
HRNet proposed by Sun et al. [15] maintains a high-resolution representation by parallel subnetworks of high-resolution to low-resolution convolution and enhances high-resolution representation by repeatedly performing multi-scale fusion across parallel convolution. This network can maintain high-resolution representation rather than just recover high-resolution representation from low-resolution representation. The effectiveness of the method was demonstrated in pixel-level classification, region-level classification, and image-level classification.

High-Resolution Network (HRNet)
HRNet proposed by Sun et al. [15] maintains a high-resolution representation by parallel subnetworks of high-resolution to low-resolution convolution and enhances high-resolution representation by repeatedly performing multi-scale fusion across parallel convolution. This network can maintain high-resolution representation rather than just recover high-resolution representation from low-resolution representation. The effectiveness of the method was demonstrated in pixel-level classification, region-level classification, and image-level classification.

MobileNetV1 and MobileNetV2
MobileNetV1 [16] and MobileNetV2 [17] are efficient models proposed by Google for mobile and embedded devices. MobileNetV1 is based on a streamlined structure. Its underlying innovation is the use of depthwise-separable convolutions (Xception) to build a lightweight-depth neural network that greatly reduces the number of parameters and the amount of computation. It also achieves a desirable balance between detection speed and precision by introducing the parameters α (width multiplier) and ρ (resolution multiplier). Based on deep separable convolution, MobileNetV2 uses the inverted residual and linear bottleneck structure to maintain the representation ability of the model.

Tiny-YOLO and YOLO-LITE
Tiny-YOLO [18] is a lightweight implementation of the YOLO network. It can be used as an alternative structure for YOLOv2 or YOLOv3 in scenarios where the demand for precision is not high. Its detection speed is faster than that of the original network. However, in the case of non-GPU based devices, Tiny-YOLO still encounters difficulty meeting the requirements of real-time detection. YOLO-LITE [19] is a lightweight version of YOLOv2, which is faster than Tiny-YOLOv2 but with a lower average precision.
This section introduced complex networks such as ResNet, YOLOv3, and HRNet, and lightweight networks such as MobileNet and YOLO-LITE. Owing to the large amount of model parameters and computation, high requirements for device performance and slow inference speed make it difficult to migrate complex networks to embedded and mobile devices. Although lightweight networks such as MobileNet and YOLO-LITE have greatly improved their detection speed, their accuracy still requires improvement.

Mixed YOLOv3-LITE Network Structure
To apply real-time object detection using convolutional networks on embedded platforms, such as augmented reality, we propose a simplified model structure, Mixed YOLOv3-LITE, which is a lightweight object detection framework suitable for non-GPU based devices or mobile terminals. Its simplified model structure is presented in Figure 2. The model is composed of fifteen 3 × 3 convolution layers, seven 1 × 1 convolution layers, three ResBlocks, and eight max-pooling (MP) layers. It has the following characteristics:

1.
For the feature extraction part, ResBlocks and the parallel high-to-low resolution subnetworks of HRNet are added based on the backbone network of YOLO-LITE, and the shallow and deep features are deeply integrated to maintain the high-resolution features of the input image. This improves the detection precision. This part includes four 3 × 3 standard convolution layers, four maximum pooling layers, three residual blocks, modules A, B, and C for reconstructing a multi-resolution pyramid, and concat modules A, B, and C. The concat-N module is located between the backbone network and the detector, and is used to reconstruct feature maps with the same resolution at different depths.

2.
For the detection part, a structure similar to that of YOLOv3 is used to reduce the number of convolution layers and channels. The detector detects the recombined feature maps of each concat-N module separately to improve the accuracy of detecting of small objects, and then selects the best detection result through maximum value suppression.

Mixed YOLOv3-LITE Network Module
The excellent performance of YOLOv3 is largely attributable to the application of the backbone network Darknet-53 [14]. To further improve the detection speed of the network, Mixed YOLOv3-LITE uses the shallow backbone network of YOLO-LITE to replace Darknet-53 and adds a residual structure and parallel high-to-low-resolution subnetworks to achieve the fusion of shallow and deep features, thereby improving the detection precision.

Shallow Network and Narrow Channel
YOLO-LITE employs a backbone network with seven convolution layers and four MP layers [19]. As shown in Table 1, it is a "shallow network and narrow channel" network. The amount of computation and the number of parameters are essentially reduced in comparison with a deep network, and the detection speed of the network is improved significantly. In Mixed YOLOv3-LITE, we used a backbone network with a structure similar to that shown in Table 1, and we simultaneously narrowed the channel according to the structure at the detection end to reduce the number of parameters and amount of computation, and to improve the network training speed.

Mixed YOLOv3-LITE Network Module
The excellent performance of YOLOv3 is largely attributable to the application of the backbone network Darknet-53 [14]. To further improve the detection speed of the network, Mixed YOLOv3-LITE uses the shallow backbone network of YOLO-LITE to replace Darknet-53 and adds a residual structure and parallel high-to-low-resolution subnetworks to achieve the fusion of shallow and deep features, thereby improving the detection precision.

Shallow Network and Narrow Channel
YOLO-LITE employs a backbone network with seven convolution layers and four MP layers [19]. As shown in Table 1, it is a "shallow network and narrow channel" network. The amount of computation and the number of parameters are essentially reduced in comparison with a deep network, and the detection speed of the network is improved significantly. In Mixed YOLOv3-LITE, we used a backbone network with a structure similar to that shown in Table 1, and we simultaneously narrowed the channel according to the structure at the detection end to reduce the number of parameters and amount of computation, and to improve the network training speed.

ResBlock and Parallel High-to-Low Resolution Subnetworks
By adding a shortcut [11] to the network, the residual structure can solve the problem of the precision of the model decreasing rather than increasing when the number of layers in the VGG [27] network increases. The residual structure used in this study is consistent with the residual structure of YOLOv3 [14].
The principle of parallel high-to-low-resolution subnetworks [15] is shown in Figure 3; the dotted frames are the parallel high-to-low-resolution subnetworks structure. We borrowed this idea for this study, thus the resolution of three feature images with different scales was reconstructed, fused, and then output to the detection end for object detection, thereby improving the detection precision of the network. By adding a shortcut [11] to the network, the residual structure can solve the problem of the precision of the model decreasing rather than increasing when the number of layers in the VGG [27] network increases. The residual structure used in this study is consistent with the residual structure of YOLOv3 [14].
The principle of parallel high-to-low-resolution subnetworks [15] is shown in Figure 3; the dotted frames are the parallel high-to-low-resolution subnetworks structure. We borrowed this idea for this study, thus the resolution of three feature images with different scales was reconstructed, fused, and then output to the detection end for object detection, thereby improving the detection precision of the network.  The residual structure and parallel high-to low-resolution subnetworks are designed to solve the degradation problem of deep networks. The difference is that the residual structure continuously transmits shallow features to deep layers through a shortcut over a small range, whereas parallel high-to low-resolution subnetworks conduct multi-resolution reconstruction of deep and shallow features at multiple scales through large-and multi-scale fusion, so that multi-scale feature maps have both deep and shallow features at the same time.

Experiment and Discussion
This section describes the experimental environment, datasets, parameter settings of the training network, and the evaluation index of the model effect. The settings of the network structure are also presented through a series of comparative experiments conducted in the process of designing the proposed network and selecting the network that yielded the optimal performance. This selection was based on a comparison of the experimental results obtained using the PASCAL VOC dataset [28]. The precision index of the network for object detection was verified using the VisDrone 2018-Det dataset [29] and the ShipData dataset. Finally, the speed index of the network was verified on the embedded platform Jetson AGX Xavier [30]. The residual structure and parallel high-to low-resolution subnetworks are designed to solve the degradation problem of deep networks. The difference is that the residual structure continuously transmits shallow features to deep layers through a shortcut over a small range, whereas parallel highto low-resolution subnetworks conduct multi-resolution reconstruction of deep and shallow features at multiple scales through large-and multi-scale fusion, so that multi-scale feature maps have both deep and shallow features at the same time.

Experiment and Discussion
This section describes the experimental environment, datasets, parameter settings of the training network, and the evaluation index of the model effect. The settings of the network structure are also presented through a series of comparative experiments conducted in the process of designing the proposed network and selecting the network that yielded the optimal performance. This selection was based on a comparison of the experimental results obtained using the PASCAL VOC dataset [28]. The precision index of the network for object detection was verified using the VisDrone 2018-Det dataset [29] and the ShipData dataset. Finally, the speed index of the network was verified on the embedded platform Jetson AGX Xavier [30].

Experimental Environment Setup
We performed training using a TensorFlow-based version of YOLOv3 as the baseline, in which the YOLO-LITE model file [19] was also converted into the TensorFlow version for performance evaluation. The training was performed on a server equipped with an Intel Core i7 mur9700K central processing unit (CPU) and an NVIDIA RTX 2080Ti GPU. During the test, the GPU of the server was disabled, and only the CPU was used to execute the video detection script under the TensorFlow framework. The configuration details of the server are listed in Table 2. In addition, the NVIDIA Jetson AGX Xavier was used as an embedded mobile terminal for performance testing. The Jetson AGX Xavier hardware was configured as an NVIDIA self-developed eight-core ARM v8.2 64-bit CPU, a 512-core Volta GPU, and a 16-GB RAM. It is a small, fully functional low-power computing system with a module size no more than 105 mm × 105 mm, designed especially for robotic, industrial automation, and other neural network application platforms. When deployed for use with intelligent devices such as unmanned vehicles and robots, a power consumption of only 10 to 30 W can provide powerful and efficient artificial intelligence (AI), computer vision, and high-performance computing power [30][31][32].

. Experimental Datasets
The datasets used in our experiments were PASCAL VOC [28], VisDrone 2018-Det [29], and a ship dataset of remote-sensing images, which we collected from Google Earth. The PASCAL VOC [28] and VisDrone 2018-Det datasets [29] were each divided into a training set and a test set (Table 3) such that our model could be trained under the same experimental settings and compared with the benchmark model. The following is a detailed description of the three datasets. B. VisDrone2018-Det VisDrone2018-Det [29] is a large UAV-based dataset consisting of 8599 images, 6471 of which were used for training, 1580 for validation, and 1580 for testing. The dataset contains rich annotations, including object bounding boxes, object categories, occlusions, and truncation rates. The label data of the training set and validation set have been made public and were used as the training set and test set, respectively. There are several real-world scenarios in the data. These datasets contain various scenes (thousands of cities and kilometers) and various weather and light conditions. We mainly focused on the following object categories in the detection of objects: pedestrians, people, cars, trucks, buses, bicycles, awning tricycles, and tricycles.
C. ShipData The ShipData produced in this study is a remote-sensing image ship dataset with 1009 images collected by Google Earth and labeled in PASCAL VOC format. The backgrounds of the images vary greatly, and there are many different types of ships. The dataset was randomly divided into subset A (706 images) and subset B (303 images)-according to the proportion 7:3. Two subsets of data were used in the experiment. In the first round, subset A was used for training and subset B was used for testing. In the second round, subset B was used for training and subset A was used for testing.

Evaluation Metrics
In this study, the precision, recall rate, F1 score, and mAP were used to evaluate the detection accuracy of the model. Floating point operations (FLOPs), the number of parameters, and the model size were used to evaluate the performance of the model, which was finally reflected in the frames per second (FPS) index.
The objects considered can be divided into four categories based on their actual and predicted categories [33]: true positive (TP), fault positive (FP), true negative (TN), and fault negative (FN). The relationships are shown in Table 4. The precision reflects the proportion of real positive examples in the positive cases determined by the classifier while the recall rate reflects the proportion of correct positive cases among the total number of positive cases. F1 is the weighted harmonic average of precision and recall, which combines the results for precision and recall. When F1 is higher, it indicates that the test method is more effective. Average precision (AP) is the area under the precision-recall (P-R) curve. For example, Figure 4 shows the P-R curve for the method in the horse category in the PASCAL VOC dataset. The AP of the horse category is the area of the shaded part in the figure, which accounts for 65.04% of the area. In general, the better the classifier, the higher the AP value. The mAP is the average of the AP value in multiple categories. The calculation method is as follows:

of 18
Sensors 2020, 20, x FOR PEER REVIEW 9 of 20 FLOP is the number of operations of the model, which can be used to evaluate the time complexity of the model. The number of parameters of the model consists of two parts: the total number of parameters and the size of the output feature graph of each layer, which can be used to evaluate the space complexity of the algorithm and the model. The overall time and space complexities of the CNN can be calculated as follows: In Equations (4) and (5)

Experimental Setup
This section describes the network models proposed during the design of Mixed YOLOv3-LITE and the network structure of each trial, as shown in Table 5. In all trials, 60 epochs of training were carried out using the PASCAL VOC 2007-2012 training dataset to obtain the final model. The input image size used in model training and testing was set to 224 × 224, which is consistent with that of YOLO-LITE. As YOLOv3 did not publish the precision data associated with the PASCAL VOC dataset, 60 epochs of training with YOLOv3 were performed under the same experimental environment and parameter settings, which were adopted as the evaluation baseline. The FLOP is the number of operations of the model, which can be used to evaluate the time complexity of the model. The number of parameters of the model consists of two parts: the total number of parameters and the size of the output feature graph of each layer, which can be used to evaluate the space complexity of the algorithm and the model. The overall time and space complexities of the CNN can be calculated as follows: In Equations (4) and (5), D represents the number of layers of the CNN, i.e., the depth of the network; l represents the lth convolution layer of the CNN; M l represents the side length of the output feature map for the lth convolution layer; K represents the side length of each convolution kernel; C l−1 represents the number of input channels of the lth convolution layer, i.e., the number of output channels of the (l − 1)th convolution layer; and C l represents the number of output channels of the lth convolution layer, i.e., the number of convolution kernels of this layer.

Experimental Setup
This section describes the network models proposed during the design of Mixed YOLOv3-LITE and the network structure of each trial, as shown in Table 5. In all trials, 60 epochs of training were carried out using the PASCAL VOC 2007-2012 training dataset to obtain the final model. The input image size used in model training and testing was set to 224 × 224, which is consistent with that of YOLO-LITE. As YOLOv3 did not publish the precision data associated with the PASCAL VOC dataset, 60 epochs of training with YOLOv3 were performed under the same experimental environment and parameter settings, which were adopted as the evaluation baseline. The experimental results for all the networks-i.e., YOLO-LITE, YOLOv3, MobileNetV1-YOLOv3, and MobileNetV2-YOLOv3-are shown in Table 6. The details of the experiment are presented below.

Model Structure Description
YOLO-LITE YOLO-LITE raw network structure [19], as shown in Table 1  YOLOv3 YOLOv3 raw network structure [14], as shown in Figure 1

MobileNetV1-YOLOv3
Backbone uses MobileNetV1 while using YOLOv3 detector part MobileNetV2-YOLOv3 Backbone uses MobileNetV2 while using YOLOv3 detector part Trial 1 All convolution layers in YOLOv3 were replaced by depth-separable convolution, and the number of ResBlocks in Darknet53 was replaced from 1-2-8-8-4 to 1-2-4-6-4. Trial 2 The convolution layer was reduced in the detector part of Trial 1 by one layer.

Trial 4
A parallel structure was added based on Trial 2, the resolution was reconstructed using a 1 × 1 convolutional kernel, and the channel was fused using a 3 × 3 convolutional kernel after the connection.
Trial 5 Based on Trial 4, the number of ResBlocks in the backbone network was replaced by 1-1-2-4-2, and the resolution was reconstructed using a 3 × 3 convolutional kernel.

Trial 6
A parallel structure was added based on YOLOv3, which used a 1 × 1 ordinary convolution. Trial 7 All convolutions in Trial 6 were replaced by depth-separable convolutions.

Trial 8
The region was exactly the same as that of YOLOv3, and the last layer became wider when the backbone extracted features.

Trial 9
The backbone was exactly the same as that in Trial 8, and three region levels were reduced by two layers for each.

Trial 10
Three region levels were reduced by two layers for each, the region was narrowed simultaneously, and the backbone was exactly the same as that of YOLO-LITE.

Trial 11
The backbone was exactly the same as that in Trial 8, and three region levels were reduced by four layers for each.

Trial 12
The backbone was exactly the same as that of YOLO-LITE, three region levels were reduced by four layers for each, and the region was narrowed simultaneously (three region levels were reduced by two layers for each based on Trial 10).

Trial 13
Three ResBlocks were added based on Trial 12.

Trial 14
Three HR structures were added based on Trial 12.

Trial 15
Based on Trial 14, the downsampling method was changed from the convolution step to the maximum pool, and a layer of convolution was added after the downsampling.

Trial 16
The convolution kernel of the last layer of HR was changed from 1 × 1 to 3 × 3 based on Trial 15.

Trial 17
Three ResBlocks were added to Trial 15.

Trial 18
Nine layers of inverted-bottleneck ResBlocks were added to Trial 15.

Trial 19
Based on Trial 18, the output layers of HR structure were increased by one 3 × 3 convolution layer for each, for a total of three layers. Trial 20 The number of ResBlocks per part was adjusted to three, based on Trial 17.

Trial 21
The last ResBlocks was moved forward to reduce the number of channels, based on Trial 20.
A. Depthwise-separable convolutions Depthwise-separable convolution (as shown on the right in Figure 5), which was used in MobileNets [16] instead of ordinary convolution (as shown on the left in Figure 5), can significantly reduce the number of parameters and the amount of computation required.
By comparing YOLOv3 for Trials 1, 2, 3, 6, and 7, it is observed that without changing the network structure and by replacing ordinary convolution with deep separable convolution, the performance of the model decreases remarkably as the number of parameters, the amount of computation, and the model size increase.
B. Shallow network In Trial 8, Darknet53 of the YOLOv3 backbone network was replaced by the seven-layer structure of the YOLO-LITE backbone network, and the number of channels in each layer was adjusted to couple with YOLOv3. Compared to the original YOLOv3, the number of parameters, amount of computation, and model size were reduced by approximately 50%. However, the mAP, recall rate, and F1 score of the model only decreased slightly. Thus, the relative efficiency of the YOLO-LITE layer backbone network structure in the lightweight model was verified.
In Trials 9 and 11, the number of computations and model size were greatly reduced in comparison to those of Trial 8 as the number of convolution layers in the detection part was gradually reduced. Meanwhile, the mAP, recall rate, and F1 score of the model decreased slightly. Thus, it was confirmed that the smaller detection part was effective for the lightweight model.

A. Depthwise-separable convolutions
Depthwise-separable convolution (as shown on the right in Figure 5), which was used in MobileNets [16] instead of ordinary convolution (as shown on the left in Figure 5), can significantly reduce the number of parameters and the amount of computation required. By comparing YOLOv3 for Trials 1, 2, 3, 6, and 7, it is observed that without changing the network structure and by replacing ordinary convolution with deep separable convolution, the performance of the model decreases remarkably as the number of parameters, the amount of computation, and the model size increase.

B. Shallow network
In Trial 8, Darknet53 of the YOLOv3 backbone network was replaced by the seven-layer structure of the YOLO-LITE backbone network, and the number of channels in each layer was  Table 1. The number of channels of each layer was significantly reduced compared to that of YOLOv3. Trials 10 and 11 were designed to verify the effectiveness of the narrow-channel backbone network.
In the comparative trials between Trials 10 and 9 and between Trials 12 and 11, the number of convolution layers of the model was exactly the same, but the number of channels of Trials 10 and 11 was remarkably reduced compared to those of Trials 9 and 12. In addition, the mAP of Trials 10 and 11 decreased by 6.76% and 5.99%, respectively, and the amount of computation, the number of parameters, and the model size were reduced by factors of approximately 3.5, 7.6, and 7.7, respectively before and after adjustment. This confirmed the relative efficiency of the narrow channel in the lightweight model. D. ResBlock Based on Trial 12, one layer of ResBlock was added before the output of the three-scale feature maps in Trial 13, and the mAP, recall rate, and F1 score of the model increased by approximately 0.8%. However, the amount of computation, the number of parameters, and the model size were reduced by approximately 30%, 66%, and 61%, respectively, which is not cost-effective at all.
E. Parallel high-to low-resolution subnetworks In the first set of comparative experiments, parallel high-to low-resolution subnetworks based on Trial 2 were added and the three-scale feature maps were fused before output for Trials 4 and 5. All the convolutional layers in the first set of comparative experiments were depthwise-separable convolutions. The difference between Trials 4 and 5 is that the multi-scale feature maps after the resolution reconstruction in Trial 4 are connected using the convolution operation for channel feature fusion, which made its mAP decrease by 0.24% compared with Trial 2. On the basis of Trial 4, Trial 5 reduced the number of residual blocks in the backbone, and its mAP decreased by 2.53% compared with Trial 4. It can be seen from the comparison that the parallel structure using deep separable convolution cannot improve the accuracy index of the network. Furthermore, it also reflects the number of residual blocks, that is, the depth of the backbone, which affects the network performance.
In the second set of comparative experiments, parallel high-to low-resolution subnetworks based on Trial 12 were added and the three-scale feature maps were fused before output for Trials 14 and 15. All the convolutional layers in the second set of comparative experiments were standard convolutions. The difference between Trials 14 and 15 is that the downsampling of Trial 14 was achieved by convolution with a step size of two, whereas that of Trial 15 was achieved by maximum pooling. The mAP of Trials 14 and 15 was improved by 0.05% and 1.85%, respectively. By this comparison, the effectiveness of using standard convolutions, the maximum pool, and parallel structure was demonstrated.
F. Comprehensive tests Trials 16, 17, and 18 were all modified based on the results of Trial 15. In Trial 16, the convolution kernel of parallel high-to low-resolution subnetworks was replaced from 1 × 1 to 3 × 3. In Trial 17, one layer of ResBlock was added for each before the output of the three-scale feature map. Further, three layers of ResBlocks with inverted-bottleneck structures [17] were added for each in Trial 18 before the output of the three-scale feature map. A comparison of the results of Trials 16, 17, and 18 with those of Trial 15 shows that the mAP increased by 1.12%, 3.32%, and 2.88%, respectively. Trial 17 exhibited the best performance in terms of precision, recall rate, and F1 scores, which increased by 13.68%, 8.48%, and 12.67%, respectively, and the amount of computation of the model increased by 0.389 GFLOPs.
Trial 19 added a 3 × 3 convolution layer before the output of the parallel-structure three-scale feature map of Trial 18, and Trial 20 adjusted the number of ResBlocks of each part to three layers, based on the results of Trial 17. Trial 21 moved forward the position of the last part of the ResBlock of Trial 20 to reduce the number of channels. From the results, we can see that the mAP of Trial 19 and Trial 20 was slightly higher than that of the original network, but the amount of computation and the number of parameters increased more significantly. The operation involved in Trial 21 significantly reduced the amount of computation but sacrificed 1.31% of the mAP.

PASCAL VOC
A total of 21 different trials were performed in this study; the results are shown in Table 6. The precision, recall rate, F1 score, mAP, and FPS of YOLO-LITE, YOLOv3, MobileNetV1-YOLOv3, MobileNetV2-YOLOv3, and the different trials obtained using the PASCAL VOC 2007 test dataset are illustrated in Figure 6. As seen from the experimental results, YOLO-LITE achieved 102 FPS (non-GPU) in the experimental environment with a high speed. However, its mAP was only 33.77%. The mAP of YOLOv3 was 55.81%, but its speed was only approximately 11 FPS (non-GPU), which is lower than that of YOLO-LITE. Based on the same parameter settings in the experimental environment, MobileNetV1-YOLOv3 s mAP is approximately 6.27% and the detection speed is approximately 19 FPS, whereas MobileNetV2-YOLOv3's mAP is 13.26% and the detection speed is 21 FPS. These results demonstrate that it is difficult to achieve real-time object detection with non-GPU-based computers or mobile terminals. Considering the precision, recall rate, and F1 score together, Trial 17 yielded the best performance (Mixed YOLOv3-LITE) by achieving 49.53%, 69.54%, and 57.85% for the above indices, respectively. The amount of computation of the model was 2.48 GFLOPs, which is only 13% of that of YOLOv3. The model size was 20.5 MB, which is only 8.3% of that of YOLOv3, and 60 FPS was achieved in the non-GPU based experimental environment, which is approximately 5.5 times that of YOLOv3. Meanwhile, when the speed was relatively slow, the mAP was 14.48% higher than that of YOLO-LITE. A portion of the experimental results for the Mixed YOLOv3-LITE model using the PASCAL VOC 2007 testing dataset is shown in Figure 7.   Figure 7.

VisDrone 2018
We selected Trial 17 (i.e., Mixed YOLOv3-LITE), which yielded the best results using the PASCAL VOC dataset, to train on the VisDrone 2018 dataset. Sixty epochs of training were performed using the training set with input image data of size 832 × 832, tested using the validation dataset, and compared with the data for SlimYOLOv3 [34]. The results are shown in Table 7. The mAP of Mixed YOLOv3-LITE was clearly higher than those of the tiny-YOLOv3 and SlimYOLOv3 series networks, and it exceeded the performance of the other two networks in terms of the evaluation index of the amount of computation and model size. Mixed YOLOv3-LITE achieved 47 FPS in the test environment when an NVIDIA RTX 2080Ti GPU was used.

VisDrone 2018
We selected Trial 17 (i.e., Mixed YOLOv3-LITE), which yielded the best results using the PASCAL VOC dataset, to train on the VisDrone 2018 dataset. Sixty epochs of training were performed using the training set with input image data of size 832 × 832, tested using the validation dataset, and compared with the data for SlimYOLOv3 [34]. The results are shown in Table 7. The mAP of Mixed YOLOv3-LITE was clearly higher than those of the tiny-YOLOv3 and SlimYOLOv3 series networks, and it exceeded the performance of the other two networks in terms of the evaluation index of the amount of computation and model size. Mixed YOLOv3-LITE achieved 47 FPS in the test environment when an NVIDIA RTX 2080Ti GPU was used. The detection efficacy of Mixed YOLOv3-LITE (832 × 832) for each type of object using the VisDrone2018-Det validation dataset is shown in Table 8. The data category distribution of the VisDrone2018-Det dataset is highly uneven, which is more challenging. For example, instances of  The detection efficacy of Mixed YOLOv3-LITE (832 × 832) for each type of object using the VisDrone2018-Det validation dataset is shown in Table 8. The data category distribution of the VisDrone2018-Det dataset is highly uneven, which is more challenging. For example, instances of cars as objects accounted for approximately 36.29% of the total instances, whereas awning tricycles accounted for relatively few sample objects, precisely only 1.37% of the total number of instances. This introduces problems of imbalance to the detector optimization. The AP achieved for cars was 70.79%, whereas that for awning tricycles it was only 6.24%. In the Mixed YOLOv3-LITE design process, the convolution layer structure was reorganized and deleted, but the problem of category imbalance was not dealt with, which provides guidance for further optimization of the network in the future.
A portion of the results for Mixed YOLOv3-LITE obtained using the VisDrone2018-Det validation dataset is shown in Figure 8. cars as objects accounted for approximately 36.29% of the total instances, whereas awning tricycles accounted for relatively few sample objects, precisely only 1.37% of the total number of instances. This introduces problems of imbalance to the detector optimization. The AP achieved for cars was 70.79%, whereas that for awning tricycles it was only 6.24%. In the Mixed YOLOv3-LITE design process, the convolution layer structure was reorganized and deleted, but the problem of category imbalance was not dealt with, which provides guidance for further optimization of the network in the future. A portion of the results for Mixed YOLOv3-LITE obtained using the VisDrone2018-Det validation dataset is shown in Figure 8.

ShipData Results
Mixed YOLOv3-LITE and YOLOv3 were trained using the ShipData dataset. The experiment was divided into two parts: (1) training using subset A and testing using subset B; (2) training using subset B and testing using subset A. The input image size was 224 × 224, other training and test

ShipData Results
Mixed YOLOv3-LITE and YOLOv3 were trained using the ShipData dataset. The experiment was divided into two parts: (1) training using subset A and testing using subset B; (2) training using subset B and testing using subset A. The input image size was 224 × 224, other training and test parameter values were the same, and 60 epochs of training were conducted. The results are shown in Table 9. The results for a single category dataset show that when the proportion of the training data was 70%, the mAPs of Mixed YOLOv3-LITE and YOLOv3 were 98.88% and 98.60%, respectively. When the proportion of training data was 30%, the recall rates of mAP of Mixed YOLOv3-LITE and YOLOv3 were 64.68% and 51.65%, respectively. However, the precision and F1 scores were slightly lower. The results for the two groups of experiments show that the network proposed in this study yields better detection results for a single category of the remote-sensing image ShipData set. A portion of the detection results for Mixed YOLOv3-LITE in the first experiment is shown in Figure 9. parameter values were the same, and 60 epochs of training were conducted. The results are shown in Table 9. The results for a single category dataset show that when the proportion of the training data was 70%, the mAPs of Mixed YOLOv3-LITE and YOLOv3 were 98.88% and 98.60%, respectively. When the proportion of training data was 30%, the recall rates of mAP of Mixed YOLOv3-LITE and YOLOv3 were 64.68% and 51.65%, respectively. However, the precision and F1 scores were slightly lower. The results for the two groups of experiments show that the network proposed in this study yields better detection results for a single category of the remote-sensing image ShipData set. A portion of the detection results for Mixed YOLOv3-LITE in the first experiment is shown in Figure 9.

Performance Tests Based on Embedded Platform
Mixed YOLOv3-LITE was tested with a Jetson AGX Xavier device; the results are shown in Table 10. When inputting an image with a size of 224 × 224, it reached 43 FPS. When used in UAV imaging with an adjusted image size of 832 × 832, it still reached 13 FPS. In summary, the proposed method can meet the real-time requirements established.

Performance Tests Based on Embedded Platform
Mixed YOLOv3-LITE was tested with a Jetson AGX Xavier device; the results are shown in Table 10. When inputting an image with a size of 224 × 224, it reached 43 FPS. When used in UAV imaging with an adjusted image size of 832 × 832, it still reached 13 FPS. In summary, the proposed method can meet the real-time requirements established.

Conclusions
In this study, we proposed an efficient lightweight object detection network that uses a shallow-layer, narrow-channel, and multi-scale feature image parallel fusion structure. On the one hand, the residual block and the parallel structure fuse the deep and shallow features and output multi-scale feature maps to maximize the utilization of the original features to improve the accuracy rate. On the other hand, the detector is constructed using shallower and narrower convolutional layers than YOLOv3, so as to reduce the amount of calculation and the number of trainable parameters and speed up the network operation. Thus, we proposed Mixed YOLOv3-LITE, which has a narrower and shallower structure than that of YOLOv3. Our proposed method has fewer trainable parameters, thereby significantly reducing the amount of computation and increasing running speed. Compared to YOLO-LITE, the detection precision is greatly improved. Computing power and power consumption are generally limited with non-GPU-based devices, mobile terminals, and all types of intelligent devices; thus, efficient lightweight-depth neural networks are needed to ensure a longer battery life for all types of devices and make them work stably. After comprehensive consideration, Mixed YOLOv3-LITE has been proven capable of achieving higher efficiency and better performance than YOLOv3 and YOLO-LITE on mobile terminals and other devices.