A Light-Weight CNN for Object Detection with Sparse Model and Knowledge Distillation

: This study details the development of a lightweight and high performance model, targeting real-time object detection. Several designed features were integrated into the proposed framework to accomplish a light weight, rapid execution, and optimal performance in object detection. Foremost, a sparse and lightweight structure was chosen as the network’s backbone, and feature fusion was performed using modiﬁed feature pyramid networks. Recent learning strategies in data augmentation, mixed precision training, and network sparsity were incorporated to substantially enhance the generalization for the lightweight model and boost the detection accuracy. Moreover, knowledge distillation was applied to tackle dropping issues, and a student–teacher learning mechanism was also integrated to ensure the best performance. The model was comprehensively tested using the MS-COCO 2017 dataset, and the experimental results clearly demonstrated that the proposed model could obtain a high detection performance in comparison to state-of-the-art methods, and required minimal computational resources, making it feasible for many real-time deployments.


Introduction
In computer vision evolution, object detection [1] is ubiquitous in many real-time applications, but still has many challenges for deployment.The main purpose of object detection is to detect the target object in an image, and locate a plurality of target positions and object types in each picture.The traditional object detection methods have many limitations, due to the computing hardware and availability of data [2,3].In addition, the majority of methods rely on human annotations to create a knowledge base for a particular task, and computers are only required to make the inference rules.However, in recent years, with the growth of artificial intelligence and computational resources, the whole process can be machine driven, without much human effort.The major difference is that, classical object detection methods involve obtaining features through human experience rules and expert opinions, whereas artificial intelligence makes use of a complex neural network architecture, which can be trained to automatically extract strong and discriminative features.The field of computer vision covers a very wide range of research fields such as scene recognition [4], image reconstruction [5], image generation [6], face recognition [7], traffic scenes [8], object detection [9], etc.The proposed method was developed by keeping the important practical constraints such as a good detection accuracy, higher frame rate processing, and minimal computational resources.The main application scope of the proposed method is in autonomous driving and smart video surveillance systems.

Related Work
Object detection is generally performed with images or videos, and the objective is to locate borders and also to indicate the range and location of the object.Subsequently, to classify the category of the object (such as people, cars, airplanes, horses, etc.), and give the classification probability.In comparison with simple image classification, this is a more challenging task, because the position of multiple objects needs to be detected from the image or video.Objects may vary greatly in size or be occluded by other objects (only part of the appearance can be seen); or there may be changes in environmental light and shadow (backlighting, shadows), differences in the same category (dogs of different breeds and colors), and differences in viewing angles (head-up view, aerial photography, depression angle, etc.).Additional challenges occur in practical scenarios, which demand multiple object detection algorithms, performing within limited time constraints and computation resources.
Object detection is mainly divided into two different types of method.In terms of architecture flow, they can be directly divided into one-stage object detection and two-stage object detection.
The earlier methods in object detection used low-level eigenvalues for preliminary estimation.To avoid deficiencies, a large number (thousands) of candidate regions were generally used for subsequent object judgements.This process consumes a lot of computing resources, and is incapable of listing all the large and small candidate areas.The early strategy was to refer to the lines and textures of the picture to estimate the possible object position.For example, the selective search method, which is still frequently used, is based on this strategy.In other approaches, the texture distribution in the area is used as the feature value, and then a separator, such as a support vector machine (SVM) [10], is used to make judgments.A common method is a histogram of gradients (HoG) [11], which uses the magnitude and direction of change of shading (the difference between adjacent image values) to describe the texture.As for the evaluation method of object detection, IoU (intersection over union) is most commonly used to calculate the overlap of the detected object's target and its real position.The main issues with the traditional methods are the robustness towards object size, lighting conditions, object categories, and that most are limited to single object detection.With the advancements in deep learning methods, recent works have predominantly been based on deep convolutional neural network architectures.
The former works on deep learning-based approaches involve two-stage object detection, such as object position detection and classification, which are performed separately, and, hence, there is a large limitation in performance.Some early representatives include region-based convolutional neural networks (R-CNN) [12], spatial pyramidal pose net (SPP-NET) [13], etc.To improve the above-mentioned methods, many streamlined one-stage object detection approaches have been proposed.The main difference is that the object location detection and classification are changed to one-stage object detection models, such as you only look once (YOLO) [14][15][16], single-shot detector (SSD) [17], fully convolutional one-stage object detector (FCOS) [18], EfficientDet [19], etc.
Recently, the YOLOV4 [16] method verified the influence of State-of-the-Art's Bagof-Freebies and Bag-of-Specials target detection methods in the detector training process.These training techniques successfully improved the speed and accuracy.The backbone architecture uses an improved CSPDarknet53 based on the YOLOv3 [14].By dividing the base layer into two parts, the originally separated two parts are recombined through the transition layer, including a 1 × 1 convolutional layer and a pooling layer.Thus, the network architecture can be made richer, and the gradient of fusion features reduces the amount of calculation.This strategy increases the overall learning ability of the CNN network and also makes it lighter.The part of neck replaces the original feature pyramid network (FPN) [20] structure with a path aggregation network (PA-Net) [21], which is based on FPN to improve the single path from the original upper layer to the lower layer, to two-way transmission, with the use of a spatial pyramid pooling (SPP) [22] method to change the original feature addition to feature merging.Although there is a slight increase in calculation, the detection efficiency is improved.
More recently, YOLOv4-tiny [23] was proposed, which has a better efficiency than YOLOv4, and had promising results with the MS-COCO [24] dataset.This method used a powerful model scaling method, and the FPN structure is inherited in its once-for-all structure.Moreover, another notable work, termed Nanodet [25], was proposed based on the assign guidance module (AGM) and the dynamic soft label assigner (DSLA), and was incorporated and implemented in mobile devices.The Nanodet model can present a higher FPS rate than YOLOv4-tiny and has a better accuracy.In this work, we considered the two latest lightweight object detection models as the baseline, and developed an even more efficient and lightweight model, which can perform better than the above methods in terms of the FPS and detection accuracy.
The critical limitations and contribution of the existing works are as follows: The main improvements are in terms of the lightweight backbone, anchor-free detection, sparse modelling, data augmentation, and knowledge distillation.The integration has substantially improvised the training, inference speed, and detection accuracy.
In Figure 1, a comparison of the leading object detection models, with respect to the average precision (AP) and frames per second (FPS), is illustrated.In comparison to all of the former methods, it can be seen that the proposed model can achieve the highest FPS, as well as a better precision in comparison to YOLOv4-tiny and EfficientDet.Moreover, the proposed model has a lightweight backend that can be quickly trained and combined with anchor-free detection methods.
change the original feature addition to feature merging.Although there is a slight increase in calculation, the detection efficiency is improved.
More recently, YOLOv4-tiny [23] was proposed, which has a better efficiency than YOLOv4, and had promising results with the MS-COCO [24] dataset.This method used a powerful model scaling method, and the FPN structure is inherited in its once-for-all structure.Moreover, another notable work, termed Nanodet [25], was proposed based on the assign guidance module (AGM) and the dynamic soft label assigner (DSLA), and was incorporated and implemented in mobile devices.The Nanodet model can present a higher FPS rate than YOLOv4-tiny and has a better accuracy.In this work, we considered the two latest lightweight object detection models as the baseline, and developed an even more efficient and lightweight model, which can perform better than the above methods in terms of the FPS and detection accuracy.
The critical limitations and contribution of the existing works are as follows: The main improvements are in terms of the lightweight backbone, anchor-free detection, sparse modelling, data augmentation, and knowledge distillation.The integration has substantially improvised the training, inference speed, and detection accuracy.
In Figure 1, a comparison of the leading object detection models, with respect to the average precision (AP) and frames per second (FPS), is illustrated.In comparison to all of the former methods, it can be seen that the proposed model can achieve the highest FPS, as well as a better precision in comparison to YOLOv4-tiny and EfficientDet.Moreover, the proposed model has a lightweight backend that can be quickly trained and combined with anchor-free detection methods.Comparison of average prediction (AP) accuracy with respect to the frame processing speed (FPS) against the state-of-the-art methods.SSD [17], EfficientDet [19], Retina Net [26], DETR [27], YOLO v3 [14], v3-tiny [15], v4 [16], v4-tiny [23].
It is concluded that most of the existing solutions are based on model search and iterative methods, which are computationally intensive and time-consuming.To overcome this problem, the proposed solution uses mixed precision [28] and sparse network generation [29] to improve the model training and inference speed.
In general, YOLOv4-tiny has a lower accuracy in place of its high FPS processing.In the proposed approach, the accuracy is improved through data augmentation and knowledge distillation [30], without compromising the processing speed.The proposed model has superior performance than YOLOv4-tiny, by around 8%; and this is achieved with a model size reduction of around 40%.Comparison of average prediction (AP) accuracy with respect to the frame processing speed (FPS) against the state-of-the-art methods.SSD [17], EfficientDet [19], Retina Net [26], DETR [27], YOLO v3 [14], v3-tiny [15], v4 [16], v4-tiny [23].
It is concluded that most of the existing solutions are based on model search and iterative methods, which are computationally intensive and time-consuming.To overcome this problem, the proposed solution uses mixed precision [28] and sparse network generation [29] to improve the model training and inference speed.
In general, YOLOv4-tiny has a lower accuracy in place of its high FPS processing.In the proposed approach, the accuracy is improved through data augmentation and knowledge distillation [30], without compromising the processing speed.The proposed model has superior performance than YOLOv4-tiny, by around 8%; and this is achieved with a model size reduction of around 40%.
In a nutshell, the proposed method has advantages compared with other methods in terms of the processing efficiency and accuracy.

Proposed Work
The main network architecture of this study is shown in Figure 2 and is similar to one-stage object detection algorithms such as FCOS [18] and EfficientDet [19], and is mainly composed of a backbone, neck, and head.
Electronics 2022, 11, x FOR PEER REVIEW 4 of 13 In a nutshell, the proposed method has advantages compared with other methods in terms of the processing efficiency and accuracy.

Proposed Work
The main network architecture of this study is shown in Figure 2 and is similar to one-stage object detection algorithms such as FCOS [18] and EfficientDet [19], and is mainly composed of a backbone, neck, and head.[31], and a modified path aggregation network (PAN) [20] using Generalized focal loss v2 [32]as the detection head representation and loss function.
Normally, to localize and detect the category of the target from an image, we first extract some necessary feature data from the image, such as HOG features, and subsequently use these features to achieve positioning and classification.In the field of deep learning, the backbone of the network is responsible for extracting features from images.Among them, there are many well-known models to choose from, such as ResNet [33], MobileNet [34,35], ShuffleNet [36], etc.The reasons for choosing EfficientDet-lite [31] as the backbone for object detection in this study were as follows: EfficientDet based CNN model has achieved the state-of-the-art performance in object detection.It uses the Au-toML MNAS [37] framework to execute a neural architecture search (NAS) to develop the base model and combine the features of compound scaling.It has a good accuracy and excellent model size, and has attracted widespread attention recently.EfficientNet-lite is a lightweight and improved version of EfficientNet, and the model removes the use of the squeeze-and-excite module, as this module is not optimized for mobile use, and the ReLU6 [38] activation function is replaced by a swish activation function.To make it easier to quantify, fixed stem and head modules are added to ensure the lightweight advantage of model scaling.Compared with the MobileNetV2 [35], ResNet-50 [33], and Inception-V4 [39], EfficientNet-lite has a better trade-off in performance between accuracy and size.
The classification of objects by this model does not have much room for improvement, due to the usage of a pre-trained backbone network.In addition, the accuracy of positioning has a great impact on the performance of the target detection algorithm.Among the various datasets or actual application scenarios, localization ambiguity is widespread.Specifically, it is difficult for lightweight object detection networks to detect fuzzy boundaries.Generalized focal loss [40] uses a general discrete distribution to express the uncertainty of the target frame, and the target frame is expressed as a probability distribution, without any prior knowledge restrictions.As opposed to the above methods, Zheng [28] introduced distillation learning into the localization branch of the target detection network, and proposed to use the localization distillation to improve the ability of the target frame, using a high-performance teacher network.This can solve the problems of  [31], and a modified path aggregation network (PAN) [20] using Generalized focal loss v2 [32] as the detection head representation and loss function.
Normally, to localize and detect the category of the target from an image, we first extract some necessary feature data from the image, such as HOG features, and subsequently use these features to achieve positioning and classification.In the field of deep learning, the backbone of the network is responsible for extracting features from images.Among them, there are many well-known models to choose from, such as ResNet [33], MobileNet [34,35], ShuffleNet [36], etc.The reasons for choosing EfficientDet-lite [31] as the backbone for object detection in this study were as follows: EfficientDet based CNN model has achieved the state-of-the-art performance in object detection.It uses the AutoML MNAS [37] framework to execute a neural architecture search (NAS) to develop the base model and combine the features of compound scaling.It has a good accuracy and excellent model size, and has attracted widespread attention recently.EfficientNet-lite is a lightweight and improved version of EfficientNet, and the model removes the use of the squeeze-and-excite module, as this module is not optimized for mobile use, and the ReLU6 [38] activation function is replaced by a swish activation function.To make it easier to quantify, fixed stem and head modules are added to ensure the lightweight advantage of model scaling.Compared with the MobileNetV2 [35], ResNet-50 [33], and Inception-V4 [39], EfficientNet-lite has a better trade-off in performance between accuracy and size.
The classification of objects by this model does not have much room for improvement, due to the usage of a pre-trained backbone network.In addition, the accuracy of positioning has a great impact on the performance of the target detection algorithm.Among the various datasets or actual application scenarios, localization ambiguity is widespread.Specifically, it is difficult for lightweight object detection networks to detect fuzzy boundaries.Generalized focal loss [40] uses a general discrete distribution to express the uncertainty of the target frame, and the target frame is expressed as a probability distribution, without any prior knowledge restrictions.As opposed to the above methods, Zheng [28] introduced distillation learning into the localization branch of the target detection network, and proposed to use the localization distillation to improve the ability of the target frame, using a high-performance teacher network.This can solve the problems of location ambiguity and learning through distillation.Hence, the student network can solve the problem of location ambiguity similarly to the teacher network.
Conventional object detectors are usually trained offline, and, thus, researchers can make full use of this feature to find a better training method that improves accuracy without increasing inference overheads.In this approach, the training strategy is altered, referring to Bag of Freebies, which can improve the accuracy, for example, of some of the image pre-processing methods such as MixUp and CutMix.Furthermore, mosaic data augmentation, GIoU loss, label smoothing, knowledge distillation [30], etc., with mixed precision training [28] and sparsity model generation [29] are used.
The object detection and identification system proposed in this study is a one-stage object detection, inspired by Nanodet and based on a lightweight convolutional network, by modifying the path aggregation network (PAN) and using Generalized focal loss v2 as the detection head representation and loss function to achieve real-time object detection.In terms of the object detection, anchor-based models such as YOLO and SSD have always occupied a dominant position.The main purpose of this research was to develop an instant anchor-free detection model, which can provide a performance not inferior to YOLOv4-tiny, and that is also convenient for training and transplantation.Some plug-in modules or post-processing methods can improve the detection accuracy, but have a small impact on the inference time, and these methods are called 'bag of specials'.Plug-in modules are usually used to enhance the specific attributes of the model, such as expanding the receptive field, introducing an attention mechanism, and strengthening the ability to combine features.Post-processing involves fine-tuning some mechanisms in accordance with the predicted results of the model.The post-processing comprises of fine-tuning some mechanisms based on the predicted results.For example, the structure of the FPN is adjusted along with the alternate activation function and Generalized Focal Loss v2 [32] to obtain superior classification accuracy.
Most of the previous studies have used FP32 (single precision) [23,25] for neural network training.In recent years, NVIDIA has proposed the use of FP32 + FP16 mixed format, namely mixed precision training (MPT).MPT [28] uses half-precision floating-point numbers to accelerate training, while minimizing precision loss.It uses FP16 to record weights and gradients.This has the effect of accelerating training, while reducing memory usage.The value range of FP16 is 5.96 × 10 −8 ~65,504, while FP32 is 1.4 × 10 −45 ~3.4 × 10 −38 .It can be seen from the scope of FP16 that the biggest problem in comparison to the FP32 neural network is the reduction in accuracy.In addition, some points on the requirement of GPU computational resources for existing models and techniques for their efficient usage are presented.
Even though the introduction of GPUs has accelerated the training process; due to the rapid evolution in model complexity, the demand for multiple GPUs and tensor processing unit (TPUs) training keeps on increasing.For instance, the former architectures such as AlexNet used two GPUs to train ImageNet for 5-6 days in 2012.Similarly, training ResNeXt-101 in 2017 required eight GPU acceleration for more than 10 days.By 2019, Noisy-Student [41], a self-growth framework through knowledge distillation proposed by Google, required more than 1000 TPU training for 7 days.Some studies have begun to explore structured pruning and have achieved good results, but other studies have used unstructured sparsity pruning for accuracy; however, unstructured data (0-D) is difficult to use with modern vector and matrix math instructions, which has led to the development of more complex methods integrated with deep learning frameworks for acceleration.
As training models often uses the ReLU activation function, many weights are 0. In this study, it was observed that the gradient disappeared when training with ReLU6 after generating the sparsity model.This is due to the fact that 50% of the weights are not involved in the gradient update because of the mask, as shown in Figure 3. ReLU6 sets the negative values to 0, so that the overall network cannot continue to converge when very few weights can hold values.The unstructured sparse matrix caused by the ReLU6 cannot provide hardware acceleration through deep learning, and, thus, this study chose to use automatic sparsity [4] to generate 50% sparsity.This was the main reason for switching the activation function to the self-regularized non-monotonic activation function (Mish) [42] and sigmoid weighted linear units (SiLU) [43].
provide hardware acceleration through deep learning, and, thus, this study chose automatic sparsity [4] to generate 50% sparsity.This was the main reason for sw the activation function to the self-regularized non-monotonic activation function [42] and sigmoid weighted linear units (SiLU) [43].This study used the automatic sparsity module under the NVIDIA PyTorch sion to generate sparse networks.As shown in Figure 4, in the weight matrix, two r masks for every four elements were called a 2:4 sparseness, and this can be suppo Tensor Core hardware level acceleration.Using this approach to generate a sparse network, the original network co trained to a certain stage, and subsequently the network sparseness could be impro can avoid the premature mask weight at the beginning, which causes the training c gence to be affected too much.As the convergence is stable, sparse network genera used to improve the training speed and training accuracy.

Experiments and Results
This section covers the extensive experiments with the object detection techno validate the performance of the proposed work.The description covers the datase uation metrics, and computational resources, and inference studies are provided.F more, the implementation details of this study, including the model architecture fo ing and its parameters and the ablation studies are explained in detail.Finally, the database MS COCO [24] was tested and compared with the former methods, as sh Table 1.For model training and testing, 118 k and 5 k images were used, respective results of each experiment were analyzed to verify the effectiveness of the pro method.For evaluation, average precision (AP) was used, which refers to the area the precision-recall curve.Generally, AP is calculated for all classes, and its ave This study used the automatic sparsity module under the NVIDIA PyTorch Extension to generate sparse networks.As shown in Figure 4, in the weight matrix, two random masks for every four elements were called a 2:4 sparseness, and this can be supported in Tensor Core hardware level acceleration.
provide hardware acceleration through deep learning, and, thus, this study chose to use automatic sparsity [4] to generate 50% sparsity.This was the main reason for switching the activation function to the self-regularized non-monotonic activation function (Mish) [42] and sigmoid weighted linear units (SiLU) [43].This study used the automatic sparsity module under the NVIDIA PyTorch Extension to generate sparse networks.As shown in Figure 4, in the weight matrix, two random masks for every four elements were called a 2:4 sparseness, and this can be supported in Tensor Core hardware level acceleration.Using this approach to generate a sparse network, the original network could be trained to a certain stage, and subsequently the network sparseness could be improved.It can avoid the premature mask weight at the beginning, which causes the training convergence to be affected too much.As the convergence is stable, sparse network generation is used to improve the training speed and training accuracy.

Experiments and Results
This section covers the extensive experiments with the object detection technology to validate the performance of the proposed work.The description covers the dataset, evaluation metrics, and computational resources, and inference studies are provided.Furthermore, the implementation details of this study, including the model architecture for training and its parameters and the ablation studies are explained in detail.Finally, the public database MS COCO [24] was tested and compared with the former methods, as shown in Table 1.For model training and testing, 118 k and 5 k images were used, respectively.The results of each experiment were analyzed to verify the effectiveness of the proposed method.For evaluation, average precision (AP) was used, which refers to the area under the precision-recall curve.Generally, AP is calculated for all classes, and its average is Using this approach to generate a sparse network, the original network could be trained to a certain stage, and subsequently the network sparseness could be improved.It can avoid the premature mask weight at the beginning, which causes the training convergence to be affected too much.As the convergence is stable, sparse network generation is used to improve the training speed and training accuracy.

Experiments and Results
This section covers the extensive experiments with the object detection technology to validate the performance of the proposed work.The description covers the dataset, evaluation metrics, and computational resources, and inference studies are provided.Furthermore, the implementation details of this study, including the model architecture for training and its parameters and the ablation studies are explained in detail.Finally, the public database MS COCO [24] was tested and compared with the former methods, as shown in Table 1.For model training and testing, 118 k and 5 k images were used, respectively.The results of each experiment were analyzed to verify the effectiveness of the proposed method.For evaluation, average precision (AP) was used, which refers to the area under the precision-recall curve.Generally, AP is calculated for all classes, and its average is defined as the mean average precision (mAP).Furthermore, the AP50 refers to the 50% region correctly detected in comparison to the ground truth, and for AP75 candidate images over 75% regions are counted.This study ensured the performance of the model for real-time applications with a good detection accuracy.To achieve a greater adaptability, EfficientNet-Lite with integrated scaling capabilities was used as the detection network (backbone).After fine-tuning, the SiLU was chosen as the activation function; compared to ReLU6, Mish, and LeakyReLU.The reason for this is that sparse pruning was used in the acceleration strategy.If ReLU is used, its characteristics cannot preserve negative values during the training process, as shown in Figure 5, which can easily lead to failures.Regarding the vanishing gradient issues, the model was tested with Mish and SiLU.SiLU found to provide a better performance empirically, and, thus, was used as the activation function for the backbone model.
ages over 75% regions are counted.This study ensured the performance of the model for real-time applications with good detection accuracy.To achieve a greater adaptability, EfficientNet-Lite with int grated scaling capabilities was used as the detection network (backbone).After fine-tun ing, the SiLU was chosen as the activation function; compared to ReLU6, Mish, an LeakyReLU.The reason for this is that sparse pruning was used in the acceleration stra egy.If ReLU is used, its characteristics cannot preserve negative values during the trainin process, as shown in Figure 5, which can easily lead to failures.Regarding the vanishin gradient issues, the model was tested with Mish and SiLU.SiLU was found to provide better performance empirically, and, thus, was used as the activation function for th backbone model.[43] and ReLU [38] activation function.
The learning rate was initialized from 0.0003 and increased up to 0.16 in 500 steps.The computer system comprised an Intel(R) Gold 5218 (2X) CPU with DDR4 2666 MHz ECC RDIMM 64 GB*8 RAM, and NVIDIA GeForce RTX 3090 GPU 24 GB*2.The batch size was set at 128.Mixed FP16 and FP32 precisions were adopted for training, the mixed parameters were set to NVIDIA Apex AMP O3, and the training time was around 4 days and 12 h.
During the inference stage, the model weights were saved and converted to ONNX [44] format.The converted model size was only 15.1 MB, compared to the 245 MB of YOLOv4 and 23.1 MB of YOLOv4-tiny.The lightweight model has great performance advantages for mobile or edge processing.Our proposed method uses a NVIDIA RTX 2080 Ti with NCNN [45] to give the model a more than 47 percent accuracy improvement over AP50, which also exceeds the detection performance of 300 FPS.We used the frames per second (FPS) and COCO2017 Val mAP as evaluation indicators to balance the evaluation of performance and accuracy.The FPS was tested using a single NVIDIA RTX 2080 Ti, to perform a batch size of 1 prediction time test, and convert it to per second.In total, there were 80 object categories in COCO, and mAP was calculated across the AP of each object category and taking the average value.
The ablation tests are provided in Table 2, in which EfficientNet-Lite0 and EfficientNet-Lite1 were adopted as the backbone of the network, to allow adaption in the experiment and achieve a good balance between performance and processing speed.EfficientNet was selected with experimental group 1 as the base, in which Lite0 as the backbone used 320 × 320 as the input resolution.Although the mAP of the control experimental group 1 was 5% lower than that of group 5, it had an advantage in terms of the speed, and it was still possible that the detection network could achieve a certain accuracy to reduce the probability of misjudgment or missed detection.Other experiments tested alternative activation functions and used different training strategies.If Generalized focal loss v2 was used, the model size could be slightly improved.Although the performance impact was small, it still had a positive impact on accuracy.The MS-COCO [28] Val set was used to verify the performance of the proposed module.Experimental Group 2 used Mish as the activation function with reference to YOLOv4, and increased the input size to the same size 416 × 416 as YOLOv4-tiny, which was 2.6% higher than the baseline accuracy and around 5.3% higher than that of YOLOv4-tiny.Experiment Group 3 replaced the activation function with SiLU and added mixed precision training, which increased the performance by around 0.3% compared with experiment Group 2. Experimental Groups 4 and 5 both used the SiLU activation function and Generalized focal loss v2 as the detection head, and added mixed precision training and sparse network pruning, which could improve the performance in comparison with Group 3 by around 2.1%.If compared to the benchmark experimental Group 1, it was improved by around 5%, which is about 14.4% higher than YOLOv3-tiny and 7.7% than YOLOv4.
Experimental Group 5 added localization distillation on this basis to finally increase the accuracy to 30.7%.The overall parameters were only 4 M, and the final recorded model size was 15.1 MB.
This study focused on balancing the execution speed and accuracy of the model.The one-stage anchor-free FCOS was used with ATSS [46] target sampling, and Generalized Focal Lossv2 [24] was used as the loss function to perform the target classification and regression calculation of the frame.This method removes the FCOS, and the centerness branch saves a lot of convolution operations.Furthermore, we added additional branches in order to use border regression statistics, to make up for the accuracy loss in classification; and the mixed precision training, sparse network pruning, model training parameter finetuning, mosaic stitching data enhancement, training strategies, such as data distillation, could not only ensure the speed of calculation, but also improved the accuracy.The evaluation value was superior to YOLOv4-tiny in terms of mAP by 7.7%.Some practical images were adopted for comparison with Nanodet-320 and YOLOv4tiny-416 and the proposed method with Group 5 configuration.
As shown in Figure 6, it can be seen that YOLOv4-tiny does not fit the boundary of the detected person, and the skis are not detected.In addition, the person at the topleft corner is not detected.On the other hand, Nanodet misclassifies skis as a surfboard.Conversely, the proposed method has more accurate results, which correctly classifies the skis, bounding boxes are precisely placed on the targets, and more individuals are detected in the top.Similar results are also found in the Figure 7, in which the proposed model detected more objects with good decision boundaries.
Experimental Group 5 added localization distillation on this basis to finally increase the accuracy to 30.7%.The overall parameters were only 4 M, and the final recorded model size was 15.1 MB.
This study focused on balancing the execution speed and accuracy of the model.The one-stage anchor-free FCOS was used with ATSS [46] target sampling, and Generalized Focal Lossv2 [24] was used as the loss function to perform the target classification and regression calculation of the frame.This method removes the FCOS, and the centerness branch saves a lot of convolution operations.Furthermore, we added additional branches in order to use border regression statistics, to make up for the accuracy loss in classification; and the mixed precision training, sparse network pruning, model training parameter fine-tuning, mosaic stitching data enhancement, training strategies, such as data distillation, could not only ensure the speed of calculation, but also improved the accuracy.The evaluation value was superior to YOLOv4-tiny in terms of mAP by 7.7%.Some practical images were adopted for comparison with Nanodet-320 and YOLOv4-tiny-416 and the proposed method with Group 5 configuration.
As shown in Figure 6, it can be seen that YOLOv4-tiny does not fit the boundary of the detected person, and the skis are not detected.In addition, the person at the top-left corner is not detected.On the other hand, Nanodet misclassifies skis as a surfboard.Conversely, the proposed method has more accurate results, which correctly classifies the skis, bounding boxes are precisely placed on the targets, and more individuals are detected in the top.Similar results are also found in the Figure 7, in which the proposed model detected more objects with good decision boundaries.From Figure 7a it can be seen that Nanodet failed to detect small vehicles, this is due to its smaller image input size to improve the detection speed, and, hence, it loses the ability to detect small objects.In Figure 7b, YOLOv4-tiny failed to detect some people standing near the vehicles, whereas from Figure 7c it can be inferred that the proposed method had a better multiple object detection capacity and also had a rapid execution.In addition, it can be seen that some traffic light signals and people cycling were not detected by the proposed method.From analysis, we found that an increase in input size can resolve some of these issues; however, this also affects the processing speed.Hence, there exists a clear trade-off between the prediction accuracy and detection speed, which will be addressed in the future.Figure 7d-f also illustrates the same phenomenon in city scenes, and the proposed model had a superior detection performance.

Conclusions
In this study, a lightweight CNN model was designed, and an accelerated training strategy was used to improve the inference speed, while maintaining a good detection accuracy.Robust object detection models normally require a deep architecture, which requires more parameters and is harder to train.In addition, many state-of-the-art objects models are trained with multiple GPUs or TPUs, which are not feasible for real-time applications and costlier for wide deployment.In this study, effective data enhancement was performed using the knowledge distillation method to improve the detection accuracy with the lightweight models.Moreover, the use of increased network sparsity, the reduction of excessive convolutional layer connections, the pre-selection mechanism of removing the box anchor, and the optimization of the loss function feedback mechanism of the detection head lead to a large-scale network using fewer model parameters.Experimental results on the public dataset MS-COCO 2017 showed that the proposed method could achieve a good detection accuracy, with rapid execution.The future scope of this work is to further improve the detection of multiple objects for high resolution images, without compromising the prediction speed.To summarize, the critical advantages of the proposed method are that the model can be easily trained and requires limited computational resources.Hence, the proposed method is feasible for many real-time applications and provides a more cost-effective solution.

Figure 2 .
Figure 2. Proposed object detection architecture.The main components are the backbone architecture, which is based on the Efficient-Lite model[31], and a modified path aggregation network (PAN)[20] using Generalized focal loss v2[32]as the detection head representation and loss function.

Figure 2 .
Figure 2. Proposed object detection architecture.The main components are the backbone architecture, which is based on the Efficient-Lite model[31], and a modified path aggregation network (PAN)[20] using Generalized focal loss v2[32] as the detection head representation and loss function.

Figure 6 .
Figure 6.Performance comparison for water-skiing image.Figure 6. Performance comparison for water-skiing image.

Figure 6 .
Figure 6.Performance comparison for water-skiing image.Figure 6. Performance comparison for water-skiing image.

Table 1 .
Comparison among the state-of-the-art detectors.

Table 1 .
Comparison among the state-of-the-art detectors.

Table 2 .
Performance comparison of different EfficientNet versions.