A Surveillance Video Real-Time Object Detection System Based on Edge-Cloud Cooperation in Airport Apron

: Airport apron carries a lot of preparations for ﬂight operation, and the advancement of its various tasks is of great signiﬁcance to the ﬂight operation. In order to build a more intelligent and easy-to-deploy airport apron operation analysis guarantee system, it is necessary to study a low-cost, fast, and real-time object detection scheme. In this article, a real-time object detection solution based on edge cloud system for airport apron operation surveillance video is proposed, which includes lightweight detection model Edge-YOLO, edge video detection acceleration strategy, and a cloud-based detection results veriﬁcation mechanism. Edge-YOLO reduces the amounts of parameters and computational complexity by using model lightweight technology, which can achieve better detection speed performance on edge-end embedded devices with weak computing power, and adds an attention mechanism to compensate for accuracy loss. Edge video detection acceleration strategy achieves further detection acceleration for Edge-YOLO by utilizing the motion information of objects in the video to achieve real-time detection. Cloud-based detection results veriﬁcation mechanism veriﬁes and corrects the detection results generated by the edge through a multi-level intervention mechanism to improve the accuracy of the detection results. Through this solution, we can achieve reliable and real-time monitoring of airport apron video on edge devices with the support of a small amount of cloud computing power.


Introduction
Video surveillance has become very important in nowadays. An enormous number of surveillance cameras are installed at public places, especially in railway stations and airports. These surveillance cameras play the role of not only surveillance of infrastructural property and public safety, but also sensor for data collection. The flight operation process is mainly divided into two stages, the flight stage and the ground service stage. During the flight phase, the flight is dispatched by air traffic control to complete the transport of passengers under the service of the crew members. During the ground service stage, the flight needs to complete various pre-flight preparations, which can be divided into five processes including passengers, luggage, cargo, fueling, and cleaning [1]. These operations will be completed on the apron. Consequently, the normal advancement of airport apron operations is very important for the normal operation of the flight, and its various operating data are of great significance for flight security and airport operation analysis. However, for airport apron operations, it is difficult to collect various operational data using traditional sensors. At present, the main method adopted is to use humans to monitor the operation monitoring video to realize the guarantee of the operation process. Nonetheless, the human processing will inevitably lead to a series of problems such as untimely response and human physiological fatigue. The main contributions of this paper can be summarized as follows:  An object detection model Edge-YOLO with better real-time performance on embedded devices with weak computing power at the edge is proposed;  A detection acceleration strategy to quickly generate non-keyframe detection results based on motion inference is proposed to further improve the detection speed;  A cloud-based detection result verification and correction mechanism, which can bring the system to a level close to pure cloud detection by using a small amount of cloud computing power.

CNN-Based Object Detectors
CNN-based object detection methods can be organized into two main categories, two-stage detection framework with region proposal and one-stage detection framework, i.e., region proposal free framework. In the two-stage framework, category-independent region proposals are generated from an image, CNN features are extracted from these regions, and then category-specific classifiers are used to determine the category labels of the proposals. RCNN series of methods is the representative of this category. Girshick proposed R-CNN in 2014 [12], which generates candidate regions by selective search algorithm [13], uses convolutional neural network for feature extraction, and uses SVM for object classification. Inspired by SPP-net [14], for the problem of feature extraction for candidate regions of R-CNN [12] which resulted in a large number of repeated redundant computations and a large amount of caches, Girshick improved R-CNN and proposed the Fast-RCNN [15], which input the entire image into the convolutional neural network to extract features, and then directly obtain the features of the candidate region from the feature map of the entire image according to the positional relationship, which greatly improves the detection speed. Subsequently, Ren proposed the faster Faster R-CNN [16] by replacing the time-consuming selective search [13] with a CNN-based Region Proposal Network (RPN).
The one-stage object detection method abandons the time-consuming candidate region generation process and directly generates the position coordinates and category probability of the object to be detected from the original image, which greatly improves the real-time performance of object detection. Typical representative methods of this The main contributions of this paper can be summarized as follows: • An object detection model Edge-YOLO with better real-time performance on embedded devices with weak computing power at the edge is proposed; • A detection acceleration strategy to quickly generate non-keyframe detection results based on motion inference is proposed to further improve the detection speed; • A cloud-based detection result verification and correction mechanism, which can bring the system to a level close to pure cloud detection by using a small amount of cloud computing power.

CNN-Based Object Detectors
CNN-based object detection methods can be organized into two main categories, twostage detection framework with region proposal and one-stage detection framework, i.e., region proposal free framework. In the two-stage framework, category-independent region proposals are generated from an image, CNN features are extracted from these regions, and then category-specific classifiers are used to determine the category labels of the proposals. RCNN series of methods is the representative of this category. Girshick proposed R-CNN in 2014 [12], which generates candidate regions by selective search algorithm [13], uses convolutional neural network for feature extraction, and uses SVM for object classification. Inspired by SPP-net [14], for the problem of feature extraction for candidate regions of R-CNN [12] which resulted in a large number of repeated redundant computations and a large amount of caches, Girshick improved R-CNN and proposed the Fast-RCNN [15], which input the entire image into the convolutional neural network to extract features, and then directly obtain the features of the candidate region from the feature map of the entire image according to the positional relationship, which greatly improves the detection speed. Subsequently, Ren proposed the faster Faster R-CNN [16] by replacing the time-consuming selective search [13] with a CNN-based Region Proposal Network (RPN).
The one-stage object detection method abandons the time-consuming candidate region generation process and directly generates the position coordinates and category probability of the object to be detected from the original image, which greatly improves the real-time performance of object detection. Typical representative methods of this branch include YOLO (You Only Look Once) [17] proposed by Redmon, CornerNet [18] proposed by Hei and SSD [19] proposed by Liu. One-stage object detection methods are widely used in object detection tasks with high real-time requirements due to good real-time performance.

Edge Computing and Object Detection Model Lightweight
Edge computing is a way to fill the insufficiency of cloud computing. With the sharp increase in the number of terminal devices in the future, cloud computing will not be able to meet the needs of future network and computing costs. Fortunately, the development of embedded devices has enhanced the capabilities of edge computing to help cloud computing with data processing. Edge computing offers the advantages of low latency, low bandwidth requirements, and low cost. However, compared with cloud devices, edge devices are much weaker in terms of computing power and storage. Therefore, in order to obtain a lightweight model that can meet the hardware requirements of frontend embedded devices, the neural network model needs to be lightweight. For object detection tasks, there are currently three main ways to implement the deployment of models on edge embedded devices: lightweight detection model design, model pruning, and model quantization.
Lightweight Model Design: Lightweight detection model design is to apply a more efficient convolution method or a more efficient structure to reduce the computing power requirement of the object detection model. Aiming at real-time detection of surface defects, Zhou [20] proposed a reusable and high-efficiency Inception-based MobileNet-SSD method for surface defect inspection in industrial environment. Zhang [21] proposed a lightweight object detection based on the MobileNet v2, YOLOv4 algorithm, and attentional feature fusion to address underwater object detection.
Model Pruning: Model pruning refers to evaluating the importance of model weights according to a certain strategy and eliminating unimportant weights to reduce the amount of computation. To address the threat of drones intruding into high-security areas, Liu [22] pruned the convolutional channel and shortcut layer of YOLOv4 to develop thinner and shallower models. Aiming at the problems of low detection accuracy and inaccurate positioning accuracy of light-weight network in traffic sign recognition task, Wang [23] proposed an improved light-weight traffic sign recognition algorithm based on YOLOv4-Tiny.
Model Quantification: The essence of model quantification is to convert floating-point operation into integer fixed-point operation, which can drastically reduce model storage requirements. Zhang [5] proposed a data-free quantization method for the CNN-based remote sensing detection model by using 5-bit quantification. Guo [24] proposed a hybrid fixed point/binary deep neural network design methodology for object detection to achieve low-power consumption.

Edge-Cloud Cooperation
Edge-cloud Cooperation refers to the integration of cloud computing and edge computing. Edge computing provides users with low-latency, low-power services, while cloud computing is used to optimize the reasoning capabilities of edge computing. The edgecloud collaboration method has been applied in various fields. Wang [25] proposed a smart surface inspection system using a faster R-CNN algorithm in the cloud-edge computing environment to solve automated surface inspection. Ye [25] used embedded devices to perform preliminary processing of the collected data, and then further analyzed the data through cloud computing to monitor the health of urban pipelines. Xu [26] proposed an edge-cloud cooperation framework to realize real-time intelligent analysis of the surveillance video of coal mine safety production. In this framework, cloud computing is used to process non-real-time and global tasks, and the edge part is responsible for real-time processing of local surveillance videos.

Lightweight Detection Model Edge-YOLO
As shown in Figure 2, the proposed lightweight detection model is called Edge-YOLO which is based on the combination of YOLOv5, ShuffleNetv2, and coord attention [2]. It is designed for the edge embedded device to detect objects from airport apron surveillance.

Lightweight Detection Model Edge-YOLO
As shown in Figure 2, the proposed lightweight detection model is called Edge-YOLO which is based on the combination of YOLOv5, ShuffleNetv2, and coord attention [2]. It is designed for the edge embedded device to detect objects from airport apron surveillance. The proposed Edge-YOLO is built by simplifying the structure of YOLOv5 and the new network architecture reduces the amount of parameters by more than 90%. YOLOv5 [3] is an end-to-end method for object detection based on nonregional candidates, and it is the current state-of-the-art method in the field of one-stage object detection. Its network structure is mainly divided into two parts: backbone (feature extraction network) and head (detection head). There are four versions of YOLOv5, namely YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x. The model size and accuracy of the four versions increase in turn, and the channel and layer control factors are used to select the appropriate size model according to the application scenario. In order to obtain a model that is easier to apply to resource-limited embedded devices, we choose YOLOv5s with the best real-time performance as the benchmark for optimization. Its network structure is mainly divided into two parts: backbone (feature extraction network) and head (detection head). In order to improve its detection speed, we chose to simplify the structure of both parts.
For backbone, we choose to use a light-weight image classification network for feature extraction. After comparing some light-weight image classification networks, we choose to use shuffleNetv2 [10] with the least amount of parameters and the best real-time performance for feature extraction. ShuffleNetv2 [10] is an image classification network designed for mobile devices and replaces the original backbone. This structure greatly reduces the computation of convolution operations by using group point wise convolutions. Moreover, a mechanism for information exchange between groups called Chan-nelShuffle, is used to achieve information fusion between groups. Compared with the original YOLOv5s feature extraction network, ShuffleNetv2 has better speed performance  The proposed Edge-YOLO is built by simplifying the structure of YOLOv5 and the new network architecture reduces the amount of parameters by more than 90%. YOLOv5 [3] is an end-to-end method for object detection based on nonregional candidates, and it is the current state-of-the-art method in the field of one-stage object detection. Its network structure is mainly divided into two parts: backbone (feature extraction network) and head (detection head). There are four versions of YOLOv5, namely YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x. The model size and accuracy of the four versions increase in turn, and the channel and layer control factors are used to select the appropriate size model according to the application scenario. In order to obtain a model that is easier to apply to resourcelimited embedded devices, we choose YOLOv5s with the best real-time performance as the benchmark for optimization. Its network structure is mainly divided into two parts: backbone (feature extraction network) and head (detection head). In order to improve its detection speed, we chose to simplify the structure of both parts.
For backbone, we choose to use a light-weight image classification network for feature extraction. After comparing some light-weight image classification networks, we choose to use shuffleNetv2 [10] with the least amount of parameters and the best real-time performance for feature extraction. ShuffleNetv2 [10] is an image classification network designed for mobile devices and replaces the original backbone. This structure greatly reduces the computation of convolution operations by using group point wise convolutions. Moreover, a mechanism for information exchange between groups called ChannelShuffle, is used to achieve information fusion between groups. Compared with the original YOLOv5s feature extraction network, ShuffleNetv2 has better speed performance on edge devices. Table 1 shows the parameters and calculation complexity comparison of YOLOv5s-backbone and ShuffleNetV2 [10]. For the detection head part, the number of channels is cropped, since the number of categories to be detected in the airport apron scene is much less than that in the general scene. There are no more than 20 categories in the airport apron scene, while the detected categories are 80 in the COCO dataset. Only a quarter of channels is kept in order to get the highest cost-effectiveness in accuracy loss and compression ratio. Table 2 shows the parameters and calculation complexity comparison of YOLOv5s-head and Edge-YOLO-head. The reduction of the computational complexity of the feature extraction network may cause the weakening of the feature extraction capability. The coord attention [27] module is used to the extracted features, in order to make full use of the extracted features. This module is able to make the model pay attention to the parts that have a more significant impact on the final result. As shown in Figure 2, the coord attention [27] module decomposes channel attention into two 1D feature encoding processes. Since these two processes aggregate features along different directions, the long-range dependencies can be captured along one spatial direction, while precise location information can be preserved along the other. Then, the generated feature maps are separately encoded to form a pair of orientation-aware and position-sensitive feature maps, which is able to enhance the representation of objects of interest according to the complementary feature maps.

Edge Video Detection Acceleration Based on Motion Inference
Based on the characteristics of airport apron operations, there are two proposed acceleration strategies shown in Figure 3 in order to further accelerate edge detection and reduce the power consumption of edge devices.
The first strategy is to avoid detection of some idle periods. There is a statistic of work time and free time of a certain apron shown in Figure 4. The operation of the apron is not continuous. There is a long idle period between the two jobs that totals more than half of the total time. Obviously, it is not necessary to detect idle periods when no jobs are taking place. Therefore, it is important to distinguish whether the current airport apron is in operation and skip the detection of idle periods to save computing power and power consumption.
The video is divided into small segments with a duration of T (in order to avoid too long detection delay, T should take a small value. T takes 1 s, 2 s, 3 s, 4 s, 5 s in the experiments). According to whether object position changed in the segment, the judger decides whether the detection can be skipped (As shown in Figure 5). The Algorithm 1 is used to determine whether the position of an object has changed in a segment.
In Algorithm 1, the parameter T area represents the pixel value of the minimum object to be detected, which can be determined according to the specific detection task. The parameter T pixel represents the minimum pixel change of the object movement. We recommend a value of 10 to 20.   The second strategy is to employ motion inference to avoid frame-by-frame detection. The objects in the airport apron usually take regular slowly movements, due to the civil aviation relevant rules. Thus, there should be same object in most of near frames of the video. This means that not all frames need to be detected. Even a few frame detections might be able to get all objects. These frames are defined as key frames. Thus, the video could be divided into small segments with duration. The results for non-key frames are inferred from the detection of key frames, according to the motion information. The results of non-key frames are inferred from the detection of key frames, according to the motion information. There are two methods to extract the motion information of objects: matching based on IOU, and matching based on the improved T frame dynamic and static separation method.  The video is divided into small segments with a duration of T (in ord long detection delay, T should take a small value. T takes 1 s, 2 s, 3 s, 4 s, iments). According to whether object position changed in the segment, the whether the detection can be skipped (As shown in Figure 5). The Algori determine whether the position of an object has changed in a segment.   The effect diagram of the way based on IOU matching is shown in Figure 6. For the detection results of two key frames, if the IOU of the two objects is greater than 0.2 and less than 0.9, we regard them as the same object. The IOU calculation formula of two objects A and B is given by Equation (1).
less than 0.9, we regard them as the same object. The IOU calculation formula of two objects A and B is given by Equation (1). However, the IOU-based matching method sometimes misses the relatively small objects on the airport apron such as staffs (as shown in Figure 7). In this case, the motion trajectory of the object may be incomplete. Thus, we proposed the T frame dynamic and static separation method based on the three-frame difference method to handle this problem as a supplement.  However, the IOU-based matching method sometimes misses the relatively small objects on the airport apron such as staffs (as shown in Figure 7). In this case, the motion trajectory of the object may be incomplete. Thus, we proposed the T frame dynamic and static separation method based on the three-frame difference method to handle this problem as a supplement. However, the IOU-based matching method sometimes misses the relatively s objects on the airport apron such as staffs (as shown in Figure 7). In this case, the m trajectory of the object may be incomplete. Thus, we proposed the T frame dynamic static separation method based on the three-frame difference method to handle this p lem as a supplement.  The three-frame difference method is one of the well-known methods to perform moving objects detection. The difference of frame i and frame i − 1, and the difference of frame i + 1 and frame i is used to obtain the motion mask layers D 1 and D 2 , respectively.
The final motion mask layer D of frame i is calculated as D 1 AND D 2 . The formula is expressed as Equations (2)-(4).
In the above expression, D(x,y) denotes the value of the position of the motion mask layer (x,y), i.e., the pixel change interpolation of this point. T denotes the minimum pixel change threshold where motion has occurred, and F i (x,y) denotes the pixel value at frame i in (x,y) position.
The three-frame difference method does not perform well for objects with similar internal pixels, and the effect is very dependent on the parameter T pixel , which represents the minimum pixel change of the object movement. Thus, we proposed T frames dynamic and static separation method to make up for its deficiencies. The static layers as the dynamic background of these T frames are obtained according to Algorithm 2. In Algorithm 2, we recommend T pixel to be in the range of 10 to 20, and the motion layer can be obtained by f 0 subtracting the dynamic background, which is expressed as Equation (5).
The final motion information is inferred from the motion layer and the IOU matching mechanism. As shown in Figure 8, if the value of the motion mask is greater than a certain threshold, the objects involved to this motion are seen as the same. Thus, the detection results of the key frame can be passed to the non-key frames.

Cloud-Based Detection Results Verification Mechanism
Although the edge detection we proposed greatly improves the real-time detection, it inevitably brings the decrease of detection accuracy at the same time. However, through experimental analysis, we found that the decrease of detection accuracy is not for the whole video, but is more obvious in some busy segments. Based on this observation, we consider using a small amount of cloud computing power to detect these fragments, improving the overall detection accuracy with little impact on detection speed. Figure 9 shows the cloud calibration mechanism to improve the reliability of detection results.
For each group edge detection results [EDR i , EDR i+T ] (EDR: Edge Detection Result), the cloud detection model DC (Cloud Detector) re-detects its intermediate frame F center (F i+T/2 ). CDR center (CDR: Cloud Detection Result) denotes the result of re-detection. Since F center is inferred from EDR i , EDR i+T , and motion layer, F center could show the correctness of the edge detection model and the motion layer. Thus, if Fcenter is consistent with CDRcenter, the set of detection results could be inferred as correct. Otherwise, it means the result of DE (Edge Detector) or DC is incorrect. Then we let DC re-detect the first frame F i and last frame F i+T of the group to obtain CDR i , CDR i+T and let the motion inference strategy proposed in 3.2 re-generate detection result [DR i , DR i+T ] (DR: Detection Result) using CDR i , CDR i+T , and motion layer. The set of detection results is inferred as correct if the generated intermediate frame detection results DR center is consistent with CDR center . Otherwise, the extraction of the motion layer is inferred as incorrect, since the motion of the objects in the segment is not linear. In this case, let DC detects this group of video frames one by one, and use the obtained result [CDR i , CDR i+T ] as the final detection result of this video segment. Figure 10 shows the number of frames detected in the cloud for a segment of a T frame in different situations.

Cloud-Based Detection Results Verification Mechanism
Although the edge detection we proposed greatly improves the real-time detectio it inevitably brings the decrease of detection accuracy at the same time. However, throu experimental analysis, we found that the decrease of detection accuracy is not for t whole video, but is more obvious in some busy segments. Based on this observation, w consider using a small amount of cloud computing power to detect these fragments, im proving the overall detection accuracy with little impact on detection speed. Figure shows the cloud calibration mechanism to improve the reliability of detection results.

Cloud-Based Detection Results Verification Mechanism
Although the edge detection we proposed greatly improves the real-time detection, it inevitably brings the decrease of detection accuracy at the same time. However, through experimental analysis, we found that the decrease of detection accuracy is not for the whole video, but is more obvious in some busy segments. Based on this observation, we consider using a small amount of cloud computing power to detect these fragments, improving the overall detection accuracy with little impact on detection speed. Figure 9 shows the cloud calibration mechanism to improve the reliability of detection results.  CDRi, CDRi+T, and motion layer. The set of detection results is inferred as correct if the generated intermediate frame detection results DRcenter is consistent with CDRcenter. Otherwise, the extraction of the motion layer is inferred as incorrect, since the motion of the objects in the segment is not linear. In this case, let DC detects this group of video frames one by one, and use the obtained result [CDRi, CDRi+T] as the final detection result of this video segment. Figure 10 shows the number of frames detected in the cloud for a segment of a T frame in different situations.

Evaluation on Edge-YOLO
The verification of the proposed lightweight object detection model Edge-YOLO is taken on a set of surveillance images captured by surveillance videos in the real airport apron environment. There are 20,885 marked frames of outdoor airport apron surveillance video, in different lighting conditions and weather conditions. There are 10 categories in this set, such as aircraft, people, and each kind of vehicles in the apron. The training set is made up of 80% frames of the set, while the test set made up of the rest 20%. Figure 11 shows some frames of the set as examples.

Evaluation on Edge-YOLO
The verification of the proposed lightweight object detection model Edge-YOLO is taken on a set of surveillance images captured by surveillance videos in the real airport apron environment. There are 20,885 marked frames of outdoor airport apron surveillance video, in different lighting conditions and weather conditions. There are 10 categories in this set, such as aircraft, people, and each kind of vehicles in the apron. The training set is made up of 80% frames of the set, while the test set made up of the rest 20%. Figure 11 shows some frames of the set as examples. The experiments to test the accuracy and inference speed to analyze the size and calculation amount of the model is taken on the cloud server and the edge device(Raspberry Pi 4B) respectively. Raspberry Pi is a series of small single-board computers (SBCs) developed in the United Kingdom by the Raspberry Pi Foundation in association with Broadcom. We can check its detailed introduction by https://en.wikipedia.org/wiki/Rasp-berry_Pi. The details of the experiment platform are shown in Table 3.  The experiments to test the accuracy and inference speed to analyze the size and calculation amount of the model is taken on the cloud server and the edge device(Raspberry Pi 4B) respectively. Raspberry Pi is a series of small single-board computers (SBCs) developed in the United Kingdom by the Raspberry Pi Foundation in association with Broadcom. We can check its detailed introduction by https://en.wikipedia.org/wiki/Raspberry_Pi (accessed on 7 September 2022). The details of the experiment platform are shown in Table 3.  Tables 4 and 5. Compared with the baseline model YOLOv5s, the proposed lightweight model Edge-YOLO has achieved considerable improvements in detection speed, model size, and computational complexity. The accuracy loss of Edge-YOLO is no more than 5% and the loss will be further reduced by using Edge-YOLO-CA which is added the coord attention [27] module. Compared with the large-scale network SSD [19], the accuracy and detection speed of our proposed model have obvious advantages in the airport apron scene. Compared with the lighter models MobileNet-SSD [20] and YOLOv4-tiny [23], our proposed model achieves a large improvement in detection speed with close accuracy. Subsequently, we also proposed some strategies to further improve the detection speed of the model for airport apron operation videos and make up for the loss of detection accuracy.

Evaluation on Edge Detect Accelerate Strategy
According to the proposed edge detection strategy, the detection module does not work during the certain idle periods. There are 30 periods of 5-min non-operating apron monitoring video to validate the strategy in this experiment. The acceleration of each period is shown in Figure 12. The experimental results show that this strategy can effectively avoiding idle detection. The failure of this strategy is when the detection is active because of the staff or vehicles passing through the operation area of the apron.
There is another experiment to validate the acceleration strategy 2, with two apron operation videos from Obihiro Airport in Japan and Guiyang Airport in China. As shown in Table 6, the duration of the two videos is 49 min 28 s and 57 min 35 s, respectively. Since the video frame rates of these two videos are 30 and 25, respectively, there are 89,029 frames in Obihiro Airport video and 86,395 frames in Guiyang Airport video. Both of the videos record the whole process from flight taxiing into the parking bay to departure. work during the certain idle periods. There are 30 periods of 5-min non-oper monitoring video to validate the strategy in this experiment. The acceleration riod is shown in Figure 12. The experimental results show that this strategy ca avoiding idle detection. The failure of this strategy is when the detection is ac of the staff or vehicles passing through the operation area of the apron. There is another experiment to validate the acceleration strategy 2, with operation videos from Obihiro Airport in Japan and Guiyang Airport in Chin in Table 6, the duration of the two videos is 49 min 28 s and 57 min 35 s, respec the video frame rates of these two videos are 30 and 25, respectively, ther frames in Obihiro Airport video and 86,395 frames in Guiyang Airport video videos record the whole process from flight taxiing into the parking bay to de    Tables 7 and 8 show accuracy and time cost of YOLOv5s, Edge-YOLO, and Edge-YOLO with acceleration strategy on these two videos. Experiments show that the proposed acceleration strategy can perform real-time detection on edge devices with weak computing power. The Figure 13 shows the mAP and time cost of Edge-YOLO-CA with acceleration strategy on raspberry 4B. According to the curves in Figure 13, the mAP remains high. When T is greater than 2 FPS, the loss of mAP increases rapidly. Therefore, the parameter T should be set as 2 FPS for these two videos. Obviously, for other videos, the parameter T could be estimated by testing on a video segment. Thus, this acceleration strategy is effective, which is not limited to the experiment videos.     Table 9 shows the experiment results to verify the effectiveness of the cloud-based detection results verification mechanism. In this experiment, the Edge-YOLO-CA with acceleration strategy generates detection results of surveillance video on edge devices, while the YOLOv5s deployed on cloud devices verifies and corrects the detection results due to the cloud-based detection results verification mechanism mentioned in Section 3.3. As shown in Table 9, with only checking less than one-tenth of frames, the edge detection results can be corrected to close to the cloud detection results.  Table 9 shows the experiment results to verify the effectiveness of the cloud-based detection results verification mechanism. In this experiment, the Edge-YOLO-CA with acceleration strategy generates detection results of surveillance video on edge devices, while the YOLOv5s deployed on cloud devices verifies and corrects the detection results due to the cloud-based detection results verification mechanism mentioned in Section 3.3. As shown in Table 9, with only checking less than one-tenth of frames, the edge detection results can be corrected to close to the cloud detection results.

Conclusions
In this article, a real-time object detection solution based on edge cloud system for airport apron operation surveillance video is proposed, which includes lightweight detection model Edge-YOLO, edge video detection acceleration strategy, and cloud-based detection results verification mechanism. The lightweight detection model is based on YOLOV5s. By replacing the backbone with ShuffleNetv2, cropping the channel of the detect head, and adding coord attention, the model size of Edge-YOLO is reduced by more than 90% compared with YOLOv5s, greatly reducing the requirements for deployment equipment. The detection speed of Edge-YOLO on Raspberry Pi 4B is increased by 3.37 times, and its accuracy loss does not exceed 5% by testing on a dataset composed of more than 20,000 real-world airport apron scenes. Edge video detection acceleration realizes further detection acceleration of Edge-YOLO on airport surveillance video by motion information inference by segmenting the video, then using IOU matching and improved T-frame difference method to extract the motion information of objects, and finally combining the detection results of Edge-YOLO on key frames to quickly generate the detection results of the entire segment. The cloud-based detection results verification mechanism uses a non-lightweight detection model with higher accuracy to verify and correct the results generated by the edge to improve the overall detection accuracy and reduce the proportion of cloud intervention through a multi-level intervention mechanism. Their feasibility is verified by experiments on two real airport apron surveillance videos that are fully annotated and contain the complete operation process. In the future, we will focus on designing more flexible edge-cloud cooperation strategies, and strengthen the versatility of the method and extend it to a wide range of video surveillance scenarios.