1. Introduction
As airports continue to grow in size, accidents occurring on airport surfaces have accounted for 90 percent of civil aviation accidents for the past 20 years [
1]. The operational safety of airport surfaces is the core of civil aviation safety. However, due to the imperfection of surveillance facilities, trespassing into restricted airport areas and unlawful interference with airport orders occur from time to time [
2]. Therefore, real-time and robust surveillance of individual targets on the airport surface, especially around the fuselage, is essential. Existing airport surface surveillance technologies can be divided into two categories: cooperative and non-cooperative surveillance.
The cooperative surveillance systems must install radio transponders on the surveillance objects [
3]. This results in the monitoring of fixed targets only and poor flexibility. The non-cooperative surveillance techniques provide only the distance and location of the target [
4]. This results in limited information and higher costs. In contrast, video surveillance systems have several advantages, including their low investment cost, wide monitoring range, automation, and intelligent analysis [
5].
Current video surveillance devices can usually only capture and save video images, with insufficient capabilities for more advanced functions and intelligent analyses [
6]. Only after a hazardous event has occurred can the saved surveillance video be analyzed manually to understand the details and process of the event. This approach is prone to response delays when dealing with sudden hazardous events. This makes it difficult to obtain information for decision-making in real time [
7]. Therefore, there is an urgent need to design an airport surface terminal surveillance system that can autonomously analyze the behavior of humans moving on the airport surface based on captured video sequences in real time [
8].
As shown in
Figure 1, typical non-cooperative objects in airport scenes are small in size in surveillance videos. This makes it difficult to accurately capture the detailed characteristics of this target. At the same time, special vehicles and aircrafts moving across the airport field can also obscure typical non-cooperative objects [
9]. The first core task of this research is to build an intelligent surveillance framework that integrates modules for target detection, human keypoint detection, and behavioral classification to address these problems.
The non-cooperative object detection module builds on the collaborative processing of multiple telephoto auto-zoom lenses. The overall detection task can be viewed as a two-stage “coarse to fine” process.
The first stage is regular surveillance. The surveillance camera maintains a wide-angle model to monitor the apron area in all directions. The initial target detection is for all the people within the surveillance area. When a suspicious person is detected entering the apron area, the surveillance camera will be switched to the telephoto model autonomously to reduce the coverage of the surveillance area and to pay close attention to the targets within the local surveillance area.
In the second stage, with the multi-scale non-cooperative object localization mode, we can capture the behavioral details of this part of the target and overcome the effects of factors such as lighting and target texture. This module uses deep learning [
10] for precise target localization and behavioral recognition to extract the key features of the target. At the same time, the module uses adaptive feature extraction and correction techniques for problems such as lighting and target texture changes to ensure an accurate analysis of the target behavior.
Through this multi-stage surveillance and analysis process, the problem of scale inconsistency in airport surface surveillance images caused by the distance and angle of surveillance cameras can be overcome. Moreover, a multi-stage framework can provide a comprehensive understanding and analysis of the behavioral details of non-cooperative objects, providing more reliable support for airport security management and risk identification. Notably, in the second stage, we use keypoints to construct a human skeleton that characterizes different behavioral categories. Unlike the classification of behaviors based on pixel features, the keypoint representations are more abstract and high-level, focusing on the pose and movement characteristics of the human body. It can ignore background information and focus on the movements of the human body itself. The main contributions of this paper are summarized as follows:
(1) A coarse-to-fine two-stage framework, NC-ABLD, is proposed for locating and detecting abnormal behaviors of typical non-cooperative subjects in airport aprons, enabling categories of abnormal behaviors and the range of duration in the time dimension.
(2) In the first stage, a dynamic target detection strategy is designed to mitigate the problem of inconsistent target scales due to the varying proximity of the person to the camera, which reduces the detection accuracy.
(3) In the second stage, based on the extracted target features of non-cooperative objects, a behavioral recognition module and a duration range localization strategy for abnormal behaviors are designed to solve the problematic recognition problem of sudden abnormal behaviors of non-cooperative objects.
The rest of this paper is organized as follows. In
Section 2, the related work is presented. The main components of the proposed method are described in
Section 3.
Section 4 contains reports on associated experiments. The conclusions are summarized in
Section 5.
4. Experiments
This section presents the results of qualitative and quantitative experiments using the NC-ABLD framework on the IIAR-30 dataset.
4.1. Datasets
Generally, datasets consisting of training, testing, and validation sets are significant in researching deep learning-based algorithms. Nevertheless, up to now, no public dataset can be used to evaluate the model’s performance for human abnormal behavior detection in an airport scene. Therefore, a unique dataset has to be constructed for abnormal behavior recognition on airport surfaces: the IIAR-30, which is used to evaluate the performance of each module in the NC-ABLD framework.
The IIAR-30 dataset is derived from the surveillance videos of two civil airports that contain three sub-datasets: the target detection dataset, the human keypoint dataset, and the behavioral classification dataset. The target detection dataset is used to train the human body detector and evaluate the network performance. The human keypoints dataset is used to train the human keypoint detection network and evaluate the network performance. The software LabelMe 3.16.7 was used to annotate human targets and keypoints according to the format provided by the dataset COCO. The details of each dataset are shown in
Table 1.
4.2. Experimental Details
4.2.1. Experimental Configuration
The platform used for the experimentswass a desktop workstation with the following device parameters: CPU model 12th Gen Intel(R) Core(TM) i9-12900K, 64 GB of RAM, GPU model GA102 [GeForce RTX 3090 Ti], 24 GB of video memory, operating system Ubuntu20.04. The language used for programming was Python3.9.12, and the experimental environments all usde the Pytorch deep learning framework and were configured with CUDA11.1 to invoke GPU-accelerated computation.
4.2.2. Evaluation Indicators
Mean average precision (mAP) is used to measure the positioning accuracy of an index in a dataset that contains multiple categories of situations. It calculates the average precision (AP) value for each category and then computes the average of these AP values. AP is calculated based on the area under the precision–recall (PR) curve. The PR curve plots precision on the x-axis and recall on the y-axis. The area under this curve is a crucial indicator for evaluating the performance of a detector. A higher AP value indicates better detector performance. The core indicator for evaluating the performance of a detector is the AP value, with higher values indicating better performances. Precision measures the ratio of correctly detected airport field targets to all targets detected by the sensor. Recall measures the ratio of correctly detected airport field targets to all actual targets on the ground. The calculation method for both
Precision and
Recall is shown in Formula (8).
The mean average accuracy of the object keypoint similarity (OKS) calculation is chosen as the criterion for judging the accuracy of human keypoint detection. OKS determines the network performance by calculating the similarity between the predicted skeleton points and the actual locations on the ground and setting different scaling ranges. It is calculated as shown in Formula (9).
s and
represent the category and Euclidean distance of the skeleton points, respectively, and
is the scale factor of the target.
4.2.3. Parameter Settings
The dataset samples are divided into training sets, validation sets, and test sets in the ratio of 7:2:1, and the same computing platform is used for all comparison experiments.
A multi-scale non-cooperative object localization module uses a training batch size of 16 and is trained for 200 epochs. The training employs a stochastic gradient descent (SGD) optimizer with an initial learning rate of 1 × 10−2. The learning rate is attenuated using the cosine annealing method with a decay coefficient of 5 × 10−4.
The human keypoint detection module applies image enhancement techniques such as mirror flipping and scale scaling to the samples during training. The pre-trained weights from the COCO dataset are utilized. The training iteration is set to 200 epochs.
In the behavioral classification module, each video is sampled with 30 frames of frame-by-frame nudged temporal skeleton point data. This frame-by-frame nudged sampling serves as a form of dataset augmentation. The training process utilizes the Adam optimization strategy with a learning rate of 10 × 10−3 and a batch size of 16. The training is performed for 30 epochs. These parameters are the optimal parameters that we have obtained after many epochs of experiments and tests. When this framework is applied in different scenarios, one can use this parameter to complete the pre-training and then further adjust the parameters according to their own evaluation results.
4.3. Quantitative Comparison with Other Methods
We completed training in IIAR-30 on the six behaviors (standing, walking, running, climbing, touch paddle, squatting) of the current mainstream methods in
Table 2. The parameters of the model were counted. We used metrics such as GFOPS and FPS for evaluation in the GPU-3090ti (NVIDIA, Santa Clara, CA, USA). and Jetson-Orin-AGX devices (NVIDIA, Santa Clara, CA, USA). The NC-ABLD framework achieves 67FPS on the GPU-3090ti and 15FPS on the edge-side device Jetson-Orin-AGX inference despite having a high computational complexity of 98.7GFLOPs. The code is avaliable at
https://github.com/3083156185/NC-ABLD.git accessed on 12 July 2024.
The performance of SSD, Fast-RCNN, YOLOv3, YOLOv5-s, YOLOX-s, and the improved model used in this paper are compared. The evaluation metrics used are the average precision AP value and recall AR value introduced in the previous section. The metrics of the six models are compared, as shown in
Table 3.
As observed from the
Table 3 data, the YOLOX-s target detection model demonstrates greater robustness compared to other popular target detection networks in terms of performance. It exhibits higher adaptability to the complex backgrounds found in airport field aprons and targets with significant scale variations. The improved model achieved a 1.7% increase in AP value compared to the baseline model, and the recall rate was improved by 1.2%.
Based on the final predictions of human keypoint detection, a coordinate error analysis was conducted, and the results are presented in
Table 4. The table includes the minimum, maximum, and average error values for the 13 skeleton points. The error values represent the absolute pixel Euclidean distances between the actual and predicted values. From
Table 5, it can be seen that the prediction value of the detection of human keypoints has a low error compared to the true value.
When recording abnormal behavior, we always pushed back time nodes in time according to the length of the sliding window intervals, and we used the time nodes obtained from the pushed back time nodes as the final prediction results. We take the error between the prediction result and the real time node as the evaluation index, and the six sets of experimental results obtained are shown in
Table 5.
The experimental results show that the average error of the temporal node localization of abnormal behavior in the field was 0.4 s, which was almost equivalent to the ability of human eyes to identify abnormal behavior.
4.4. Ablation Study
4.4.1. Comparison of the Results of Ablation Experiments with Improved YOLOX-s
In order to verify the enhancement of the mosaic data strategy, attention mechanism module, and loss function optimization on the performance of the benchmark network, ablation comparison experiments of the modules were carried out on the constructed airport field target detection dataset. The experimental results are shown in
Table 6.
For the results of each method in
Table 2 and each configuration in
Table 5, it is a good suggestion to see the differences in the results.
As shown in
Figure 11, when only the mosaic data enhancement strategy is used, the AP value grows from 85.2% to 85.8%, and the recall rate grows from 88.2% to 88.6%. When only the attention module is added, the AP value grows from 85.2% to 85.6%. The recall rate grows from 88.2% to 88.5%. When only the loss-of-confidence function is optimized, the AP value grows from 85.2% to 85.8%. The recall grows from 88.2% to 88.7%, indicating that all three modules can improve detection precision and recall.
4.4.2. Type of Behavior Predicted
The experiment yielded a confusion matrix, as shown in
Figure 12. The recognition results indicate that the accuracy of recognizing standing, walking, and climbing actions, which possess distinctive skeleton feature information, can reach 100%. However, due to the simultaneous occurrence of standing and touch paddle actions, there is a minimal misclassification rate for these actions. Similarly, the initial stages of squatting and running actions are often mistaken for standing and walking actions, respectively. As a result, there are misclassifications, such as 3% of squatting actions being recognized as standing, 1% of running actions being recognized as walking, and 1% of touch paddle actions being recognized as standing.
Overall, the adopted spatiotemporal graph convolutional network is highly accurate when applied to skeleton-based human behavior recognition in the field, which meets the requirements for adequate recognition of personnel behavior in real-field apron surveillance.
Figure 12.
Predicted results for behavioral categories.
Figure 12.
Predicted results for behavioral categories.
4.5. Visualization of Experimental Results
To compare the prediction error more intuitively and further validate the robustness of the human keypoint detection module, the visualization results of the actual skeleton points with and without occlusion are shown in
Figure 13, where only part of the region containing the human body target is intercepted due to the large size of the original image. The green skeleton points in the figure indicate the actual values, and the red skeleton points indicate the predicted values. It can be seen by comparing the experimental results that the human keypoint detection model can still accurately detect the position of each human keypoint when it is used for skeleton point recognition in the scene with occlusions. As shown in
Figure 14, comparing
Figure 14b,c, it can be seen that the improvement of YOLOX-s reduces missed detection due to occlusions.
Six abnormal behavior localization experiments were conducted, involving both single-person and multi-person scenes.
Figure 15 presents the experimental results for one group of single and multi-person instances at specific time nodes in the time series. The time series is divided into frames, and the blue circle in the figure represents normal behavior exhibited by the corresponding target in a specific frame. The red circle denotes the period when abnormal behavior occurs in the video sequence.
To visually represent the detection effect after joint inference of the models on the visual graph, the visualization includes human body bounding boxes, tracked identity IDs, human body skeleton maps, and the classification results of the temporal behaviors.
The temporal action detection method employed in this chapter uses a sliding window with a length of 30 frames. Behavioral classification is not performed when the number of image frames containing the target is less than 30 frames. The results of each experiment shown here start from the 30th frame onward.
Taking the example of a single person, in the first frame of the experiment, the timing behavior category of person 1 is outputted. In the 80th frame, the person is detected as climbing, which is an abnormal behavior. The abnormal behavior continues until the 243rd frame, when the behavior returns to normal. In the 404th frame, the person is detected as touching the paddle, which is another abnormal behavior. The abnormal behavior persists until the 439th frame, when the behavior returns to normal.
5. Conclusions and Future Works
In this paper, we propose a new scheme for autonomous monitoring of non-cooperative object behavior in civil aviation airport scenarios, which provides new ideas for the application of abnormal behavior detection algorithms in specific industrial areas. An NC-ABLD framework for the localization and detection of abnormal behaviors of typical non-cooperative targets in airport areas has been developed through the design of three modular networks ranging from coarse to fine-grained. In addition, an IIAR-30 airport video dataset is proposed on which the overall performance of NC-ABLD is validated, and a dataset contribution is provided for subsequent studies targeting specific industrial areas. The NC-ABLD framework achieves 67 FPS on the GPU-3090ti and 15 FPS on the edge-side device Jetson-Orin -AGX inference despite having a high computational complexity of 98.7 GFLOPs. As
Table 5 shows, the average error of the temporal node localization of abnormal behavior in the field is 0.4 s, which is almost equivalent to the ability of human eyes to detect abnormal behavior.
Although the framework performs well in handling detection in light-occlusion situations, there are still some limitations. For example, when the human body is highly occluded or the shape of the occluder is similar to that of the target object, the model may fail to recognize the target correctly or produce incorrect detection results. In addition, our model detects poorly in low-light environments. In future work, we will also explore the use of infrared cameras to improve the performance of the system in low-visibility conditions, thereby increasing its robustness and reliability in a variety of operational situations.