Robust Airport Surface Object Detection Based on Graph Neural Network

: Accurate and robust object detection is of critical importance in airport surface surveillance to ensure the security of air transportation systems. Owing to the constraints imposed by a relatively fixed receptive field, existing airport surface detection methods have not yet achieved substantial advancements in accuracy. Furthermore, these methods are vulnerable to adversarial attacks with carefully crafted adversarial inputs. To address these challenges, we propose the Vision GNN-Edge (ViGE) block, an enhanced block derived from the Vision GNN (ViG). ViGE introduces the receptive field in pixel space and represents the spatial relation between pixels directly. Moreover, we implement an adversarial training strategy with augmented training samples generated by adversarial perturbation. Empirical evaluations on the public remote sensing dataset LEVIR and a manually collected airport surface dataset show that: 1. our proposed method surpasses the original model in precision and robustness; 2. defining the receptive field in pixel space performs better than that on representation space.


Introduction
Airport surface object detection is a crucial technology for maintaining the security and efficiency of airport operations by analyzing real-time data streams, including video and radar.Specifically, airport surface video plays an important role in detecting and tracking planes automatically and localizing ground crews and vehicles.Leveraging airport surface object detection technologies, an airport administrator is capable of surveilling the airport efficiently, reducing the reliance on manual monitoring [1].
With the development of artificial intelligence techniques, a number of deep learningbased methods are proposed to address the object detection problem with advancements in accuracy and speed.Existing objection detection algorithms can be mainly divided into two categories: one-stage and two-stage approaches.The one-stage method such as SSD [2], YOLO [3][4][5][6] series offers higher detection speed.On the other hand, the two-stage method such as Faster RCNN [7] and R-FCN [8] provides greater accuracy at the cost of reduced speed, which is more suitable in real-time detection and is also the focus of this work.
Additionally, from the view of the neural network architecture, these objection detection methods can be divided into convolution-based [7] and transformer-based [9] frameworks.The convolution-based method extracts image features by leveraging convolutional operations with learnable convolutional kernels.Due to the square-like kernel shape, this method is more suitable to integrate information from a square-like receptive field (neighborhood).Not limited to the square-like receptive field, Dai et al. proposed deformable convolution [10] and deformable kernel [11] that explored different shapes of the receptive field.However, these deformable methods are not capable of representing spatial relations between two pixels (nodes).The transformer-based method [9] collects information from the entire feature map through the self-attention mechanism, i.e., a patch's neighborhood is designed to contain all patches, resulting in a receptive field covering the entire feature map.Though they can reach great performance, these methods suffer from substantial computational costs.
Apart from these two methods, object detection based on graph neural networks (GNN) has been proposed nowadays.GNN is traditionally used for non-Euclidean data with inherent graphical structures, such as a molecular graph with its atoms and chemical bonds corresponding to nodes and edges, respectively.GNN was first introduced in [12,13], and the graph convolutional network [14] was proposed based on the spectral graph theory [15].In recent years, a series of GNNs were introduced to deal with graph data efficiently, such as mixture model network [16], message passing neural network [17], and graph attention network [18].
The usages of GNN for computer vision mainly involve scene graph construction [19] and point cloud classification [20] because these two issues contain naturally constructed graphs.When GNN is introduced to image feature extraction, the first and most important problem is how to construct a graph for an image or a feature map.For a pixel in the feature map, Vision GNN (ViG) [21] defines its neighborhood (receptive field) as k-nearestneighbor in representation space, which achieves great performance.However, as there is not a naturally constructed graph structure among all pixels, it is of great significance to explore which is a better way to construct a graph on pixels when introducing GNN into image feature extraction.
Recently, many researchers began to study how to utilize object detection in airport scenes.For airport surface object detection, Zhu et al. [1] proposed an attentional multiscale feature fusion enhancement network based on CNN.Li et al. [22] applied Ting-YOLO to airport aircraft object recognition and proposed a simplified Ting-YOLO framework to improve the detection speed.Guo et al. applied the YOLO v3 [23] algorithm to the field of aircraft detection on the airport surface and proposed several improvements.Han [24] introduced a unified and effective moving object detection architecture, which combined appearance and motion cues.As far as we know, there is no research exploring GNN-based method potential in this domain.
For airport surface object detection, the adversarial attack is considered a big challenge for model robustness and has been studied by more and more researchers.Adversarial attacks, whether arbitrarily caused or artificially crafted, are mainly performed by adding perturbation to the input image.Arbitrary perturbation results from weather changes, transmission failure, camera failure, and so on.This kind of perturbation can be detected by human eyes and solved immediately.On the contrary, artificial perturbation is implemented by malicious hackers who create imperceptible perturbation on an image.The wrong prediction in airport surveillance caused by artificial perturbation may lead to a serious safety problem.So, it is necessary to improve model robustness against adversarial attacks in our scenario.
Gradient-based adversarial attack methods, e.g., FGSM [25], PGD [26], are well studied these years as well as black box attack methods [27].Meanwhile, researchers proposed several methods to tackle the problem of model robustness.Adversarial training methods, such as AdvProp [28] for classification and Det-AdvProp [29] for detection, create adversarial samples and add them to the training process to improve the model's robustness.RobustDet [30] modifies the model framework to defend against attack.
In this paper, we propose a Vision GNN-Edge (ViGE) block that is integrated into a one-stage detector's (i.e., YOLOv5s) backbone for airport surface surveillance.We explore different receptive field definitions and implement adversarial training.Our method aims to improve the detection accuracy and robustness of the GNN-based backbone, which can be applied in various backbones and other areas.
Our contributions are mainly summarized in three aspects: • We propose a novel GNN-based block named ViGE and plug it into a one-stage detector's (i.e., YOLOv5s) backbone to improve the performance of object detection.
• We explore several receptive field (neighbor) definition approaches and find that defining the receptive field in pixel space is better than that in representation space empirically.• We apply an adversarial training framework to improve model robustness against adversarial attacks for airport surface object detection.

Materials and Methods
In this section, we will first introduce and analyze the original Vision GNN block briefly.Subsequently, we will focus on the improvements of Vision GNN-Edge in the second part, including the graph construction and the message-passing neural network architecture.Finally, we present the adversarial training strategy in the third part.

ViG Block
Vision GNN (ViG) [21] is a pioneering work that first introduces GNN into image feature extraction.Several ViG blocks are stacked as a backbone and each ViG block takes a feature map F ∈ R H×W×C as input and outputs an updated feature map F' ∈ R H×W×C .There are three basic modules in the ViG block: two convolutional neural networks (CNN) and a message-passing neural network (MPNN) (Figure 1).Firstly, in the CNN module, the ViG block implements convolutional operation on the feature map to capture local feature information.Then, in the MPNN module, the ViG block applies a convolutional operation on the feature map before the message-passing layer.The feature map is treated as a graph in message-passing layer where it implements message passing between nodes.After that, the feature map goes through the following CNN modules.Before output, the ViG block adds the original feature map F as a residual connection on the updated feature map F' to overcome gradient vanishing and over-smoothness caused by the graph neural network.We will describe the details of the message-passing layer in the second part.As there is not a naturally defined graph structure for a feature map, we can define any graph structure with inductive bias.Specifically, ViG defines graph structure in the representation space and achieves competitive performance.After graph construction, GNN facilitates message passing among graph nodes, aggregating information from the receptive field, which is also the same as neighboring nodes.Despite the innovative approach introduced by ViG, it has yet to fully leverage the potential of GNN, such as in defining edge features or investigating alternative approaches to neighbor definition.

ViGE Block2
Our ViGE block is constructed based on the ViG's framework with three principal enhancements: graph construction, message passing, and model usage.Initially, in the graph construction phase, while ViG defines the k-nearest neighbors based on the feature proximity in the representation space and leverages position embedding to represent the position information, it overlooks the edge feature.In contrast, our ViGE defines neighbors in the pixel space and employs carefully designed edge features to represent two nodes' spatial relations more directly.After graph construction, we design a novel message-passing operation that leverages edge features to gather information from neighbors.Different from ViG, which stacks several ViG blocks as a backbone, our approach integrates a single ViGE block within the backbone of a one-stage detector, such as YOLOv5, to assess its compatibility with mainstream CNN architecture.The following sections will explain these differences in detail.
Graph construction.Considering a feature map F with dimensions H × W × C, where H, W, and C represent height, width, and channel, respectively, we segment it into H × W nodes, each representing a pixel and transforming The proximity between any two distinct nodes is determined by their Euclidean distance in the pixel space, representing the spatial separation of corresponding pixels in the feature map.The closer two nodes are in pixel space, the more likely they are neighbors in a graph.Based on the pixel Euclidean distance between nodes, we identify each node v i , with its k-nearest neighbors N (v i ) and add an edge e ji ∈ E from node v j to v i , where v j ∈ N (v i ).Furthermore, we employ the dilation number to integrate information from different distances.Based on the pairwise distances between nodes, a node is selected as a neighbor for every increment of "dilation number" in the sequence, thereby ensuring a systematic and sparse selection of neighboring nodes based on spatial proximity.A diagram of neighbor definition in pixel space is shown in the following Figure 2, where k is finally set to 12 and the dilation number is set to 3.After defining neighbors, we construct the edge feature according to the distance and direction between two nodes: where x i ∈ R 2 denotes the Cartesian coordinate vector of node v i in feature map F; ∥•∥ 2 de- notes L-2 norm, which is usually used to calculate Euclidean distance; Concat(•, •) denotes concatenation operation; RBF ji ∈ R is a radius basis function scalar between node v j and v i which is transformed from the distance in pixel space: By transforming the feature map F into a graph G(F) = (V, E ), we enable the ap- plication of Graph Neural Network (GNN) methods for effective feature extraction.It is important to note that the vanilla ViG block does not define edge features as described previously but integrates position embedding to each node feature v i to encode positional information.The original position embedding is sinusoidal and recurrent, which does not directly represent the spatial relationship between two nodes' spatial relation.
Message passing.With defined a graph with nodes, edges, and corresponding features, we can operate message passing to update node representation.Given a set of node feature on the l th message passing layer, for a node v i , we have: where agg i denotes aggregated information from neighbor node i denotes node feature v i on the l th layer; and e ji represents the edge feature between two nodes, which v (0) i is equal to original feature v i before message passing; Conv 1 (•) and Conv 2 (•) denote convolutional operations with different parameters.To stabilize the training process, the parameters of two convolution layers are shared between different message-passing layers, respectively.After L message-passing layers, we obtain the updated node feature HW ∈ R HW×C .Different from convolution and transformer operations, the ViGE block aggregates information from each node's k-nearest neighbors defined in pixel space.The convolution method aggregates information from each node's surrounding neighbors in pixel space and the transformer method aggregates information from all nodes by deploying the self-attention mechanism.Treating the feature map as a graph, we illustrate neighbor definitions of convolution and transformer in Figure 3, respectively, where the red patch denotes the center node and the blue patch denotes the neighbor node.After building ViGE block, we add into YOLOv5s' backbone to gather information from defined neighbors (Figure 4).YOLOv5 is typically separated into three parts: backbone, neck, and detection head.We follow vanilla YOLOv5 and implement the backbone network with CSPDarknet [4], which employs Cross-Stage Partial (CSP) and CSP Bottleneck with a Lite head (CBL) for information integration.For simplicity, CSP1_X represents X residue unit in a single CSP block.The single ViGE block is integrated after the second CSP block to gather information for the feature map generated by previous multiple convolution layers.

Adversarial Training
Adversarial sample generation.Given a clean image I and object detector D θ , we generate adversarial perturbation based on the gradient derived from the loss function [29]: where Î denotes the adversarial sample derived from the clean image I; B(•) is a boundary function that restricts the amplitude of perturbation, s.t.Î Î − I ∞ ≤ ϵ ; ϵ is the step size of adversarial perturbation; sign(•) is a signum function, sign(x) = 1 when x ≥ 0 and vice versa; L(•) is a loss function of the object detector in the training phase, e.g., classification loss, location loss, and so on; D θ (I) is the prediction of the object detector according to the clean image I; θ denotes the detector's parameters; GT denotes ground truth of image I, i.e., classification label, ground truth bounding box.
According to Equation ( 1) above, we first apply the model D θ to detect clean image I and calculate its loss with ground truth GT.Then, we calculate the gradient vector of loss L w.r.t the clean image I and implement a step of gradient ascent on image I with step size ϵ.Adding this adversarial perturbation may cause the model to obtain the wrong detection results.If step size ϵ is small enough, the perturbation cannot be perceived by human eyes.What is more, we can iterate Equation (1) several times to obtain an adversarial sample by different adversarial iterations.
Contrary to adversarial sample generation, updating the model is based on gradient descent: where θ ′ denotes the updated parameters.Equation ( 2) calculates gradient w.r.t model parameters θ.It aims to modulate model parameters to decrease the loss value.Meanwhile, Equation (1) aims to add some imperceptible perturbation on the image to increase the loss value, i.e., causing the wrong detection.
Adversarial training.For a clean image I, we can generate amounts of adversarial samples by attacking different losses or attacking several times.After calculating the loss of image I, we implement an adversarial attack on it and obtain adversarial samples.Then, we apply the same model to detect these adversarial images and calculate loss with the same ground truth as I's to form the total training objective: where λ adv is the weight that balances the gradients calculated from the clean image and adversarial samples and is set as 0.2.The pseudo-code of adversarial training is in the following Algorithm 1.

Results
We will introduce the datasets and experiment settings in the following sections, along with three experiments: model comparison on LEVIR, model comparison on airport surface dataset, and model robustness under adversarial attack on LEVIR and airport surface dataset.Due to some copyright reasons, we do not have access to a well-annotated airport surface dataset.Without loss of generality, we evaluate our method on an open remote sensing dataset LEVIR and a manually collected airport surface dataset.

Dataset and Set Up
Dataset.The datasets we use include LEVIR [31] and the airport surface dataset.LEVIR is a popular remote sensing target detection dataset derived from Google Earth, which is one order of magnitude larger than any other datasets in this field.This dataset contains 3.8 k labeled images with three kinds of labels: plane, ship, and tank (i.e., oil pot).The instance numbers of plane, ship, and tank are 5030, 3197, and 2851, respectively.Each image in LEVIR has a size of 800 × 600 pixels and a resolution of 0.2~1.0m/pixels.Building upon the methodology of prior research [32], we adopt a randomized dataset division approach, apportioning the images into training, validation, and testing sets in a ratio of 8:1:1, respectively.
To meet the demand for airport surveillance, we collect several airport surface videos from public resources on the Internet.Since the difference between adjacent frames is too slight, we select an image every five frames and annotate it manually.Finally, we build a small-scale dataset with only 143 images of 1920 × 1080 pixels and two label categories: airplane and vehicle.The instance numbers of airplane and vehicle are 187 and 56, respectively.We employed a random partitioning of the dataset according to distinct videos three times, maintaining a training, validation, and testing split ratio of 8:1:1.We ensured that no frame from any single video was concurrently represented in the training, validation, and testing subsets.
Experiment Set All experiments in this paper are performed on the same server with CPU of Intel Xeon Gold 6226R 2.90 GHz, 4 GPUs of NVIDIA RTX 3090, the operating system of Ubuntu 18.04, and PyTorch version 1.10.1.Each minibatch comprises 32 images with the resolution of 512 × 512 pixels.We use SGD optimizer to implement gradient descent with an initial learning rate of 0.1, momentum of 0.937, and weight decay of 0.0005.Image preprocessing is the same as YOLOv5, which keeps the width-height ratio, reshapes to a given size, and uses gray padding.To improve the performance on small objects, we follow YOLOv5 to employ mosaic data augmentation, which joins four images into one single image where objects are at a smaller scale.We train each model no more than 300 epochs with patience value of 100.It means that if the loss does not decrease lasting for 100 epochs, we treat the training loss as convergent, stop training, and save the best and the last models' parameters.It is noted that the best model performs the best on the evaluation set.With the best and the last models' parameters, we test both of them on the test set and record the one that reaches higher accuracy.
The models we compare in this paper includes vanilla YOLOv5s, YOLOv5m, YOLOv5s -ViGE, YOLOv5s -ViG, YOLOv5s -A, YOLOv5s -ViGE -A, and YOLOX-s [5].The vanilla YOLOv5 series is derived from Ultralytics' source code on GitHub, including small-size YOLOv5s and middle-size YOLOv5m mentioned above; '-ViGE' denotes a base model added a ViGE block in the backbone, which defines 12-nearest-neighbor in pixel space with dilation number =3; '-ViG' denotes a base model added an original ViG [21] block in the backbone, which defines 12-nearest-neighbor in pixel space with dilation number =3; '-A' denotes implementing adversarial training by attacking location loss for one iteration.Considering the computational cost, we only set one message-passing layer in each ViGE block.YOLOX-s is another outstanding model of YOLO series after YOLOv5.All the models we used are trained from scratch.

Result and Analysis
Result on LEVIR.We train and evaluate several models on the LEVIR dataset; the results are in Table 1.We illustrate the detection result on an image in Figure 5.According to the experiment above, we can conclude that:

•
Compared with vanilla YOLOv5s, ViGE block brings a significant improvement of about 2% for mAP.5.This proves that introducing a new and proper receptive field for YOLOv5s' backbone is able to extract more effective features.

•
Compared with the original ViG block, the ViGE block performs better with about 3% mAP.5 increase.The ViG block defines neighbor in representation space with positional embedding, while the ViGE block defines neighbor in pixel space with a well-designed edge feature.The reason why ViGE outperforms the original ViG may come from two aspects.One is that defining neighbor in pixel space does better than that in representation space.This phenomenon confirms the previous success of CNN, which gathers information from neighbors in pixel space.The other is that our edge feature can better capture spatial relations than positional embedding.
• YOLOv5s-ViGE shows competence other popular models, i.e., YOLOv5m and YOLOX-s.It is noticed that YOLOX adopts several tricks on the basis of YOLOv5, e.g., decouple head, SimOTA, and so on.We only add a ViGE block on the basis of YOLOv5s and obtain an approximate accuracy with YOLOX-s, which proves the effectiveness of the ViGE block in improving detection accuracy.Result on airport surface dataset.We show different models' test results with their average value and standard deviation on the airport surface dataset in Table 2 as follows.We illustrate the detection result on an image in Figure 6: According to experiment above, we obtain the conclusion that: Result of robustness under adversarial attack.When applying detection models for airport surveillance, it is necessary to improve their robustness against potential attacks.Considering the importance of ensuring perturbations remain imperceptible within images, we focus on the gradient-based artificial perturbation methods.These methods are well-studied these years and constitute the primary challenge that adversarial training approaches aim to mitigate [28,29].For each salient sample, we attack the model's location loss for five iterations to generate a corresponding adversarial sample based on the PGD algorithm [26].Then, adversarial samples are utilized to test the robustness of models.The result is in Table 3 as follows.According to the experiment above, we come to the conclusion that: • On the LEVIR and airport surface dataset, the adversarial attack does cause a significant loss in object detection accuracy for all models above.Especially for detectors with GNN block, they are more vulnerable than purely convolution-based detectors (i.e., YOLOv5s and YOLOv5m) against attack.In detail, detectors with GNN block obtain about 38% and 67% mAP.5 decreases while vanilla YOLOv5s models obtain about 47% and 60% on two datasets, respectively.When using a GNN-based detector in safety supervision, its poor robustness must be taken into consideration.• Adversarial training also to improve detection accuracy.This framework takes the use of adversarial samples as auxiliary samples to train the model, which can be seen as a data augmentation approach.It is known that data augmentation is able to improve the model's performance, especially for small-scale datasets.

•
Model performance under various settings of adversarial attack is provided in Table 4.
As the number of attack iterations rises to 10, there is a significant drop in the mAP.5 of the GNN-based model without adversarial training.However, the application of adversarial training mitigates these adverse effects under different adversarial settings.

Conclusions
In this paper, we design a GNN-based feature extraction block, called ViGE, and implement adversarial training for the popular object detector YOLOv5s.Different from CNN which aggregates local messages and transformer which gathers global information through self-attention, ViGE block aggregates information from each node's k-nearestneighbors in pixel (or representation) space.Compared to the original ViG block, our ViGE block defines edge features that can represent spatial relations more directly.Considering potential perturbation in airport surveillance, it is necessary to enhance the model's robustness, especially towards adversarial attacks.We perform sufficient experiments to prove our methods' effectiveness in detection accuracy and robustness improvement.
When introducing GNN into image representation learning, how to construct a graph for an image (or a feature map) is the first problem we encounter.Though we have explored several neighbor definitions approaches in this article (i.e., in different spaces with various parameters), there still are other promising approaches we have not discovered.What is more, the noisy node trick [33] may help to improve GNN's robustness.Furthermore, the performance of the GNN on datasets with a broader range of object classes, such as the DOTA dataset [34], remains to be researched.Like transformers, we do anticipate that GNN can also obtain great success in the computer vision domain.However, there is a long way to go.

Figure 1 .
Figure 1.Illustration of ViG and ViGE block (number of neighbors is set to 4).

Figure 2 .
Figure 2. Neighbor definition in pixel space.(Blue number denotes the corresponding neighbor of the center node).

Figure 3 .
Figure 3. Neighbor definitions of convolution and transformer.

Algorithm 1 :
Pseudo code of Adversarial training

Figure 6 .
Figure 6.Detection results for airport surface dataset (a sample).
Note: The ↑ symbol means the larger the better; the best performance is marked in bold text.The --symbol means that metrics are not provided in the result.

Table 2 .
Model performance on airport surface dataset.
Note: The ↑ symbol means the larger the better; the best performance is marked in bold text.The --symbol means that metrics are not provided in the result.

•
Our YOLOv5s-ViGE model gets about 7% mAP.5 improvement compared to vanilla YOLOv5s, which verifies the ViGE block's potential on feature extraction.What is more, the original ViG block brings accuracy improvement to YOLOv5s as well.These phenomena demonstrate that GNN can also perform well in image feature extraction.•Though YOLOX-s performs better on LEVIR, its accuracy is lower than YOLOv5s-ViGE on the airport surface dataset.This result indicates that YOLOv5s-ViGE obtains a better convergence on small-scale datasets from another side.• In our experiments, we have explored several neighbor definition approaches with different setups, including k-nearest-neighbor in pixel space, k-nearest-neighbor in representation space, and k-farthest-neighbor in representation space.The worst definition is k-farthest-neighbor in representation space, whose accuracy always stays at 0 during training.Obviously, gathering information from irrelevant parts does not help to improve model performance.Though the original ViG is amazing, defining neighbors in pixel space surpasses that in representation space verified by our experiments.As for parameter setups, i.e., radius k and dilation number, they need to be adjusted carefully to define a proper receptive field.Due to space limitations, we do not show this part's experiment.

Table 3 .
Model robustness under adversarial attack.

•
Adversarial training does help to improve model robustness against attack.Although models through adversarial training still suffer from accuracy loss towards adversarial attacks, this phenomenon has been alleviated obviously compared with those models without adversarial training.

Table 4 .
Model performance under various settings of adversarial attack.
Note: The best performance is marked in bold text.