4.1. Dataset Information
To verify the effectiveness and feasibility of the proposed method, we conducted experiments on three UAV aerial image benchmark datasets, including the UAVid dataset [
2], Semantic Drone dataset [
70], and Aeroscapes dataset [
71].
The details of these datasets are as follows.
UAVid is a challenging benchmark dataset for UAV aerial image semantic segmentation, which contains many static and moving objects in complex urban scenes. This dataset is captured using the resolution video recording mode, and the image sizes are 3840 × 2160 pixels. Since UAVid is mainly collected from urban scenes, the dataset contains eight object classes in urban scenes, namely building, road, tree, low-vegetation, moving-car, static-car, background-clutter, and human. There are 420 images in the dataset, of which we use 200 images for training, 70 images for validation, and the remaining 150 images for testing. Due to the large size of the original image, the image size is scaled to 1024 × 1024 pixels. The sample image and corresponding labels are shown in
Figure 3.
Semantic Drone focuses on semantic understanding of urban scenes; this dataset observes ground objects from the bird’s eye perspective at an altitude of 5 to 30 m. The high-resolution camera is used to capture images at a size of 6000 × 4000 pixels, and the dataset contains eighteen classes of ground objects, such as trees, rocks, dogs, fences, grass, water, bicycle, fence-pole, vegetation, dirt, pool, door, gravel, wall, obstacle, car, window, and paved-area. The original image and label are shown in
Figure 3. This dataset contains 400 publicly available images; 280 images in the dataset are used as the training set, 40 images as the validation set, and 80 images as the testing set. To facilitate training, we crop the original image size from 6000 × 4000 pixels to 2048 × 1024 pixels.
AeroScapes dataset is more challenging for semantic segmentation tasks because it includes the ground objects in complex urban and suburban scenes. The AeroScapes dataset contains 3269 images and eleven categories of ground objects, namely person, bike, car, drone, boat, animal, obstacle, construction, vegetation, road, and sky. As shown in
Figure 3, the number of pixels for different object categories in the dataset varies greatly. The image size in the dataset is 1280 × 720 pixels, and we maintain the original image size constant during the training process. For the 3269 images contained in the dataset, we use 2288 images as the training set, 654 images as the validation set, and the remaining 327 images for testing.
4.4. Comparison with State-of-the-Art Methods
For experimental comparison, we verify the model semantic segmentation performance and the robustness against adversarial attacks. On the UAVid dataset, we compare the GFANet with the existing aerial image semantic segmentation networks LANet [
10], AERFC [
11], and AFNet [
12]. For the Semantic Drone, we compare the proposed method with the MCLNet [
13], BSNet [
14], and SBANet [
15]. For the AeroScapes dataset, the GFANet is compared with MANet [
16], HPSNet [
17], and TCHNet [
72].
Compare on UAVid Dataset: First, we compared performance on the clean example test set, and the quantitative results and visual comparisons are shown in
Table 1 and
Figure 4. Second, the robustness against adversarial attacks is verified on the adversarial example test set generated by FGSM attack [
24], and the results are shown in
Table 2 and
Figure 5. Next, we give the performance analysis and robustness against attacks of different methods.
(1) LANet [
10]: This network consists of a patch attention mechanism and attention embedding module, which can mine local feature information of the ground objects to guide the model to complete semantic segmentation. As shown in
Table 1, LANet achieves 66.52% mIoU on the clean example test set, while only 15.85% mIoU is completed on the adversarial example test set. The visualization results in
Figure 4 and
Figure 5 show that LANet can better predict the pixels of each category for clean examples, which for adversarial samples, there are serious mistakes, such as the “tree” is misclassified as “low-vegetation”. The results in
Figure 6 further shows the performance difference between LANet for clean and adversarial examples, with the mIoU decreasing by 50.67%. The experimental results also further demonstrate the poor performance of local features against adversarial attacks.
(2) AERFC [
11]: To achieve accurate segmentation of different scale objects, AERFC constructs the adaptive convolution kernel to extract multi-scale feature information of ground objects. The results in
Table 1 show that AERFC has a better semantic segmentation effect on different categories of objects; for example, its mPA and mIoU reach 79.85% and 69.28%, respectively. For adversarial examples, the results in
Table 2 show that the mPA and mIoU of AERFC only reach 18.23% and 15.46%. The visualization results in
Figure 4 and
Figure 5 further show the performance difference of AERFC on clean examples and adversarial examples. For example, for the clean sample test set, AERFC achieves better prediction for different categories of object pixels, while its performance on the adversarial sample test set is significantly degraded. The experimental results of AERFC show that multi-scale features cannot be against the impact of adversarial attacks.
(3) AFNet [
12]: For the purpose of feature enhancement, AFNet constructs the scale-feature attention mechanism and scale-layer attention module, which achieves semantic segmentation by enhancing features of different scales and different convolution layers. From
Table 1, we can observe that AFNet has better semantic segmentation performance on the clean sample test set, while the results of
Table 2 show that AFNet performs poorly on the adversarial sample test set. The visualization results also show the performance difference of AFNet on clean samples and adversarial samples. For example, in
Figure 4, AFNet can accurately predict the object “road”, while for adversarial examples, “road” is misclassified as “background-clutter”. It can be seen from
Figure 6 that the mIoU of AFNet decreased from 70.47% of clean samples to 19.75% of adversarial samples. The experimental results of AFNet show that simple feature enhancement cannot alleviate the impact of adversarial samples on model performance.
For our proposed GFANet, it can be seen from
Table 1 and
Table 2 that GFANet achieves the best results on both clean and adversarial sample test sets. For clean samples, the mIoU reaches 71.89%, while for adversarial samples, its mIoU reaches 69.51%. The visualization results of
Figure 4 and
Figure 5 also prove that GFANet can complete accurate semantic segmentation for clean and adversarial samples. From
Figure 6, it can be observed that the mIoU difference between GFANet for clean and adversarial samples is only 2.38%, which further indicates the robustness of GFANet against adversarial example attacks. The experimental results of GFANet show that the global features can complete accurate aerial image semantic segmentation tasks and have strong robustness against adversarial attacks.
Compare on Semantic Drone Dataset: Since the dataset contains more object categories and complex scenes, it can further verify the semantic segmentation accuracy and the robustness against adversarial attacks of different methods. We use C&W attack [
27] with
norm to generate an adversarial example test set.
Table 3 and
Figure 7 show the experimental results of different methods on the clean sample test set, and
Table 4 and
Figure 8 show the results on the adversarial example test set. Next, we analyze the experimental results of different methods in detail.
(1) MCLNet [
13]: To enhance the correlation between multi-scale features, MCLNet constructs the multi-scale calibration learning strategy. The network performs semantic segmentation by mining the correlation between local and global features. The experimental results on the clean sample test set in
Table 3 and
Figure 7 show that MCLNet can better segment objects of different scales, and its mAP and mIoU reach 73.81% and 62.52%. However, for the adversarial example test set, as shown in
Table 4, the performance of MCLNet is significantly degraded, with mAP and mIoU 23.16% and 12.85%. From the visualization results of
Figure 8, it can be seen that the adversarial attack has a great impact on the performance of MCLNet, and it cannot complete accurate semantic segmentation on the adversarial sample test set. It can also be observed from
Figure 6 that the adversarial example attack reduces the mIoU of MCLNet by 39.36%. The experimental results further illustrate that only establishing the correlation between local and global features is ineffective against adversarial example attacks.
(2) BSNet [
14]: This network consists of dynamic hybrid gradient convolution and coordinates sensitive attention, which completes semantic segmentation by obtaining the salient boundary information of the object region. As shown in
Table 3, the mPA and mIoU of BSNet are 74.29% and 65.13%, which shows the contribution of boundary feature information in accurate semantic segmentation. From the visualization results of
Figure 7, it can be seen that BSNet can finely segment the contour boundary. For the adversarial example test set, as shown in
Table 4, the mIoU of BSNet on the adversarial example test set is only 26.35% and 15.07%, which is obviously inferior to the experimental results on the clean example test set. From
Figure 6, it can be seen that the mIoU of BSNet decreased from 65.13% to 15.07%. The results of
Figure 8 further prove the impact of adversarial examples on the performance of BSNet, which cannot complete the semantic segmentation task under adversarial example attacks. The results of BSNet also show that only enhancing boundary features cannot alleviate the impact of adversarial examples on model performance.
(3) SBANet [
15]: To obtain the fine-grained semantic features of the object region, SBANet uses the boundary attention mechanism to locate the object region and uses the adaptive weighted multi-task learning guidance model to complete the semantic feature extraction. As shown in the clean example experiment results in
Table 4 and
Figure 7, SABNet obtained 76.82% and 68.07% of mAP and mIoU and completed accurate semantic segmentation for different object categories. However, the experimental results of
Table 4 and
Figure 8 show that SBANet is ineffective against adversarial example attacks. The mIoU of SBANet on the adversarial example test set is only 16.72%, and there are pixel classification errors, such as the “water” is misclassified as “vegetation”. The results in
Figure 6 show that the adversarial examples reduce the mIoU from 68.07% to 16.72%. The experimental results of SBANet verify that semantic features are ineffective against adversarial attacks.
As shown in
Table 3 and
Table 4, GFANet achieves the mIoU of 74.80% and 73.20% on clean example and adversarial example test sets, which is superior to other compared methods. The visualization results of
Figure 7 and
Figure 8 show that GFANet can complete accurate semantic segmentation and effectively alleviate the impact of adversarial example attacks.
Compare on Aeroscapes Dataset: The dataset contains many suburban scenes and has higher resolution and fine annotation information, which can effectively verify the robustness and generalization ability of the semantic segmentation network. For the adversarial example attack, we use the PGD attack [
28] to generate the adversarial example test set.
Figure 9 and
Figure 10 show the visual comparison results. The specific performance analysis of different methods is as follows. Correspondingly,
Table 5 and
Table 6 show the quantitative comparison results of different methods on clean example and adversarial example test sets.
(1) MANet [
16]: This network uses a multi-attention cascade to obtain multi-scale context features and uses dot-product attention for feature fusion. MANet effectively alleviates the feature loss problem in the feature fusion process. As shown in
Figure 9 and
Table 5, for the clean example test set, MANet obtains complete object region contour information and achieves accurate pixel classification for different category objects, with mPA and mIoU of 81.36% and 69.89%. However, for the adversarial example test set, the quantitative comparison results in
Table 6 show that the mIoU of MANet only reaches 9.95%. From
Figure 9, it can be observed that MANet misclassified “vegetation” as “road” on the adversarial example test set, indicating that the adversarial example seriously damaged the model performance. The experimental results of MANet show that only using context information cannot effectively resist the interference of adversarial examples.
(2) HPSNet [
17]: To establish the correlation between different features, HPSNet constructs the hidden path selection strategy, which completes accurate semantic segmentation by correlation modeling and global connection of different features. For the clean example test set, HPSNet obtains 82.45% and 71.52% of mPA and mF1. The results in
Figure 9 show that HPSNet can obtain accurate semantic segmentation results by establishing the relationship between different object features. For the adversarial example test set, as shown in
Table 6 and
Figure 10, HPSNet is seriously affected by the adversarial example, with mPA and mIoU of 29.32% and 10.86%. The experiment of HPSNet further verifies that simply establishing the correlation between features cannot improve the resistance to adversarial attacks.
(3) TCHNet [
72]: This network consists of atrous spatial pyramid pooling and channel attention mechanism, which realizes semantic segmentation by extracting fine-grained spatial structure features and enhancing local channel features. As shown in
Figure 9 and
Table 5, TCHNet completes the accurate segmentation of different category objects, and its mAP and mIoU reach 83.06% and 71.81%, indicating that TCHNet has better semantic segmentation performance. However, for the adversarial example test set, as shown in
Table 6, the mPA and mIoU of TCHNet are only 32.57% and 13.28%. The visualization results in
Figure 10 illustrate that adversarial examples have a serious impact on its segmentation performance. The experiment shows that the channel attention mechanism or feature enhancement strategy can not resolve adversarial examples to the model performance.
Our proposed GFANet, as shown in
Figure 9 and
Table 5, obtains the best results on the clean example test set and accomplishes the accurate segmentation of different object categories. For the adversarial example test set, as shown in
Figure 10 and
Table 6, GFANet maintains the same performance as the clean example test set, and the results in
Figure 6 show that for clean and adversarial examples, the difference in mIoU of GFANet is only 1.66%, further illustrating the performance advantage and robustness of the proposed method.