AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation
Abstract
:1. Introduction
- The proposed approach incorporates a high-resolution CNN Stem that preserves half the input image size, providing larger and more detailed features to the decoder. This improvement enhances the segmentation of tiny and densely packed objects.
- This paper introduces a unique multi-dilated CNN Decoder that efficiently integrates both the local and global context. This module utilizes chopped channel-wise feature maps with three distinct filters, maintaining computational efficiency while enhancing the diversity of feature representations.
- The proposed method demonstrates the effectiveness of combining a Transformer Encoder with a multi-dilated CNN Decoder and CNN Stem. This integration successfully addresses challenges such as background–foreground imbalance, intra-class heterogeneity, inter-class homogeneity, and the segmentation of tiny and dense objects in aerial image segmentation tasks.
2. Related Works
2.1. DL-Based Image Segmentation
2.2. Aerial Image Segmentation
3. Methods
3.1. Network Overview
3.2. CNN Stem
3.3. Transformer Encoder
3.3.1. Patch Embedding
3.3.2. Transformer Encoder Block
3.3.3. Patch Merging
3.4. Multi-Dilated CNN Decoder
3.4.1. MDC Block
3.4.2. Deconv Block
3.5. Loss Function
4. Experiments
4.1. Datasets
- iSAID: The iSAID dataset [86] is a large-scale and densely annotated aerial segmentation dataset that contains 655,451 instances of 2806 high-resolution images for 15 classes, i.e., ship (Ship), storage tank (ST), baseball diamond (BD), tennis court (TC), basketball court (BC), ground field track (GTF), bridge (Bridge), large vehicle (LV), small vehicle (SV), helicopter (HC), swimming pool (SP), roundabout (RA), soccer ball field (SBF), plane (Plane), and harbor (Harbor). This dataset is challenging due to foreground–background imbalance, the presence of a large number of objects per image, limited-appearance details, a variety of tiny objects, large-scale variations, and high class imbalance. These images were collected from multiple sensors and platforms with multiple resolutions and image sizes ranging from pixels to pixels. Following the experiment setup [11,70], the dataset was split into 1411/458/937 images for train/val/test. The network was trained on the trainset and benchmarked on the valset. Each image was overlap-partitioned into a set of sub-images sized with a step size of 512 by 512.
- Potsdam: The Potsdam dataset [87] contains 38 high-resolution images of pixels over Potsdam City, Germany, and the ground sampling distance is 5 cm. The dataset was divided into 24 images for training and 14 images for validation/testing. There are two modalities included in the Potsdam dataset, i.e., true orthophoto (TOP) and digital surface model (DSM). While DSM consists of the near-infrared (NIR) band, TOP corresponds to an RGB image. In this work, TOP images were from Potsdam, and DSM images were ignored. The dataset presents a complex background and challenging inter- and intra-class variations due to the unique characteristics of Potsdam. For example, the similarity between low-vegetation and building classes caused by roof greening illustrates the inter-class difficulty. The dataset offers two types of annotations with non-eroded (NE) and eroded (E) options, with and without the boundary. To avoid ambiguity in labeling boundaries, all experimental results were performed and benchmarked on the eroded boundary dataset. Following the setup of the experiment [77,88], the dataset was divided into 24 images for training and 14 images for testing. The testset of 14 images included 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, and 7_13. The dataset consisted of six categories of surfaces, buildings, low vegetation, trees, cars, and clutter/background. The performance is reported in two scenarios: with and without clutter. Each image is overlap-partitioned into a set of sub-images sized with a step size of 256 by 256.
- LoveDA: The LoveDA dataset [8] consists of 5987 high-resolution images of pixels and 30 cm in spatial resolution. The data include 18 complex urban and rural scenes and 166,768 annotated objects from three different cities (Nanjing, Changzhou, and Wuhan) in China. This dataset presents challenges due to its diverse geographical sources, leading to complex and varied backgrounds as well as inconsistent appearances within the same class, such as differences in scale and texture. In alignment with the experimental setup delineated in [8], the dataset was partitioned into 2522/1669/1796 images for training, validation, and testing, respectively. In evaluation scenarios involving the testset, the training and validation sets of LoveDA were amalgamated to create a combined training set, while keeping the testset unchanged.
4.2. Evaluation Metrics
4.3. Implementation Details
- AerialFormer-T: , , ;
- AerialFormer-S: , , ;
- AerialFormer-B: , , .
4.4. Quantitative Results and Analysis
4.4.1. iSAID Semantic Segmentation Results
4.4.2. Potsdam Semantic Segmentation Results
- Potsdam with Clutter: Table 2 reports the performance comparisons between our AerialFormer with the existing methods in six classes (that is, including the clutter class). It should be noted that among all existing methods, Segformer [58] is a strong transformer-based segmentation model and obtains the best performance. The proposed model gains a notable improvement of in mIoU, in OA, and in mF1 compared to the best existing Segformer methods.
- Potsdam without Clutter: In this experimental setting, the review shows that FT-UNetformer [15], HMANet [76], and DC-Swin [16] obtained the best scores on the metrics mIoU, OA, and mF1, and none of them could achieve the best score on all three metrics. On the other hand, the proposed AerialFormer-B scores the best in all three metrics and gains improvements of mIoU, OA, and mF1 compared to FT-UNetformer, HMANet, and DC-Swin, respectively. Compared to Table 2, which contains clutter, it can be seen that clutter, when ignored, tends to alleviate ambiguity between the remaining classes.
4.4.3. LoveDA Semantic Segmentation Results
4.4.4. Ablation Study
4.4.5. Network Complexity
4.5. Qualitative Results and Analysis
- Foreground–background imbalance: As mentioned in Section 1, the Introduction, the iSAID dataset exhibits a notable foreground and background imbalance. This imbalance is particularly evident in Figure 7, where certain images contain only a few labeled objects. Despite this extreme imbalance, AerialFormer shows its ability to accurately segment objects of interest, as depicted in the figure.
- Tiny objects: As evidenced in Figure 8, AerialFormer, is capable of accurately identifying and segmenting tiny objects like cars on the road, which might only be represented by approximately pixels. This showcases the model’s remarkable capability to handle small-object segmentation in high-resolution aerial images. Additionally, the proposed model demonstrates the ability to accurately segment cars that are not present in the ground truth labels (red boxes). However, this poses a problem in evaluating the proposed model, as its prediction could be penalized as a false positive even if the prediction is correct based on the given image.
- Dense objects: Figure 9 demonstrates the proficient ability of the proposed model in accurately segmenting dense objects, particularly clusters of small vehicles, which often pose challenges for baseline models. Baseline models often overlook or struggle to identify such objects. The success of the proposed model in segmenting dense objects is attributed to the MDC decoder, which can capture the global context and the CNN Stem that enables the local details of the tiny objects.
- Intra-class heterogeneity: Figure 10 visually demonstrates the existence of intra-class heterogeneity in aerial images, where objects of the same category can appear in diverse shapes, textures, colors, scales, and structures. The red boxes indicate two regions that are classified as belonging to the category of ‘Agriculture’. However, their visual characteristics differ significantly due to the presence of greenhouses. Notably, while baseline models encounter challenges in correctly classifying the region with greenhouses, misclassifying it as ‘Building’, the proposed model successfully identifies and labels the region as ‘Agriculture’. This showcases the superior performance and effectiveness of the proposed model in handling the complexities of intra-class variations in aerial image analysis tasks.
- Inter-class heterogeneity: Figure 11 illustrates the inter-class homogeneity in aerial images, where objects of different classes may exhibit similar visual properties. The regions enclosed within the red boxes represent areas that exhibit similar visual characteristics, i.e., the rooftop greened with lawn and the park. However, there is a distinction in the classification of these regions, with the former being labeled as ‘Building’ and the latter falling into the ‘Low Vegetation’ category. Although the baseline models are confused by the appearance and produce mixed predictions, the proposed model can produce more robust results.
- Overall performance: Figure 12 showcases these qualitative outcomes across three datasets: (a) iSAID, (b) Potsdam, and (c) LoveDA. Each dataset possesses unique characteristics and presents a wide spectrum of challenges encountered in aerial image segmentation. The major differences among the methods are highlighted in red boxes. Figure 12a visually demonstrates the efficiency of the proposed model in accurately recognizing dense and tiny objects. Unlike the baseline models, which often overlook or misclassify these objects into different categories, the proposed model exhibits its robustness in handling dense and tiny objects, e.g., small vehicle (SV) and helicopter (HC). As depicted in Figure 12b, the proposed model demonstrates a reduced level of inter-class confusion in comparison to the baseline models. An example of this is evident in the prediction of building structures, where the baseline models exhibit confusion. In contrast, the proposed model delivers predictions closely aligned with the ground truth. Similarly, in Figure 12c, the proposed model’s predictions are less noisy, further asserting its robustness in scenarios where scenes belong to different categories but exhibit similar visual appearances. As in the quantitative analysis, the performance of the proposed model on the ‘Road’ class is visually appealing. The ability of the proposed model to accurately delineate road structures, despite their narrow and elongated features, is visibly superior.
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Schumann, G.J.; Brakenridge, G.R.; Kettner, A.J.; Kashif, R.; Niebuhr, E. Assisting flood disaster response with earth observation data and products: A critical assessment. Remote Sens. 2018, 10, 1230. [Google Scholar] [CrossRef]
- Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
- Griffiths, D.; Boehm, J. Improving public data for building segmentation from Convolutional Neural Networks (CNNs) for fused airborne lidar and image data using active contours. ISPRS J. Photogramm. Remote Sens. 2019, 154, 70–83. [Google Scholar] [CrossRef]
- Shamsolmoali, P.; Zareapoor, M.; Zhou, H.; Wang, R.; Yang, J. Road segmentation for remote sensing images using adversarial spatial pyramid networks. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4673–4688. [Google Scholar] [CrossRef]
- Samie, A.; Abbas, A.; Azeem, M.M.; Hamid, S.; Iqbal, M.A.; Hasan, S.S.; Deng, X. Examining the impacts of future land use/land cover changes on climate in Punjab province, Pakistan: Implications for environmental sustainability and economic growth. Environ. Sci. Pollut. Res. 2020, 27, 25415–25433. [Google Scholar] [CrossRef]
- Marcos, D.; Volpi, M.; Kellenberger, B.; Tuia, D. Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 2018, 145, 96–107. [Google Scholar] [CrossRef]
- Xia, J.; Yokoya, N.; Adriano, B.; Broni-Bediako, C. OpenEarthMap: A Benchmark Dataset for Global High-Resolution Land Cover Mapping. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 6254–6264. [Google Scholar]
- Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Virtual. 6–14 December 2021; Vanschoren, J., Yeung, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 1. [Google Scholar]
- O’neill, S.J.; Boykoff, M.; Niemeyer, S.; Day, S.A. On the use of imagery for climate change engagement. Glob. Environ. Change 2013, 23, 413–421. [Google Scholar] [CrossRef]
- Andrade, R.; Costa, G.; Mota, G.; Ortega, M.; Feitosa, R.; Soto, P.; Heipke, C. Evaluation of semantic segmentation methods for deforestation detection in the amazon. ISPRS Arch. 2020, 43, 1497–1505. [Google Scholar] [CrossRef]
- Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual. 14–19 June 2020; pp. 4096–4105. [Google Scholar]
- Shafique, A.; Cao, G.; Khan, Z.; Asad, M.; Aslam, M. Deep learning-based change detection in remote sensing images: A review. Remote Sens. 2022, 14, 871. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Wang, F.; Piao, S.; Xie, J. CSE-HRNet: A context and semantic enhanced high-resolution network for semantic segmentation of aerial imagery. IEEE Access 2020, 8, 182475–182489. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Minaee, S.; Boykov, Y.Y.; Porikli, F.; Plaza, A.J.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
- Le, N.; Bui, T.; Vo-Ho, V.K.; Yamazaki, K.; Luu, K. Narrow band active contour attention model for medical segmentation. Diagnostics 2021, 11, 1393. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual. 3–7 May 2021. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
- Hoang, D.H.; Diep, G.H.; Tran, M.T.; Le, N.T.H. Dam-al: Dilated attention mechanism with attention loss for 3d infant brain image segmentation. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, Virtual. 25–29 April 2022; pp. 660–668. [Google Scholar]
- Le, N.; Yamazaki, K.; Quach, K.G.; Truong, D.; Savvides, M. A multi-task contextual atrous residual network for brain tumor detection & segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR); Virtual: 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5943–5950. [Google Scholar]
- Le, T.H.N.; Duong, C.N.; Han, L.; Luu, K.; Quach, K.G.; Savvides, M. Deep contextual recurrent residual networks for scene labeling. Pattern Recognit. 2018, 80, 32–41. [Google Scholar] [CrossRef]
- He, J.; Deng, Z.; Zhou, L.; Wang, Y.; Qiao, Y. Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7519–7528. [Google Scholar]
- Hsiao, C.W.; Sun, C.; Chen, H.T.; Sun, M. Specialize and fuse: Pyramidal output representation for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7137–7146. [Google Scholar]
- Hu, H.; Ji, D.; Gan, W.; Bai, S.; Wu, W.; Yan, J. Class-wise dynamic graph convolution for semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII 16. Springer: Cham, Switzerland, 2020; pp. 1–17. [Google Scholar]
- Jin, Z.; Gong, T.; Yu, D.; Chu, Q.; Wang, J.; Wang, C.; Shao, J. Mining contextual information beyond image for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7231–7241. [Google Scholar]
- Jin, Z.; Liu, B.; Chu, Q.; Yu, N. ISNet: Integrate image-level and semantic-level context for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7189–7198. [Google Scholar]
- Yu, C.; Wang, J.; Gao, C.; Yu, G.; Shen, C.; Sang, N. Context prior for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12416–12425. [Google Scholar]
- Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmentation transformer: Object-contextual representations for semantic segmentation. arXiv 2019, arXiv:1909.11065. [Google Scholar]
- Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
- Bertasius, G.; Shi, J.; Torresani, L. Semantic segmentation with boundary neural fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3602–3610. [Google Scholar]
- Le, N.; Le, T.; Yamazaki, K.; Bui, T.; Luu, K.; Savides, M. Offset curves loss for imbalanced problem in medical segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9189–9195. [Google Scholar]
- Ding, H.; Jiang, X.; Liu, A.Q.; Thalmann, N.M.; Wang, G. Boundary-aware feature propagation for scene segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6819–6829. [Google Scholar]
- Li, X.; Li, X.; Zhang, L.; Cheng, G.; Shi, J.; Lin, Z.; Tan, S.; Tong, Y. Improving semantic segmentation via decoupled body and edge supervision. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII 16. Springer: Cham, Switzerland, 2020; pp. 435–452. [Google Scholar]
- Zhen, M.; Wang, J.; Zhou, L.; Li, S.; Shen, T.; Shang, J.; Fang, T.; Quan, L. Joint semantic segmentation and boundary detection using iterative pyramid contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13666–13675. [Google Scholar]
- Harley, A.W.; Derpanis, K.G.; Kokkinos, I. Segmentation-aware convolutional networks using local attention masks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5038–5047. [Google Scholar]
- He, J.; Deng, Z.; Qiao, Y. Dynamic multi-scale filters for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3562–3572. [Google Scholar]
- Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
- Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. In Proceedings of the British Machine Vision Conference, Virtual. 22–25 November 2018. [Google Scholar]
- Sun, G.; Wang, W.; Dai, J.; Van Gool, L. Mining cross-image semantics for weakly supervised semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Cham, Switzerland, 2020; pp. 347–365. [Google Scholar]
- Wang, W.; Zhou, T.; Qi, S.; Shen, J.; Zhu, S.C. Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3508–3522. [Google Scholar] [CrossRef]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar]
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Tran, M.; Vo, K.; Yamazaki, K.; Fernandes, A.; Kidd, M.; Le, N. AISFormer: Amodal Instance Segmentation with Transformer. In Proceedings of the British Machine Vision Conference, London, UK, 21–24 November 2022; BMVA Press: London, UK, 2022. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual. 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations, Virtual. 3–7 May 2021. [Google Scholar]
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
- Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10502–10511. [Google Scholar]
- Vo, K.; Joo, H.; Yamazaki, K.; Truong, S.; Kitani, K.; Tran, M.T.; Le, N. AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation. In Proceedings of the British Machine Vision Conference, Virtual. 22–25 November 2021. [Google Scholar]
- Vo, K.; Truong, S.; Yamazaki, K.; Raj, B.; Tran, M.T.; Le, N. AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation. Int. J. Comput. Vis. 2022, 131, 302–323. [Google Scholar] [CrossRef]
- Yamazaki, K.; Truong, S.; Vo, K.; Kidd, M.; Rainwater, C.; Luu, K.; Le, N. Vlcap: Vision-language with contrastive learning for coherent video paragraph captioning. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3656–3661. [Google Scholar]
- Yamazaki, K.; Vo, K.; Truong, S.; Raj, B.; Le, N. VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
- Li, X.; He, H.; Li, X.; Li, D.; Cheng, G.; Shi, J.; Weng, L.; Tong, Y.; Lin, Z. Pointflow: Flowing semantics through points for aerial image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4217–4226. [Google Scholar]
- Xue, G.; Liu, Y.; Huang, Y.; Li, M.; Yang, G. AANet: An attention-based alignment semantic segmentation network for high spatial resolution remote sensing images. Int. J. Remote Sens. 2022, 43, 4836–4852. [Google Scholar] [CrossRef]
- Ma, A.; Wang, J.; Zhong, Y.; Zheng, Z. FactSeg: Foreground Activation-Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
- Hou, J.; Guo, Z.; Wu, Y.; Diao, W.; Xu, T. Bsnet: Dynamic hybrid gradient convolution based boundary-sensitive network for remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–22. [Google Scholar] [CrossRef]
- You, H.; Tian, S.; Yu, L.; Lv, Y. Pixel-level remote sensing image recognition based on bidirectional word vectors. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1281–1293. [Google Scholar] [CrossRef]
- Mou, L.; Hua, Y.; Zhu, X.X. Relation matters: Relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7557–7569. [Google Scholar] [CrossRef]
- Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
- Wang, D.; Zhang, J.; Du, B.; Xia, G.S.; Tao, D. An empirical study of remote sensing pretraining. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–20. [Google Scholar] [CrossRef]
- Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. Rssformer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
- Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. Ringmo: A remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–22. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 28–37. [Google Scholar]
- 2D Semantic Labeling Contest-Potsdam. International Society for Photogrammetry and Remote Sensing. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 5 August 2024).
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Liu, G.; Wang, Q.; Zhu, J.; Hong, H. W-Net: Convolutional neural network for segmenting remote sensing images by dual path semantics. PLoS ONE 2023, 18, e0288311. [Google Scholar] [CrossRef]
- Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. FarSeg++: Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13715–13729. [Google Scholar] [CrossRef]
- Gong, Z.; Duan, L.; Xiao, F.; Wang, Y. MSAug: Multi-Strategy Augmentation for rare classes in semantic segmentation of remote sensing images. Displays 2024, 84, 102779. [Google Scholar] [CrossRef]
- He, S.; Jin, C.; Shu, L.; He, X.; Wang, M.; Liu, G. A new framework for improving semantic segmentation in aerial imagery. Front. Remote Sens. 2024, 5, 1370697. [Google Scholar] [CrossRef]
- Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9167–9176. [Google Scholar]
- Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 905–909. [Google Scholar] [CrossRef]
- Ma, X.; Ma, M.; Hu, C.; Song, Z.; Zhao, Z.; Feng, T.; Zhang, W. Log-can: Local-global class-aware network for semantic segmentation of remote sensing images. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Sun, L.; Zou, H.; Wei, J.; Cao, X.; He, S.; Li, M.; Liu, S. Semantic segmentation of high-resolution remote sensing images based on sparse self-attention and feature alignment. Remote Sens. 2023, 15, 1598. [Google Scholar] [CrossRef]
- Wang, T.; Xu, C.; Liu, B.; Yang, G.; Zhang, E.; Niu, D.; Zhang, H. MCAT-UNet: Convolutional and Cross-shaped Window Attention Enhanced UNet for Efficient High-resolution Remote Sensing Image Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 9745–9758. [Google Scholar] [CrossRef]
- Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 426–435. [Google Scholar] [CrossRef]
- Xu, Q.; Yuan, X.; Ouyang, C.; Zeng, Y. Spatial–spectral FFPNet: Attention-Based Pyramid Network for Segmentation and Classification of Remote Sensing Images. arXiv 2020, arXiv:2008.08775. [Google Scholar]
- Zhang, Q.; Yang, Y.B. Rest: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 15475–15485. [Google Scholar]
- Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
- Chen, Y.; Fang, P.; Yu, J.; Zhong, X.; Zhang, X.; Li, T. Hi-resnet: A high-resolution remote sensing network for semantic segmentation. arXiv 2023, arXiv:2305.12691. [Google Scholar]
- Wang, L.; Dong, S.; Chen, Y.; Meng, X.; Fang, S. MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images. arXiv 2023, arXiv:2312.12735. [Google Scholar]
- Zhang, X.; Weng, Z.; Zhu, P.; Han, X.; Zhu, J.; Jiao, L. ESDINet: Efficient Shallow-Deep Interaction Network for Semantic Segmentation of High-Resolution Aerial Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5607615. [Google Scholar] [CrossRef]
- Chaurasia, A.; Culurciello, E. Linknet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–4. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
- Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing plain vision transformer towards remote sensing foundation model. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5607315. [Google Scholar] [CrossRef]
- Liu, B.; Zhong, Z. GDformer: A lightweight decoder for efficient semantic segmentation of remote sensing urban scene imagery. In Proceedings of the Fourth International Conference on Computer Vision and Data Mining (ICCVDM 2023), Changchun, China, 19–21 October 2023; SPIE: Bellingham, WA, USA, 2024; Volume 13063, pp. 149–154. [Google Scholar]
- Long, Y.; Xia, G.S.; Li, S.; Yang, W.; Yang, M.Y.; Zhu, X.X.; Zhang, L.; Li, D. On creating benchmark dataset for aerial image interpretation: Reviews, guidances, and million-aid. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4205–4230. [Google Scholar] [CrossRef]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the Computer Vision–ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part III. Springer: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]
Method | Year | mIoU ↑ | IoU per Category * ↑ | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Vehicles | Artifacts | Fields | |||||||||||||||
LV | SV | Plane | HC | Ship | ST | Bridge | RA | Harbor | BD | TC | GTF | SBF | SP | BC | |||
UNet [81] | 2015 | ||||||||||||||||
PSPNet [91] | 2017 | ||||||||||||||||
DeepLabV3 [92] | 2017 | ||||||||||||||||
DeepLabV3+ [24] | 2018 | ||||||||||||||||
HRNet [69] | 2019 | ||||||||||||||||
FarSeg [11] | 2020 | 63.7 | 60.6 | 46.3 | 82.0 | 35.8 | 65.4 | 61.8 | 36.7 | 71.4 | 53.9 | 77.7 | 86.4 | 56.7 | 72.5 | 51.2 | 62.1 |
HMANet [76] | 2021 | 62.6 | 59.7 | 50.3 | 83.8 | 32.6 | 65.4 | 70.9 | 29.0 | 62.9 | 51.9 | 74.7 | 88.7 | 54.6 | 70.2 | 51.4 | 60.5 |
PFNet [70] | 2021 | 66.9 | 64.6 | 50.2 | 85.0 | 37.9 | 70.3 | 74.7 | 45.2 | 71.7 | 59.3 | 77.8 | 87.7 | 59.5 | 75.4 | 50.1 | 62.2 |
Segformer [58] | 2021 | 65.6 | 64.7 | 51.3 | 85.1 | 40.3 | 70.8 | 73.9 | 40.8 | 60.9 | 56.9 | 74.6 | 87.9 | 58.9 | 75.0 | 51.2 | 59.1 |
FactSeg [72] | 2022 | 64.8 | 62.7 | 49.5 | 84.1 | 42.7 | 68.3 | 56.8 | 36.3 | 69.4 | 55.7 | 78.4 | 88.9 | 54.6 | 73.6 | 51.5 | 64.9 |
BSNet [73] | 2022 | 63.4 | 63.4 | 46.6 | 81.8 | 31.8 | 65.3 | 69.1 | 41.3 | 70.0 | 57.3 | 76.1 | 86.8 | 50.3 | 70.2 | 48.8 | 55.9 |
AANet [71] | 2022 | 66.6 | 63.2 | 48.7 | 84.6 | 41.8 | 71.2 | 65.7 | 40.2 | 72.4 | 57.2 | 80.5 | 88.8 | 60.5 | 73.5 | 52.3 | 65.4 |
RSP-Swin-T [77] | 2022 | 64.1 | 62.0 | 50.6 | 85.2 | 37.6 | 67.0 | 74.6 | 44.3 | 64.9 | 53.8 | 73.7 | 70.7 | 60.1 | 76.2 | 46.8 | 59.0 |
Ringmo [80] | 2022 | 67.2 | 63.9 | 51.2 | 85.7 | 40.1 | 73.5 | 73.0 | 43.2 | 67.3 | 58.9 | 77.0 | 89.1 | 63.0 | 78.5 | 48.9 | 62.5 |
RSSFormer [78] | 2023 | 65.9 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
W-Net [93] | 2023 | 63.7 | 55.9 | 64.8 | 50.6 | 18.6 | 88.9 | 42.1 | 61.5 | 56.7 | 67.8 | 59.7 | 72.1 | 43.2 | 80.0 | 29.5 | 44.4 |
FarSeg++ [94] | 2023 | 67.9 | 65.9 | 53.6 | 86.5 | 42.7 | 71.7 | 65.7 | 41.8 | 75.8 | 62.0 | 76.0 | 89.7 | 59.4 | 75.8 | 53.6 | 66.6 |
MSAug [95] | 2024 | 68.4 | 67.5 | 52.6 | 85.3 | 41.7 | 71.7 | 74.6 | 46.1 | 72.2 | 60.3 | 79.1 | 89.8 | 60.4 | 77.6 | 52.3 | 64.4 |
PFMFS [96] | 2024 | 67.3 | 62.8 | 49.7 | 84.2 | 43.1 | 68.5 | 68.7 | 37.1 | 72.6 | 56.3 | 80.0 | 89.2 | 56.9 | 74.1 | 52.8 | 65.4 |
AerialFormer-T | – | 67.5 | 67.0 | 52.6 | 86.1 | 42.0 | 68.6 | 74.9 | 45.3 | 73.0 | 58.2 | 77.5 | 88.8 | 57.5 | 75.1 | 50.5 | 63.4 |
AerialFormer-S | – | 68.4 | 66.5 | 53.6 | 86.5 | 40.0 | 72.1 | 74.1 | 44.8 | 74.0 | 60.9 | 78.8 | 89.2 | 59.5 | 77.0 | 52.1 | 66.5 |
AerialFormer-B | – | 69.3 | 67.8 | 53.7 | 86.5 | 46.7 | 75.1 | 76.3 | 46.8 | 66.1 | 60.8 | 81.5 | 89.8 | 65.0 | 78.3 | 52.4 | 62.4 |
Method | Year | mIoU ↑ | OA ↑ | mF1 ↑ | F1 per Category * ↑ | |||||
---|---|---|---|---|---|---|---|---|---|---|
Imp. Surf. | Building | Low Veg. | Tree | Car | Clutter | |||||
FCN [21] | 2015 | 64.2 | — | 75.9 | 87.6 | 91.6 | 77.8 | 84.6 | 73.5 | 40.3 |
PSPNet [91] | 2017 | 77.1 | 90.1 | 85.6 | 92.6 | 96.2 | 86.2 | 88.0 | 95.3 | 55.4 |
DeeplabV3 [92] | 2017 | 77.2 | 90.0 | 85.6 | 92.4 | 95.9 | 86.4 | 87.6 | 94.9 | 56.7 |
UPerNet [97] | 2018 | 76.8 | 89.7 | 85.6 | 92.5 | 95.5 | 85.5 | 87.5 | 94.9 | 58.0 |
DeepLabV3+ [24] | 2018 | 77.1 | 90.1 | 85.6 | 92.6 | 96.4 | 86.3 | 87.8 | 95.4 | 55.1 |
Denseaspp [25] | 2018 | 64.7 | — | 76.4 | 87.3 | 91.1 | 76.2 | 83.4 | 77.1 | 43.3 |
DANet [98] | 2019 | 65.3 | — | 77.1 | 88.5 | 92.7 | 78.8 | 85.7 | 73.7 | 43.2 |
EMANet [99] | 2019 | 65.6 | — | 77.7 | 88.2 | 92.7 | 78.0 | 85.7 | 72.7 | 48.9 |
CCNet [46] | 2019 | 64.3 | — | 75.9 | 88.3 | 92.5 | 78.8 | 85.7 | 73.9 | 36.3 |
SCAttNet V2 [100] | 2020 | 68.3 | 88.0 | 78.4 | 81.8 | 88.8 | 72.5 | 66.3 | 80.3 | 20.2 |
PFNet [70] | 2021 | 75.4 | — | 84.8 | 91.5 | 95.9 | 85.4 | 86.3 | 91.1 | 58.6 |
Segformer [58] | 2021 | 78.0 | 90.5 | 86.4 | 92.9 | 96.4 | 86.9 | 88.1 | 95.2 | 58.9 |
LOGCAN++ [101] | 2023 | 78.6 | - | 86.6 | 87.5 | 93.8 | 77.2 | 79.8 | 93.1 | 40.1 |
SAANet [102] | 2023 | 73.8 | 88.2 | 83.6 | 83.4 | 90.8 | 72.5 | 74.5 | 84.1 | 37.5 |
MCAT-UNet [103] | 2024 | 75.4 | 83.3 | 84.8 | 84.6 | 92.5 | 74.3 | 76.3 | 83.8 | 41.5 |
AerialFormer-T | — | 79.5 | 91.1 | 87.5 | 93.5 | 96.9 | 87.2 | 89.0 | 95.9 | 62.5 |
AerialFormer-S | — | 79.3 | 91.3 | 87.2 | 93.5 | 97.0 | 87.7 | 88.9 | 96.0 | 60.2 |
AerialFormer-B | — | 79.7 | 91.4 | 87.6 | 93.5 | 97.2 | 88.1 | 89.3 | 95.7 | 61.9 |
Method | Year | mIoU ↑ | OA ↑ | mF1 ↑ | F1 per Category * ↑ | ||||
---|---|---|---|---|---|---|---|---|---|
Imp. Surf. | Building | Low Veg. | Tree | Car | |||||
DeepLabV3+ [24] | 2018 | 81.7 | 89.6 | 89.8 | 92.3 | 95.5 | 85.7 | 86.0 | 89.4 |
DANet [98] | 2019 | — | 89.7 | 89.1 | 91.6 | 96.4 | 86.1 | 88.0 | 83.5 |
LANet [104] | 2020 | — | 90.8 | 92.0 | 93.1 | 97.2 | 87.3 | 88.0 | 94.2 |
S-RA-FCN [75] | 2020 | 72.5 | 88.5 | 89.6 | 90.7 | 94.2 | 83.8 | 85.8 | 93.6 |
FFPNet [105] | 2020 | 86.2 | 91.1 | 92.4 | 93.6 | 96.7 | 87.3 | 88.1 | 96.5 |
ResT [106] | 2021 | 85.2 | 90.6 | 91.9 | 92.7 | 96.1 | 87.5 | 88.6 | 94.8 |
ABCNet [107] | 2021 | 86.5 | 91.3 | 92.7 | 93.5 | 96.9 | 87.9 | 89.1 | 95.8 |
Segmenter [56] | 2021 | 80.7 | 88.7 | 89.2 | 91.5 | 95.3 | 85.4 | 85.0 | 88.5 |
TransUNet [79] | 2021 | 86.1 | — | 88.1 | 92.4 | 94.9 | 82.9 | 88.9 | 91.3 |
HMANet [76] | 2021 | 87.3 | 92.2 | 93.2 | 93.9 | 97.6 | 88.7 | 89.1 | 96.8 |
DC-Swin [16] | 2022 | 87.6 | 92.0 | 93.3 | 94.2 | 97.6 | 88.6 | 89.6 | 96.3 |
BSNet [73] | 2022 | 77.5 | 90.7 | 91.5 | 92.4 | 95.6 | 86.8 | 88.1 | 94.6 |
UNetFormer [15] | 2022 | 86.8 | 91.3 | 92.8 | 93.6 | 97.2 | 87.7 | 88.9 | 96.5 |
FT-UNetformer [15] | 2022 | 87.5 | 92.0 | 93.3 | 93.9 | 97.2 | 88.8 | 89.8 | 96.6 |
UperNet RSP-Swin-T [77] | 2022 | — | 90.8 | 90.0 | 92.7 | 96.4 | 86.0 | 85.4 | 89.8 |
UperNet-RingMo [80] | 2022 | — | 91.7 | 91.3 | 93.6 | 97.1 | 87.1 | 86.4 | 92.2 |
Hi-ResNet [108] | 2023 | 86.1 | 91.1 | 92.4 | 93.2 | 96.5 | 87.9 | 88.6 | 96.1 |
MetaSegNet [109] | 2023 | 87.5 | 92.1 | 93.2 | 94.6 | 97.3 | 88.1 | 89.7 | 96.4 |
ESDINet [110] | 2024 | 85.3 | 90.5 | 92.0 | 92.7 | 96.3 | 87.3 | 88.1 | 95.4 |
AerialFormer-T | — | 88.5 | 93.5 | 93.7 | 95.2 | 98.0 | 89.1 | 89.1 | 97.3 |
AerialFormer-S | — | 88.6 | 93.6 | 93.8 | 95.3 | 98.1 | 89.2 | 89.1 | 97.4 |
AerialFormer-B | — | 89.0 | 93.8 | 94.0 | 95.4 | 98.0 | 89.6 | 89.7 | 97.4 |
Method | Year | mIoU | IoU per Category ↑ | ||||||
---|---|---|---|---|---|---|---|---|---|
Background | Building | Road | Water | Barren | Forest | Agriculture | |||
FCN [21] | 2015 | 46.7 | 42.6 | 49.5 | 48.1 | 73.1 | 11.8 | 43.5 | 58.3 |
UNet [81] | 2015 | 47.8 | 43.1 | 52.7 | 52.8 | 73.1 | 10.3 | 43.1 | 59.9 |
LinkNet [111] | 2017 | 48.5 | 43.6 | 52.1 | 52.5 | 76.9 | 12.2 | 45.1 | 57.3 |
SegNet [112] | 2017 | 47.3 | 41.8 | 51.8 | 51.8 | 75.4 | 10.9 | 42.9 | 56.7 |
UNet++ [113] | 2018 | 48.2 | 42.9 | 52.6 | 52.8 | 74.5 | 11.4 | 44.4 | 58.8 |
DeeplabV3+ [24] | 2018 | 47.6 | 43.0 | 50.9 | 52.0 | 74.4 | 10.4 | 44.2 | 58.5 |
FarSeg [11] | 2020 | 48.2 | 43.4 | 51.8 | 53.3 | 76.1 | 10.8 | 43.2 | 58.6 |
TransUNet [79] | 2021 | 48.9 | 43.0 | 56.1 | 53.7 | 78.0 | 9.3 | 44.9 | 56.9 |
Segmenter [56] | 2021 | 47.1 | 38.0 | 50.7 | 48.7 | 77.4 | 13.3 | 43.5 | 58.2 |
Segformer [58] | 2021 | 49.1 | 42.2 | 56.4 | 50.7 | 78.5 | 17.2 | 45.2 | 53.8 |
DC-Swin [16] | 2022 | 50.6 | 41.3 | 54.5 | 56.2 | 78.1 | 14.5 | 47.2 | 62.4 |
ViTAE-B+RVSA [114] | 2022 | 52.4 | — | — | — | — | — | — | — |
FactSeg [72] | 2022 | 48.9 | 42.6 | 53.6 | 52.8 | 76.9 | 16.2 | 42.9 | 57.5 |
UNetFormer [15] | 2022 | 52.4 | 44.7 | 58.8 | 54.9 | 79.6 | 20.1 | 46.0 | 62.5 |
RSSFormer [78] | 2023 | 52.4 | 52.4 | 60.7 | 55.2 | 76.3 | 18.7 | 45.4 | 58.3 |
LOGCAN++ [101] | 2023 | 53.4 | 47.4 | 58.4 | 56.5 | 80.1 | 18.4 | 47.9 | 64.8 |
Hi-ResNet [108] | 2023 | 52.5 | 46.7 | 58.3 | 55.9 | 80.1 | 17.0 | 46.7 | 62.7 |
MetaSegNet [109] | 2023 | 52.2 | 44.0 | 57.9 | 58.1 | 79.9 | 18.2 | 47.7 | 59.4 |
ESDINet [110] | 2024 | 50.1 | 41.6 | 53.8 | 54.8 | 78.7 | 19.5 | 44.2 | 58.0 |
GDformer [115] | 2024 | 52.2 | 45.1 | 57.6 | 56.6 | 79.7 | 17.9 | 45.8 | 62.2 |
AerialFormer-T | — | 52.0 | 45.2 | 57.8 | 56.5 | 79.6 | 19.2 | 46.1 | 59.5 |
AerialFormer-S | — | 52.4 | 46.6 | 57.4 | 57.3 | 80.5 | 15.6 | 46.8 | 62.8 |
AerialFormer-B | — | 54.1 | 47.8 | 60.7 | 59.3 | 81.5 | 17.9 | 47.9 | 64.0 |
Method | Stem | MDC | Params (M) | mIoU | OA w/out bg | OA |
---|---|---|---|---|---|---|
Baseline | ✗ | ✗ | 44.92 | 66.7 | 74.25 | 99.03 |
Stem-only | ✓ | ✗ | 45.08 | 66.9 | 75.06 | 99.05 |
MDC-only | ✗ | ✓ | 42.56 | 67.2 | 75.60 | 99.05 |
AerialFormer-T | ✓ | ✓ | 42.71 | 67.5 | 75.89 | 99.06 |
Methods | Params (M) | FLOPs (GB) | Inference Time (s) | mIoU |
---|---|---|---|---|
PSPNet [91] | 12.8 | 54.26 | 0.0097 | 60.3 |
DeepLabV3+ [22] | 12.5 | 54.21 | 0.010 | 61.4 |
Segformer [58] | 3.72 | 6.38 | 0.0089 | 65.6 |
Unet [81] | 31.0 | 184.6 | 0.020 | 65.5 |
SwinUNet [117] | 41.4 | 237.4 | 0.021 | 68.3 |
TransUnet [79] | 90.7 | 233.7 | 0.023 | 70.3 |
DC-SwinS [16] | 66.9 | - | 0.030 | 87.6 |
UnetFormer-SwinB [15] | 96.0 | - | 0.043 | 87.5 |
AerialFormer-T | 42.7 | 49.0 | 0.020 | 88.5 |
AerialFormer-S | 64.0 | 72.2 | 0.028 | 88.6 |
AerialFormer-B | 113.8 | 126.8 | 0.047 | 89.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hanyu, T.; Yamazaki, K.; Tran, M.; McCann, R.A.; Liao, H.; Rainwater, C.; Adkins, M.; Cothren, J.; Le, N. AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation. Remote Sens. 2024, 16, 2930. https://doi.org/10.3390/rs16162930
Hanyu T, Yamazaki K, Tran M, McCann RA, Liao H, Rainwater C, Adkins M, Cothren J, Le N. AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation. Remote Sensing. 2024; 16(16):2930. https://doi.org/10.3390/rs16162930
Chicago/Turabian StyleHanyu, Taisei, Kashu Yamazaki, Minh Tran, Roy A. McCann, Haitao Liao, Chase Rainwater, Meredith Adkins, Jackson Cothren, and Ngan Le. 2024. "AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation" Remote Sensing 16, no. 16: 2930. https://doi.org/10.3390/rs16162930
APA StyleHanyu, T., Yamazaki, K., Tran, M., McCann, R. A., Liao, H., Rainwater, C., Adkins, M., Cothren, J., & Le, N. (2024). AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation. Remote Sensing, 16(16), 2930. https://doi.org/10.3390/rs16162930