AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation

Hanyu, Taisei; Yamazaki, Kashu; Tran, Minh; McCann, Roy A.; Liao, Haitao; Rainwater, Chase; Adkins, Meredith; Cothren, Jackson; Le, Ngan

doi:10.3390/rs16162930

Open AccessEditor’s ChoiceArticle

AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation

by

Taisei Hanyu

^1,*,†

,

Kashu Yamazaki

^1,†

,

Minh Tran

¹,

Roy A. McCann

¹,

Haitao Liao

¹,

Chase Rainwater

¹,

Meredith Adkins

²

,

Jackson Cothren

¹

and

Ngan Le

¹

College of Engineering, University of Arkansas, Fayetteville, AR 72701, USA

²

Institute for Integrative and Innovative Research, University of Arkansas, Fayetteville, AR 72701, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(16), 2930; https://doi.org/10.3390/rs16162930

Submission received: 26 June 2024 / Revised: 5 August 2024 / Accepted: 6 August 2024 / Published: 9 August 2024

(This article belongs to the Special Issue Deep Learning and Computer Vision in Remote Sensing-III)

Download

Browse Figures

Versions Notes

Abstract

When performing remote sensing image segmentation, practitioners often encounter various challenges, such as a strong imbalance in the foreground–background, the presence of tiny objects, high object density, intra-class heterogeneity, and inter-class homogeneity. To overcome these challenges, this paper introduces AerialFormer, a hybrid model that strategically combines the strengths of Transformers and Convolutional Neural Networks (CNNs). AerialFormer features a CNN Stem module integrated to preserve low-level and high-resolution features, enhancing the model’s capability to process details of aerial imagery. The proposed AerialFormer is designed with a hierarchical structure, in which a Transformer encoder generates multi-scale features and a multi-dilated CNN (MDC) decoder aggregates the information from the multi-scale inputs. As a result, information is taken into account in both local and global contexts, so that powerful representations and high-resolution segmentation can be achieved. The proposed AerialFormer was benchmarked on three benchmark datasets, including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that the proposed AerialFormer remarkably outperforms state-of-the-art methods.

Keywords:

remote sensing; semantic segmentation; transformers; dilated convolution

Graphical Abstract

1. Introduction

The use of aerial images provides a view of the Earth from above, which consists of various geospatial objects such as cars, buildings, airplanes, ships, etc., and allows us to regularly monitor certain large areas of the planet. Recent advances in sensor technology have promoted the potential use of remote sensing images in broader applications, attributed to the ability to capture high-spatial resolution (HSR) images with abundant spatial details and rich potential semantic content. Aerial image segmentation (AIS) is a particular semantic segmentation task that aims to assign a semantic category to each image pixel. Thus, AIS plays an important role in the understanding and analysis of remote sensing data, offering both semantic and localization cues for targets of interest. Understanding and analyzing these objects from the top-down perspective offered by remote sensing imagery is crucial for urban monitoring and planning. This understanding finds utility in numerous practical urban-related applications, such as disaster monitoring [1], agricultural planning [2], street view extraction [3,4], land change [5,6,7], land cover [8], climate change [9], deforestation [10], etc. However, due to the large size of aerial images and limited sensor bandwidth, several challenging characteristics need to be investigated. Figure 1 delineates five principal challenges, as follows: (i) background–foreground imbalance [11], characterized by a disproportionate ratio of foreground to background elements (2.86%/97.14%); (ii) the presence of tiny objects [12], defined as those with dimensions less than 32 × 32 pixels according to the absolute size definition introduced by [13]; (iii) high object density [12], where objects are closely aggregated; (iv) intra-class heterogeneity [14], denoting the variability within a single category in terms of shape, texture, color, scale, and structure, exemplified by tennis courts displaying diverse appearances; and (v) inter-class homogeneity [14], indicating the visual similarities shared among objects of distinct categories, such as the comparable appearances of tennis and basketball courts.

Recently, many studies [15,16] have adopted the use of neural networks, including Convolutional Neural Networks (CNNs) and Transformers, for AIS due to the great success shown in the natural image domain. However, for proposed use cases such as self-driving vehicles [17] and medical imaging [18], many existing image segmentation methods in computer vision are unable to effectively address the five characteristics of AIS within a reasonable computational budget. Typical image backbone models [19,20], for example, involve a 1/4 downsampling process that does not prioritize tiny, densely packed objects, frequently resulting in their oversight. Therefore, when designing a segmentation model for remote sensing imagery, it is crucial to focus on addressing the unique challenges specific to this domain. CNNs are based on using convolution to compute the local correlation among neighboring pixels. Consequently, CNNs are good at extracting high-frequency components and localized structures. However, this property leads to locality and strong inductive biases. Transformers, on the other hand, treat images as a sequence of embedded patches and model the global correlation among patches with self-attention mechanisms. Consequently, Transformers are good at capturing low-frequency components and the global structure. Generally, CNNs and Transformers exhibit opposite behaviors, where CNNs act like high-pass filters and Transformers act like low-pass filters. This analysis shows that CNNs and Transformers are naturally complementary to each other. Thus, combining CNNs and Transformers can overcome the weaknesses of the two models and strengthen their advantages simultaneously. To alleviate tiny- and dense-object difficulties, it is beneficial to utilize detailed local features captured by CNNs. For background–foreground imbalance, intra-class heterogeneity, and inter-class homogeneity, it is important to utilize strong semantic representations at both the local level (e.g., boundary) from CNNs and the global context level (e.g., the relationship between objects/classes) from Transformers.

To address the five challenges, including (i) background–foreground imbalance, (ii) the presence of tiny, (iii) densely packed objects, (iv) intra-class heterogeneity, and (v) inter-class homogeneity, this paper proposes AerialFormer. The proposed approach effectively combines the robust feature extraction capabilities of CNNs with the advanced contextual understanding offered by Transformers, while maintaining an acceptable parameter increase. The proposed AerialFormer integrates three key components to address the five challenges. The CNNs Stem is crucial for high-resolution feature extraction, targeting the precise identification of tiny and densely packed objects. By maintaining half the size of the input image, this module serves larger, more detailed features to the decoder. The Transformer Encoder, leveraging Swin Transformer technology, focuses on capturing complex global relationships. Lastly, the multi-dilated CNN Decoder combines local and global context to improve detail recognition and accuracy on various scales. This module ensures computational efficiency, which is comparable to that of plain convolution while benefiting from diverse feature representations. The combination of the Transformer Encoder and multi-dilated CNN (MDC) Decoder effectively addresses issues such as background–foreground balance and intra-class heterogeneity and inter-class homogeneity. Our contributions are summarized as follows:

The proposed approach incorporates a high-resolution CNN Stem that preserves half the input image size, providing larger and more detailed features to the decoder. This improvement enhances the segmentation of tiny and densely packed objects.
This paper introduces a unique multi-dilated CNN Decoder that efficiently integrates both the local and global context. This module utilizes chopped channel-wise feature maps with three distinct filters, maintaining computational efficiency while enhancing the diversity of feature representations.
The proposed method demonstrates the effectiveness of combining a Transformer Encoder with a multi-dilated CNN Decoder and CNN Stem. This integration successfully addresses challenges such as background–foreground imbalance, intra-class heterogeneity, inter-class homogeneity, and the segmentation of tiny and dense objects in aerial image segmentation tasks.

2. Related Works

Generally, image segmentation is categorized into three tasks: instance segmentation, semantic segmentation, and panoptic segmentation. Each of these tasks is distinguished on the basis of its respective semantic considerations. This work focuses on the second task of semantic segmentation, a form of dense prediction tasks where each pixel of an image is associated with a class label. Unlike instance segmentation, it does not distinguish each individual instance of the same object class. The goal of semantic segmentation is to divide an image into several visually meaningful or interesting areas for visual understanding according to semantic information. Semantic segmentation plays an important role in a broad range of applications, e.g., scene understanding, medical image analysis, autonomous driving, video surveillance, robot perception, satellite image segmentation, agriculture analysis, etc. This section begins by reviewing DL-based semantic image segmentation and the advancements made in computer vision with Transformers. It then shifts focus to a review of aerial image segmentation using deep neural networks.

2.1. DL-Based Image Segmentation

Convolutional Neural Networks (CNNs) are widely regarded as the de facto standard for various tasks within the field of computer vision. Long et al. [21] showed that fully convolutional networks can be used to segment images without fully connected layers, and they have become one of the principal networks for semantic segmentation. With the advancements brought by fully convolutional networks into semantic segmentation, many improvements have been achieved by designing the network deeper, wider, or more effective. This includes enlarging the receptive field [22,23,24,25,26,27], strengthening context cues [27,28,29,30,31,32,33,34,35,36] leveraging boundary information [18,37,38,39,40,41], and incorporating neural attention [42,43,44,45,46,47,48,49,50]. Recently, a new paradigm of neural network architecture that does not employ any convolutions and mainly relies on a self-attention mechanism, called a Transformer, has become rapidly adopted to CV tasks [51,52,53] and achieved promising performance. The core idea behind the Transformer architecture [54] is the self-attention mechanism used to capture long-range relationships. In addition, Transformers can be easily parallelized, facilitating training on larger datasets. Vision Transformer (ViT) [20] is considered one of the first works that applied the standard Transformer to vision tasks. Unlike the CNN structure, the ViT processes a 2D image as a 1D sequence of image patches. Thanks to the powerful sequence-to-sequence modeling ability of the Transformer, the ViT demonstrates superior characterization for extracting global context, especially in lower-level features compared to its CNN counterparts. Recent advancements in Transformers over the past few years have demonstrated their effectiveness as backbone networks for visual tasks, surpassing the performances of numerous CNN-based models trained on large datasets. Transformer-based image segmentation approaches [55,56,57,58,59,60] inherit the flexibility of Transformers in modeling long-range dependencies, yielding remarkable results. Transformers have been applied with notable success across a variety of computer vision tasks. These include image recognition [20,61] object detection [51,62,63], image segmentation [57,59,64], action localization [65,66], and video captioning [67,68], thereby showcasing their capability to augment global information.

2.2. Aerial Image Segmentation

Computer vision techniques have long been employed for the analysis of satellite images. Historically, satellite images had a lower resolution, and the goal of segmentation was primarily to identify boundaries such as straight lines and curves in aerial pictures. However, modern satellite imagery possesses a significantly higher resolution, and consequently, the demands of segmentation tasks have substantially increased, which include the segmentation of tiny objects, objects with substantial scale variation, and entities exhibiting visual ambiguities. To this end, fully convolutional networks and their variants have become the mainstream solution for aerial image segmentation and have led to state-of-the-art performances across numerous datasets [24,69,70,71,72,73]. To capture contextual interrelations among pixels in remote sensing images, techniques from natural language processing have also been incorporated into aerial image segmentation [74]. By imitating the channel attention mechanism [45], S-RA-FCN [75] employs a spatial relation module to capture global spatial relations, and [76] introduced HMANet with spatial interaction while balancing between the size of the receptive field and the computation cost. In HMANet, a region shuffle attention module is proposed to improve the efficiency of the self-attention mechanism by reducing redundant features and forming region-wise representations. In recent years, the advancements in transformer-based networks, which leverage self-attention mechanisms to achieve receptive fields as large as the entire image, have sparked increased interest in their applications. Consequently, there has been a surge in research studies [15,16,58,77,78,79,80] that have integrated Transformers into remote sensing applications. In recent state-of-the-art models, hybrid architectures that combine Transformers and CNNs have demonstrated significant progress. RSSFormer [78] introduces an Adaptive Transformer Fusion Module that employs Adaptive Multi-Head Attention (AMHA) and MLP with Dilated Convolution to mitigate background noise, enhances object saliency, and captures broader contextual information during multi-scale feature fusion. DC-Swin [16] introduces a decoder with a Densely Connected Feature Aggregation Module (DCFAM) to aggregate multi-scale features from the Swin Transformer encoder. By incorporating dense connections and attention mechanisms, DC-Swin enhances the ability to capture and utilize multi-scale information and relation-enhanced context. UNetFormer [15] introduces a Global–Local Transformer Block (GLTB), which incorporates efficient global–local attention using an attention-based branch and a convolutional-based branch.

This paper introduces AerialFormer, an innovative fusion of a Transformer encoder enhanced by CNN Stem and a multi-dilated CNN decoder. Although Transformer-based approaches excel at modeling long-range dependencies, they face challenges in capturing local details and struggle to handle tiny objects. Thus, the proposed AerialFormer incorporates a CNN Stem module to effectively support the Transformer encoders and multi-dilated convolution to capture long-range dependence without increasing the memory footprint at the decoder. The proposed novel AerialFormer combines the strengths of a Transformer encoder and a multi-dilated CNN decoder, aided by skip connections, to capture both local context and long-range dependencies effectively in aerial image segmentation.

3. Methods

3.1. Network Overview

An overview of the AerialFormer architecture is presented in Figure 2. The architecture design is fundamentally rooted in the renowned Unet structure for semantic segmentation [81], characterized by its encoder–decoder network with the use of skip connections between the matched blocks with identical spatial resolutions on both the encoder and the decoder sides. The composition of the model is threefold: a CNN Stem, a Transformer Encoder, and a multi-dilated CNN Decoder. The CNN Stem is designed to be a complement module to the Transformer Encoder that generates local specific features preserving low-level information at high resolution, which are subsequently passed to the final concatenation in the decoder, ensuring that the model retains focus on (ii) tiny and (iii) densely packed objects. The Transformer Encoder is designed as a sequence of s stages of Transformer Encoder blocks (s is set as 4 in the architecture) aimed at extracting long-range context representation. The multi-dilated CNN (MDC) decoder consists of

s + 1

MDC blocks with skip connections to obtain information from multiple scales and wide contexts. The MDC module effectively decodes the rich feature maps of the Stem module and the Transformer Encoder, addressing the challenges related to distinguishing similar objects within (i) complex backgrounds and handling (iv) intra-class heterogeneity and (v) inter-class homogeneity. These components will be detailed in the following subsections. Given a high-resolution aerial image, it is first partitioned into a set of overlapping subimages sized

H \times W \times 3

, where 3 corresponds to three color channels. Then, each sub-image is fed to AerialFormer, and the output is the segmentation of

H \times W

.

3.2. CNN Stem

This work proposes a simple yet effective way to inject the local and detailed features of the input image into our decoder through the CNN Stem module. The proposed CNN Stem serves two crucial purposes in the architecture. Firstly, it generates larger spatial feature maps to preserve the fine-grained details of tiny objects, which are often lost during the downsampling process inherent in encoders. Secondly, the Stem module complements the transformer-based encoder by capturing local features, leveraging the inherent strengths of convolutions. These functionalities of the CNN Stem empower the model to effectively handle (ii) tiny objects and (iii) high object density. This module is expected to model the local spatial contexts of images parallel with the patch embedding layer. As shown in Figure 3, our CNN Stem consists of four convolution layers, each followed by BatchNorm [82], and GELU [83] activation layers. The first

3 \times 3

convolutional layer with a stride of

2 \times 2

reduces the input spacial size to half, and through the following three layers of

3 \times 3

convolution with a stride of

1 \times 1

, the local features for tiny and dense objects are obtained.

3.3. Transformer Encoder

The Transformer Encoder starts by processing an input image size of

H \times W \times 3

, which is tokenized by the Patch Embedding layer, which results in a feature map

\frac{H}{p} \times \frac{W}{p} \times C

. The feature map is then passed through a sequence of

s = 4

Transformer Encoder Blocks and produces multi-level outputs of different sizes at each block:

\frac{H}{4} \times \frac{W}{4} \times C

,

\frac{H}{8} \times \frac{W}{8} \times 2 C

,

\frac{H}{16} \times \frac{W}{16} \times 4 C

, and

\frac{H}{32} \times \frac{W}{32} \times 8 C

. Each Transformer Encoder Block is followed by a Patch Merging layer, which reduces the spatial dimension by half before being passed to the next deeper Transformer Encoder Block.

3.3.1. Patch Embedding

The Transformer Encoder starts by taking an image

H \times W \times 3

as input and dividing it into patches of size

p \times p

in a non-overlapping manner. Each patch is embedded in a vector in the dimensional space of

R^{C}

by a linear projection, which can be simplified as a single convolution operation with a kernel size of

p \times p

and a stride of

p \times p

. Patch embedding produces feature maps of

\frac{H}{p} \times \frac{W}{p} \times C

. The patch size determines the spatial resolution of the input sequence of the transformer, and therefore, a smaller patch size is favored for dense prediction tasks including semantic segmentation. Although ViT [20] is a commonly used vision transformer in computer vision, which processes the

16 \times 16

patch and is able to capture a wider range context, it may not be suitable to capture detailed information. One of the most challenging aspects of aerial image segmentation is dealing with tiny objects. On the other hand, the Swin Transformer [84], one of the Transformer variants, utilizes a smaller patch of

4 \times 4

. Thus, the Swin Transformer [84] was adopted to implement a patch embedding layer to better capture the detailed information of tiny objects in aerial image segmentation.

3.3.2. Transformer Encoder Block

In general, let

x \in R^{h \times w \times d}

denote the input of a Transformer Encoder Block. The Transformer Encoder Block processes the input data with a series of self-attention and a feed-forward network with residual connection. To compensate the increase in computation because of the smaller patch size, the Swin Transformer [84] utilizes a local self-attention instead of global self-attention. Global self-attention, used in standard Transformers, has a computational cost of

O (N^{2} \cdot d)

, where N is the number of tokens (i.e.,

N = h \times w

) and d is the representation dimension, which can be prohibitively expensive for large images and small patch sizes. The Swin Transformer introduces window-based self-attention (WSA) that divides the image into non-overlapping windows and performs self-attention within each window. With WSA, the computational cost is linear to the number of tokens, i.e.,

O (M^{2} \cdot N \cdot d)

, where

M^{2}

is the number of patches within a window, and

M^{2} ≪ N

. In order to apply the WSA, an input

x \in R^{h \times w \times d}

is partitioned into a group of local patches

x^{'} \in R^{\frac{h \times w}{M^{2}} \times M^{2} \times d}

, and the first dimension

\frac{h \times w}{M^{2}}

is treated as a batch dimension, i.e., the network parameters are shared along the first dimension. Considering the multi-head attention operation with h heads, the feature dimension d is split into h identical blocks, i.e.,

R^{\frac{h \times w}{M^{2}} \times M^{2} \times \frac{d}{h} \times h}

. Then, the WSA can be formulated as

WSA (x^{'}) = [{head}_{1}; \dots; {head}_{h}] W^{O}

(1)

where

[;]

denotes the channel-wise concatenation of the tensor,

W^{O} \in R^{d \times d}

denotes the output projection weights, and each head

{head}_{i}

is calculated as

{head}_{i} = softmax (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d / h}} + B) V_{i}

(2)

where

Q_{i} = x_{i}^{'} W_{i}^{Q}, K_{i} = x_{i}^{'} W_{i}^{K},

and

V_{i} = x_{i}^{'} W_{i}^{V} \in R^{M^{2} \times \frac{d}{h}}

are the query, key, and value tensors, which are created from the local window with

M \times M

patches with

\frac{d}{h}

feature dimensions by linearly projecting with learnable weights of

W^{Q}

,

W^{K}

, and

W^{V} \in R^{\frac{d}{h} \times \frac{d}{h}}

.

B \in R^{M^{2} \times M^{2}}

is the relative position bias [84] that introduces relative positional information into the model.

Because the WSA applies self-attention to the local window, the WSA alone cannot obtain a global context of the image. To alleviate this issue, the Swin Transformer stacks Transformer blocks using WSA and alternates the window location by half of the window size to gradually build global context by integrating information from different windows. Specifically, the Swin Transformer block consists of a shifted WSA, followed by a 2-layer FFN with a GELU activation function in between, which is formulated as

\begin{matrix} {\hat{x}}^{l} & = x^{l} + WSA (n o r m (x^{l})) \\ x^{l + 1} & = {\hat{x}}^{l} + FFN (n o r m ({\hat{x}}^{l})) \end{matrix}

(3)

where

n o r m

indicates the LayerNorm [85] operation, FFN indicates the feed-forward network, and the partition of the input

x

is shifted by

(⌊ \frac{M}{2}, \frac{M}{2} ⌋)

from the regularly partitioned windows when layer l is even. This process is illustrated in Figure 4. For each Transformer Encoder Block, the set of the total number of layers is denoted as

L_{s}

.

3.3.3. Patch Merging

In order to generate a hierarchical representation, the spatial resolution of each Transformer Encoder Block is reduced by half through the patch merging layer. The patch merging layer takes as input a feature map size of

x \in R^{h \times w \times d}

. The layer first splits and gathers the feature in a checkerboard pattern, creating four sub-feature maps

x_{1}

to

x_{4}

with half of the spatial dimension of the original feature map, where

x_{1}

contains pixels from ‘black’ squares in even rows,

x_{2}

from ‘white’ squares in even rows,

x_{3}

from ‘black’ squares in odd rows, and

x_{4}

from ‘white’ squares in odd rows. Then, these four feature maps are concatenated along the channel dimension, resulting in a tensor of size

h / 2 \times w / 2 \times 4 d

. Finally, the linear projection is applied to reduce the channel dimension from

4 d

to

2 d

.

3.4. Multi-Dilated CNN Decoder

The proposed decoder with MDC blocks was designed to address challenges including (i) complex backgrounds, (iv) intra-class heterogeneity, and (v) inter-class homogeneity, effectively decoding rich feature maps from the Stem module and the Swin Transformer. Although local fine-grained features are important for segmenting tiny objects, it is also crucial to consider the global context at the same time. In the decoder, multiple dilated convolutional operations are used in parallel with different dilation rates to obtain a wider context for decoding without any additional parameters. Efficient feature aggregation in the dilated convolutions selectively emphasizes important spatial information, improving object boundary delineation, and reducing class confusion. To incorporate these benefits, the multi-dilated CNN decoder contains a sequence of multi-dilated CNN (MDC) blocks followed by Deconvolutional (Deconv) blocks, which are detailed as follows.

3.4.1. MDC Block

An MDC Block is defined by three parameters [

r_{1}, r_{2}, r_{3}

] corresponding to three receptive fields and consists of three parts: a Pre-Channel Mixer, dilated convolutional layer (DCL), and Post-Channel Mixer.

The MDC Block starts by applying the Pre-Channel Mixer to the input, which is the concatenation of the previous MDC block’s output and the skip connection from the mirrored encoder, in order to exchange the information in the channel dimension. The channel mixing operation can be implemented with any operator that enforces information exchange in the channel dimension. Here, the Pre-Channel Mixer is implemented as a point-wise convolution layer without any normalization or activation layer.

The DCL utilizes three convolutional kernels with different dilation rates of

d_{1}

,

d_{2}

, and

d_{3}

, which allows one to obtain multi-scale receptive fields. The length of one side of a receptive field r of dilated convolution given a kernel size k and a dilation rate d is calculated as follows:

r_{i} = d_{i} (k - 1) + 1

(4)

where the kernel size k is established as 3 for receptive fields that exceed

3 \times 3

in size and as 1 for those receptive fields that are smaller. The dilated convolutional operation with a receptive field of

r \times r

is denoted by

{Conv}_{r} (\cdot)

. Then, the proposed DCL can be formulated as follows:

{DCL}_{r_{1}, r_{2}, r_{3}} (x) = [{Conv}_{r_{1}} (x_{1}); {Conv}_{r_{2}} (x_{2}); {Conv}_{r_{3}} (x_{3})]

(5)

where

x = [x_{1}; x_{2}; x_{3}]

, i.e.,the tensor after the Pre-Channel Mixer

x

is sliced into three sub-tensors with equivalent channel length. As the feature is split to process with the DCL with three different spatial resolutions, the Post-Channel Mixer is applied to exchange the information from the three convolutional layers. The Post-Channel Mixer is implemented with a sequence of point-wise and

3 \times 3

convolution layers, each of which is followed by BatchNorm and ReLU activation layers. This lets us formulate the multi-dilated convolution (MDC) block as follows. The entire operation for the MDC Block is illustrated in Figure 5.

MDC (x) = PostMixer ({DCL}_{r_{1}, r_{2}, r_{3}} (PreMixer (x)))

(6)

where PreMixer refers to the Pre-Channel Mixer and PostMixer refers to the Post-Channel Mixer.

3.4.2. Deconv Block

The Deconv Block serves to increase the spatial dimensions of the feature map by a factor of two, while concurrently decreasing the channel dimension by half. Concretely, this block takes a feature map of size

x \in R^{h / 2 \times w / 2 \times 2 d}

and transforms it into a feature of size

x \in R^{h \times w \times d}

. This is achieved utilizing a transposed convolution layer followed by BatchNorm and ReLU activation layers. This learnable upsampling operation is applied to the output of the MDC block before concatenating it with the bypassed high-resolution features from the encoder, providing an opposite functionality to the patch merging block.

3.5. Loss Function

The network is trained using supervised learning with cross-entropy loss, which can be formulated as follows:

L_{C E} = - \sum_{i = 1}^{n} t_{i} log (p_{i})

(7)

where

t_{i}

represents the ground truth, and

p_{i}

is the softmax probability for the

i^{t h}

class.

4. Experiments

4.1. Datasets

The proposed AerialFormer is benchmarked on three standard aerial imaging datasets, i.e., iSAID, Potsdam, and LoveDA as below.

iSAID: The iSAID dataset [86] is a large-scale and densely annotated aerial segmentation dataset that contains 655,451 instances of 2806 high-resolution images for 15 classes, i.e., ship (Ship), storage tank (ST), baseball diamond (BD), tennis court (TC), basketball court (BC), ground field track (GTF), bridge (Bridge), large vehicle (LV), small vehicle (SV), helicopter (HC), swimming pool (SP), roundabout (RA), soccer ball field (SBF), plane (Plane), and harbor (Harbor). This dataset is challenging due to foreground–background imbalance, the presence of a large number of objects per image, limited-appearance details, a variety of tiny objects, large-scale variations, and high class imbalance. These images were collected from multiple sensors and platforms with multiple resolutions and image sizes ranging from $800 \times 800$ pixels to $4000 \times 13,000$ pixels. Following the experiment setup [11,70], the dataset was split into 1411/458/937 images for train/val/test. The network was trained on the trainset and benchmarked on the valset. Each image was overlap-partitioned into a set of sub-images sized $896 \times 896$ with a step size of 512 by 512.
Potsdam: The Potsdam dataset [87] contains 38 high-resolution images of $6000 \times 6000$ pixels over Potsdam City, Germany, and the ground sampling distance is 5 cm. The dataset was divided into 24 images for training and 14 images for validation/testing. There are two modalities included in the Potsdam dataset, i.e., true orthophoto (TOP) and digital surface model (DSM). While DSM consists of the near-infrared (NIR) band, TOP corresponds to an RGB image. In this work, TOP images were from Potsdam, and DSM images were ignored. The dataset presents a complex background and challenging inter- and intra-class variations due to the unique characteristics of Potsdam. For example, the similarity between low-vegetation and building classes caused by roof greening illustrates the inter-class difficulty. The dataset offers two types of annotations with non-eroded (NE) and eroded (E) options, with and without the boundary. To avoid ambiguity in labeling boundaries, all experimental results were performed and benchmarked on the eroded boundary dataset. Following the setup of the experiment [77,88], the dataset was divided into 24 images for training and 14 images for testing. The testset of 14 images included 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, and 7_13. The dataset consisted of six categories of surfaces, buildings, low vegetation, trees, cars, and clutter/background. The performance is reported in two scenarios: with and without clutter. Each image is overlap-partitioned into a set of sub-images sized $512 \times 512$ with a step size of 256 by 256.
LoveDA: The LoveDA dataset [8] consists of 5987 high-resolution images of $1024 \times 1024$ pixels and 30 cm in spatial resolution. The data include 18 complex urban and rural scenes and 166,768 annotated objects from three different cities (Nanjing, Changzhou, and Wuhan) in China. This dataset presents challenges due to its diverse geographical sources, leading to complex and varied backgrounds as well as inconsistent appearances within the same class, such as differences in scale and texture. In alignment with the experimental setup delineated in [8], the dataset was partitioned into 2522/1669/1796 images for training, validation, and testing, respectively. In evaluation scenarios involving the testset, the training and validation sets of LoveDA were amalgamated to create a combined training set, while keeping the testset unchanged.

4.2. Evaluation Metrics

To evaluate the performance, three commonly used metrics were adopted: mean intersection over union (mIoU), overall accuracy (OA), and mean F1 score (mF1). These metrics are computed on the basis of four fundamental values, namely true positive (TP), true negative (TN), false positive (FP), and false negative (FN). The calculation of these four values involves the utilization of the prediction

P \in R^{L \times H \times W}

and the class-wise binary ground truth mask

GT \in R^{L \times H \times W}

, where H and W are the height and width of the input image, and L is the number of classes/categories existing in the input. In the context of multi-class segmentation, these values are computed for each class

l \in [1, 2, \dots, L]

across all pixels.

\begin{matrix} {TP}_{l} & = \sum_{h = 1}^{H} \sum_{w = 1}^{W} {GT}_{l, h, w} \land P_{l, h, w} \\ {TN}_{l} & = \sum_{h = 1}^{H} \sum_{w = 1}^{W} \neg ({GT}_{l, h, w} \lor P_{l, h, w}) \\ {FP}_{l} & = \sum_{h = 1}^{H} \sum_{w = 1}^{W} \neg {GT}_{l, h, w} \land P_{l, h, w} \\ {FN}_{l} & = \sum_{h = 1}^{H} \sum_{w = 1}^{W} {GT}_{l, h, w} \land \neg P_{l, h, w} \end{matrix}

(8)

Based on the four values above, the IoU and F1 of an individual category l are calculated as follows:

{IoU}_{l} = \frac{T P_{l}}{T P_{l} + F N_{l} + F P_{l}}

(9)

{F 1}_{l} = \frac{2 T P_{l}}{2 T P_{l} + F N_{l} + F P_{l}}

(10)

{IoU}_{l}

and

{F 1}_{l}

are referred as the IoU and F1 of category l. OA is then further computed as the ratio of correctly predicted pixels to the total number of pixels. The mIoU and mF1 are computed as the arithmetic means of the IoU and F1 score, respectively, for each class category.

OA = \frac{\sum_{l = 1}^{L} T P_{l}}{\sum_{l = 1}^{L} (T P_{l} + F P_{l} + N P_{l} + N F_{l})}

(11)

mIoU = \frac{1}{L} \sum_{l = 1}^{L} {IoU}_{l}

(12)

mF 1 = \frac{1}{L} \sum_{l = 1}^{L} {F 1}_{l}

(13)

4.3. Implementation Details

The proposed AerialFormer-T was trained on a single RTX 8000 GPU, and the proposed AerialFormer-S and AerialFormer-B on two RTX 8000 GPUs. The Adam [89] optimizer was employed with a learning rate of

6 \times 10^{- 5}

, weight decay of 0.01, betas of (0.9, 0.999), and batch size of 8. The experimental models were trained for 160k iterations for the LoveDA and Potsdam datasets and 800k iterations for the iSAID dataset. During all training processes, data augmentations such as random horizontal flipping and photometric distortions were applied.

AerialFormer was trained on three different backbones, i.e., Swin Transformer-Tiny (Swin-T), Swin Transformer-Small (Swin-B), and Swin Transformer-Base (Swin-B). The first two backbones were pre-trained on the Imagenet-1k dataset [90], and the last backbone was pre-trained on the Imagenet-22k dataset [90]. As a result, the experimental performances of the three models AerialFormer-T, AerialFormer-S, and AerialFormer-B were compared. As introduced in Section 3.3, the model hyperparameters including the number of channels C, window size

M^{2}

, and a set of layers

L = {L_{s}}_{s = 1}^{s = 4}

in Transformer Encoder Blocks specific to each model are delineated as follows:

AerialFormer-T: $C = 96$ , $M^{2} = 7^{2}$ , $L = {2, 2, 6, 2}$ ;
AerialFormer-S: $C = 96$ , $M^{2} = 7^{2}$ , $L = {2, 2, 18, 2}$ ;
AerialFormer-B: $C = 128$ , $M^{2} = 12^{2}$ , $L = {2, 2, 18, 2}$ .

In addition to the aforementioned parameters, the receptive field sizes of the MDC decoder are noted, which remain constant across the models, detailed as follows:

[r_{1}, r_{2}, r_{3}] = {[1, 3, 3], [3, 3, 3], [3, 5, 7], [3, 5, 7],

and

[3, 5, 7]}

, as demonstrated in Figure 2.

It is worth highlighting that, relative to the commonly utilized CNN backbones, the proposed model does not significantly increase computational cost, as the computational complexities of Swin-T and Swin-S align closely with those of ResNet-50 and ResNet-101, respectively. Our source code is publicly available at https://github.com/UARK-AICV/AerialFormer (accessed on 5 August 2024).

4.4. Quantitative Results and Analysis

Quantitative performance comparisons between AerialFormer and other existing methods are presented in Table 1, Table 2, Table 3 and Table 4 for three different datasets under various settings of iSAID (valset), Potsdam (with clutter), Potsdam (without clutter), and LoveDA (testset), respectively. For each dataset, the performance of the proposed AerialFormer is reported on three backbones of Swin-T, Swin-S, and Swin-B, called AerialFormer-T, AerialFormer-S, and AerialFormer-B, respectively. The proposed AerialFormer was benchmarked with CNN-based and Transformer-based image segmentation methods. The comparison of each dataset is detailed as follows.

4.4.1. iSAID Semantic Segmentation Results

Performance comparisons of the proposed AerialFormer with existing state-of-the-art methods on the iSAID dataset are presented in Table 1. The iSAID dataset consists of 15 categories and is divided into three groups of vehicles, artifacts, and fields. In general, it is observed that AerialFormer-B achieves the best performance, while both AerialFormer-S and AerialFormer-T obtain comparable results to the second-best methods. All three models significantly outperform other existing methods. Specifically, AerialFormer-T obtains an mIoU of

67.5 %

, AerialFormer-S achieves an mIoU of

68.4 %

, and AerialFormer-B attains an mIoU of

69.3 %

. These results present improvements of

0.3 %, 1.2 %, and 2.1 %

over the previous highest score of

67.2 %

from RingMo [80]. Moreover, on some small and dense classes (e.g., small vehicles (SVs), planes, helicopters (HCs), etc.), AerialFormer gains a big margin compared to the existing methods. In taking the small-vehicle (SV) class as an example, AerialFormer-T achieves a

1.4 %

IoU gain, AerialFormer-S gains a

2.4 %

IoU margin, AerialFormer-B gains a

2.5 %

IoU margin better than that of the best existing method, i.e., RingMo [80]. It should be noted that RingMo utilizes Swin-B as its backbone, which shares a similar computational cost with AerialFormer-B. This analysis further shows that AerialFormer-T and AerialFormer-S, despite being smaller models, outperform the best existing method, RingMo.

4.4.2. Potsdam Semantic Segmentation Results

The segmentation performance was analyzed on the Potsdam dataset in two cases, with and without clutter/background, and the results are summarized in Table 2 and Table 3, respectively. The clutter class is the most challenging class, as it can contain anything except for the five classes of impervious surface, building, low vegetation, tree, and car. Similar to in other existing work [15,80,100], the proposed AerialFormer was benchmarked using various metrics of mIoU, OA, mF1, and F1 per category.

Potsdam with Clutter: Table 2 reports the performance comparisons between our AerialFormer with the existing methods in six classes (that is, including the clutter class). It should be noted that among all existing methods, Segformer [58] is a strong transformer-based segmentation model and obtains the best performance. The proposed model gains a notable improvement of $1.7 %$ in mIoU, $0.9 %$ in OA, and $1.2 %$ in mF1 compared to the best existing Segformer methods.

Unlike the experiment on iSAID (Section 4.4.1), the trade-off between performance and model size does not seem favorable for this dataset. We speculate that the cause for this could be the difference in the spatial resolution of the datasets. According to [116], while the iSAID dataset includes images with spatial resolutions of up to

0.3

m, the spatial resolution of the Potsdam dataset is finer at

0.05

m. Consequently, objects in the Potsdam dataset are represented with more pixels, appearing much larger. This might reduce the requirement for architectural enhancements specifically aimed at improving the segmentation of tiny objects.

As the most challenging category, the F1 score in clutter is the lowest compared to the other five categories. Due to the challenging clutter category, many methods have ignored this category and focused on training the network in only the five other categories, as shown in Table 3.

Potsdam without Clutter: In this experimental setting, the review shows that FT-UNetformer [15], HMANet [76], and DC-Swin [16] obtained the best scores on the metrics mIoU, OA, and mF1, and none of them could achieve the best score on all three metrics. On the other hand, the proposed AerialFormer-B scores the best in all three metrics and gains improvements of $1.6 %$ mIoU, $1.7 %$ OA, and $0.9 %$ mF1 compared to FT-UNetformer, HMANet, and DC-Swin, respectively. Compared to Table 2, which contains clutter, it can be seen that clutter, when ignored, tends to alleviate ambiguity between the remaining classes.

Similarly to the observation on the iSAID dataset (Section 4.4.1), it is observed that AerialFormer-B achieves the best performance, while both AerialFormer-S and AerialFormer-T obtain comparable results as the second-best methods on the Potsdam dataset in both settings with and without the clutter category.

4.4.3. LoveDA Semantic Segmentation Results

The performance comparisons with existing methods are reported based on the testset splits of the LoveDA dataset in Table 4. In this experiment, the proposed method was evaluated on a public test server (https://codalab.lisn.upsaclay.fr/competitions/421 (accessed on 5 August 2024)) by sending our predictions. Our smaller model, AerialFormer-S, achieved a performance comparable to those of existing state-of-the-art methods, such as UNetFormer [15] and RSSFormer [78], with a mean mIoU (mean intersection over union) of

52.4 %

. However, our best model, AerialFormer-B, shows a significant improvement of

1.7 %

in mIoU compared to the existing state-of-the-art methods. Notably, AerialFormer-B outperforms the existing methods by 4.1% IoU for the road category, 5.2% IoU for the water category, 2.5% IoU for the forest category, and 5.7% IoU for the agriculture category. In particular, the ‘Road’ category is typically characterized by narrow and elongated features. Segmenting such objects necessitates both local and global perspectives, a capability that the proposed model exhibits effectively.

4.4.4. Ablation Study

The proposed network’s components were ablated as follows. An ablation study was performed on the CNN Stem and multi-dilated CNN (MDC) decoder on the tiny (T) model, as shown in Table 5. The performance was reported using the number of parameters, mIoU, OA without background, and OA. To account for the significant influence of the background class on OA, the OA excluding the background class is provided. A baseline model featuring a Swin Transformer encoder and a UNet decoder was evaluated, where the MDC block was replaced with standard CNN blocks, retaining all original parameters except for the dilation parameters. Note that MDC does not add any additional parameters compared to the plain CNN. The baseline performance is the lowest among the configurations, suggesting that both the Stem and the MDC decoder contribute positively to the overall effectiveness of the model. Specifically, adding the CNN Stem improves

+ 0.2

and

+ 0.3

points in mIoU, and using MDC improves

+ 0.5

and

+ 0.6

points in mIoU. The improvement by examining the overall accuracy (OA) without the background class was further verified. Specifically, adding the CNN Stem enhances accuracy by

+ 0.81

and

+ 0.29

percentage points, while implementing MDC boosts accuracy by

+ 1.35

and

+ 0.83

percentage points. It is worth noting that the background class is excluded from this comparison, as it tends to dominate OA calculations. In Figure 6, it qualitatively verifies that the Stem module actually improves the ability of the model to recognize small and dense objects, but due to the absence of GT labels on some of the small and dense objects such as small vehicles in the iSAID dataset, the improvements in quantitative results remain relatively marginal compared to those of the MDC.

4.4.5. Network Complexity

Besides qualitative analysis, an analysis of the network complexity is included, as presented in Table 6. This section first details the model parameters (M), computation (GFLOPs), and inference time (seconds per image) for AerialFormer. Following that, a comparison of it with baseline models including the recent Transformer-based architectures [15,16,79,117] is provided. To calculate the inference time, we averaged the results of 10,000 runs of the model using a

512 \times 512

input with a batch size of 1. All the measurements were conducted on the Potsdam dataset without the clutter class. While AerialFormer-T, with a model size of 42.7 MB, has a similar model size and inference time to SwinUNet [117], it requires fewer computational resources and achieves a significantly higher performance (

+ 20.2

in mIoU). The comparison to TransUnet [79] also highlights the effectiveness of the parameters in the proposed model where we achieve a higher performance (

+ 18.2

in mIoU) with an inference time of 0.02 seconds per image, as opposed to 0.023 seconds per image. The comparison to the more recent architecture using the Swin Transformer, e.g., DC-SwinS [16] and UnetFormer-SwinB [16], underscores the effectiveness of the proposed architecture. While having less parameters and a faster inference speed, AerialFormer-T can still outperform those two models with a noticeable performance gap of

+ 0.9

and

+ 1.0

, respectively. The smallest model in our series, AerialFormer-T, can perform inference at a rate of 50 images per second, while AerialFormer-S, with a model size of 64.0 MB, achieves 35.7 images per second. Even the largest model, AerialFormer-B, with a model size of 113.82 MB, can achieve a real-time inference speed at 21.3 images per second.

4.5. Qualitative Results and Analysis

This section presents the qualitative results obtained from the proposed model, comparing them with well-established and robust baseline models, specifically PSPNet [91] and DeepLabV3+ [24]. This section will illustrate the advantages of AerialFormer in dealing with the challenging characteristics of remote sensing images.

Foreground–background imbalance: As mentioned in Section 1, the Introduction, the iSAID dataset exhibits a notable foreground and background imbalance. This imbalance is particularly evident in Figure 7, where certain images contain only a few labeled objects. Despite this extreme imbalance, AerialFormer shows its ability to accurately segment objects of interest, as depicted in the figure.

Tiny objects: As evidenced in Figure 8, AerialFormer, is capable of accurately identifying and segmenting tiny objects like cars on the road, which might only be represented by approximately $10 \times 5$ pixels. This showcases the model’s remarkable capability to handle small-object segmentation in high-resolution aerial images. Additionally, the proposed model demonstrates the ability to accurately segment cars that are not present in the ground truth labels (red boxes). However, this poses a problem in evaluating the proposed model, as its prediction could be penalized as a false positive even if the prediction is correct based on the given image.

Dense objects: Figure 9 demonstrates the proficient ability of the proposed model in accurately segmenting dense objects, particularly clusters of small vehicles, which often pose challenges for baseline models. Baseline models often overlook or struggle to identify such objects. The success of the proposed model in segmenting dense objects is attributed to the MDC decoder, which can capture the global context and the CNN Stem that enables the local details of the tiny objects.

Intra-class heterogeneity: Figure 10 visually demonstrates the existence of intra-class heterogeneity in aerial images, where objects of the same category can appear in diverse shapes, textures, colors, scales, and structures. The red boxes indicate two regions that are classified as belonging to the category of ‘Agriculture’. However, their visual characteristics differ significantly due to the presence of greenhouses. Notably, while baseline models encounter challenges in correctly classifying the region with greenhouses, misclassifying it as ‘Building’, the proposed model successfully identifies and labels the region as ‘Agriculture’. This showcases the superior performance and effectiveness of the proposed model in handling the complexities of intra-class variations in aerial image analysis tasks.

Inter-class heterogeneity: Figure 11 illustrates the inter-class homogeneity in aerial images, where objects of different classes may exhibit similar visual properties. The regions enclosed within the red boxes represent areas that exhibit similar visual characteristics, i.e., the rooftop greened with lawn and the park. However, there is a distinction in the classification of these regions, with the former being labeled as ‘Building’ and the latter falling into the ‘Low Vegetation’ category. Although the baseline models are confused by the appearance and produce mixed predictions, the proposed model can produce more robust results.

Overall performance: Figure 12 showcases these qualitative outcomes across three datasets: (a) iSAID, (b) Potsdam, and (c) LoveDA. Each dataset possesses unique characteristics and presents a wide spectrum of challenges encountered in aerial image segmentation. The major differences among the methods are highlighted in red boxes. Figure 12a visually demonstrates the efficiency of the proposed model in accurately recognizing dense and tiny objects. Unlike the baseline models, which often overlook or misclassify these objects into different categories, the proposed model exhibits its robustness in handling dense and tiny objects, e.g., small vehicle (SV) and helicopter (HC). As depicted in Figure 12b, the proposed model demonstrates a reduced level of inter-class confusion in comparison to the baseline models. An example of this is evident in the prediction of building structures, where the baseline models exhibit confusion. In contrast, the proposed model delivers predictions closely aligned with the ground truth. Similarly, in Figure 12c, the proposed model’s predictions are less noisy, further asserting its robustness in scenarios where scenes belong to different categories but exhibit similar visual appearances. As in the quantitative analysis, the performance of the proposed model on the ‘Road’ class is visually appealing. The ability of the proposed model to accurately delineate road structures, despite their narrow and elongated features, is visibly superior.

5. Discussion

This section discusses the unique challenges posed by aerial imagery, particularly in terms of object scale variations and object cutoff, and propose potential future research directions to address these challenges.

In datasets like iSAID, objects within the same category, such as ships, can vary significantly in size, with differences in an area of up to

10^{5}

times [86]. Current fixed-window and patch methods often struggle to accommodate such drastic changes in scale, potentially leading to misclassification. Moreover, the patch-based approach used in the encoder can result in object cutoff, where objects are only partially included in a patch. This can lead to insufficient information for accurate classification, as evident for objects like buildings class in the Potdam dataset or field class group in the iSAID dataset. The limited context provided by the patches can result in incorrect classifications. To address these challenges, future research should focus on developing methods that can adapt to the varying scales of objects in aerial imagery. Techniques that dynamically adjust window sizes or patch dimensions based on object characteristics could be explored. Additionally, incorporating contextual information from surrounding patches or employing attention mechanisms to focus on relevant regions could help mitigate the issue of object cutoff. Multi-scale feature fusion techniques could also be investigated to capture and integrate information from objects at different scales.

While the proposed AerialFormer demonstrated superior performance in remote sensing image segmentation, there is still room for improvement in terms of handling scale variations and object cutoff. Addressing these challenges will be crucial for developing even more robust and accurate segmentation models in the future.

6. Conclusions

This study introduced AerialFormer, a novel approach specifically designed to address the unique and challenging characteristics encountered in remote sensing image segmentation. These challenges include the presence of tiny objects, dense objects, foreground–background imbalance, intra-class heterogeneity, and inter-class homogeneity. To overcome these challenges, this work designed AerialFormer by combining the strengths of both Transformer and CNN architectures, creating a hybrid model that incorporates a Transformer encoder with a multi-dilated CNN decoder. Furthermore, this work incorporated a CNN Stem module to facilitate the transmission of low-level, high-resolution features to the decoder. This comprehensive design allows AerialFormer to effectively capture global context and local features simultaneously, significantly enhancing its ability to handle the complexities inherent in aerial images.

The proposed AerialFormer was evaluated using three different backbone sizes: Swin Transformer-Tiny, Swin Transformer-Small, and Swin Transformer-Base. the proposed model was benchmarked on three standard datasets: iSAID, Potsdam, and LoveDA. Through extensive experimentation, it was demonstrated that AerialFormer-T and AerialFormer-S, with smaller model sizes and lower computational costs, achieve performances that are superior or comparable to those of existing state-of-the-art methods, ranking them as second-best performers. Moreover, the proposed AerialFormer-B surpasses all existing state-of-the-art methods, showcasing its exceptional performance in the field of remote sensing image segmentation.

Author Contributions

Conceptualization, T.H., K.Y. and M.T.; formal analysis, J.C. and N.L.; methodology, T.H., K.Y., M.T., R.A.M. and H.L.; project administration, J.C. and N.L.; validation, T.H., K.Y. and M.T.; writing—review and editing, T.H., K.Y., C.R., M.A., J.C. and N.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation #1946391, National Science Foundation #2119691, and National Science Foundation #2236302.

Data Availability Statement

In this study, we utilized the iSAID, Potsdam, and LoveDA datasets. The iSAID dataset is available at https://captain-whu.github.io/iSAID/dataset.html, accessed on 7 May 2024. The Potsdam dataset can be accessed at https://www.isprs.org/education/benchmarks/UrbanSemLab/default.aspx, accessed on 7 May 2024. Lastly, the LoveDA dataset is publicly available at https://zenodo.org/records/5706578#.YZvN7SYRXdF, accessed on 7 May 2024. Please refer to the provided links for further details.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schumann, G.J.; Brakenridge, G.R.; Kettner, A.J.; Kashif, R.; Niebuhr, E. Assisting flood disaster response with earth observation data and products: A critical assessment. Remote Sens. 2018, 10, 1230. [Google Scholar] [CrossRef]
Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
Griffiths, D.; Boehm, J. Improving public data for building segmentation from Convolutional Neural Networks (CNNs) for fused airborne lidar and image data using active contours. ISPRS J. Photogramm. Remote Sens. 2019, 154, 70–83. [Google Scholar] [CrossRef]
Shamsolmoali, P.; Zareapoor, M.; Zhou, H.; Wang, R.; Yang, J. Road segmentation for remote sensing images using adversarial spatial pyramid networks. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4673–4688. [Google Scholar] [CrossRef]
Samie, A.; Abbas, A.; Azeem, M.M.; Hamid, S.; Iqbal, M.A.; Hasan, S.S.; Deng, X. Examining the impacts of future land use/land cover changes on climate in Punjab province, Pakistan: Implications for environmental sustainability and economic growth. Environ. Sci. Pollut. Res. 2020, 27, 25415–25433. [Google Scholar] [CrossRef]
Marcos, D.; Volpi, M.; Kellenberger, B.; Tuia, D. Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 2018, 145, 96–107. [Google Scholar] [CrossRef]
Xia, J.; Yokoya, N.; Adriano, B.; Broni-Bediako, C. OpenEarthMap: A Benchmark Dataset for Global High-Resolution Land Cover Mapping. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 6254–6264. [Google Scholar]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Virtual. 6–14 December 2021; Vanschoren, J., Yeung, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 1. [Google Scholar]
O’neill, S.J.; Boykoff, M.; Niemeyer, S.; Day, S.A. On the use of imagery for climate change engagement. Glob. Environ. Change 2013, 23, 413–421. [Google Scholar] [CrossRef]
Andrade, R.; Costa, G.; Mota, G.; Ortega, M.; Feitosa, R.; Soto, P.; Heipke, C. Evaluation of semantic segmentation methods for deforestation detection in the amazon. ISPRS Arch. 2020, 43, 1497–1505. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual. 14–19 June 2020; pp. 4096–4105. [Google Scholar]
Shafique, A.; Cao, G.; Khan, Z.; Asad, M.; Aslam, M. Deep learning-based change detection in remote sensing images: A review. Remote Sens. 2022, 14, 871. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Wang, F.; Piao, S.; Xie, J. CSE-HRNet: A context and semantic enhanced high-resolution network for semantic segmentation of aerial imagery. IEEE Access 2020, 8, 182475–182489. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Minaee, S.; Boykov, Y.Y.; Porikli, F.; Plaza, A.J.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
Le, N.; Bui, T.; Vo-Ho, V.K.; Yamazaki, K.; Luu, K. Narrow band active contour attention model for medical segmentation. Diagnostics 2021, 11, 1393. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual. 3–7 May 2021. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
Hoang, D.H.; Diep, G.H.; Tran, M.T.; Le, N.T.H. Dam-al: Dilated attention mechanism with attention loss for 3d infant brain image segmentation. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, Virtual. 25–29 April 2022; pp. 660–668. [Google Scholar]
Le, N.; Yamazaki, K.; Quach, K.G.; Truong, D.; Savvides, M. A multi-task contextual atrous residual network for brain tumor detection & segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR); Virtual: 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5943–5950. [Google Scholar]
Le, T.H.N.; Duong, C.N.; Han, L.; Luu, K.; Quach, K.G.; Savvides, M. Deep contextual recurrent residual networks for scene labeling. Pattern Recognit. 2018, 80, 32–41. [Google Scholar] [CrossRef]
He, J.; Deng, Z.; Zhou, L.; Wang, Y.; Qiao, Y. Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7519–7528. [Google Scholar]
Hsiao, C.W.; Sun, C.; Chen, H.T.; Sun, M. Specialize and fuse: Pyramidal output representation for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7137–7146. [Google Scholar]
Hu, H.; Ji, D.; Gan, W.; Bai, S.; Wu, W.; Yan, J. Class-wise dynamic graph convolution for semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII 16. Springer: Cham, Switzerland, 2020; pp. 1–17. [Google Scholar]
Jin, Z.; Gong, T.; Yu, D.; Chu, Q.; Wang, J.; Wang, C.; Shao, J. Mining contextual information beyond image for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7231–7241. [Google Scholar]
Jin, Z.; Liu, B.; Chu, Q.; Yu, N. ISNet: Integrate image-level and semantic-level context for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7189–7198. [Google Scholar]
Yu, C.; Wang, J.; Gao, C.; Yu, G.; Shen, C.; Sang, N. Context prior for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12416–12425. [Google Scholar]
Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmentation transformer: Object-contextual representations for semantic segmentation. arXiv 2019, arXiv:1909.11065. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
Bertasius, G.; Shi, J.; Torresani, L. Semantic segmentation with boundary neural fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3602–3610. [Google Scholar]
Le, N.; Le, T.; Yamazaki, K.; Bui, T.; Luu, K.; Savides, M. Offset curves loss for imbalanced problem in medical segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9189–9195. [Google Scholar]
Ding, H.; Jiang, X.; Liu, A.Q.; Thalmann, N.M.; Wang, G. Boundary-aware feature propagation for scene segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6819–6829. [Google Scholar]
Li, X.; Li, X.; Zhang, L.; Cheng, G.; Shi, J.; Lin, Z.; Tan, S.; Tong, Y. Improving semantic segmentation via decoupled body and edge supervision. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII 16. Springer: Cham, Switzerland, 2020; pp. 435–452. [Google Scholar]
Zhen, M.; Wang, J.; Zhou, L.; Li, S.; Shen, T.; Shang, J.; Fang, T.; Quan, L. Joint semantic segmentation and boundary detection using iterative pyramid contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13666–13675. [Google Scholar]
Harley, A.W.; Derpanis, K.G.; Kokkinos, I. Segmentation-aware convolutional networks using local attention masks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5038–5047. [Google Scholar]
He, J.; Deng, Z.; Qiao, Y. Dynamic multi-scale filters for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3562–3572. [Google Scholar]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. In Proceedings of the British Machine Vision Conference, Virtual. 22–25 November 2018. [Google Scholar]
Sun, G.; Wang, W.; Dai, J.; Van Gool, L. Mining cross-image semantics for weakly supervised semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Cham, Switzerland, 2020; pp. 347–365. [Google Scholar]
Wang, W.; Zhou, T.; Qi, S.; Shen, J.; Zhu, S.C. Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3508–3522. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Tran, M.; Vo, K.; Yamazaki, K.; Fernandes, A.; Kidd, M.; Le, N. AISFormer: Amodal Instance Segmentation with Transformer. In Proceedings of the British Machine Vision Conference, London, UK, 21–24 November 2022; BMVA Press: London, UK, 2022. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual. 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations, Virtual. 3–7 May 2021. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10502–10511. [Google Scholar]
Vo, K.; Joo, H.; Yamazaki, K.; Truong, S.; Kitani, K.; Tran, M.T.; Le, N. AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation. In Proceedings of the British Machine Vision Conference, Virtual. 22–25 November 2021. [Google Scholar]
Vo, K.; Truong, S.; Yamazaki, K.; Raj, B.; Tran, M.T.; Le, N. AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation. Int. J. Comput. Vis. 2022, 131, 302–323. [Google Scholar] [CrossRef]
Yamazaki, K.; Truong, S.; Vo, K.; Kidd, M.; Rainwater, C.; Luu, K.; Le, N. Vlcap: Vision-language with contrastive learning for coherent video paragraph captioning. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3656–3661. [Google Scholar]
Yamazaki, K.; Vo, K.; Truong, S.; Raj, B.; Le, N. VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Li, X.; He, H.; Li, X.; Li, D.; Cheng, G.; Shi, J.; Weng, L.; Tong, Y.; Lin, Z. Pointflow: Flowing semantics through points for aerial image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4217–4226. [Google Scholar]
Xue, G.; Liu, Y.; Huang, Y.; Li, M.; Yang, G. AANet: An attention-based alignment semantic segmentation network for high spatial resolution remote sensing images. Int. J. Remote Sens. 2022, 43, 4836–4852. [Google Scholar] [CrossRef]
Ma, A.; Wang, J.; Zhong, Y.; Zheng, Z. FactSeg: Foreground Activation-Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Hou, J.; Guo, Z.; Wu, Y.; Diao, W.; Xu, T. Bsnet: Dynamic hybrid gradient convolution based boundary-sensitive network for remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–22. [Google Scholar] [CrossRef]
You, H.; Tian, S.; Yu, L.; Lv, Y. Pixel-level remote sensing image recognition based on bidirectional word vectors. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1281–1293. [Google Scholar] [CrossRef]
Mou, L.; Hua, Y.; Zhu, X.X. Relation matters: Relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7557–7569. [Google Scholar] [CrossRef]
Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
Wang, D.; Zhang, J.; Du, B.; Xia, G.S.; Tao, D. An empirical study of remote sensing pretraining. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–20. [Google Scholar] [CrossRef]
Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. Rssformer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. Ringmo: A remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–22. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 28–37. [Google Scholar]
2D Semantic Labeling Contest-Potsdam. International Society for Photogrammetry and Remote Sensing. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 5 August 2024).
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Liu, G.; Wang, Q.; Zhu, J.; Hong, H. W-Net: Convolutional neural network for segmenting remote sensing images by dual path semantics. PLoS ONE 2023, 18, e0288311. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. FarSeg++: Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13715–13729. [Google Scholar] [CrossRef]
Gong, Z.; Duan, L.; Xiao, F.; Wang, Y. MSAug: Multi-Strategy Augmentation for rare classes in semantic segmentation of remote sensing images. Displays 2024, 84, 102779. [Google Scholar] [CrossRef]
He, S.; Jin, C.; Shu, L.; He, X.; Wang, M.; Liu, G. A new framework for improving semantic segmentation in aerial imagery. Front. Remote Sens. 2024, 5, 1370697. [Google Scholar] [CrossRef]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9167–9176. [Google Scholar]
Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 905–909. [Google Scholar] [CrossRef]
Ma, X.; Ma, M.; Hu, C.; Song, Z.; Zhao, Z.; Feng, T.; Zhang, W. Log-can: Local-global class-aware network for semantic segmentation of remote sensing images. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Sun, L.; Zou, H.; Wei, J.; Cao, X.; He, S.; Li, M.; Liu, S. Semantic segmentation of high-resolution remote sensing images based on sparse self-attention and feature alignment. Remote Sens. 2023, 15, 1598. [Google Scholar] [CrossRef]
Wang, T.; Xu, C.; Liu, B.; Yang, G.; Zhang, E.; Niu, D.; Zhang, H. MCAT-UNet: Convolutional and Cross-shaped Window Attention Enhanced UNet for Efficient High-resolution Remote Sensing Image Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 9745–9758. [Google Scholar] [CrossRef]
Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 426–435. [Google Scholar] [CrossRef]
Xu, Q.; Yuan, X.; Ouyang, C.; Zeng, Y. Spatial–spectral FFPNet: Attention-Based Pyramid Network for Segmentation and Classification of Remote Sensing Images. arXiv 2020, arXiv:2008.08775. [Google Scholar]
Zhang, Q.; Yang, Y.B. Rest: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 15475–15485. [Google Scholar]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Chen, Y.; Fang, P.; Yu, J.; Zhong, X.; Zhang, X.; Li, T. Hi-resnet: A high-resolution remote sensing network for semantic segmentation. arXiv 2023, arXiv:2305.12691. [Google Scholar]
Wang, L.; Dong, S.; Chen, Y.; Meng, X.; Fang, S. MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images. arXiv 2023, arXiv:2312.12735. [Google Scholar]
Zhang, X.; Weng, Z.; Zhu, P.; Han, X.; Zhu, J.; Jiao, L. ESDINet: Efficient Shallow-Deep Interaction Network for Semantic Segmentation of High-Resolution Aerial Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5607615. [Google Scholar] [CrossRef]
Chaurasia, A.; Culurciello, E. Linknet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–4. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing plain vision transformer towards remote sensing foundation model. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5607315. [Google Scholar] [CrossRef]
Liu, B.; Zhong, Z. GDformer: A lightweight decoder for efficient semantic segmentation of remote sensing urban scene imagery. In Proceedings of the Fourth International Conference on Computer Vision and Data Mining (ICCVDM 2023), Changchun, China, 19–21 October 2023; SPIE: Bellingham, WA, USA, 2024; Volume 13063, pp. 149–154. [Google Scholar]
Long, Y.; Xia, G.S.; Li, S.; Yang, W.; Yang, M.Y.; Zhu, X.X.; Zhang, L.; Li, D. On creating benchmark dataset for aerial image interpretation: Reviews, guidances, and million-aid. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4205–4230. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the Computer Vision–ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part III. Springer: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]

Figure 1. Examples of challenging characteristics in remote sensing image segmentation. (Left) (i) The distribution of the foreground and background is highly imbalanced (black). (Top right) Objects in some classes are (ii) tiny (yellow) and (iii) dense (orange) so that they are hardly identifiable. (Bottom right) Within a class, there is a large diversity in appearance: (iv) intra-class heterogeneity (purple); some different classes share the similar appearance: (v) inter-class homogeneity (pink). The image is from the iSAID dataset, best viewed in color.

Figure 2. Overall network architecture of the proposed AerialFormer, which consists of three components, i.e., CNN Stem, Transformer Encoder, and multi-dilated CNN decoder.

Figure 3. Illustrations of the CNN Stem. The Stem takes the input image and produces feature maps with half of the original spacial resolution.

Figure 4. The figure illustrates a Transformer Encoder Block, showcasing how it employs localized attention within shifting windows to progressively capture the global context as the network depth increases.

Figure 5. An illustration of the MDC Block, which consists of Pre-Channel Mixer, DCL, and Post-Channel Mixer.

Figure 6. The qualitative ablation study on the CNN Stem and multi-dilated CNN decoder on iSAID dataset. By comparing Swin-Unet (Baseline) and MDC-Only with Stem-Only and AerialFormer, it is clear that our Stem module helps to segment the small and dense objects, highlighted by the red line.

Figure 7. Qualitative comparison between AerialFormer and PSPNet [91] and DeepLabV3+ [24] in terms of foreground–background imbalance. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines.

Figure 8. Qualitative comparison between AerialFormer and PSPNet [91] and DeepLabV3+ [24] in terms of tiny objects. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines. Some of the objects that are evident in the input are ignored in the ground truth label.

Figure 9. Qualitative comparison between AerialFormer and PSPNet [91] and DeepLabV3+ [24] in terms of dense objects. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines.

Figure 10. Qualitative comparison between AerialFormer and PSPNet [91] and DeepLabV3+ [24] in terms of intra-class heterogeneity: the regions highlighted in the box are both classified under the ‘Agriculture’ category. However, one region features green lands, while the other depicts greenhouses. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines.

Figure 11. Qualitative comparison between AerialFormer and PSPNet [91] and DeepLabV3+ [24] in terms of inter-class homogeneity: the regions highlighted in the box share similar visual characteristics but one region is classified as a ‘Building’ while the other is classified as belonging to the ‘Low Vegetation’ category. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines.

Figure 12. Qualitative comparison on various datasets: (a) iSAID, (b) Potsdam, and (c) LoveDA. From left to right: original image, ground truth, PSPNet, DeeplabV3+, and the proposed AerialFormer. The major difference are highlighted in red boxes.

Table 1. Performance comparison on iSAID valset between AerialFormer and other state-of-the-art approaches. The performance is reported in mIoU and IoU for each category. The bold and italic–underline values in each column show the best and the second best performances. An arrow (↑) indicates that a higher score is better.

Method	Year	mIoU ↑	IoU per Category * ↑
			Vehicles					Artifacts				Fields
			LV	SV	Plane	HC	Ship	ST	Bridge	RA	Harbor	BD	TC	GTF	SBF	SP	BC
UNet [81]	2015	$37.4$	$49.9$	$35.6$	$74.7$	$0.0$	$49.0$	$0.0$	$7.5$	$46.5$	$45.6$	$6.5$	$78.6$	$5.5$	$9.7$	$38.0$	$22.9$
PSPNet [91]	2017	$60.3$	$58.0$	$43.0$	$79.5$	$10.9$	$65.2$	$52.1$	$32.5$	$68.6$	$54.3$	$75.7$	$85.6$	$60.2$	$71.9$	$46.8$	$61.1$
DeepLabV3 [92]	2017	$59.0$	$54.8$	$33.7$	$75.8$	$31.3$	$59.7$	$50.5$	$32.9$	$66.0$	$45.7$	$77.0$	$84.2$	$59.6$	$72.1$	$44.7$	$57.9$
DeepLabV3+ [24]	2018	$61.4$	$61.9$	$46.7$	$82.1$	$0.0$	$66.2$	$71.5$	$37.5$	$63.1$	$56.9$	$73.1$	$87.2$	$56.2$	$73.8$	$46.6$	$59.8$
HRNet [69]	2019	$62.3$	$61.6$	$48.5$	$82.3$	$6.9$	$67.5$	$70.3$	$38.4$	$65.7$	$54.7$	$75.4$	$87.1$	$55.5$	$75.5$	$46.4$	$62.1$
FarSeg [11]	2020	63.7	60.6	46.3	82.0	35.8	65.4	61.8	36.7	71.4	53.9	77.7	86.4	56.7	72.5	51.2	62.1
HMANet [76]	2021	62.6	59.7	50.3	83.8	32.6	65.4	70.9	29.0	62.9	51.9	74.7	88.7	54.6	70.2	51.4	60.5
PFNet [70]	2021	66.9	64.6	50.2	85.0	37.9	70.3	74.7	45.2	71.7	59.3	77.8	87.7	59.5	75.4	50.1	62.2
Segformer [58]	2021	65.6	64.7	51.3	85.1	40.3	70.8	73.9	40.8	60.9	56.9	74.6	87.9	58.9	75.0	51.2	59.1
FactSeg [72]	2022	64.8	62.7	49.5	84.1	42.7	68.3	56.8	36.3	69.4	55.7	78.4	88.9	54.6	73.6	51.5	64.9
BSNet [73]	2022	63.4	63.4	46.6	81.8	31.8	65.3	69.1	41.3	70.0	57.3	76.1	86.8	50.3	70.2	48.8	55.9
AANet [71]	2022	66.6	63.2	48.7	84.6	41.8	71.2	65.7	40.2	72.4	57.2	80.5	88.8	60.5	73.5	52.3	65.4
RSP-Swin-T [77]	2022	64.1	62.0	50.6	85.2	37.6	67.0	74.6	44.3	64.9	53.8	73.7	70.7	60.1	76.2	46.8	59.0
Ringmo [80]	2022	67.2	63.9	51.2	85.7	40.1	73.5	73.0	43.2	67.3	58.9	77.0	89.1	63.0	78.5	48.9	62.5
RSSFormer [78]	2023	65.9	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
W-Net [93]	2023	63.7	55.9	64.8	50.6	18.6	88.9	42.1	61.5	56.7	67.8	59.7	72.1	43.2	80.0	29.5	44.4
FarSeg++ [94]	2023	67.9	65.9	53.6	86.5	42.7	71.7	65.7	41.8	75.8	62.0	76.0	89.7	59.4	75.8	53.6	66.6
MSAug [95]	2024	68.4	67.5	52.6	85.3	41.7	71.7	74.6	46.1	72.2	60.3	79.1	89.8	60.4	77.6	52.3	64.4
PFMFS [96]	2024	67.3	62.8	49.7	84.2	43.1	68.5	68.7	37.1	72.6	56.3	80.0	89.2	56.9	74.1	52.8	65.4
AerialFormer-T	–	67.5	67.0	52.6	86.1	42.0	68.6	74.9	45.3	73.0	58.2	77.5	88.8	57.5	75.1	50.5	63.4
AerialFormer-S	–	68.4	66.5	53.6	86.5	40.0	72.1	74.1	44.8	74.0	60.9	78.8	89.2	59.5	77.0	52.1	66.5
AerialFormer-B	–	69.3	67.8	53.7	86.5	46.7	75.1	76.3	46.8	66.1	60.8	81.5	89.8	65.0	78.3	52.4	62.4

* Categories in iSAID dataset: large vehicle (LV), small vehicle (SV), plane, helicopter (HC), ship, storage tank (ST), bridge, roundabout (RA), harbor, baseball diamond (BD), tennis court (TC), ground track field (GTF), soccerball field (SBF), swimming pool (SP), and basketball court (BC).

Table 2. Performance comparison on Potsdam valset with clutter. The performance is reported in the mIoU, OA, mF1 and F1 score for each category. Note that both training and evaluation were performed on the eroded dataset. The values in bold and italic–underline values in each column show the best and the second best performance. An arrow (↑) indicates that a higher score is better.

Method	Year	mIoU ↑	OA ↑	mF1 ↑	F1 per Category * ↑
Method	Year	mIoU ↑	OA ↑	mF1 ↑	Imp. Surf.	Building	Low Veg.	Tree	Car	Clutter
FCN [21]	2015	64.2	—	75.9	87.6	91.6	77.8	84.6	73.5	40.3
PSPNet [91]	2017	77.1	90.1	85.6	92.6	96.2	86.2	88.0	95.3	55.4
DeeplabV3 [92]	2017	77.2	90.0	85.6	92.4	95.9	86.4	87.6	94.9	56.7
UPerNet [97]	2018	76.8	89.7	85.6	92.5	95.5	85.5	87.5	94.9	58.0
DeepLabV3+ [24]	2018	77.1	90.1	85.6	92.6	96.4	86.3	87.8	95.4	55.1
Denseaspp [25]	2018	64.7	—	76.4	87.3	91.1	76.2	83.4	77.1	43.3
DANet [98]	2019	65.3	—	77.1	88.5	92.7	78.8	85.7	73.7	43.2
EMANet [99]	2019	65.6	—	77.7	88.2	92.7	78.0	85.7	72.7	48.9
CCNet [46]	2019	64.3	—	75.9	88.3	92.5	78.8	85.7	73.9	36.3
SCAttNet V2 [100]	2020	68.3	88.0	78.4	81.8	88.8	72.5	66.3	80.3	20.2
PFNet [70]	2021	75.4	—	84.8	91.5	95.9	85.4	86.3	91.1	58.6
Segformer [58]	2021	78.0	90.5	86.4	92.9	96.4	86.9	88.1	95.2	58.9
LOGCAN++ [101]	2023	78.6	-	86.6	87.5	93.8	77.2	79.8	93.1	40.1
SAANet [102]	2023	73.8	88.2	83.6	83.4	90.8	72.5	74.5	84.1	37.5
MCAT-UNet [103]	2024	75.4	83.3	84.8	84.6	92.5	74.3	76.3	83.8	41.5
AerialFormer-T	—	79.5	91.1	87.5	93.5	96.9	87.2	89.0	95.9	62.5
AerialFormer-S	—	79.3	91.3	87.2	93.5	97.0	87.7	88.9	96.0	60.2
AerialFormer-B	—	79.7	91.4	87.6	93.5	97.2	88.1	89.3	95.7	61.9

* Categories in Potsdam dataset with clutter: impervious surface (Imp. Surf), building, low vegetation (Low Veg.), tree, car, and clutter/background.

Table 3. Performance comparison on Potsdam valset without clutter. The performance is reported using the mIoU, OA, mF1, and F1 score for each category. Note that both the training and evaluation were performed on the eroded dataset, and the clutter category was ignored. The bold and italic–underline values in each column show the best and the second best performances. An arrow (↑) indicates that a higher score is better.

Method	Year	mIoU ↑	OA ↑	mF1 ↑	F1 per Category * ↑
Method	Year	mIoU ↑	OA ↑	mF1 ↑	Imp. Surf.	Building	Low Veg.	Tree	Car
DeepLabV3+ [24]	2018	81.7	89.6	89.8	92.3	95.5	85.7	86.0	89.4
DANet [98]	2019	—	89.7	89.1	91.6	96.4	86.1	88.0	83.5
LANet [104]	2020	—	90.8	92.0	93.1	97.2	87.3	88.0	94.2
S-RA-FCN [75]	2020	72.5	88.5	89.6	90.7	94.2	83.8	85.8	93.6
FFPNet [105]	2020	86.2	91.1	92.4	93.6	96.7	87.3	88.1	96.5
ResT [106]	2021	85.2	90.6	91.9	92.7	96.1	87.5	88.6	94.8
ABCNet [107]	2021	86.5	91.3	92.7	93.5	96.9	87.9	89.1	95.8
Segmenter [56]	2021	80.7	88.7	89.2	91.5	95.3	85.4	85.0	88.5
TransUNet [79]	2021	86.1	—	88.1	92.4	94.9	82.9	88.9	91.3
HMANet [76]	2021	87.3	92.2	93.2	93.9	97.6	88.7	89.1	96.8
DC-Swin [16]	2022	87.6	92.0	93.3	94.2	97.6	88.6	89.6	96.3
BSNet [73]	2022	77.5	90.7	91.5	92.4	95.6	86.8	88.1	94.6
UNetFormer [15]	2022	86.8	91.3	92.8	93.6	97.2	87.7	88.9	96.5
FT-UNetformer [15]	2022	87.5	92.0	93.3	93.9	97.2	88.8	89.8	96.6
UperNet RSP-Swin-T [77]	2022	—	90.8	90.0	92.7	96.4	86.0	85.4	89.8
UperNet-RingMo [80]	2022	—	91.7	91.3	93.6	97.1	87.1	86.4	92.2
Hi-ResNet [108]	2023	86.1	91.1	92.4	93.2	96.5	87.9	88.6	96.1
MetaSegNet [109]	2023	87.5	92.1	93.2	94.6	97.3	88.1	89.7	96.4
ESDINet [110]	2024	85.3	90.5	92.0	92.7	96.3	87.3	88.1	95.4
AerialFormer-T	—	88.5	93.5	93.7	95.2	98.0	89.1	89.1	97.3
AerialFormer-S	—	88.6	93.6	93.8	95.3	98.1	89.2	89.1	97.4
AerialFormer-B	—	89.0	93.8	94.0	95.4	98.0	89.6	89.7	97.4

* Categories in Potsdam dataset without clutter: impervious surface (Imp. Surf), building, low vegetation (Low Veg.), tree, and car.

Table 4. Performance comparison on LoveDA testset dataset between AerialFormer and other existing state-of-the-art semantic segmentation approaches. The evaluation is based on a submission to the official server. The performance is reported based on the mIoU and IoU for each category. The bold and italic–underline values in each column show the best and the second best performances.

Method	Year	mIoU	IoU per Category ↑
Method	Year	mIoU	Background	Building	Road	Water	Barren	Forest	Agriculture
FCN [21]	2015	46.7	42.6	49.5	48.1	73.1	11.8	43.5	58.3
UNet [81]	2015	47.8	43.1	52.7	52.8	73.1	10.3	43.1	59.9
LinkNet [111]	2017	48.5	43.6	52.1	52.5	76.9	12.2	45.1	57.3
SegNet [112]	2017	47.3	41.8	51.8	51.8	75.4	10.9	42.9	56.7
UNet++ [113]	2018	48.2	42.9	52.6	52.8	74.5	11.4	44.4	58.8
DeeplabV3+ [24]	2018	47.6	43.0	50.9	52.0	74.4	10.4	44.2	58.5
FarSeg [11]	2020	48.2	43.4	51.8	53.3	76.1	10.8	43.2	58.6
TransUNet [79]	2021	48.9	43.0	56.1	53.7	78.0	9.3	44.9	56.9
Segmenter [56]	2021	47.1	38.0	50.7	48.7	77.4	13.3	43.5	58.2
Segformer [58]	2021	49.1	42.2	56.4	50.7	78.5	17.2	45.2	53.8
DC-Swin [16]	2022	50.6	41.3	54.5	56.2	78.1	14.5	47.2	62.4
ViTAE-B+RVSA [114]	2022	52.4	—	—	—	—	—	—	—
FactSeg [72]	2022	48.9	42.6	53.6	52.8	76.9	16.2	42.9	57.5
UNetFormer [15]	2022	52.4	44.7	58.8	54.9	79.6	20.1	46.0	62.5
RSSFormer [78]	2023	52.4	52.4	60.7	55.2	76.3	18.7	45.4	58.3
LOGCAN++ [101]	2023	53.4	47.4	58.4	56.5	80.1	18.4	47.9	64.8
Hi-ResNet [108]	2023	52.5	46.7	58.3	55.9	80.1	17.0	46.7	62.7
MetaSegNet [109]	2023	52.2	44.0	57.9	58.1	79.9	18.2	47.7	59.4
ESDINet [110]	2024	50.1	41.6	53.8	54.8	78.7	19.5	44.2	58.0
GDformer [115]	2024	52.2	45.1	57.6	56.6	79.7	17.9	45.8	62.2
AerialFormer-T	—	52.0	45.2	57.8	56.5	79.6	19.2	46.1	59.5
AerialFormer-S	—	52.4	46.6	57.4	57.3	80.5	15.6	46.8	62.8
AerialFormer-B	—	54.1	47.8	60.7	59.3	81.5	17.9	47.9	64.0

Table 5. The ablation study results on the CNN Stem and multi-dilated CNN decoder on iSAID dataset. The table outlines the number of parameters (Params in millions), mean intersection over Union (mIoU), overall accuracy (OA) without background, and overall accuracy (OA). Checkmarks (✓) indicate the presence of the module, and crosses (✗) indicate the absence of the module. Note that the MDC Block was replaced with a normal CNN Block to evaluate the favorable properties of the MDC Block.

Method	Stem	MDC	Params (M)	mIoU	OA w/out bg	OA
Baseline	✗	✗	44.92	66.7	74.25	99.03
Stem-only	✓	✗	45.08	66.9	75.06	99.05
MDC-only	✗	✓	42.56	67.2	75.60	99.05
AerialFormer-T	✓	✓	42.71	67.5	75.89	99.06

Table 6. Comparative analysis of model complexity and performance for different semantic segmentation methods on the Potsdam dataset without clutter, measured using a

512 \times 512

input size. The table outlines the number of parameters (Params) in millions (M), computational cost in Giga Floating Point Operations (GFLOPs), inference time in seconds (s), and mean Intersection over Union (mIoU) scores.

Table 6. Comparative analysis of model complexity and performance for different semantic segmentation methods on the Potsdam dataset without clutter, measured using a

512 \times 512

input size. The table outlines the number of parameters (Params) in millions (M), computational cost in Giga Floating Point Operations (GFLOPs), inference time in seconds (s), and mean Intersection over Union (mIoU) scores.

Methods	Params (M)	FLOPs (GB)	Inference Time (s)	mIoU
PSPNet [91]	12.8	54.26	0.0097	60.3
DeepLabV3+ [22]	12.5	54.21	0.010	61.4
Segformer [58]	3.72	6.38	0.0089	65.6
Unet [81]	31.0	184.6	0.020	65.5
SwinUNet [117]	41.4	237.4	0.021	68.3
TransUnet [79]	90.7	233.7	0.023	70.3
DC-SwinS [16]	66.9	-	0.030	87.6
UnetFormer-SwinB [15]	96.0	-	0.043	87.5
AerialFormer-T	42.7	49.0	0.020	88.5
AerialFormer-S	64.0	72.2	0.028	88.6
AerialFormer-B	113.8	126.8	0.047	89.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hanyu, T.; Yamazaki, K.; Tran, M.; McCann, R.A.; Liao, H.; Rainwater, C.; Adkins, M.; Cothren, J.; Le, N. AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation. Remote Sens. 2024, 16, 2930. https://doi.org/10.3390/rs16162930

AMA Style

Hanyu T, Yamazaki K, Tran M, McCann RA, Liao H, Rainwater C, Adkins M, Cothren J, Le N. AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation. Remote Sensing. 2024; 16(16):2930. https://doi.org/10.3390/rs16162930

Chicago/Turabian Style

Hanyu, Taisei, Kashu Yamazaki, Minh Tran, Roy A. McCann, Haitao Liao, Chase Rainwater, Meredith Adkins, Jackson Cothren, and Ngan Le. 2024. "AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation" Remote Sensing 16, no. 16: 2930. https://doi.org/10.3390/rs16162930

APA Style

Hanyu, T., Yamazaki, K., Tran, M., McCann, R. A., Liao, H., Rainwater, C., Adkins, M., Cothren, J., & Le, N. (2024). AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation. Remote Sensing, 16(16), 2930. https://doi.org/10.3390/rs16162930

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation

Abstract

1. Introduction

2. Related Works

2.1. DL-Based Image Segmentation

2.2. Aerial Image Segmentation

3. Methods

3.1. Network Overview

3.2. CNN Stem

3.3. Transformer Encoder

3.3.1. Patch Embedding

3.3.2. Transformer Encoder Block

3.3.3. Patch Merging

3.4. Multi-Dilated CNN Decoder

3.4.1. MDC Block

3.4.2. Deconv Block

3.5. Loss Function

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Quantitative Results and Analysis

4.4.1. iSAID Semantic Segmentation Results

4.4.2. Potsdam Semantic Segmentation Results

4.4.3. LoveDA Semantic Segmentation Results

4.4.4. Ablation Study

4.4.5. Network Complexity

4.5. Qualitative Results and Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI