Semantic Segmentation of UAV Images Based on Transformer Framework with Context Information

: With the advances in Unmanned Aerial Vehicles (UAVs) technology, aerial images with huge variations in the appearance of objects and complex backgrounds have opened a new direction of work for researchers. The task of semantic segmentation becomes more challenging when capturing inherent features in the global and local context for UAV images. In this paper, we proposed a transformer-based encoder-decoder architecture to address this issue for the precise segmentation of UAV images. The inherent feature representation of the UAV images is exploited in the encoder network using a self-attention-based transformer framework to capture long-range global contextual information. A Token Spatial Information Fusion (TSIF) module is proposed to take advantage of a convolution mechanism that can capture local details. It fuses the local contextual details about the neighboring pixels with the encoder network and makes semantically rich feature representations. We proposed a decoder network that processes the output of the encoder network for the ﬁnal semantic level prediction of each pixel. We demonstrate the effectiveness of this architecture on UAVid and Urban Drone datasets, where we achieved mIoU of 61.93% and 73.65%, respectively.


Introduction
Semantic segmentation of images, i.e., assigning labels to each pixel, has been widely studied in the field of computer vision.Majorly its applications are in medical imaging [1,2], autonomous driving cars [3][4][5], satellite image segmentation [6], and point cloud scene segmentation [7].Due to the rapid development of smart technologies, the application of Unmanned Aerial Vehicles (UAVs), i.e., drones, has significantly increased.UAV devices can capture photos even in remote areas where it is difficult for a human.Aerial images captured by UAV devices contain rich information [8] that can be utilized for different tasks such as traffic density estimation [9], building extraction [10], and flooded area monitoring [11].
Recently, semantic segmentation of UAV images has shown decent performance in a variety of applications [12].Dutta et al. [13] designed a segmentation framework for UAV images for the early detection of disease in cruciferous crops.Song et al. [14] designed an image fusion-based segmentation network for sunflower lodging recognition using UAV images.Lobo Torres et al. [15] tested Fully Convolutional Network (FCN) based approaches for conducting endangered tree species analyses using UAV urban scene areas.The semantic segmentation of UAV street scene images has opened a new direction of work for researchers [16].Pixel-level dense prediction results of objects in street scene images are beneficial for land use monitoring.Convolution-based architectures are widely exploited for segmentation in computer vision.The majority of the segmentation models [17][18][19][20] adopt  encoder network and design different decoder architectures.However, the FCN-based encoder backbone results in coarse predictions [22], and the receptive field becomes constant after a certain threshold limit [4].Due to local receptive fields in Convolutional Neural Networks (CNNs), it maintains local context properly but fails to capture global contexts [23].For UAV scene segmentation, Lyu et al. [16] constructed a segmentation dataset and designed a segmentation model, which showed satisfactory results for UAV image segmentation.Girisha et al. [24] adopted Long Short-Term Memory (LSTM) or optical flow modules for UAV urban scene segmentation.
UAV images normally contain very complex backgrounds, with lots of variations in object appearance and scale, which poses a significant challenge for semantic segmentation [12].Even though the existing segmentation frameworks show satisfactory results, including the ones specially designed for UAV scene segmentation, their feature extractor struggles to capture the inherent features of aerial images.Segmenting minority classes, especially small objects such as humans, which have the least number of pixels in the whole image, becomes challenging.For dense prediction tasks on complex UAV images, i.e., semantic segmentation, global and local context information is an important component [22,25].Considering both, the global and local contextual details can precisely give better performance [8,12,26]  The focus of this work is to achieve precise semantic segmentation of UAV street scene images.Inspired by the transformer-based design paradigms for computer vision tasks [23,27], an encoder-decoder framework is proposed to address such issues in this work.It incorporates a self-attention-based encoder network, which captures long-range information in the UAV images via maintaining global receptive fields.This allows the network to maintain global contextual details.For modeling low-level contextual details, using the advantage of CNNs in capturing local information, a convolution-based element is introduced in the encoder network, i.e., the Token Spatial Information Fusion (TSIF) module.Capturing local context information lets the network maintain the proper shape and size of the objects in the segmentation results.The output of the powerful self-attentionbased encoder network contains semantically very rich information, both global and local contextual details.A decoder network is proposed for final pixel-level predictions, which processes this output information from the encoder network.
In short, the main contributions of this paper are as follows: 1.
We propose a new encoder-decoder-based framework for semantic segmentation of UAV street scene images; 2.
The encoder network is designed using the transformer-based module to capture global and local context information;

3.
A decoder network is constructed to upsample the output of the encoder network to the original size image without losing the semantically rich information; 4.
Experiments on UAVid and Urban Drone Dataset (UDD-6) datasets demonstrate that the proposed method outperforms the state-of-the-art methods.
The rest of the paper is presented as follows: Section 2 discusses the related work.In Section 3, we briefly discuss the proposed design.We describe the experimental results in Section 4. Based on our findings, we discuss the advantage of the proposed method in Section 5.At last, in Section 6, we conclude the work of this paper.

Related Work 2.1. Semantic Segmentation
In the direction of semantic image segmentation, a Fully Convolutional Network (FCN) [21] is the first work that used Convolutional Neural Networks (CNNs).Following this work, many different methods have been designed to improve its segmentation results.DeconvNet [28] used a progressive upsampling strategy to upsample the coarse output of the FCN.SegNet [29] upsampled a coarse feature map by using max pool indices transferred from the FCN encoder.U-Net [1] used skip connections and introduced contracting and expanding paths.It fused the information from the encoder network to the decoder network during each upsampling stage.GCN [30] and DeepLAbV3+ [17] adopted a dilation rate greater than 1 to increase the strength of the receptive field.DANet [18], CCNet [31], and ISNet [3] used attention-based operations to capture long-range global context details.Li et al. [32] used the Multitask Low-rank Affinity Pursuit (MLAP) method to annotate and segment images.Li et al. [33] used automatic labeling of infra-red images for UAVs using a probabilistic framework.

Transformer
Transformer-based [34] frameworks have transformed the entire field of Natural Language Processing (NLP).With its huge success in NLP, researchers have also started exploring its application for computer vision tasks.Vision Transformer (ViT) [23] is the first work in this direction, and it used a fully transformer-based design for image classification.After that, many researchers [35][36][37][38] followed its work to improve the classification accuracy.DeiT [35] introduced a teacher-student framework to provide easy learning for the network.
ViT split an input image into a sequence of tokens, and the sequence is processed by a set of stacked transformer encoder blocks.It uses a self-attention-based mechanism to maintain the global receptive field, which CNN-based architectures lack [4,23].After its introduction, many architectures were designed to solve dense prediction tasks such as image segmentation [4,39] and object detection [40,41].SETR [4] used ViT as an encoder and developed different variants of decoder designs.On top of a ViT-based encoder, Segmenter [42] designed a mask-based decoder network.PVT [27], DPT [43], and Segformer [44] adopted different design choices to generate multi-scale feature representations like CNNs.DPT generated hierarchical feature representation by assembling tokens from different stages of the ViT.PVT and Segformer adopted a framework to progressively shrink the pyramid to output varying scale features.Yuan et al. [45], Xie et al. [44], Wu et al. [46], and Chu et al. [37] adopted different convolution-based variants to fuse positional information.

Semantic Segmentation of Aerial Images
Due to an increased application of UAV devices, recent semantic segmentation of aerial images [8,25] has given a new research opportunity to researchers.UNetFormer [25] used a ResNet-based encoder module and designed a decoder network using a transformer framework to model image-level and semantic-level details.MFNet [8] employed a framework to maintain low-level details and inter-class discriminative characteristics.UAVFormer [26] used aggregation window-based self-attention modules to model complex features in the UAV scenes.ABCNet [22], CANet [47], and BANet [48] used a dual path approach to capture long-range and fine-grained information.Liu et al. [49] designed a lightweight attention network, which uses spatial and channel attention modules.Iqbal et al. [50] followed a weakly-supervised domain adaption framework for the segmentation of aerial and satellite images.It handles cross-domain discrimination issues between aerial and satellite imagery.Yeung et al. [6] adopted steerable-filter-based and transfer-learning-based paradigms for segmenting satellite images.It addressed the issue of the least availability of labeled data for satellite scenes.Gebrehiwot et al. [51] segmented flooded regions in UAV images, where the flooded water is differentiated from buildings, vegetation, and roads.Ichim et al. [11] used a decision fusion-based strategy to segment the flooded areas and vegetation in UAV scenes.Zhang et al. [52] combined a series of residual U-Net modules for segmenting plants in UAV images.Dutta et al. [13] designed a framework to semantically segment unhealthy leaves in aerial images.ICENet [53] used a positional and channel-wise fusion of attentive features for segmenting ice in rivers.Gevaert et al. [54] used the semantic classification method to manage infrastructural development in informal settlement regions.

Encoder Network
The encoder network is constructed using a self-attention-based transformer [23] framework.Overall, it consists of an image tokenization module, a linear projection module, transformer encoder modules, and a Token Spatial Information Fusion (TSIF) module.Figure 2a describes an overview of the encoder network.It uses an image as input and splits it into a sequence of tokens using the image tokenization module.
The image tokenization module transforms the two-dimensional input data, i.e., x ∈ I H×W×3 into a sequence of two-dimensional tokens, i.e., x t ∈ I S×(t 2 •3) .(H, W) and (t, t) represent the input image size and the token size, respectively.S = HW t 2 represents the number of tokens coming out from the input image.The linear projection module projects the sequence to a projection dimension, i.e., x t ∈ I S×P .P represents the projection dimension.Here, the values of t and P are set to 16 and 1024, respectively.The output of the linear projection module is processed using H number of transformer encoder modules.Each transformer encoder module consists of a Multi-head Self-attention (MHA) block and a Multi-Layer Perceptron (MLP) block.A Layer Normalization (LN) is applied before each block, and a residual connection after each block is applied.Figure 2b demonstrates the overall architecture of the transformer encoder module.Here, H is set to 24.
The collection of multiple Self-Attention (SA) operations makes an MHA block.The input sequence x t = I S×P is projected into query (Q), key (K), and value (V): then the SA operation is performed using Q, K and V vectors as follows: feature space of Q, K, and V are split into k times, and the MHA operation is performed in parallel.Here, k denotes the number of heads and is set to 16.The output of each head is concatenated together and projected to the projection dimension.The self-attention-based transformer encoder module gives the ability to incorporate a global receptive field at each stage, and hence supports the network to capture global contextual information [23].
The MLP block consists of two linear layers with a non-linearity function in between them.It performs feature expansion and reduction operations using the layers.The first layer expands the feature space by a factor of e.The second layer restores the expanded dimension back to the original.Here, e represents the feature expansion/reduction ratio.The output of the MLP module is as follows: where W 1 ∈ I P×L and W 2 ∈ I L×P corresponds to the weight of the first and second layers, respectively.P is the projection dimension and L is the expanded dimension.Here, L is set to 4096.The b 1 ∈ I L and b 2 ∈ I P are the biases corresponding to the first and second layers, respectively.∇(.) is the non-linearity function.The overall output of an individual transformer encoder is as follows: The Token Spatial Information Fusion (TSIF) module helps to capture low-level spatial context details about neighboring pixels in image tokens.Since the receptive field of CNNs are local in nature, it can extract local information around the neighboring pixels well [45].Hence, convolution keeps the ability to capture local context details for dense prediction tasks such as semantic segmentation [44].TSIF operation can be represented as follows: where ψ is the R × R convolution operation and it is set to 3 in this work.ψ can also be replaced with other convolution variants as well for fusing the local fine-grained details.
Using this design paradigm of the TSIF module also increases the representation power of features.

Decoder Network
The decoder network consists of four upsampling stages.It takes the output feature map of the powerful transformer-based encoder as input for final segmentation.It uses a gradual upsampling policy to generate the original size feature map as the input image.Before feeding the output of the encoder network to the decoder network, the feature maps are reshaped into a 3D representation as follows: where G is the reshape operation, which takes the output feature of the encoder network, i.e., x in as the input.P is the projection dimension and H 16 × W 16 represents the output feature resolution of the encoder network.This reshaped 3D feature is fed to the decoder network.Figure 3 represents an overview of the decoder network.Each stage consists of bilinear upsampling with a scale factor of 2, followed by a 3 × 3 convolution operation and a Gaussian Error Linear Units (GELU) activation function.N c represents the feature channels at each stage and is set to 256.The output feature maps of the last stage are of the same resolution as the original image, i.e., (H, W).The channel dimension of these features are projected to the number of semantic categories, i.e., N cls using a 1 × 1 convolution operation.
The output feature of the encoder network contains semantically rich information in terms of both global and local context details.This design paradigm of the decoder network generates the segmentation output without losing those details.

Dataset Overview
The performance of the proposed method was evaluated on two publicly available Unmanned Aerial Vehicles (UAVs) semantic segmentation datasets; UAVid [16] and Urban Drone Dataset (UDD-6) [55].The UAVid dataset contains 42 sequences of urban street scene images.Each sequence consists of 10 RGB images of 3840 × 2160 or 4096 × 2160 size.Twenty sequences are for training, seven sequences are for validation, and the remaining are for testing.It consists of eight semantic categories; building, road, tree, vegetation, static car, moving car, humans, and clutter.The UDD-6 dataset is collected in multiple cities.It contains 141 RGB images that consist of training and validation sets of 106 and 35 images, respectively.Each image has 3840 × 2160 or 4096 × 2160 or 4000 × 3000 size.It contains six semantic labels; facade, road, vegetation, vehicle, roof, and others.

Implementation Details
Since it is difficult and computationally expensive to use high-resolution UAV images for training, we split the original image into patches of 512 × 512 resolution without overlap.The original image size is not divisible by 512, and there will be some extra image regions that will not fit in 512 × 512 size.For those extra regions, we extract 512 × 512 size patch with the required overlap.We merge the predictions on each patch into the original size image for evaluation of the metrics.Image attributes are normalized using mean and standard deviation of (0.4914, 0.4824, 0.4467) and (0.2471, 0.2436, 0.2616), respectively.Random horizontal flipping is used for augmentation.Complete experiments are conducted on NVIDIA RTX A6000 GPU, using the PyTorch library.Step learning rate schedule is used with an initial learning rate set to 1 × 10 −3 and Stochastic Gradient Descent (SGD) is utilized as an optimizer with a momentum of 0.9 and weight decay of 1 × 10 −4 on both the datasets.The network is trained for 120 and 180 epochs on UAVid and UDD-6 datasets respectively, using a batch size of 4. Cross-entropy loss is minimized during training.

Results on UAVid Dataset
In this subsection, we present the test results using the UAVid dataset.It is an UAV urban scene segmentation dataset, and due to many scale variations, it is very challenging to get high segmentation accuracy.We selected the Dilation Net [56], U-Net [1], MSD [16], ERFNet [57], BiSeNetV2 [58], Fast-SCNN [59], ShelfNet [60], SETR-PUP [4], and Segmenter [42] for comparison.Per class Intersection over Union (IoU) score, Mean IoU (mIoU), and Overall Accuracy (OA) are used as performance metrics.Appendix A.1 describes the evaluation protocol for these metrics.The quantitative results compared with competitive methods are mentioned in Table 1.The highest value of each metric in the table is marked in bold.The proposed framework shows mIoU and OA of 61.93% and 84.49%, respectively.It outperforms other methods at least by 2.25% in mIoU.Per class IoU also shows competitive performance with other methods.
ERFNet and BiSeNetV2 show mIoU of 59.28% and 59.68%, respectively, which is comparable to the performance of the proposed framework with a mIoU of 61.93%.However, the proposed method outperforms Dilation Net, U-Net, MSD, Fast-SCNN, ShelfNet, SETR-PUP, and Segmenter with a decent margin of 13.35%, 3.51%, 4.94%, 15.99%, 14.9%, 7.94%, and 3.18%, respectively, in mIoU.In per-class IoU, the proposed method outperforms Segmenter in five out of eight categories.Humans are the minority class in the whole dataset and also the smallest object to segment.Hence, it is very challenging to segment humans.For the human class, the proposed method shows decent performance, with a mIoU of 22.94%.The proposed method lags ERFNet by 0.2% in OA, but it outperforms by 2.68% in mIoU.It outperforms U-Net, BiSeNetV2, and SETR-PUP by 1.06%, 3.39%, and 4.63%, respectively, in OA.Qualitative results of the proposed method using the UAVid validation and test datasets are depicted in Figures 4 and 5, respectively.We can see that the proposed approach generates smooth segmentation of UAV images.It captures the global and local contextual information well.Cars are well segmented while maintaining their proper shape and size.In the third and fifth images of Figure 5, small objects such as humans are well segmented.

Results on UDD-6 Dataset
In this subsection, we present the validation results using the UDD-6 dataset.FCN-8s [21], U-Net [1], GCN [30], ENet [61], ERFNet [57], BiSeNetV2 [58], DeepLab V3+ [17], and SETR-PUP [4] are selected for comparison.Per class IoU, mIoU, mean F1 score, and OA are used as performance metrics.Appendix A.1 describes the evaluation protocol for these metrics.The quantitative results compared with the competitive methods are mentioned in Table 2.The highest value of each metric in the table is marked in bold.The proposed method shows mIoU, mean F1 score, and OA of 73.65%, 84.40%, and 86.98% respectively.It outperforms other methods in most metrics by a significant margin.
Qualitative results of the proposed method using the UDD-6 validation dataset are depicted in Figures 6 and 7. preserves the local and global context details well and produces a fine segmentation output.Red and yellow boxes in Figure 7 represent the local and global contextual details, respectively, in an image.SETR-PUP captures global contexts well but it struggles to capture local details.

Ablation Studies
In this subsection, we discuss the results of the sensitivity test for the adopted approach using the UDD-6 validation dataset.We investigate the effect of different convolution variants for ψ in the TSIF module.We tested 3 × 3 convolution and 3 × 3 dilated convolution operations for this.A dilation rate of 2 is used for the dilated convolution.It can be seen from Table 3 that 3 × 3 convolution gives better results.It outperforms the dilated convolution by 0.56%, 0.41%, and 0.006% in mIoU, mean F1 score, and OA, respectively.We tested the nearest neighbor and bilinear upsampling policies to study the effect of distinct upsampling operations in decoder network.It can be seen from Table 4 that the bilinear upsampling approach performs best.It outperforms the nearest neighbor by 0.81%, 0.59%, and 0.81% in mIoU, mean F1 score, and OA, respectively.The number of feature channels in the decoder network, N c , is a crucial hyperparameter.We investigate the sensitivity for its two values, 512 and 256.We used a bilinear upsampling operation and 3 × 3 convolution for ψ in the TSIF module for both tests.The test results can be depicted in Table 5.Both the feature channels show very close performance in OA.However, the N c of 256 shows superior results compared to the 512 in the mIoU and mean F1 score.Overall, it outperforms N c of 512 by 0.51%, 0.37%, and 0.02% in mIoU, mean F1 score, and OA, respectively.

Discussion
Due to the huge background clutter and variations in aerial images, it poses many challenges for its semantic segmentation.The proposed framework captures the inherent features in a global and local context for UAV images precisely.The qualitative visualization of results shows a smooth segmentation output.The global context about two similar objects in an image such as cars, low vegetation, and trees is well maintained.This can be observed in Figures 4 and 5.The self-attention mechanism in the encoder network captures these long-range details about similar objects.We can also see in Figure 6 that our method can handle global information about cars and vegetation precisely.Similarly, in the first and second rows of Figure 7, global details about the cars are well captured by our method.Long-range vegetation is also well-segmented in the third and sixth rows.
The shape and size of the objects are well preserved, and the boundary information between the two classes is maintained smoothly.This can also be seen in the figures.The Token Spatial Information Fusion (TSIF) module in the encoder network captures these local context details around the neighboring pixels.The roof and facade are smoothly segmented in Figure 6, while preserving the local details.Compared to the SETR-PUP [4], the proposed method gives better results.SETR-PUP struggles to preserve information around the neighboring pixels belonging to the same class.In the fourth and sixth rows of Figure 7, our method segments the roof well.Similarly, in the first and third rows, our method smoothly handles the local context for the vegetation and road categories.These can be seen in the regions marked in red boxes.
The encoder network produces strong feature representation, which is semantically rich in both high and low-level details.The gradual upsampling strategy followed by a convolution operation in the decoder network helps to preserve this rich information, and hence it generates smooth predictions.The quantitative and qualitative findings using the UAVid and UDD-6 datasets show the advantage of the proposed framework in UAV image segmentation.
We performed sensitivity tests for various hyperparameters in our method.As shown in Table 3, the 3 × 3 convolution for ψ in the TSIF module gives superior results as compared to the dilated convolution.It captures the inherent features of the pixels belonging to the same class in a better way.Similarly, the bilinear upsampling policy in the decoder network gives the best performance.It preserves the overall semantic contextual details precisely for UAV images.A combination of 3 × 3 convolution for ψ, the bilinear upsampling approach, and the feature channels of 256 in the decoder network generates the best results in segmenting the aerial scenes, as presented in Table 5. Increasing the number of kernels leads to a decrease in performance.
Overall, the proposed design shows competitive performance but it still struggles to segment small objects such as humans.However, when compared to other methods, our method shows decent performance for the human category, but segmenting this category is very challenging.The UAVid and UDD-6 datasets were captured in high-visibility conditions.The performance of the proposed method may degrade for images captured in different weather conditions.The design of an effective framework to handle small objects in UAV scenes, which is also resilient to different weather conditions, can be considered in future work.

Conclusions
In this study, we proposed a transformer-based design for the semantic segmentation of UAV street scene images.We used an encoder-decoder framework that captures the inherent features in the global and local context for UAV images.The self-attention-based encoder network captures long-range information, therefore the global context about two similar objects in the image are well maintained.The convolution-based TSIF module fuses local contextual details in the network.This helps the network to segment neighboring pixels while maintaining the proper shape and size of the objects.The decoder network uses semantically rich feature representations from the encoder network for final pixel-level predictions.It generates smooth segmentation with well-preserved boundary information between two classes.We performed a set of ablation studies for the sensitivity test.Overall, the proposed framework shows a competitive result on the two public datasets; UAVid and UDD-6.However, it struggles to segment small objects in complex UAV images.The datasets used in this work were captured in high visibility conditions, so the performance of the framework may degrade for different weather condition images.In future research, we will explore more potential segmentation designs for handling small objects in aerial images, which are also resilient to changing weather conditions.
. Pixels belonging to the same class but far apart from each other represent the global context, whereas the neighboring pixels of the same class represent the local context.Global and local context modeling is shown in Figure 1.Yellow and red arrows in the figure represent the global and local contextual information modeling.

Figure 1 .
Figure 1.Illustrating global and local context information for UAV image semantic segmentation.(a) Input image.(b) Global context modeling.The yellow arrows represent the global contextual information.(c) Local context modeling.The red arrows represent the local contextual information.

Figure 2 .
Figure 2. (a) Overview of an encoder network design.It consists of linear projection, transformer encoder, and Token Spatial Information Fusion (TSIF) modules.(b) Transformer encoder module.

Figure 3 .
Figure 3. Overview of a decoder network.It consists of four upsampling stages.

Figure 7 .
Figure 7. Qualitative prediction results using the UDD-6 validation dataset.The red and yellow boxes represent the local and global context, respectively, in the images.

Table 1 .
Quantitative comparison of the UAVid test dataset result with competitive methods.

Table 2 .
Quantitative comparison of UDD-6 validation dataset result with competitive methods.

Table 3 .
The effect of different convolution variants for ψ in the TSIF module using the UDD-6 dataset.

Table 4 .
The effect of a different upsampling policy in the decoder network using the UDD-6 dataset.

Table 5 .
The effect of different feature channels, N c , in the decoder network using the UDD-6 dataset.