Semi-Supervised Adversarial Semantic Segmentation Network Using Transformer and Multiscale Convolution for High-Resolution Remote Sensing Imagery

: Semantic segmentation is a crucial approach for remote sensing interpretation. High-precision semantic segmentation results are obtained at the cost of manually collecting massive pixelwise annotations. Remote sensing imagery contains complex and variable ground objects and obtaining abundant manual annotations is expensive and arduous. The semi-supervised learning (SSL) strategy can enhance the generalization capability of a model with a small number of labeled samples. In this study, a novel semi-supervised adversarial semantic segmentation network is developed for remote sensing information extraction. A multiscale input convolution module (MICM) is designed to extract sufﬁcient local features, while a Transformer module (TM) is applied for long-range dependency modeling. These modules are integrated to construct a segmentation network with a double-branch encoder. Additionally, a double-branch discriminator network with different convolution kernel sizes is proposed. The segmentation network and discriminator network are jointly trained under the semi-supervised adversarial learning (SSAL) framework to improve its segmentation accuracy in cases with small amounts of labeled data. Taking building extraction as a case study, experiments on three datasets with different resolutions are conducted to validate the proposed network. Semi-supervised semantic segmentation models, in which DeepLabv2, the pyramid scene parsing network (PSPNet), UNet and TransUNet are taken as backbone networks, are utilized for performance comparisons. The results suggest that the approach effectively improves the accuracy of semantic segmentation. The F1 and mean intersection over union (mIoU) accuracy measures are improved by 0.82–11.83% and 0.74–7.5%, respectively, over those of other methods.


Introduction
Massive quantities of high-resolution remote sensing data are collected every day, along with the progress of sensor technology, which creates great challenges to fast and accurate remote sensing imagery information acquisition. Recently, convolutional neural networks (CNNs) have realized excellent presentation on remote sensing imagery interpretation, with their powerful feature representation capability [1,2]. Semantic segmentation techniques represented by fully convolutional networks (FCNs) [3] can achieve accurate pixelwise image classification with sufficient training data, which has become the mainstream technology in the information extraction field and is widely used for remote sensing imagery object extraction, including buildings, roads, and water bodies [4][5][6].
Classical semantic segmentation networks, such as the pyramid scene parsing network (PSPNet) [7], DeepLabs [8] and dual attention network (DANet) [9], are trained in a fully supervised mode, which relies on massive manual annotations. Remote sensing imagery is characterized by multisource, multitemporal and complex scenes and acquiring adequate pixelwise annotations is extremely expensive. Although some datasets have been established for remote sensing semantic segmentation, such as the Gaofen Image Dataset (GID) [10], the EVLab-Semantic Segmentation (EVLab-SS) Dataset [11], and the International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam datasets [12], the quantity of training data for semantic segmentation is still small, considering the complexity of remote sensing information extraction tasks. The existing datasets have difficulty in covering different regions and image types simultaneously, which seriously affects the generalization capability of models. Therefore, many existing approaches rely on semisupervised training schemes to reduce annotation requirements [13,14]. Research on using unlabeled samples to assist model training and improving the accuracy of object extraction with a small quantity of annotated data, namely, semi-supervised learning (SSL) strategies, is of great significance.
SSL can automatically utilize unlabeled samples to enhance the generalization ability of learners, without interacting with the outside world. End-to-end semi-supervised deep learning methods include proxy-label methods [15,16], consistency regularization [17,18], hybrid methods [19,20], and SSL methods combined with generative adversarial networks (GANs) [21]. GAN-based SSL methods, namely semi-supervised adversarial learning (SSAL) techniques, have become popular in recent years and have been applied for remote sensing tasks, involving image segmentation and image interpretation [22,23]. Figure 1 shows a typical SSAL framework for image semantic segmentation [24]. The generator in an initial GAN framework [25] is replaced by a segmentation network, which inputs labeled and unlabeled data and outputs the corresponding prediction maps. The discriminator network inputs the prediction maps and ground-truth maps and outputs confidence maps, which are taken as supervisory signals for the unlabeled data to guide the SSL process. Some studies [24] have shown that this framework enables segmentation networks to learn higher-order structural information without postprocessing, thereby improving the generalization ability of the networks. fully supervised mode, which relies on massive manual annotations. Remote sensing imagery is characterized by multisource, multitemporal and complex scenes and acquiring adequate pixelwise annotations is extremely expensive. Although some datasets have been established for remote sensing semantic segmentation, such as the Gaofen Image Dataset (GID) [10], the EVLab-Semantic Segmentation (EVLab-SS) Dataset [11], and the International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam datasets [12], the quantity of training data for semantic segmentation is still small, considering the complexity of remote sensing information extraction tasks. The existing datasets have difficulty in covering different regions and image types simultaneously, which seriously affects the generalization capability of models. Therefore, many existing approaches rely on semi-supervised training schemes to reduce annotation requirements [13,14]. Research on using unlabeled samples to assist model training and improving the accuracy of object extraction with a small quantity of annotated data, namely, semi-supervised learning (SSL) strategies, is of great significance.
SSL can automatically utilize unlabeled samples to enhance the generalization ability of learners, without interacting with the outside world. End-to-end semi-supervised deep learning methods include proxy-label methods [15,16], consistency regularization [17,18], hybrid methods [19,20], and SSL methods combined with generative adversarial networks (GANs) [21]. GAN-based SSL methods, namely semi-supervised adversarial learning (SSAL) techniques, have become popular in recent years and have been applied for remote sensing tasks, involving image segmentation and image interpretation [22,23]. Figure 1 shows a typical SSAL framework for image semantic segmentation [24]. The generator in an initial GAN framework [25] is replaced by a segmentation network, which inputs labeled and unlabeled data and outputs the corresponding prediction maps. The discriminator network inputs the prediction maps and ground-truth maps and outputs confidence maps, which are taken as supervisory signals for the unlabeled data to guide the SSL process. Some studies [24] have shown that this framework enables segmentation networks to learn higher-order structural information without postprocessing, thereby improving the generalization ability of the networks. FCNs are commonly used to construct segmentation networks and discriminator networks under the SSAL framework. FCNs have powerful feature extraction capabilities. However, restricted by the given receptive fields, convolution operations have difficulty However, restricted by the given receptive fields, convolution operations have difficulty acquiring global contextual information [26]. To overcome this limitation, some multiscale modules [7,8] have been proposed to improve the feature extraction capability of the resulting models. In addition, utilizing deep networks with complex components [27] and integrating attention modules into FCN architectures, such as DANet [9] and the squeezeand-excitation network (SENet) [28], can provide effective global context. However, these approaches cannot avoid the loss of details when the resolutions of feature maps are gradually reduced during the encoding phase.
The Transformer first appeared in machine translation tasks and has recently raised much concern in the computer vision field [29][30][31][32]. Transformer layers [33], which contain stacked multi-head self-attention (MSA) and multilayer perceptron (MLP) blocks, can capture global contextual information and the long-range dependencies between objects. In complex remote sensing scenes, acquiring contextual long-range dependencies is important for accurate object recognition and extraction. Methods combining convolutions with a Transformer can acquire both the local feature and the global contextual relationship simultaneously. Some works have shown that this combination effectively improves image segmentation accuracy [26,34]. However, such studies are rare in semi-supervised remote sensing image segmentation.
In this article, we develop a novel semi-supervised adversarial semantic segmentation approach for remote sensing information extraction that combines the advantages of both convolution and Transformer, called TRANet. The main contributions include the following:

•
A multiscale input convolution module (MICM) and an improved strip-max pooling (SMP) structure are provided. The MICM adopts multiscale downsampling and skip connections to capture information of different input scales, while maintaining the spatial details of objects in complex remote sensing scenes. The SMP preserves both the global and horizontal/vertical information during feature extraction, thereby reducing the information loss when the resolutions of the feature maps are gradually reduced. • TRANet is developed with two subnetworks. The segmentation network is characterized by a double-branch encoder, which integrates the Transformer module (TM) and the MICM. The discriminator network is designed by using a parallel convolution architecture with different kernel sizes. Two subnetworks are trained under the SSAL framework. TRANet can extract local features and long-range contextual information simultaneously and improve generalization capability with the assistance of unlabeled data.

•
Taking building extraction as a case study, experiments on the WHU Building Dataset (WBD) [35], Massachusetts Building Dataset (MBD) [36] and GID [10] are carried out to validate TRANet. DeepLabv2, PSPNet, UNet and TransUNet are used as segmentation networks for a performance comparison under the same SSAL scheme. The results demonstrate that TRANet improves segmentation accuracy compared to other approaches when only a few labeled samples are available.
The remainder of this article is arranged as follows. Section 2 introduces some related works. The design of the proposed approach is detailed in Section 3. The experimental setup and results are illustrated in Section 4. Section 5 discusses ablation experiments and parameter selections. Section 6 summarizes this article.

Semi-Supervised Semantic Segmentation
Many existing methods rely on the SSL scheme to reduce the workload of manual annotation [37,38]. Currently, end-to-end SSL methods can be roughly divided into four categories (1) Proxy-label methods. Such methods use trained models with labeled data to produce pseudo-labels for unlabeled data; examples include pseudo-label [15] and co-training [16]. Their training depends on experience. (2) Consistency regularization. These approaches assume that if noise is applied to samples, the predictions for noisy and non-noisy samples should be as consistent as possible, such as the temporal ensembling [17] and mean teacher methods [18]. They require high robustness to perturbations to achieve improved generalization ability. (3) Hybrid methods. These techniques, such as MixMatch [19] and FixMatch [20], integrate the aforementioned two SSL methods into one framework and have complex model structures. (4) SSL methods combined with GANs [21]. Such methods use the discriminator to facilitate the training of the generator, thereby improving the performance of the resulting models.
SSL methods combined with GANs have been widely applied in semantic segmentation tasks and have achieved good performance. Souly et al. [39] used a GAN generator to create pseudosamples and used a discriminator to classify the pixels into different semantic categories. Four datasets were used to verify the developed method. Hung et al. [24] replaced the generator in a GAN framework with the DeepLabv2 model and designed a fully convolutional discriminator. They utilized the confidence maps generated by the discriminator as the supervisory signals for the unlabeled data to improve the segmentation accuracy under adversarial training. Zhang et al. [40] utilized a segmentation network with two self-attention modules to learn the spatial semantic relationship. They simultaneously used a discriminator containing spectral normalization to improve the training performance. Sun et al. [41] designed a segmentation network with a channel-weighted multiscale feature module and a discriminator network integrating a boundary attention module and residual blocks. Their method alleviated the boundary blur of objects and obtained improved segmentation accuracy on remote sensing datasets.

Convolution Neural Network and Variants
FCN-based architectures are used to construct both the segmentation and discriminator networks in the classical semi-supervised adversarial semantic segmentation framework. CNN is a hierarchical data representation method that gradually abstracts features with rich semantic information from shallow to deep. FCNs [3], which are extended on the basis of CNNs, contain encoder-decoder structures and replace the fully connected layers of CNNs with convolution layers for image segmentation. FCNs can automatically obtain precise local features and abstract high-level features via end-to-end training, and they have strong feature representation ability for specific tasks.
Deep learning-based semantic segmentation networks are mostly implemented with FCNs. However, restricted by the receptive fields, the features captured by the convolution layers fail to effectively learn long-range dependency information. To overcome this limitation, multiscale modules, such as the atrous convolution module [7] and spatial pyramid pooling [8], use convolution or pooling operations with different scales to obtain features with different receptive fields, thereby enhancing the feature representation ability of the resulting model. In addition, simply increasing the depths of networks [27], acquiring multiscale image characteristics, and integrating attention modules into FCN architectures can provide effective global context. For instance, Luo et al. [42] utilized two uniform residual networks with five levels in the encoder to process input images and auxiliary feature data. They also added the channel attention mechanism into the decoder for remote sensing image feature selection. Huang et al. [43] used a channel-wise attention mechanism to refine coarse labels of different scales and fused features of different levels via an attention-based module. Their method reduced the feature differences and improved the segmentation accuracy in remote sensing datasets. However, the attention modules are usually placed at the top of the employed convolution architecture, which restricts attention learning to high-level features. Such strategies still cannot prevent the loss of details when the resolutions of feature maps are gradually reduced.

Transformer
The vision Transformer (ViT) [29] was the first work to apply a pure Transformer with self-attention to image classification. ViT divides the input image into a series of image patches for sequence-to-sequence prediction and has achieved state-of-the-art performance on the ImageNet dataset. Context modeling is extremely important for semantic segmentation. The Transformer can capture global contextual information via self-attention, which compensates for the deficiency of convolution operations. Therefore, some scholars have studied combining Transformers with CNNs to improve semantic segmentation accuracy. Zheng et al. [26] proposed a segmentation model with a Transformer-alone encoder, which replaced the stacked convolution layers with a pure Transformer to extract features and combined it with a convolution-based decoder for image segmentation. Chen et al. [44] inserted a Transformer into the top of the encoder in UNet to extract global information and then upsampled the features by a convolution-based decoder to obtain precise segmentation results. However, the aforementioned methods are applied to natural scenes and medical images in a fully supervised training mode. Few studies have used the Transformer to segment high-resolution remote sensing images containing complex objects. Furthermore, few studies have focused on constructing semi-supervised segmentation networks by using Transformers.
The proposed TRANet is mainly characterized by its double-branch encoder segmentation network. The unique MICM enables the network to acquire features of different input scales and maintain spatial information. Furthermore, the long-range modeling advantages of the Transformer compensate for the deficiency regarding the limited receptive fields of convolution operations. Relying on the SSAL framework, TRANet uses the confidence map generated by the unique double-branch discriminator network to guide the training of unlabeled data and further refines the segmentation network, thereby achieving increased image segmentation accuracy.

Algorithm Overview
The semi-supervised adversarial semantic segmentation task is expressed as follows. Given (m + n) images with sizes of H × W × C and corresponding labels as inputs: where x lm and x un denote m labeled images x l and n unlabeled images x u , respectively. Generally, n m; that is, unlabeled data are far more abundant than labeled data. y lm is the binary label map corresponding to x lm , which contains a target value of 1 and a background value of 0. The segmentation network generates prediction maps by training with the labeled and unlabeled data. The discriminator network distinguishes the approximation degree between segmented results and sample labels and optimizes the segmentation model during adversarial training. Figure 2 illustrates the TRANet graphically. The segmentation network comprises a classical encoder-decoder structure, and the discriminator network includes double-branch convolution structures with different kernel sizes. The two networks are combined for image segmentation under the SSAL framework ( Figure 1).

Segmentation Network
As shown in part I of Figure 2, the encoder of the segmentation network contains a TM and an MICM. The TM acquires the global contextual features FA by self-attention. The MICM obtains the spatial information of multiscale input images and extracts local features FB through convolution and pooling operations. The joint feature F is obtained by Equation (2): where ⊕ denotes the feature concatenation operation.

Transformer Module
The TM serializes the input images and captures global contextual information by using self-attention, which maintains the complete object features and alleviates the detail loss while gradually reducing the resolutions of the feature maps. The standard Transformer [33] receives a 1D sequence as input. As displayed in Figure 3, to handle a 2D image [29], we divide the input and then flatten them into a sequence, where (H, W) indicates the size of the input images, indicates the patch number, C indicates the channel number, and P represents the length and width of each patch, which is set as 16 in our study.

Segmentation Network
As shown in part I of Figure 2, the encoder of the segmentation network contains a TM and an MICM. The TM acquires the global contextual features F A by self-attention. The MICM obtains the spatial information of multiscale input images and extracts local features F B through convolution and pooling operations. The joint feature F is obtained by Equation (2): where ⊕ denotes the feature concatenation operation.

Transformer Module
The TM serializes the input images and captures global contextual information by using self-attention, which maintains the complete object features and alleviates the detail loss while gradually reducing the resolutions of the feature maps. The standard Transformer [33] receives a 1D sequence as input. As displayed in Figure 3, to handle a 2D image [29], we divide the input X ∈ R H×W×C into a series of image patches X p ∈ R N×(P×P×C) and then flatten them into a sequence, where (H, W) indicates the size of the input images, N = H × W/P 2 indicates the patch number, C indicates the channel number, and P represents the length and width of each patch, which is set as 16 in our study. Each vector patch is mapped to D dimensions with a learnable linear projection, resulting in a patch embedding. Then a 1D position embedding is added to this patch embedding to reserve the associated location information, as displayed in Equation (3): where Ε and pos Ε denote linear projection functions of the patch embedding and position embedding, respectively, and N p X denotes the N-th image patch.
Subsequently, the resulting embedding sequences are input into the Transformer layers. Each layer is composed of stacked MSA and MLP blocks. Layer normalization (LN) is used before each block, and residual connections are applied after each block [29]. The hidden feature representations are obtained by Equations (4) and (5): where l z represents the l-th encoder feature. A hidden feature representation of size is obtained by processing the L Transformer layers and reshaping to ( ) ( ) H P W P D   , resulting in the middle feature FA. In this study, D is set to 768, and the TM module contains 12 Transformer layers and 8 heads in each MSA layer. Section 5 analyses and discusses the parameter selection.

Multiscale Input Convolution Module
The MICM consists of four submodules, each of which has the same double-branch architecture ( Figure 4). Taking X as an input, the lower branch extracts features k δ by using two convolution layers, each of which contains a batch normalization (BN) layer and a rectified linear unit (ReLU) activation function. Each vector patch is mapped to D dimensions with a learnable linear projection, resulting in a patch embedding. Then a 1D position embedding is added to this patch embedding to reserve the associated location information, as displayed in Equation (3): where E and E pos denote linear projection functions of the patch embedding and position embedding, respectively, and X N p denotes the N-th image patch. Subsequently, the resulting embedding sequences are input into the Transformer layers. Each layer is composed of stacked MSA and MLP blocks. Layer normalization (LN) is used before each block, and residual connections are applied after each block [29]. The hidden feature representations are obtained by Equations (4) and (5): where z l represents the l-th encoder feature. A hidden feature representation of size (H × W/P 2 ) × D is obtained by processing the L Transformer layers and reshaping to (H/P) × (W/P) × D, resulting in the middle feature F A . In this study, D is set to 768, and the TM module contains 12 Transformer layers and 8 heads in each MSA layer. Section 5 analyses and discusses the parameter selection.

Multiscale Input Convolution Module
The MICM consists of four submodules, each of which has the same double-branch architecture ( Figure 4). Taking X as an input, the lower branch extracts features δ k by using two convolution layers, each of which contains a batch normalization (BN) layer and a rectified linear unit (ReLU) activation function.
where g(·) denotes the double convolution operations and δ k denotes the convolution feature of the k-th submodule when k = 1, δ 0 = X. Then, the SMP is employed for feature abstraction and dimensionality reduction.

Decoder
The decoder takes the joint feature F, which concatenates the outputs of the TM and MICM, as the input for feature restoration (Figure 2). Two convolution layers are used to reshape the feature dimensions to 1024 16 16 × × . The resulting feature is restored to the same dimension as the input image by Equation (11): In this article, SMP is used to replace the max pooling operations of classical networks. Max pooling probes information within square windows, which limits the flexibility in capturing anisotropic context features. Strip pooling [45] resolves this problem well. The given convolution feature δ k is fed into a horizontal and vertical strip pooling layer simultaneously, resulting in two 1D features δ h k ∈ R C×H and δ v k ∈ R C×W : Subsequently, δ h k and δ v k are converted into feature matrices with sizes of H × W via a 1D convolution. Then, the feature map δ k of the SMP structure in the k-th submodule is obtained by Equation (9): where MP(·) denotes a max pooling, f st (·) denotes a 1 × 1 convolution with a stride size of st, and ⊕ represents the feature concatenation operation. The upper branch downsamples the input and reshapes the feature dimensions to make them consistent with δ k . The resulting feature maps are connected with δ k , and subsequently a 1 × 1 convolution is utilized to acquire the subfeature F k : where F k denotes the intermediate feature of the k-th submodule when k = 1, F 0 = X, d(·) denotes the downsampling operation, s =

Decoder
The decoder takes the joint feature F, which concatenates the outputs of the TM and MICM, as the input for feature restoration (Figure 2). Two convolution layers are used to reshape the feature dimensions to 16 × 16 × 1024. The resulting feature is restored to the same dimension as the input image by Equation (11): where γ k denotes the feature map of the k-th upsampling step, when k = 1, γ 0 = F, and TransposeConv(·) denotes the transposed convolution layer. Four skip connections [46] are adopted to combine the convolution features in the MICM with the upsampled feature maps. Such an operation effectively alleviates the loss of features over successive convolution and pooling operations.
where γ k denotes the feature map of the k-th double convolution, and the numbers of feature channels are {512,256,128,64}. Finally, the feature maps with 2 channels are acquired via a 1 × 1 convolution, and these maps are fed into the sigmoid layer to obtain the prediction result R.

Discriminator Network
An FCN-based discriminator network is designed; it contains a double-branch structure with different convolution kernel sizes. More information about different receptive fields can be obtained by multiscale inputs and convolution kernels with different sizes. The discriminator network receives the segmentation result R or ground-truth maps as input, as shown in part II of Figure 2. Features are extracted from the upper and lower branches (Equations (13) and (14)): where F U k and F D k denote the features obtained by the k-th convolution in the upper and lower branches, respectively. When k = 1, F U 0 = R, Conv st ke (·) represents a convolution with strides of st and kernel sizes of ke, LeakyReLU(·) denotes the leaky ReLU activation function, and d s (·) denotes the downsampling operation with a parameter s = 1/2. The numbers of channels in the resulting four feature maps are {64,128,256,512}. Subsequently, the feature maps generated by the two branches are concatenated and fed into a 1 × 1 convolution and a classification layer. Last, the confidence map is acquired via a sigmoid operation, in which each pixel represents the approximation degree of the pixels in the segmented map with respect to the sample label. This map is utilized as a supervisory signal for unlabeled data.

Loss Function
The segmentation network and discriminator network are trained jointly via labeled samples. When inputting unlabeled samples, the discriminator network generates confidence maps to supervise the training of the segmentation network in a self-taught mechanism. The discriminator network is optimized by minimizing the binary cross-entropy loss L D : where O R (i,j) and O Y (i,j) represent confidence maps for the prediction maps R and groundtruth labels Y, respectively, (i, j) denotes pixel locations, and y represents the label of each pixel.
The multitask loss in [24] is optimized to train the segmentation network: where L CE , L adv and L semi respectively indicate the cross-entropy loss, adversarial loss, and semi-supervised loss, and λ adv and λ semi are weights utilized for adjusting L Seg . In this study, λ adv is respectively set to 0.01 and 0.001 while using labeled and unlabeled samples. λ semi is equal to 0.1. Taking C as the number of categories, L CE is obtained by Equation (17): The adversarial loss and semi-supervised loss are shown in Equations (18) and (19), respectively: where R u (i,j,c) denotes the class c prediction results of the unlabeled data at location (i, j), Y u c denotes the pseudo-label of the class c of unlabeled data, O (i,j) represents the confidence map, and τ is a threshold value of 0.2.

Datasets
Three open-source remote sensing datasets with different spatial resolutions, including the WBD [35], MBD [36] and GID [10], were used for method verification. We clipped all images and labels into 256 × 256 image patches for model training and classification. Some building examples contained in the three datasets are shown in Figure 5. The labels were uniformly processed into binary images with a target value of 1 and a background value of 0.
• WBD: This building dataset consists of 8189 aerial image tiles and contains 187,000 buildings with diverse usages, sizes and colors in Christchurch, New Zealand. The spatial resolution is 0.3 m. After cropping without overlap, 15,256 image patches were selected and randomly split into 14,256 patches for training and 1000 patches for testing. • MBD: The MBD is a large dataset for building segmentation that consists of 151 aerial images of the Boston area with 1500 × 1500 pixels. The spatial resolution is 1 m. A total of 11,384 image patches containing buildings with 256 × 256 pixels were chosen after cropping. These patches were further randomly divided into 10,384 patches for training and 1000 patches for testing. • GID: This land-use dataset contains 5 land-use categories and 150 Gaofen-2 satellite images, obtained from more than 60 different cities in China. The spatial resolution is 4 m. We extracted the building class and constructed a dataset containing 13,671 image patches for our experiments, among which 12,175 patches were used for training and 1496 were used for testing.

Method Implementation
Several well-known semantic segmentation networks, i.e., DeepLabv2 [8], PSPNet [7], UNet [46], and TransUNet [44], with combinations of Transformer and convolution, were used for method comparisons under the SSAL framework. ResNet-101 was used as the backbone for DeepLabv2 and PSPNet. The numbers of Transformer layers and attention heads in TransUNet are set to 12 [44]. To validate the proposed method, we randomly sampled 1/8, 1/4 and 1/2 of images as labeled data and the remainder as unlabeled data. The quantities of labeled data are displayed in Table 1.

Method Implementation
Several well-known semantic segmentation networks, i.e., DeepLabv2 [8], PSPNet [7], UNet [46], and TransUNet [44], with combinations of Transformer and convolution, were used for method comparisons under the SSAL framework. ResNet-101 was used as the backbone for DeepLabv2 and PSPNet. The numbers of Transformer layers and attention heads in TransUNet are set to 12 [44]. To validate the proposed method, we randomly sampled 1/8, 1/4 and 1/2 of images as labeled data and the remainder as unlabeled data. The quantities of labeled data are displayed in Table 1.   All models were implemented with Python 3.6 and PyTorch 1.2.0, which were powered by a 24-GB NVIDIA GeForce RTX 3090 GPU. The segmentation network was optimized using the stochastic gradient descent approach. The original learning rate was 2.5 × 10 −4 and was declined via polynomial decay with a power of 0.9. The Adam optimizer [47], where the learning rate is 1 × 10 −4 , was utilized to optimize the discriminator network. All networks were trained over 80 K iterations and the batch size was 4. Adopting the same strategy used in [24], we started SSL after training 5000 iterations with labeled samples to avoid the model being influenced by the original noisy masks and predictions.

Method Evaluation Measures
Four assessment indices, precision, recall, F1 and mean intersection over union (mIoU), were utilized to evaluate the different methods. Equation (20) gives the definitions of these metrics: where TP indicates the quantity of building pixels correctly categorized, FP indicates the quantity of nonbuilding pixels categorized as buildings, FN indicates the quantity of building pixels incorrectly categorized as nonbuildings, and C is the quantity of categories. The F1 and mIoU metrics were utilized to comprehensively assess the model performance.

Experimental Results and Analysis
All the networks were trained on the WBD, MBD and GID using different quantities of labeled samples under the SSAL framework. The test sets did not participate in the model training and were used for evaluating and comparing the method performance. Tables 2-4 show the building extraction accuracies achieved on the three datasets. In general, adding the quantity of labeled samples increases the accuracy measures of each approach. The F1 and mIoU measures of the proposed TRANet were the best on the three datasets, and this finding was consistent with the subsequent visualization analysis. As shown in Table 2, the building extraction accuracies of all methods on the WBD were higher than 90%, except DeepLabv2 and PSPNet. PSPNet performed worst among all the models. When trained with fully labeled data, the four measures yielded by TRANet increased by 5.51%, 11.27%, 8.53% and 8.82%, compared with those of PSPNet. The UNet model performed the second best. With only 1/8 of the labeled data, UNet's F1 and mIoU values were 92.88% and 91.82%, respectively, which were 0.5% lower than those of TRANet. The accuracy of TransUNet was slightly lower than that of UNet. The Transformer structure is added only at the top of the TransUNet encoder, resulting in limited global information. TRANet, which combines the Transformer and convolution, performed the best on the WBD. Table 3 lists the accuracy measures produced by the different methods on the MBD. The accuracies of all models were lower than 80%. With 1/8 of the labeled data, the F1 and mIoU measures of TRANet were 72.21% and 74.54%, respectively, which were 5% lower than those obtained using fully labeled data. However, this method still performed the best. TRANet's F1 and mIoU increased by approximately 0.82%~11.83% and 0.74%~7.5%, respectively, compared with those of other methods. The UNet model performed suboptimally. The F1 and mIoU measures of TransUNet were 3.19% and 2.24% lower than those of UNet, respectively, under 1/8 of the labeled data. The performances of DeepLabv2 and PSPNet were poor, and all the F1 and mIoU values were lower than 70%. The DeepLabv2 model performed slightly better than PSPNet.

Quantitative Analyses
On the GID, as shown in Table 4, TransUNet, using the Transformer structure, achieved better building extraction accuracy than UNet. When trained with 1/8 labeled samples, TransUNet's F1 and mIoU values were 1.63% and 1.44% better than those of UNet, respectively. DeepLabv2 performed better than PSPNet and UNet. When trained with fully labeled data, DeepLabv2's F1 and mIoU were 1.51% and 1.29% less than those of TRANet, respectively. TRANet performed the best. The four measures of TRANet, when training with 1/2 labeled data, decreased by 0.27%, 1.23%, 0.8%, and 0.68% relative to the metrics obtained when training with fully labeled data, where TRANet achieved an accuracy similar to that of using fully supervised training.

Qualitative Analyses
The semantic segmentation results obtained when training with 1/8 labeled samples under the SSAL framework were used for visual analysis. Figures 6-8 show the representative building regions derived with the three datasets.     The resolution of the MBD is 1 m. Many buildings with small areas are represented by only a few to more than a dozen pixels in the corresponding images; this situation brings difficulties to the fine extraction of buildings. As shown in Figure 7c,d, all results Based on the aforementioned quantitative and qualitative analyses, the proposed TRANet performed the best. TRANet uses the Transformer to obtain global contextual information and the MICM to extract local multiscale features simultaneously. The proposed SMP structure is designed to retain horizontal and vertical features, which alleviates the loss of details over continuous convolution operations. All these designs facilitate improvements in the building extraction accuracy.  The WBD has high resolution and good image quality. Figure 6c,d show that the results obtained by DeepLabv2 and PSPNet exhibited many missed extractions and falsely extracted areas, and obvious distortions were present on the edges of buildings, especially in subregions 1 and 2. The extraction results of UNet and TransUNet had fewer missed extractions (subregions 1 and 2) and falsely extracted areas (subregion 4). TRANet extracted more complete building surfaces in subregions 2-5, and the details were closer to the reference labels.
The resolution of the MBD is 1 m. Many buildings with small areas are represented by only a few to more than a dozen pixels in the corresponding images; this situation brings difficulties to the fine extraction of buildings. As shown in Figure 7c The GID has good image quality but relatively low resolution. Multiple complex objects, i.e., water bodies, roads, farmland, bare land, etc., are contained in one image. Buildings have irregular edges and are mostly distributed in pieces, which are easily mixed with other types of objects. Such a situation increases the difficulty of building extraction. Overall, all extraction results had missed extractions and falsely extracted areas. The falsely extracted areas in the results obtained by DeepLabv2, PSPNet, UNet and TransUNet were smaller, as shown in Figure 8c-f, but there were more missed extractions in subregions 2, 4 and 5. TRANet extracted more complete buildings than other models.
Based on the aforementioned quantitative and qualitative analyses, the proposed TRANet performed the best. TRANet uses the Transformer to obtain global contextual information and the MICM to extract local multiscale features simultaneously. The proposed SMP structure is designed to retain horizontal and vertical features, which alleviates the loss of details over continuous convolution operations. All these designs facilitate improvements in the building extraction accuracy.

Discussion
We performed four groups of ablation experiments to validate the performance of the designed double-branch segmentation network, the MICM, the SMP, and the discriminator network. The double-branch encoder is the core of TRANet, and it was verified by semisupervised experiments with the WBD, MBD and GID under different amounts of labeled data, to fully illustrate the advantages of the Transformer combined with convolution. For the other three groups, 7128 labeled samples and 7128 unlabeled samples from the WBD were selected for the ablation experiments.

Comparison between Single/Double-Branch Encoder Structures
The encoder of the TRANet segmentation network contains a parallel TM and MICM, and it was verified via module replacement, along with the fixed decoder and discriminator network under the SSAL framework. Table 5 shows that the accuracies were low when the TM was used alone as the encoder, among which the F1 and mIoU were approximately 8.11~18.96% and 8.44~12.69% less than those obtained by the encoder using the MICM alone, respectively. The Transformer focuses on context modeling during the encoding phase and ignores the detailed localization of low-level features, which is hardly restored by upsampling. Convolution operations can extract rich low-level features. Combining the Transformer with convolution facilitates the improvement in the segmentation accuracy. The F1 and mIoU increased by approximately 0.13~19.44% and 0.14~13.09%, respectively, over the results obtained by using the single encoder. Therefore, TRANet utilizes the ad-vantages of the Transformer and convolution to extract robust features, thereby improving semantic segmentation accuracy.

Comparison among Different Pooling Modules
The proposed SMP was verified by module replacement along with the fixed decoder and discriminator network under the SSAL framework. One set of experiments used a single-branch encoder, containing four simple "convolution-pooling" architectures, where the pooling layer was successively replaced by max pooling, strip pooling [45], and the SMP structure. These corresponding alternates were represented by CNN_MP, CNN_SP, and CNN_SMP. Another set of experiments used a double-branch encoder combining the TM and the aforementioned "convolution-pooling" architectures, which were represented by TM+CNN_MP, TM+CNN_SP, and TM+CNN_SMP. The achieved accuracy measures are listed in Table 6. The single-or double-branch encoders using the SMP performed the best when compared with those using other pooling structures, thereby proving the proposed SMP structure. Table 6. Accuracy assessment of TRANet in terms of building extraction with different pooling modules. The highest accuracy is displayed in bold.

Comparison among Different Multiscale Modules
The MICM was verified by module replacement along with the fixed decoder and discriminator network under the SSAL framework. One set of experiments used a singlebranch encoder, containing four simple "convolution-pooling" architectures and added atrous spatial pyramid pooling (ASPP) [8], selective kernel (SK) [48], and MICM modules to the encoder, which were represented by CNN, CNN+ASPP, CNN+SK, and CNN+MICM, respectively. Another set of experiments used the aforementioned double-branch encoder with different multiscale modules, which were represented by TM+CNN, TM+CNN+ASPP, TM+CNN+SK, and TM+CNN+MICM. Table 7 shows that the methods using multiscale modules achieved higher accuracy than those that did not utilize multiscale modules. Both the single-and double-branch encoders using the MICM performed better than those using other multiscale modules. The MICM captures multiscale input maps before feature extraction, which reduces the loss of details caused by continuous convolution operations with limited receptive fields.

Comparison among Different Discriminator Networks
The discriminator network in [24] and that proposed in this paper (represented by an additional *), along with five segmentation networks, including DeepLabv2, PSPNet, UNet, TransUNet and TRANet, were utilized for model training under the SSAL framework. Table 8 presents the achieved accuracy measures. The developed discriminator network facilitated the same segmentation network to obtain higher segmentation accuracy. This strategy was effective for all five segmentation networks. The proposed discriminator network can capture more information with different receptive fields by utilizing multiscale inputs and convolutions with different kernel sizes.

Model Parameter Discussions
Two important parameters in the TM of TRANet, the number of Transformer layers and number of heads, are represented by layer_num and head_num, respectively. We used 7128 labeled data and 7128 unlabeled data from the WBD for semi-supervised training, with different parameter settings, and analyzed the network performance. When the influence of layer_num was analyzed, head_num was fixed to 8, and layer_num was set to {4,8,12,16,20}. When the influence of head_num was analyzed, layer_num was fixed to 12, and head_num was set to {2,4,8,12,16}. Tables 9 and 10 show that the highest accuracy was obtained when layer_num was 12 and head_num was 8. Therefore, this set of values was used in all experiments in this study.

Conclusions
In this article, we designed a novel semi-supervised adversarial semantic segmentation network for object extraction, from high-resolution remote sensing imagery, which leverages both the local feature extraction advantages of CNNs and the global context modeling abilities of the Transformer. Experimental results on three datasets with different spatial resolutions show that TRANet significantly increases the building extraction accuracies and makes the acquired segmentation results close to those obtained via fully supervised learning when a small number of labeled data are available. Future works will further fuse the multilevel features of the Transformer and CNNs to obtain more refined object information, thereby enhancing the performance of the segmentation network and applying it to segmentation tasks involving other objects in high-resolution remote sensing imagery.