Scale-Invariant Multi-Level Context Aggregation Network for Weakly Supervised Building Extraction

: Weakly supervised semantic segmentation (WSSS) methods, utilizing only image-level annotations, are gaining popularity for automated building extraction due to their advantages in eliminating the need for costly and time-consuming pixel-level labeling. Class activation maps (CAMs) are crucial for weakly supervised methods to generate pseudo-pixel-level labels for training networks in semantic segmentation. However, CAMs only activate the most discriminative regions, leading to inaccurate and incomplete results. To alleviate this, we propose a scale-invariant multi-level context aggregation network to improve the quality of CAMs in terms of ﬁneness and completeness. The proposed method has integrated two novel modules into a Siamese network: (a) a self-attentive multi-level context aggregation module that generates and attentively aggregates multi-level CAMs to create ﬁne-structured CAMs and (b) a scale-invariant optimization module that cooperates with mutual learning and coarse-to-ﬁne optimization to improve the completeness of CAMs. The results of the experiments on two open building datasets demonstrate that our method achieves new state-of-the-art building extraction results using only image-level labels, producing more complete and accurate CAMs with an IoU of 0.6339 on the WHU dataset and 0.5887 on the Chicago dataset, respectively.


Introduction
Automatic building extraction from high-resolution images has become an active topic in the field of remote sensing in recent decades.It plays a vital role in a variety of applications, such as urban monitoring [1], population and economic estimation [2], and geospatial database making and updating [3].Building extraction aims to classify each pixel as building or non-building, which can be regarded as binary semantic segmentation.However, this task is very challenging due to the difficulty in distinguishing between buildings with complex appearances and varying scales in high-resolution images with rich details and intra-class variance characteristics [4].
Convolutional neural networks (CNNs) have gained widespread popularity in various domains in recent years, including computer vision [5], climate change [6], and industrial detection [7,8].With remarkable success in image segmentation, CNNs have been applied to building extraction from high-resolution remote sensing imagery using networks such as UNet [9], DeeplabV3+ [10], and PSPNet [11].Researchers have also proposed buildingspecific approaches based on analyzing the characteristics of buildings, which have shown promising results [12][13][14][15].However, these approaches have limited applications due to the need for a large number of pixel-level annotations, which are time-consuming and laborintensive to collect.Instead, weakly supervised semantic segmentation (WSSS) tries to alleviate this issue by utilizing weak supervision, such as image-level labels [16], bounding boxes [17], and scribbles [18].As image-level labels are more readily available than other forms of supervision, this paper focuses on weakly supervised building extraction using image-level labels.
Image-level labels only indicate the presence or absence of buildings in the image without any localization cue, making it challenging for WSSS to achieve compelling results with fully supervised semantic segmentation.Fortunately, a class activation map (CAM) [16] is proposed to perform object localization only using image-level labels.Most advanced WSSS approaches are based on CAM and follow a three-stage learning paradigm: (1) using image-level labels to train a classification network to obtain the initial CAMs; (2) refining the initial CAMs to generate pseudo-pixel-level labels by semantic affinitybased methods [19], dense conditional random fields (CRF) [20] or saliency detection methods [21]; and (3) training a semantic segmentation network with these pseudo-labels.As supervised information is only utilized in the first stage, the key to the WSSS method is generating a promising CAM that is accurately activated on entire objects and not the background.To this point, many methods have been extensively investigated in the field of computer vision [22][23][24][25][26] and have shown effectiveness in processing natural scene images.However, these methods may not be suitable for high-resolution remote sensing images, which often contain vast amounts of visual information, significant spatio-spectral variability, and a much wider field of view [27].
In recent years, the remote sensing community has seen a growing interest in utilizing WSSS techniques for building extraction in high-resolution images.Advances in this area have resulted in several notable contributions, such as the SPMF-Net [28], which incorporates a superpixel-pooling mechanism to enhance the CAM and better preserve the shape and boundary information of buildings.The MSG-SR-Net [29] further advances this line of research by integrating a multi-scale generation strategy to improve the fineness of the CAM.Other methods, such as that of Li et al. [30], have utilized conditional random fields (CRF) to optimize both the CAM and segmentation results.In an effort to learn more building-specific information and encourage the network to perform more accurately, some studies have explored the exploitation of inherent relationships, such as pixel affinity [31] and adversarial information [32], to benefit building representation.Despite the demonstrated effectiveness of these techniques, the quality of the pseudo-labels generated by the CAM remains a critical factor affecting the performance of WSSS for building extraction.As illustrated in Figure 1, the CAMs often only activate the most discriminative regions, making it challenging to generate complete buildings.Furthermore, over-activation and under-activation of the CAMs can result in fuzzy boundaries, presenting opportunities for further improvement in the performance of weakly supervised building extraction.The limitations in building extraction using weakly supervised methods stem from the absence of localization information in image-level labels, creating a supervision gap between fully and weakly supervised methods.To overcome this challenge, it is crucial to incorporate more spatial information into the weakly supervised method, such as by utilizing the outputs of various layers in a neural network.As depicted in Figure 1, our research findings indicate that the CAM generated by lower-level layers possesses more details, and also more noise, than the CAM produced by high-level layers.By effectively combining CAMs from various levels, the accuracy of the CAMs can be significantly improved.Furthermore, our observations suggest that CAMs generated at different image scales do not always align with the scale variations of the buildings.The CAM generated at a coarser scale tends to highlight the complete area of the buildings but lacks detail at the boundaries, whereas the CAM generated at a finer scale exhibits the opposite trend.This discrepancy can serve as supervisory information to enhance the integrity of the building representations in a CAM.
Based on the above observation and analysis, we present a unified network that aims to enhance the quality of a CAM with regard to building representation.This is achieved through two key improvements to CAMs: (1) multi-level context aggregation for fine-structured refinement and (2) utilization of multi-scale supervision for integrity improvement.To achieve the first improvement, we introduce the self-attentive multi-level context aggregation module (SMCAM), which is based on GradCAM++ [23].This module generates CAMs from multiple levels while suppressing noise in the network and uses a self-attention mechanism to effectively combine these CAMs, resulting in a more refined representation of building structures.For the second improvement, we propose the scaleinvariant optimization module (SIOM) to further improve the integrity of CAMs.This module uses CAMs generated on a coarse scale as pixel-level supervision, guiding the network in learning to improve the integrity of the CAMs, thus ensuring that the CAMs are more aligned with the scale variation of buildings.Therefore, by incorporating the two enhancements, the proposed unified network is capable of generating high-quality CAMs that not only preserve fine structures but also ensure the integrity of the building representation.This leads to more reliable pixel-level training samples that are crucial for the performance of subsequent semantic segmentation steps.The main contributions of this study are summarized as follows: • A self-attentive method that effectively generates and aggregates multi-level CAMs is proposed to produce fine-structured CAMs; • A scale-invariant optimization method that incorporates multi-scale supervision is proposed to improve the completeness of CAMs; • The Siamese network that integrates the above two improvements with designed losses is introduced with the aim of narrowing the supervision gap between fully and weakly supervised building extraction.
The organization of this paper is as follows: In Section 2, we give a comprehensive overview of previous studies on building extraction and weakly supervised semantic segmentation methods.The proposed network and its crucial components, SMCAM and SIOM, are thoroughly explained in Section 3. In Section 4, the performance of the proposed network is evaluated and compared with existing methods on two commonly used building datasets.A thorough analysis and discussion of the results are presented in Section 5, followed by the conclusions and future work in Section 6.

Related Works
This section presents a comprehensive overview of the most recent deep learning methods for building extraction from high-resolution remote sensing images.With the rapid advancements in the field, many researchers have proposed new techniques for weakly supervised semantic segmentation and building extraction specifically.Instead of giving a complete assessment of all current approaches in the field, this review only focuses on the related studies.

Building Extraction
Recently, fully convolutional networks (FCNs) have gained significant attention in the remote sensing community for their ability to perform efficient and accurate building extraction from high-resolution images.As a type of convolutional neural network (CNN), FCNs use a combination of a convolutional encoder and decoder to make dense predictions for every pixel in the input image, which makes them a promising solution for building extraction tasks.Studies have shown that FCNs for building extraction outperform traditional methods that rely on hand-crafted features in terms of accuracy and computational efficiency [12,15,[33][34][35][36][37][38][39][40][41][42].Some of these studies modify existing semantic segmentation networks to adapt them to building extraction, such as Schuegraf and Bittner [12], who combine two parallel U-Net-like [9] FCNs to extract the spatial and spectral features, respectively, and then fuse them to extract buildings and Yuan et al. [40], who improve PSPNet [11] by adding a designed feature pooling layer to capture both local and global relationships in building extraction.Others have proposed FCN models for building extraction that are specific to building characteristics.Guo et al. [42] propose a novel coarse-to-fine boundary refinement network (CBR-Net) that accurately extracts buildings.Li et al. [39] develop the CrossGeoNet with a Siamese network and a cross-geolocation attention module to provide the general building representation in different cities.There are also FCN models that utilize auxiliary data, such as digital surface models [36,38], to improve building extraction results.
The aforementioned methods can produce desirable outcomes; however, they heavily rely on large quantities of pixel-level annotations during training.To mitigate this issue, semi-supervised, unsupervised, and weakly supervised methods have emerged as alternatives for building extraction.Existing semi-supervised building extraction methods can be divided into three categories: self-training, generative adversarial network (GAN)-based, and consistency regularization methods [43].Self-training models are trained with labeled samples, and then predictions for unlabeled samples are utilized as pseudo-labels for supervised training [4,44].GAN-based methods mainly use limited annotated data to train the generators, which, in turn, generate synthetic annotations for the unannotated training images [45].Consistency regularization methods can learn the distribution of unlabeled data by detecting the consistency of the output before and after perturbation [46].However, these methods still rely on pixel-level labels in essence.Moreover, some researchers have attempted unsupervised building extraction [47].As expected, the task presents significant challenges, and much progress must still be made before it can be accomplished successfully.More studies are focused on weakly supervised building extraction and are based on annotations that are less supervised than pixel-level labels, such as scribbles [48], imagelevel labels, and bounding boxes.In this paper, the approach taken is a weakly supervised segmentation method that exploits image-level labels, as described in the next subsection.

Weakly Supervised Semantic Segmentation
Weakly supervised semantic segmentation (WSSS) methods with image-level labels have gained widespread attention in the field of computer vision due to their costeffectiveness as compared to pixel-level labeling.WSSS methods typically involve localizing objects with class activation maps (CAMs), generating pseudo-labels from CAMs, and training a semantic segmentation network.However, native CAMs [16,23] tend to produce inaccurate pseudo-labels as they only highlight the most discriminative regions, which are often incomplete and rough for objects.Hence, as the key to WSSS, many efforts have been made to improve CAMs for more complete object localization, such as iterative erasing [49], random dropping [50], super-pixel pooling [25], and local-to-global transferring [51].Other studies have focused on refining CAMs using techniques such as multitask learning [52], region growing [53], pixel affinity [19], and inter-pixel relations [54].Researchers also believe that the limitations of WSSS methods can be attributed to the supervision gap between the classification and segmentation tasks and have thus proposed methods to introduce auxiliary supervisory information to narrow this gap.
Wang et al. [24] propose SEAM to exploit pixel-level supervision through constraints between various affine changes.Du et al. [22] explore pixel-level supervisory signals with a combination of contrastive learning and consistency regularization.Sub-pixel supervision information is also introduced to compensate for the lack of supervision information [55].Although these WSSS methods have achieved promising results for handling natural images in the field of computer vision, they may struggle to perform efficiently with remote sensing images as they are not specifically designed for this domain.Generalizing weakly supervised methods to different domains remains a challenge, as has been established by research [27].
Researchers have recently started exploring the potential of weak supervision in building extraction from remote sensing imagery, recognizing the need for methods specifically tailored to building characteristics to improve the accuracy and consistency of building representations in CAMs.Several studies have been conducted to address this challenge and enhance the quality of CAMs.For example, Fu et al. [56] develop WSF-Net for binary segmentation in remote sensing images and address class imbalance through a balanced binary training approach.MSG-SR-Net [29] further improves CAM fineness by integrating a multi-scale generation strategy.Meanwhile, methods such as conditional random fields by Li et al. [30] and superpixel pooling by Chen et al. [28] have been utilized to explore the spatial context and enhance building representation, respectively.The use of semantic affinity [32] and pixel affinity [31] has also been shown to improve the quality of CAMs.In contrast to these methods, we propose a novel WSSS method, based on the observations of CAM (Figure 1), which addresses two crucial aspects: multi-scale and multi-level.By generating and aggregating CAMs at multiple levels and training the images with multi-scale inputs, our method generates auxiliary pixel-level supervised information.This interplay of multi-scale and multi-level results in higher-quality CAMs compared to existing methods.It is noteworthy that MSG-SR-Net also leverages multi-level CAM fusion; however, our approach incorporates a self-attentive mechanism that enables CAMs to automatically calculate their importance during fusion, thus preserving valuable information and suppressing noise.

Scale-Invariant Multi-Level Context Aggregation Network
This section provides a description of the methodology and design of the network.Firstly, we define the problem context and describe the process of generating a CAM.Secondly, the overall architecture of the proposed network is presented.Thirdly, we delve into the two modules that are utilized to improve the quality of the CAMs with respect to building characteristics.Finally, we discuss the loss functions that are used in the proposed network.

Prerequisites
The problem of weakly supervised building extraction can be defined as follows: Given a set of images and corresponding image-level labels, the goal is to learn a model that can predict a pixel-level segmentation mask for the buildings in the image.Specifically, each training image, represented as I D ∈ R W×H×3 in dataset D, is associated with a binary image-level label y ∈ {0, 1}, where 1 indicates the presence of buildings in image I and 0 indicates their absence.Since the image-level labels do not provide any localization information, current methods typically follow a two-stage approach to tackle this task, i.e., first training a classification network to identify regions in the image that correspond to buildings, and then using these regions to generate pseudo-labels for training a semantic segmentation network for building extraction.The majority of existing WSSS approaches rely on CAMs to produce the localization map.The training phase of the task can be formulated as follows: Here, M represents the CAMs for the image I generated by the classification network F CAM , which is trained on the image I D and corresponding image-level labels y D in dataset D. F SEG is an FCN network for semantic segmentation, and F pseudo is used to generate the required pseudo-pixel-level labels.The final score map of the buildings is denoted by Y pred .Accurate and complete CAMs are crucial for the success of semantic segmentation, as they significantly influence the performance of the pseudo-labels.Therefore, this paper focuses on improving the generation of CAMs for building extraction.
The proposed network is based on GradCAM++ [23] to improve the quality of CAMs.We chose GradCAM++ due to its two key features.Firstly, it can be loosely considered as a generalized version of the original CAM with an improved ability to localize features.Secondly, it does not require any modifications to the network, preserving the original classification network's learning capabilities.GradCAM++ generates a visual explanation for the specified class label c by using a weighted combination of the positive partial derivatives of the last convolutional layer's feature maps with respect to the class score, which is calculated as: where A k denotes the k -th channel of the output feature map generated by the last convolutional layer.The rectified linear unit, relu(), filters out the features with negative values, where relu(x) = max(0, x).α c k denotes the weight of class c corresponding to the k -th channel, which is calculated as follows: where , and k represent the first-order, second-order, and third-order gradients of the prediction score Y c , respectively.For computational convenience, it is common to make Y c = exp( f c ); here, f c is the output score of the classification network.
It should be noted that the CAMs generated by GradCAM++ have the same size as feature map A and are not normalized.To make them more suitable for subsequent processing, we resample them to the size of the input image and normalize them using the following formula: where MAX() represents the maximum value, and MI N() represents the minimum value.

Overall Network Architecture
We proposed a network for the task of weakly supervised building extraction, as depicted in Figure 2, which leverages multi-level features and multi-scale information to improve the quality of CAMs.It consists of three main components: a Siamese classification network, the self-attentive multi-level context aggregation module (SMCAM), and the scale-invariant optimization module (SIOM).The Siamese network utilizes shared weights to classify two input images with different scales, utilizing the ground-truth image-level labels as the target.Both images are processed simultaneously using the same network architecture, which can be designed based on well-known networks, such as ResNet [57] and VGG [58].The main purpose of the Siamese network is to generate multi-level features at different scales through training with an image-level labeled dataset.The other two modules are designed to utilize these features to guide the improvement of the CAM.Specifically, SMCAM generates multi-level CAMs based on GradCAM++ and fuses these CAMs through a self-attention mechanism.SIOM further exploits these features to improve the integrity of CAMs by mutual learning between different scale supervisions.Both modules are described in more detail in the subsequent subsections.It is noteworthy that the proposed network takes two images of different scales as input during the training phase in order to provide multi-scale supervision.During the inference phase, the network only requires a single input image, and the CAM is generated through sequential optimization of SMCAM and SIOM.

Self-Attentive Multi-Level Context Aggregation Module
The proposed module, SMCAM, is designed to address two major problems in building extraction tasks: the generation of multi-level CAMs and the aggregation of these CAMs.The main aim of SMCAM is to utilize GradCAM++ to generate CAMs at different levels of a deep neural network and to aggregate these CAMs to produce more detailed CAMs.To achieve the first goal, we utilize GradCAM++ to backpropagate gradients from score maps to any nodes of the network (such as the red dot on each ResNet Block in Figure 3).However, it is observed that CAMs generated from low-level features may contain a significant amount of noise, making them challenging to use directly.This noise is mainly due to two factors: disturbance introduced during gradient backpropagation due to the long path and presence of noise in the low-level features themselves.To tackle these issues, SMCAM introduces auxiliary classification branches at each node to improve the semantic depth of low-level features and shorten the gradient backpropagation path.These branches consist of a 1x1 convolutional layer with 1024 output channels, followed by an average pooling layer, which ultimately outputs an image-level score.The auxiliary classification branches enhance the semantic information in low-level features, making the generated CAMs more useful for the building extraction task.
To achieve the second goal of aggregating multi-level CAMs, our proposed module incorporates an attention-based mechanism.Unlike traditional fusion methods, such as averaging and concatenation, which ignore the relative importance of features at different levels, our model aims to learn the varying contributions of each level for every pixel in the CAMs.Specifically, the process starts with passing the feature maps generated by the average pooling layer of each auxiliary branch through the channel average pooling layer, generating a score map for each level, l ∈ {1, ..., L}, where each score map has a single channel.The score maps are then upsampled to match the size of the CAMs.Mathematically, let H l denote the weight score generated at level l.A softmax function is applied across the levels to compute the specific weights, w l , of the CAMs for each level: Finally, the CAMs are aggregated into a single map, M, through a weighted sum of the score maps of all levels, as calculated by: where M l represents the CAM generated by GradCAM++ at the auxiliary branch of level l, and the weights are determined by the softmax function applied to the weight scores generated at each level.Element-wise Sum

Scale-Invariant Optimization Module for Improving Integrity of CAMs
Although the CAMs generated by SMCAM utilize multi-level features, there still exists a drawback in that the generated CAMs may not cover the entire building object, as they often struggle to activate the most discriminative regions without adequate pixel-level supervision.One of our key observations is that CAMs generated by the classification network for different scales of input images activate different regions for building objects, providing valuable additional supervisory information.Therefore, we proposed the scale-invariant optimization module (SIOM) to leverage this multi-scale information.As illustrated in Figure 2, SIOM consists of two crucial parts: a mutual learning mechanism of CAMs and hierarchical feature optimization.For the former, a mutual learning mechanism benefits from the architecture of the Siamese network, which consists of two subnetworks with shared parameters that process two different scale images simultaneously and generate a series of CAMs at different scales under two scales with SMCAM.By motivating CAMs of different scales to be similar at multiple levels, the network can learn scale-invariant representations that better capture complete building objects.Therefore, we proposed the multi-level invariant constraint loss L MIC by extending the equivariant constraint to multiple levels, as presented in the next subsection.
For the latter, although the multi-level invariant constraint loss can provide additional pixel-level supervision at multiple levels, the CAM generation is still limited to the framework of the classification networks.To alleviate this, we further propose a separate learnable branch, or hierarchical feature optimization, for enhancing the completeness of a CAM.In particular, it optimizes a CAM in a progressive manner, leveraging progressively the image and multi-level features generated by the classification network through a three-stage process, as shown in Figure 4.The first two stages utilize self-attention mechanisms to uncover non-local relationships [59] within the multi-level features of the classification network, resulting in improved object integrity.Each stage consists of two convolutional layers with learnable parameters.The optimization process can be mathematically expressed as follows: where M denotes the optimized CAM.W 1 and W 2 denote the parameters of the first and second convolutional layers, respectively.The utilized features are indicated by x, for stage 1, x = [x 4 , Down(x 2 )] and for stage 2, x = [x 3 , Down(x 1 )], where [] and Down() denote the concatenating and downsampling, respectively.This self-attention mechanism essentially uses the relationships between all of the pixels in feature maps to refine the CAM and has been shown to be particularly effective at mining global relationships.However, as the size of the CAM increases, the size of the attention map W T 1 W 2 increases exponentially, leading to memory explosions on the graphics card during training; hence, we only use this mechanism in the first two stages.In the third stage, we fully exploit the local information in the image, which is used to enhance the detail stages of the CAM.We use a fixed-parameter pixel adaption convolution [60], and the optimization process can be represented by the following equation: where M i denotes the value at pixel i in the CAM; N (i) denotes the set of neighboring pixels of pixel i; and D denotes the affinity map, where D i,j denotes the similarity relationship between pixel i and pixel j, as calculated by the following equation: where I i denotes the spectral vector at image pixel i; k is the kernel function, where the Gaussian function is used, i.e., k(I i , I j ) = exp(− 1 2 (I i − I j ) T (I i − I j )); and the softmax function is used to normalize.It is important to note that this operation is distinct from the Hadamard product, as indicated by the notation in Figure 4.The parameters of SIOM are learned by comparing optimized and unoptimized CAMs on two scales.The gradient propagation path of this module is detached from the classification network to prevent interference with learning classification representation.By optimizing the hierarchy of non-local and local information, the problem of building edges and integrity can be solved to a large extent, effectively improving the quality of the CAM for building extraction.
It should be noted that the basic architecture of SIOM is inspired by SEAM [24] but has three differences.Firstly, we are based on GradCAM++, which is more generalized than the original CAM applied in SEAM.Secondly, we propose hierarchical feature optimization, which can utilize multi-level features and images to optimize CAM with regard to building extraction.Thirdly, our loss function is specifically designed to address the objective of building extraction.Therefore, the proposed SIOM is more appropriate for weakly supervised building extraction than SEAM.

The Design of Loss Function
Since only image-level labels are available, the design of the loss function is exceptionally important in order to introduce more supervised information, especially some pixel-level supervisions.In summary, the total loss of the proposed network is defined as follows: where the classification loss is denoted as L CLS , which exploits image-level supervision, and L MIC is the multi-level invariant constraint loss in SIOM.The equivariant cross regularization loss L ECR is adopted to train the branch of hierarchical feature optimization.The binary cross-entropy loss is utilized as the classification loss in the proposed network and is defined as follows: where N is the batch size; y pred denotes the output of the classification network and y n is the corresponding image-level label.The network is a Siamese network and can generate two scores for the original image I and the rescaled image Ĩ, respectively.In addition, the proposed SMCAM module has multiple output branches for various levels.The classification loss of the network is made up of many sub-losses, which is described below: where L cls and Lcls are the losses calculated for the Siamese network, while SMCAM at level l has its own losses, represented by L l cls and Ll cls .The multi-level invariant constraints (MIC) loss is built with a multi-level equivariant metric based on the L1 loss.The aim is to enhance the CAMs generated by different branches of the Siamese network through multiple levels of features, leading to scaleinvariant features that result in more complete building representations in the CAMs.The calculation of this loss is described below.
For the equivariant cross regularization loss, we refer to the SEAM [24], which can be represented as: where M is the CAM optimized by hierarchical feature optimization; M denotes the CAM generated by SMCAM the from rescaled image.This loss can further improve the quality of the CAMs and prevent degeneration.

Experiments 4.1. Experimental Datasets
The proposed method is evaluated using two publicly available high-resolution remote sensing datasets, namely the WHU Building Dataset [34] and the Inria Aerial Image Labeling Dataset [61].The WHU Building Dataset (abbreviated as "WHU dataset") consists of 8189 tiles with 512 × 512 pixels, covering approximately 450 km 2 in the Christchurch area.Each image has three channels (RGB) and a spatial resolution of 0.3 m.These images are divided into 4736, 1036, and 2416 patches with corresponding ground-truth labels, which are then split into training, validation, and test sets.The Inria Aerial Image Labeling Dataset includes images and building labels for 10 cities, with a subset dataset of Chicago (abbreviated as "Chicago dataset") selected for the experiment.It contains 36 color image tiles of 5000 × 5000 pixels with a 0.3 m spatial resolution.All of these images have corresponding building annotations and are divided into a training set of 24 images, a testing set of 8 images, and a validation set of 4 images in this experiment.
Both datasets provide pixel-level labels and are primarily used to test and evaluate fully supervised building extraction methods.For the proposed weakly supervised method, only image-level labels are utilized to train the model.Therefore, pixel-level labels are used to generate image-level annotations for evaluating the method.Following the preprocessing in previous works [29] and [32], the images are cropped into patches with 256 × 256 pixels and a stride of 128 pixels.The image-level labels of these patches are determined by the percentage of building pixels in the pixel-level annotations.Specifically, patches with more than 22 percent of building pixels (about 2/9) are labeled as building, while patches with a percentage of 0, i.e., without any building pixels, are labeled as non-building.The remaining patches with a percentage between 0 and 22 are simply ignored.After processing, the WHU dataset contains 27,879 image patches for training, with 14,316 building patches, 13,563 non-building patches, and 18,364 patches for testing.The Chicago dataset contains 24,736 patches for training, including 12,793 building patches, 12,943 non-building patches, and 11,020 patches for testing.Some processed examples from both datasets are shown in Figure 5.To gain a better understanding of these two datasets, we utilize t-SNE [62] to visualize the feature distributions of both datasets.The features are extracted from the last layer output of a residual neural network [57].Despite having the same resolution and purpose, the t-SNE visualization (Figure 5b) reveals that the feature distributions of the two datasets are notably distinct.

Experimental Setup 4.2.1. Methods for Comparison
In the experiments, seven WSSS methods are compared with the proposed method for the selected datasets.The main information about these methods, including ours, is summarized as follows: • CAM-GAP: Zhou et al. [16]  As most methods focus on improving the quality of a CAM for target extraction as well as ours, we mainly compare the completeness and fineness of buildings in a CAM.

Evaluation Criteria
In the evaluation of our proposed methods, we employ three commonly used quantitative criteria: the F1 score, overall accuracy (OA), and intersection over union (IoU).The F1 score is computed as the harmonic mean of precision and recall, which are defined as follows: where precision and recall are defined as: where TP and TN, respectively, represent a true positive and true negative, while FP is a false positive and FN is a false negative.
The OA is calculated as the ratio of correctly classified instances to the total instances, as represented by the following equation: The IoU is calculated as the ratio of the area of intersection between the predicted and ground-truth segmentation divided by the area of union between the predicted and ground-truth segmentation and can be calculated as follows: The above three metrics are used to quantitatively evaluate the results of the proposed methods, both for CAMs and building extractions.Additionally, comparisons of visualizations are also utilized to evaluate the results.

Implementation Details
The proposed method employs ResNet50 [57] as the backbone of the Siamese network, and the pre-trained weights from ImageNet (provided by PyTorch) are utilized for the backbone initialization.The generation of multi-scale CAMs is depicted in Figure 3 and derived from the blocks of ResNet50.The training process is carried out by using stochastic gradient descent (SGD) with momentum, where a momentum coefficient of 0.9 and weight decay of 0.0005 are set [11,29,31].In accordance with the approach presented in [11], a poly-like learning rate rule is employed, where the learning rate is defined as base one multiplying (1 − t/T) power with the base set to 0.01, power to 0.9, t denoting the current iteration, and T denoting the maximum iteration.The network is trained with a batch size of 8, over a total of 30 epochs, using 2 GPUs.In the initial 5 epochs, only cross-entropy loss (L CLS ) is employed, while the total loss is utilized in the subsequent epochs.Adhering to SEAM ( [24]), the proposed method employs online hard example mining (OHEM) on the ECR loss (L ECR ), retaining the top 20% of pixel losses.Random data augmentation, including mirroring, rotation, Gaussian blurring, and color jittering, is applied during the training process, and the Siamese structure expands the matched augmentation to the rescaled image.During inference, the shared-weight Siamese network only requires one branch to be restored.
To ensure a fair comparison, all comparison methods are reimplemented based on the ResNet50 [57] backbone, with the exception of SPN, which presents some difficulties in separating its main modules from the classification network.Both SPN and MSG-SR-Net employ the simple linear iterative clustering algorithm [63] to pre-segment the images into approximately 256 superpixels per image.Furthermore, DeepLabV3+ (based on Resnet50) is utilized as the semantic segmentation network F SEG for all methods.To train DeepLabV3+, we employ cross-entropy as the loss function along with the same SGD optimizer as mentioned previously and with the same parameters.However, the initial learning rate is set to 0.007 for this process [10].The training batch size is set to 8, and a total of 20 epochs are trained.For data augmentation, we only apply random mirroring and rotation.All experimental settings are made consistent across methods as much as possible, though some hyperparameters (e.g., learning rate and momentum) specific to each model may have been adjusted to enable efficient convergence during training.It should be noted that the background thresholds of CAMs for each approach differ as well, and one should be chosen with the best F1 score of pseudo-labels after traversing all background threshold options on the validation set.
The proposed method and all comparison methods are implemented on a Linux 5.15 platform, using Python 3.9 and PyTorch for deep learning.The CUDA 11.6 version is utilized for GPU acceleration.The experiments are run on a Linux platform equipped with two Intel Xeon 8-core CPUs @ 2.2 GHz and two NVIDIA RTX 4000 GPUs with 8 GB memory and 2304 shading units.

Results of Chicago Dataset
For the quantitative aspect, Table 1 reports three evaluation criteria of the CAM results in terms of building extraction accuracy.The proposed method yields the highest values of F1, IoU, and OA when compared to other methods, achieving the best pseudo-label accuracy.In particular, CAM-GAP and GradCAM++ obtain pool accuracy due to the fact that they only activate the most discriminative regions.However, WILDCAT achieves the worst results, which demonstrates that this method is not suitable for building extraction in the remote sensing field.Ours-SMCAM still achieves remarkable results with 0.6774 of F1 and 0.8081 of OA, which is slightly behind MSG-SR-NET, which benefits from the superpixel's ability to utilize the low-level features.It also illustrates the effectiveness of the attentive aggregation of multi-scale CAMs since we do not employ the superpixel-like technique.As for ours-SIOM, it achieves slightly better results than SEAM.Both of these two methods utilize the Siamese networks.Overall, the results indicate the effectiveness of our method, which incorporates SMCAM and SIOM for optimal performance.Five image patches containing buildings with different sizes and shapes are selected for visual comparison, as shown in Figure 6.It can be observed that WILDCAT performed the worst in identifying various sizes of buildings, making it difficult to separate the buildings from the background.As inferred above, CAM-GAP and GradCAM++ only activate the most discriminative parts of buildings (e.g., edges and texture); the difference is that the former is relatively complete and the latter relatively fine.SEAM achieves promising results in terms of building completeness but still has some over-activation and under-activation issues.The CAMs generated by WSF-Net are relatively coarse and not suitable for building extraction, as this method is designed for water and cloud extraction in remote sensing images.Both SPN and MSG-SR-Net benefit from using superpixel for more detailed information but are also limited by the inaccurate activation of non-building regions.In addition, the results of MSG-SR-Net are notably better than SPN for building extraction.The CAMs obtained by the proposed methods are better than those obtained by the other methods.Specifically, ours-SIOM obtained a relatively integral building in CAMs, while ours-SMCAM contained more detailed information.The proposed method combines the advantages of both modules to obtain the best overall results.

Results of WHU Dataset
Table 1 also reports the F1 score, IoU, and OA of different methods on the WHU dataset.It can be observed that: (1) WILDCAT cannot successfully generate a CAM for buildings.Other than that, CAM-GAP, GradCAM++, and WILDCAT have the worst results, and MSG-SR-Net, designed for remote sensing images, achieves remarkable results, as well as the Chicago dataset.(2) The proposed method outperforms all other methods, but with a larger lead than the results of the Chicago dataset.(3) Ours-SMCAM also shows a comparative performance with other methods.This reveals the advantages of the proposed module.(4) Ours-SIOM shows a better performance than SEAM, with a higher F1 score and IoU and an improvement of over one percent in OA, demonstrating its effectiveness in distinguishing buildings from the background.
In order to visually compare the performances of the different methods, five image patches featuring buildings of various sizes and shapes have been selected, as shown in Figure 7. Notably, the quality of images and labels in the WHU dataset is better than that in the Chicago dataset.The images in the WHU dataset are well-calibrated and contain fewer shadows from the buildings.Since buildings and shadows often accompany each other, almost all methods are difficult to distinguish them from one another due to the lack of relevant supervision information.Therefore, both in terms of quality and quantity, the results for the WHU dataset are superior to those of the Chicago dataset.Additionally, the corresponding pixel-level labels are more accurate for producing better image-level labels.Similar to the results for the Chicago dataset, WILDCAT, CAM-GAP, and GradCAM++ still struggle to accurately identify buildings.Specifically, WILDCAT has over-activated too many regions, which overwhelms the buildings.Additionally, CAM-GAP and GradCAM++ still highlight the most distinctive areas.SEAM performs better than the others but still shows over-activation in some areas.The CAMs generated by WSF-Net are coarse for buildings and more suitable for large targets.Although MSG-SR-Net outperforms SPN when utilizing superpixel segmentation, it still results in false negatives along building boundaries.Our proposed methods outperform the other methods, as seen in the results for the Chicago dataset.In particular, ours-SIOM generates more complete building activations and distinguishes buildings from backgrounds more effectively.However, the edges of the buildings remain somewhat blurred.Conversely, ours-SMCAM preserves the structural details and boundaries of the buildings well but may result in some false activations.By combining both modules, the proposed method is able to achieve the best results, as it fuses multi-level and multi-scale features to enhance the quality of CAMs for building extraction.The results confirm the effectiveness of the proposed method for building extraction in remote sensing images.

Comparison of Building Extraction Results
A CAM can generate building extraction results directly with the use of a suitable threshold, but such results often have significant limitations are therefore are used commonly as pseudo-labels to train a segmentation network.For a more comprehensive comparison, we trained fully convolutional networks for each method to perform building extraction.Table 2 presents the building extraction results on the two datasets.Firstly, when comparing the CAM results listed in Table 1, almost all methods show improvement in their building extraction results after training the fully convolutional network, with the exception of WILDCAT, which produces a poor CAM that results in ineffective training of the segmentation network.Secondly, the building extraction accuracies of SEAM and MSG-SR-Net are relatively good.Both methods achieve high overall accuracy (OA) scores, demonstrating their capabilities in effectively distinguishing between buildings and nonbuilding objects.Thirdly, when applied to the WHU dataset, our proposed method achieved an F1 score of 0.7759, an OA score of 0.9457, and an IoU score of 0.6339.Similarly, on the Chicago dataset, our method achieved an F1 score of 0.7411, an OA score of 0.8436, and an IoU score of 0.5887.Both outperform the other methods significantly.This highlights the superior effectiveness and robustness of our proposed method for building extraction from remote sensing images, which can be attributed to its ability to produce high-quality CAMs.
The visual comparison of building extraction results from three image patches with varying building densities from each dataset is presented in Figure 8.The results reveal a strong correlation between the quality of the building extraction and the accuracy of the CAM.The higher quality of the annotations and images in the WHU dataset may explain the improved performance compared to the Chicago dataset.Unfortunately, the results of WILDCAT and SPN are below expectations and fail to effectively extract the buildings.On the other hand, CAM, GradCAM++, WSF-Net, and SEAM tend to over-highlight building regions, resulting in coarse outcomes.Although MSG-SR-Net performs better than the aforementioned methods, it still has some inaccuracies and incomplete building extractions.Our proposed method, however, demonstrates the best results, especially on the WHU dataset, where individual buildings are clearly distinguishable and have well-defined edges.This highlights the superior effectiveness of our approach for WSSS building extraction.

The Influence of Auxiliary Branches for Classification Network
The proposed SMCAM introduces auxiliary classification branches that enhance the performance of GradCAM++ by shortening the gradient propagation path and improving the semantic representation of low-level features.This allows for the effective fusion of CAMs at multiple levels.However, the alteration of the original classification network structure and the reliance on image-level annotations for supervised learning raise concerns about the impact on the network's feature learning capabilities.In order to assess this, the classification accuracies of the trunks and branches of the Siamese network are calculated, and the results are presented in Table 3.The results indicate that the accuracies of all branches and trunks are higher than 0.97, demonstrating that the auxiliary branches do not compromise the training accuracy of the classification network.Specifically, branch 1 has slightly lower training and testing accuracies compared to the other branches and trunks, due to its weaker feature map generalization capabilities, yet it still meets the accuracy requirement.As shown in Figure 9, the different levels of CAMs obtained from the branching network are also satisfactory, effectively suppressing the noise of low-level CAMs while highlighting the regions related to the buildings.In summary, the auxiliary branches introduced by SMCAM do not negatively impact the performance of the classification network.

Comparison of Different Fusion Strategies in SMCAM
The main objective of SMCAM is to attain a fine-structure CAM by aggregating multi-level CAMs generated by GradCAM++.A critical aspect of this challenge is to effectively fuse these CAMs.In this study, we propose a self-attention mechanism for fusing multi-level CAMs by exploiting multi-level features.To evaluate the effectiveness of the proposed self-attention mechanism (denoted as "Attention"), we conduct experiments and compare them with direct addition fusion (denoted as "Addition") in the framework of the proposed method.Additionally, we include the CAM generated from the last convolutional layer (denoted as "Last Conv.") for comparison.The results shown in Table 4 reveal that the self-attention mechanism outperforms Addition, demonstrating its suitability for fusing multi-level CAMs.Additionally, the accuracies of both Attention and Addition are significantly improved compared to Last Conv., implying that utilizing multi-level CAMs enhances the quality of CAMs.The comparison with the results of GradCAM++ listed in Table 1 further highlights the effectiveness of the mutual learning mechanism (L MIC ) in the Siamese network in enhancing the quality of the CAM.Furthermore, the improvement in the accuracies of Addition, Attention, and Last Conv.after incorporating SIOM further verifies the effectiveness of SIOM.
Visualizations of the CAMs in this experiment can be seen in Figure 10, with the self-attention method producing more complete and fine-grained CAMs than the others.
To conclude, the proposed self-attention mechanism effectively fuses multi-level CAMs and can be used in conjunction with SIOM to produce better results.

Performance of Hierarchical Feature Optimization in SIOM
The role of SIOM is to utilize the CAM with multi-scale information, which is effective in improving the integrality of the CAM.The core component of SIOM is hierarchical feature optimization, which is a learnable module that leverages the multi-level features to further improve the quality of CAM.It consists of three stages of optimization: the first two stages focus on mining the non-local relationships between the multi-level features to improve the integrity of the CAM, while the third stage leverages the local image information to further improve the fineness of the CAM.To evaluate the efficacy of hierarchical feature optimization, each stage of optimization is analyzed.The pixel correlation module (PCM) in SEAM is also included in the experiments for comparison, which also leverages feature relationships but only considers features from the last two convolutional layers.The experimental results are listed in Table 5, showing that the CAMs generated with hierarchical feature optimization in SIOM are better than those generated in PCM and GradCAM++.Specifically, the results of SIOM are better as the number of stages increases, especially when the third stage of local image information is included, which demonstrates the effectiveness of hierarchical optimization.The CAMs produced by each stage of SIOM are displayed in Figure 11, and it is evident that the CAMs generated by SIOM are more comprehensive and clearer than those generated by GradCAM++ and PCM.As the number of stages increases, the clarity and fineness of the CAMs also improve.It is noteworthy that after including the image features in the third stage, the fineness is further enhanced, but the details are magnified to an excessive extent, which may not align with expectations (e.g., the texture of a roof).Despite this, the difference in semantic representation between the layers is still substantial, and, in practical applications, the direct use of image features may require additional processing.

Effect of Scale Setting
The proposed method employs a Siamese network that is trained using both the original image and a rescaled image from a multi-scale representation.It is crucial to discuss the potential impact of different scale settings on the method's performance.To achieve this, experiments are conducted on two datasets by varying the scale of the rescaled image while keeping the original image as the input.The scale of the rescaled image represents the ratio size of the original image.Since the task only involves building extraction, the F1 score is selected as the evaluation metric for these experiments.
The results are depicted in Figure 12.The Siamese network struggles to train effectively when the rescaled image and the original image are of the same size, as indicated by the dashed lines in the figure.From the results of the WHU dataset, the best accuracy is achieved when the scale of the rescaled image is within the range from [0.7, 0.8] to [1.2, 1.3]; secondly, the F1 score decreases and even fails when the scale is less than 0.6, while the F1 score also decreases slowly when the scale is larger than 1.5; thirdly, the F1 score decreases as the scale approaches one, as previously discussed.These observations suggest that the scale difference between the rescaled image and the original image must not be too large or too small.The results from the Chicago dataset are similar, except that the points with the highest scores are 0.8 for the WHU dataset and 0.7 and 1.4 for the Chicago dataset, which highlights the importance of considering dataset bias in the scale setting.In our initial experiments with the same scale setting as SEAM (0.3), the network fails to complete the training and even experiences a gradient explosion, further emphasizing the importance of finding an appropriate scale setting.For this reason, we set the scale ratio of the rescaled image to 0.8 in our experiments.

Limitations and Future Work
From the experimental results and the above analysis, the proposed method stands out among other WSSS methods by not only achieving highly accurate buildings but also ensuring completeness.However, the accuracy of building extraction using the proposed method is still considerably distant from that of the fully supervised segmentation approach.As shown in Figures 6 and 7, the CAM generated by the proposed method can sometimes blur the distinction between buildings and shadows, leading to inaccurate building edges.We hypothesize that this could be attributed to insufficient availability of supervisory information for differentiating buildings from shadows.Additionally, we have observed that our method performs better on the WHU dataset than on the Chicago dataset (Figure 5b), indicating that our approach is sensitive to dataset bias, which can hinder the generalization capability.
To further improve the performance of building extraction, incorporating more supervised information derived from underlying image features or semantic supervised information, such as additional category labels and image restoration information, may be a potential solution.Utilizing these extra supervisions is crucial for effectively and accurately distinguishing between buildings and non-buildings.Additionally, to overcome dataset bias, an unsupervised domain adaption approach using generative adversarial networks could be incorporated, thereby enabling the weakly supervised approach to be applied to a broader range of domains.However, developing specific models and conducting corresponding experiments is left for future work.

Conclusions
In this work, we introduce a novel scale-invariant multi-level context aggregation network for the task of weakly supervised building extraction from high-resolution imagery.Our approach focuses on two main aspects: (1) The self-attentive multi-level context aggregation module (SMCAM) is proposed to generate noise-free, fine-structured CAMs by shortening the back-gradient path and utilizing a self-attention mechanism to aggregate multi-level CAMs, allowing the model to learn the contributions of different CAMs for each pixel.(2) The scale-invariant optimization module (SIOM) is designed to narrow the supervision gap between segmentation and classification by leveraging multi-scale representations.This module includes a mutual learning mechanism for pixel-level auxiliary supervision through optimization with a multi-level constraint loss and a hierarchical optimization module to improve the CAM's completeness in a coarse-to-fine manner.The Siamese network integrates these two modules, leading to improved CAM quality in terms of both completeness and fineness, allowing for accurate building extraction results with only image-level labels.
The proposed method is evaluated through two experiments: CAM generation and building extraction.The results of the CAM generation experiment show that the proposed method outperforms state-of-the-art methods in the field of remote sensing and computer vision, as evidenced by its improved performance on both the WHU and Chicago datasets.The building extraction experiment results are similarly impressive, with the proposed method achieving an F1 score of 0.7759, an OA score of 0.9457, and an IoU score of 0.6339 for the WHU dataset, and an F1 score of 0.7411, an OA score of 0.8436, and an IoU score of 0.5887 for the Chicago dataset.These results demonstrate the effectiveness of the proposed method in accurately extracting buildings using only image-level labels.
Additionally, we present an analysis of the proposed modules.Our results indicate that the self-attentive aggregation component in SMCAM and the hierarchical optimization module in SIOM are both reasonable and effective.Furthermore, we examine the influence of the auxiliary classification branches on the network and the scale setting.
In our study, we have observed limitations in the ability of the proposed method to distinguish between shadows and buildings, as well as its sensitivity to dataset bias.In future work, we plan to incorporate additional supervision information to improve shadow differentiation and investigate domain adaptation techniques to mitigate the impact of dataset bias with the aim of achieving enhanced performance.

Figure 1 .
Figure 1.Visualization of CAMs generated by GradCAM++: (a) CAMs generated at different levels; (b) CAMs generated from different scales.

Figure 2 .
Figure 2. The framework of the proposed network.
t e x i t s h a 1 _ b a s e 6 4 = " b C

w 2 < 3 < 4
l a t e x i t s h a 1 _ b a s e 6 4 = " A a a o T P V I I 2 T 2 U p m B I g k w o A S e D 5 M = " > A A A B 8 3 i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s z 4 X h b d u K x g H 9 A Z S i b N t K F J Z k g y S h n 6 G 2 5 c K O L W n 3 H n 3 5 h p Z 6 H V A 4 H D O f d y T 0 6 Y c K a N 6 3 4 5 p a X l l d W 1 8 n p l Y 3 N r e 6 e 6 u 9 f W c a o I b Z G Y x 6 o b Y k 0 5 k 7 R l m O G 0 m y i K R c h p J x z f 5 H 7 n g S r N Y n l v J g k N B B 5 K F j G C j Z V 8 X 2 A z C q P s c d o / 7 V d r b t 2 d A f 0 l X k F q U K D Z r 3 7 6 g 5 i k g k p D O N a 6 5 7 m J C T K s D C O c T i t + q m m C y R g P a c 9 S i Q X V Q T b L P E V H V h m g K F b 2 S Y N m 6 s + N D A u t J y K 0 k 3 l G v e j l 4 n 9 e L z X R V Z A x m a S G S j I / F K U c m R j l B a A B U 5 Q Y P r E E E 8 V s V k R G W G F i b E 0 V W 4 K 3 + O W / p H 1 S 9 y 7 q 5 3 d n t c Z 1 U U c Z D u A Q j s G D S 2 j A L T S h B Q Q S e I I X e H V S 5 9 l 5 c 9 7 n o y W n 2 N m H X 3 A + v g E v I Z H L < / l a t e x i t > w l a t e x i t s h a 1 _ b a s e 6 4 = " y I d 8 z U T n 2 i s w 1 1 a I z w / d s U J d T R M = " > A A A B 8 3 i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I r 2 X R j c s K 9 g G d o W T S T B u a S Y Y k o 5 S h v + H G h S J u / R l 3 / o 2 Z d h b a e i B w O O d e 7 s k J E 8 6 0 c d 1 v p 7 S y u r a + U d 6 s b G 3 v 7 O 5 V 9 w / a W q a K 0 B a R X K p u i D X l T N C W Y Y b T b q I o j k N O O + H 4 N v c 7 j 1 R p J s W D m S Q 0 i P F Q s I g R b K z k + z E 2 o z D K n q b 9 8 3 6 1 5 t b d G d A y 8 Q p S g w L N f v X L H 0 i S x l Q Y w r H W P c 9 N T J B h Z R j h d F r x U 0 0 T T M Z 4 S H u W C h x T H W S z z F N 0 Y p U B i q S y T x g 0 U 3 9 v Z D j W e h K H d j L P q B e 9 X P z P 6 6 U m u g 4 y J p L U U E H m h 6 K U I y N R X g A a M E W J 4 R N L M F H M Z k V k h B U m x t Z U s S V 4 i 1 9 e J u 2 z u n d Z v 7 g / r z V u i j r K c A T H c A o e X E E D 7 q A J L S C Q w D O 8 w p u T O i / O u / M x H y 0 5 x c 4 h / I H z + Q M w p Z H M < / l a t e x i t > w Upsampling Hadamard product r e 2 d 3 T 9 8 / 6 I s w 5 p j 0 c M h C P n S R I I w G p C e p Z G Q Y c Y J 8 l 5 G B O 7 / I / c E d 4 Y K G w Y 1 M I m L 7 a B p Q j 2 I k l e T o 1 b G P 5 A w j l l 5 l t 5 a T

Figure 3 .
Figure 3.The structure of self-attentive multi-level context aggregation module.
t e x i t s h a 1 _ b a s e 6 4 = " h H o D q 5 8 8 9 l N U 3 3 D j Y u l 5 C A / e L e c = " > AA A B / X i c b V D L S s N A F J 3 U V 6 2 v + N i 5 G S y C q 5 J I f S y L b l x W s A 9 o Q 5 h M J + 3 Q y S T M 3 C g 1 F H / F j Q t F 3 P o f 7 v w b p 2 0 W 2 n p g 4 N x z 7 u X e O U E i u A b H + b Y K S 8 s r q 2 v F 9 d L G 5 t b 2 j r 2 7 1 9 R x q i h r 0 F j E q h 0 Q z Q S X r A E c B G s n i p E o E K w V D K 8 n f u u e K c1 j e Q e j h H k R 6 U s e c k r A S L 5 9 M P C r D 3 4 V d 4 F H T O N Z 5 d t l p + J M g R e J m 5 M y y l H 3 7 a 9 u L 6 h Y A I r m R D c + S 8 v k u Z p x T 2 v n N 1 W y 7 W r P I 4 i O k R H 6 A S 5 6 A L V 0 A 2 q o w a i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T F r L V j 5 z D 7 6 A + v z B / M W l E U = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " 5 7 B u 0 1 l a wx 7 y v C l p b l Y b M z q x O O M = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o t Q Q U p S 6 m N Z d O O y g n 1 g G 8 J k O m m H T i Z h Z q K U 0 L 9 w 4 0 I R t / 6 N O / / G a Z u F t h 6 4 c D j n X u 6 9 x 4 8 5 U 9 q 2 v 6 3 c y u r a + k Z + s 7 C 1 v b O 7 V 9 w / a K k o k Y Q 2 S c Q j 2 f G x o p w J 2 t R M c 9 q J J c W h z 2 n b H 9 1 M / f Y j l Y p F 4 l 6 P Y + q G e C B Y w A j W R n o o D 7 3 a G Xr y a q d e s W R X 7 B n Q M n E y U o I M D a / 4 1 e t H J A m p 0 I R j p b q O H W s 3 x V I z w u m k 0 E s U j T E Z 4 Q H t G i p w S J W b z i 6 e o B O j 9 F E Q S V N C o 5 n 6 e y L F o V L j 0 D e d I d Z D t e h N x f + 8 b q K D K z d l I k 4 0 F W S + K E g 4 0 h G a v o / 6 T F K i + d g Q T C Q z t y I y x B I T b U I q m B C c x Z e X S a t a c S 4 q 5 3 e 1 U v 0 6 i y M P R 3 A M Z X D g E u p w C w 1 o A g E B z / A K b 5 a y X q x 3 6 2 P e m r O y m U P 4 A + v z B 4 n k j 4 o = < / l a t e x i t > (h4, w4) < l a t e x i t s h a 1 _ b a s e 6 4 = " H K A E w 4 6 0 V 4 E D X z W A t A o n 3 W y w 6 q I = " > A A A B + 3 i c b V D L S s N A F L 2 p r 1 p f s S 7 d D B b B V U l K f S y L b l x W s A 9 o Q 5 h M J + 3 Q y S T M T N Q S + i t u X C j i 1 h 9 x 5 9 8 4 b b P Q 1 g M X D u f c y 7 3 3 B A l n S j v O t 1 V Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P 7 M N y W 8 1 s e i t W D l M 0 f w B 9 b n D 7 j m k v M = < / l a t e x i t > h4w4 ⇥ 1024 < l a t e x i t s h a 1 _ b a s e 6 4 = "2 v K u N 8 o B C d q + Y 6 A C J O h n O c O I o G Q = " > A A A B / H i c b V D L S s N A F L 2 p r 1 p f 0 S 7 d D B b B V U l K f C y L b l x W s A 9 o Q 5 h M J + 3 Q y Y O Z i R J C / R U 3 L h Rx 6 4 e 4 8 2 + c t l l o 6 4 E L h 3 P u 5 d 5 7 / I Q z q S z r 2 y i t r W 9 s b p W 3 K z u 7 e / s H 5 u F R R 8 a p I L R N Y h 6 L n o 8 l 5 S y i b c U U p 7 1 E U B z 6 n H b 9 y c 3 M 7 z 5 Q I V k c 3 a s s o W 6 I R x E L G M F K S 5 5 Z t a 2 G g w a K h V S i s e c 8 e g 7 y z J p V t + Z A q 8 Q u S A 0 K t D z z a z C M S R r S S B G O p e z b V q L c H A v F C K f T y i C V N M F k g k e 0 r 2 m E 9 T I 3 n x 8 / R a d a G a I g F r o i h e b q 7 4 k c h 1 J m o a 8 7 Q 6 z G c t m b i f 9 5 / V Q F V 2 7 O o i R V N C K L R U H K k Y r R L A k 0 Z I I S x T N N M B F M 3 4 r I G A t M l M 6 r o k O w l 1 9 e J Z 1 G 3 b 6 o n 9 8 5 t e Z 1 E U c Z j u E E z s C G S 2 j C L b S g D Q Q y e I Z X e D O e j B f j 3 f h Y t J a M Y q Y K f 2 B 8 / g A M R p M d < / l a t e x i t > 1024 ⇥ h4w4 < l a t e x i t s h a 1 _ b a s e 6 4 = " K Q 1 R l W y M p 9 s 5 0 u s a g a o s I s S j a e 4 = " > A A A B / X i c b V D J T s M w E J 2 w l r K F 5 c b F o k L i V C W U 7 V j B h W O R 6 C K 1 U e S 4 T m v V c S L b A Z W o 4 l e 4 c A A h r v w H N / 4 G t 8 0 B W p 5 k 6 c 1 7 M 5 r x C x L O l H a c b 2 t h c W l 5 Z b 4 M 5 + e Z 4 0 T s r u e f n s 9 r R U v c r j K M A B H M I x u H A B V b i B G t S B w C M 8 w y u 8 W U / W i / V u f U x b F 6 x 8 Z g / + w P r 8 A e z k l E E = < / l a t e x i t > h3w3 ⇥ h3w3 < l a t e x i t s h a 1 _ b a s e 6 4 = " G H X A m 4 Q h d u X C j i 1 r 9 x 5 9 8 4 b b P Q 1 g M X D u f c y 7 3 3 + D F n S t v 2 t 5 V b W l 5 Z X c u v F z Y 2 t 7 Z 3 i r t 7 T R U l k t A G i X g k 2 z 5 W l D N B G 5 p p T t u x p D j 0 O W 3 5 w 5 u J 3 3 q k U r F I 3 O t R T N 0 Q 9 w U L G M H a S A / l g V c 9 Q U 9 e 9 d g r l u y h w C T W 4 h T o 0 g I C A Z 3 i F N 0 t Z L 9 a 7 9 T F r z V n Z z D 7 8 g f X 5 A 4 b V j 4 g = < / l a t e x i t > (h3, w3) t e x i t s h a 1 _ b a s e 6 4 = " N e m k n 3 k u g t 7 h 4 l X P O z B O y U M N b 6 o = " > A A A C O H i c b V D L S g M x F M 3 4 r P V V d e k m W C w V p M y I r 2 X R j S J o B a t C p 5 R M e k f T Z j J D k h H K M J / l x s 9 w J 2 5 c K O L W L z D T d l G r B w I n 5 9 y b 3 H u 8 i D O l b f v F m p i c m p 6 Z z c 3 l 5 x c W l 5 Y L K 6 v X K o w l h T o N e S h v P a K A M w F 1 z T S H 2 0 g C C T w O N 1 7 3 O P N v H k A q F o o r 3 Y u g G Z A 7 w X x G i T Z S q 3 B R O g M p g O O S 6 + Z x q V t 2 A 6 L v P T 8 5 T V s J S 7 f x 6 L 2 T b m 2 7 2 M W 4 g 1 0 m B h Y l P D l P y 6 Z 2 K 3 u j V S j a F b s P / r 1 T c f Y r e 5 e 7 x e r R M I 4 c W k c b q I w c d I C q 6 A T V U B 1 R 9 I h e 0 T v 6 s J 6 s N + v T + h q U T l j D n j X 0 C 9 b 3 D 6 9 Y q p Q = < / l a t e x i t > Kernel k(Ii, Ij), j 2 N (i)< l a t e x i t s h a 1 _ b a s e 6 4 = " N Q w z 2 Y g F 0 H o P j Z u U 5 f 8 p J U a n a K M = " > A A A C B X i c b Z D J S g N B E I Z r X G P c R j 3 q o T E I n s K M u B 2 D X n K M Y B Z I h t D T 6 S R N e n q G 7 h o h h F y 8 + C p e P C j i 1 X f w 5 t v Y W R R N / K H h 4 6 8 q q u s P E y k M e t 6 n s 7 C 4 t L y y m l n L r m 9 s b m 2 7 O 7 s V E 6 e a 8 T K L Z a x r I T V c C s X L K F D y W q I 5 j U L J q 2 H v e l S v 3 n F t R K x u s Z / w I K I d J d q C U b R W 0 z 0 o k g a K i B t S / Q b 1 A 0 0 3 5 + W 9 s c g 8 + F P I w V S l p v v R a M U s j b h C J q k x d d 9 L M B h Q j Y J J P s w 2 U s M T y n q 0 w + s W F b V r g s H 4 i i E 5 s k 6 L t G N t n 0 I y d n 9 P D G h k T D 8 K b W d E s W t m a y P z v 1 o 9 x f Z l M B A q S Z E r N l n U T i X B m I w i I S 2 h O U P Z t 0 C Z F v a v h H W p p g x t c F k b g j 9 7 8 j x U T v L + e f 7 s 5 j R X u J r G k Y F 9 O I R j 8 O E C C l C E E p S B w T 0 8 w j O 8 O A / O k / P q v E 1 a F 5 z p z B 7 8 k f P + B c t Q l 4 I = < / l a t e x i t > H ⇥ W ⇥ n ⇥ n < l a t e x i t s h a 1 _ b a s e 6 4 = " L 5 i 3 f A X K T F u C b o i 0 G d K X i X Z b S m g = " > A A A B 8 X i c b V D L S g N B E O y N r x h f U Y9 e B o P g K e y K r 2 P Q i 8 c I 5 o H J E m Y n s 8 m Q 2 d l l p l c I S / 7 C i w d F v P o 3 3 v w b J 8 k e N L G g o a j q p r s r S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l r F u B 9 R w K R R v o E D J 2 4 n m N A o k b w W j 2 6 n f e u L a i F g 9 4 D j h f k Q H S o S C U b T S o y J d F B E 3 R P X K F b f q z k C W i Z e T C u S o 9 8 p f 3 X 7 M 0 o g r Z J I a 0 / H c B P 2 M a h R M 8 k m p m x q e U D a i A 9 6 x V F G 7 x s 9 m F 0

3 <
3 h z j P P i v D s f 8 9 a C k 8 8 c w h 8 4 n z 8 F x p C C < / l a t e x i t > l a t e x i t s h a 1 _ b a s e 6 4 = " d V c P D I u A e D t i d e e g m 3 n 6 w f 7 l g X w = " > A A A B 8 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K r 2 P Q S 4 4 R z A O T J c x O Z p M h s 7 P L T K 8 Q l v y F F w + K e P V v v P k 3 T p I 9 a L S g o a j q p r s r S K Q w 6 L p f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x + 0T J x q x p s s l r H u B N R w K R R v o k D J O 4 n m N A o k b w f j 2 5 n f f u T a i F j d 4 y T h f k S H S o S C U b T S Q 5 3 0 U E T c k H a / X H G r 7 h z k L / F y U o E c j X 7 5 s z e I W R p x h U x S Y 7 q e m 6 C f U Y 2 C S T 4 t 9 V L D E 8 r G d M i 7 l i p q 1 / j Z / O I p O b H K g I S x t q W Q z N W f E x m N j J l E g e 2 M K I 7 M s j c T / / O 6 K Y b X f i Z U k i J X b L E o T C X B m M z e J w O h O U M 5 s Y Q y L e y t h I 2 o p g x t S C U b g r f 8 8 l / S O q t 6 l 9 W L u / N K 7 S a P o w h H c A y n 4 M E V 1 K A O D W g C A w V P 8 A K v j n G e n T f n f d F a c P K Z Q / g F 5 + M b p + 2 Q R Q = = < / l a t e x i t > H ⇥ W < l a t e x i ts h a 1 _ b a s e 6 4 = " d V c P D I u A e D t i d e e g m 3 n 6 w f 7 l g X w = " > A A A B 8 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K r 2 P Q S 4 4 R z A O T J c x O Z p M h s 7 P L T K 8 Q l v y F F w + K e P V v v P k 3 T p I 9 a L S g o a j q p r s r S K Q w 6 L p f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x + 0 T J x q x p s s l r H u B N R w K R R v o k D J O 4 n m N A o k b w f j 2 5 n f f u T a i F j d 4 y T h f k S H S o S C U b T S Q 5 3 0 U E T c k H a / X H G r 7 h z k L / F y U o E c j X 7 5 s z e I W R p x h U x S Y 7 q e m 6 C f U Y 2 C S T 4 t 9 V L D E 8 r G d M i 7 l i p q 1 / j Z / O I p O b H K gI S x t q W Q z N W f E x m N j J l E g e 2 M K I 7 M s j c T / / O 6 K Y b X f i Z U k i J X b L E o T C X B m M z e J w O h O U M 5 s Y Q y L e y t h I 2 o p g x t S C U b g r f 8 8 l / S O q t 6 l 9 W L u / N K 7 S a P o w h H c A y n 4 M E V 1 K A O D W g C A w V P 8 A K v j n G e n T f n f d F a c P K Z Q / g F 5 + M b p + 2 Q R Q = = < / l a t e x i t > H ⇥ W < l a t e x i ts h a 1 _ b a s e 6 4 = " d V c P D I u A e D t i d e e g m 3 n 6 w f 7 l g X w = " > A A A B 8 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K r 2 P Q S 4 4 R z A O T J c x O Z p M h s 7 P L T K 8 Q l v y F F w + K e P V v v P k 3 T p I 9 a L S g o a j q p r s r S K Q w 6 L p f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x + 0 T J x q x p s s l r H u B N R w K R R v o k D J O 4 n m N A o k b w f j 2 5 n f f u T a i F j d 4 y T h f k S H S o S C U b T S Q 5 3 0 U E T c k H a / X H G r 7 h z k L / F y U o E c j X 7 5 s z e I W R p x h U x S Y 7 q e m 6 C f U Y 2 C S T 4 t 9 V L D E 8 r G d M i 7 l i p q 1 / j Z / O I p O b H K g I S x t q W Q z N W f E x m N j J l E g e 2 M K I 7 M s j c T / / O 6 K Y b X f i Z U k i J X b L E o T C X B m M z e J w O h O U M 5 s Y Q y L e y t h I 2 o p g x t S C U b g r f 8 8 l / S O q t 6 l 9 W L u / N K 7 S a P o w h H c A y n 4 M E V 1 K A O D W g C A w V P 8 A K v j n G e n T f n f d F a c P K Z Q / g F 5 + M b p + 2 Q R Q = = < / l a t e x i t > H ⇥ W < l a t e x i t s h a 1 _ b a s e 6 4 = " n U 1 u d F 7 R e t c 5 e L U w j N t d Y G f I 1 H

Figure 4 .
Figure 4.The structure of scale-invariant optimization module.

Figure 5 .
Figure 5. Sample images (a) and t-SNE visualization (b) for two datasets.

Figure 6 .
Figure 6.Qualitative comparison of the CAM results obtained by various methods on the Chicago dataset.

Figure 7 .
Figure 7. Qualitative comparison of the CAM results obtained by various methods on the WHU dataset.

Figure 8 .
Figure 8. Examples of building extraction results of the proposed method and other comparison methods on two datasets: (a) WHU dataset; (b) Chicago dataset.

Figure 9 .
Figure 9. Visulizations of CAMs generated at various levels.

Figure 11 .
Figure 11.Visulizations of CAMs of different stages settings in hierarchical feature optimization.

Figure 12 .
Figure 12.F1 score of CAM results under different scale settings on two datasets.
Chattopadhay et al. [23]propose GradCAM++ based on gradients without changing the network structure.Its goal is to provide a visual interpretation for CNN-based models and can also be used for WSSS.It can be regarded as the generalization of a CAM; • WILDCAT: Durand et al. [26] introduce WILDCAT to simultaneously align image regions for spatial invariance and learn strongly localized features for WSSS.It uses a single generic training scheme for classification, object localization, and semantic segmentation; • SPN: The superpixel pooling network (SPN), proposed by Kwak et al. [25], utilizes superpixel segmentation of the input image as a pooling layout to cooperate with low-level features for semantic segmentation learning and inferring.The network architecture decouples the semantic segmentation task into classification and segmentation, allowing the network to learn class-agnostic shapes prior to the noisy annotations.It achieves outstanding performance on the challenging PASCAL VOC 2012 segmentation benchmark; • WSF-Net: WSF-Net [56] is proposed for binary segmentation in remote sensing images with the potential to handle class imbalance through a balanced binary training strategy.It introduces a feature fusion strategy to adapt to the characteristics of objects in remote sensing images.The experiments achieve a promising performance for

Table 2 .
Evaluation of building extraction results generated by various methods with IoU (%), OA (%), and F1 score (%) on WHU dataset and Chicago dataset.

Table 3 .
The training and testing accuracies of branches and trunks in classification network.
Figure 10.Visulizations of CAMs of different fusion strategies.