Making Low-Resolution Satellite Images Reborn: A Deep Learning Approach for Super-Resolution Building Extraction

: Existing methods for building extraction from remotely sensed images strongly rely on aerial or satellite-based images with very high resolution, which are usually limited by spatiotemporally accessibility and cost. In contrast, relatively low-resolution images have better spatial and temporal availability but cannot directly contribute to ﬁne- and/or high-resolution building extraction. In this paper, based on image super-resolution and segmentation techniques, we propose a two-stage framework (SRBuildingSeg) for achieving super-resolution (SR) building extraction using relatively low-resolution remotely sensed images. SRBuildingSeg can fully utilize inherent information from the given low-resolution images to achieve high-resolution building extraction. In contrast to the existing building extraction methods, we ﬁrst utilize an internal pairs generation module (IPG) to obtain SR training datasets from the given low-resolution images and an edge-aware super-resolution module (EASR) to improve the perceptional features, following the dual-encoder building segmentation module (DES). Both qualitative and quantitative experimental results demonstrate that our proposed approach is capable of achieving high-resolution (e.g., 0.5 m) building extraction results at 2 × , 4 × and 8 × SR. Our approach outperforms eight other methods with respect to the extraction result of mean Intersection over Union (mIoU) values by a ratio of 9.38%, 8.20%, and 7.89% with SR ratio factors of 2, 4, and 8, respectively. The results indicate that the edges and borders reconstructed in super-resolved images serve a pivotal role in subsequent building extraction and reveal the potential of the proposed approach to achieve super-resolution building extraction.


Introduction
With rapid urbanization in recent years, high-resolution building extraction plays an increasingly essential role in urban planning, change monitoring, and population estimation [1][2][3][4]. With a rich set of remotely sensed images, it is possible to infer and distinguish buildings from background objects at pixel level [5]. Such a process is defined as building segmentation or building extraction [6].
In terms of data source, very high resolution (VHR) remotely sensed images were viewed as an essential data source for producing high-resolution building extraction in previous studies, such as 0.1 m airborne images [7,8] and 0.5 m space-borne images [9]. Nevertheless, those VHR images are restricted to a limited spatial extent and temporal availability, thus making the methods which demand VHR images as data sources difficult to apply in large areas. In contrast, relatively low-resolution images such as satellite-based images of WorldView series (1.2-2.4 m) and Planet series (3 m) have better spatiotemporal availability. Notwithstanding, it has been proven by Mariana Belgiu and Lucian Drǎguţ [10] along with Huiping Huang et al. [11] that remotely sensed images with relatively lower resolution could generally lead to lower accuracy and coarser boundaries in segmentation results. The resolution inconsistency between the remotely sensed images and the building extraction target greatly impacts the segmentation results. Ryuhei Hamaguchi and Shuhei Hikosaka [12] pointed out that deep learning models that were trained using low-resolution images could hardly extract buildings with significantly different high-resolution. Haut et al. [13] pointed out that the resolution of remotely sensed images significantly affects the distribution of the spatial features, which is important in distinguishing the pixels of buildings from those of the background. Therefore, it remains challenging to develop an automated framework for achieving super-resolution building extraction results using relatively low-resolution remotely sensed images [14][15][16].
Despite the difficulties, achieving super-resolution building extraction results from relatively low-resolution remotely sensed images can be of great value. First, for long-term building change detection research, relatively low-resolution remotely sensed images are irreplaceable and exclusive, especially for the late 20th century and early 21st century [17][18][19]. In these cases, relatively low-resolution remotely sensed images are the only choice for building extraction. In addition, with the rich diversity of satellites and remote sensor technologies, it is common to observe inconsistent spatial resolutions in source datasets and target results for a certain task. For example, Chen et al. [20] transferred collected low-resolution training material into remotely sensed image pixel classification of another resolution version, making it possible to generate building segmentation results over large areas [10,21,22] or long time periods [23][24][25] via spatiotemporally available low-resolution remotely sensed images.
To conduct such a task, the simplest and most widely used solution is to interpolate all the resolution-inconsistent images into one desired resolution in the preprocess, for example, by bilinear interpolation and bicubic interpolation [26]. However, interpolationbased methods, for which the generated pixels are calculated by adjacent pixels, suffer the loss of spatial information, especially in the edges and high-frequency regions where interpolation will generate insufficient gradients [26,27].
Alternatively, super-resolution (SR) methods aim at reconstructing low-resolution images into high-resolution versions with finer spatial details [28]. SR provides a promising alternative to map remotely sensed images with inconsistent resolution into a version with uniform resolution for high-resolution building extraction. However, existing SR approaches in remote sensing require a number of external high-resolution images to obtain training datasets. Haut et al. [29] retrieved 2100 external high-resolution images for training, while Shao et al. [30] collected more than 100,000 image patches for training. Developing a novel SR approach with no need for external high-resolution images remains challenging but valuable. Moreover, previous studies mainly focus on the perceptual improvements of super-resolved images, with no evaluation regarding how much the improvement of image perceptual quality can be transferred into the improvement of subsequent building extraction.
Hereby, we propose the edge-aware super-resolved building segmentation network (SRBuildingSeg) as a novel framework to achieve super-resolution building extraction. The major contributions in this paper are as follows: • We propose a two-stage framework for attaining super-resolution building extraction, named SRBuildingSeg, which can make use of the extracted features of the given low-resolution images to improve the performance of building extraction in highresolution representation. • Considering the self-similarity between each building in remotely sensed images, we develop an internal pairs generation module (IPG) and an edge-aware superresolution module (EASR). Using the two proposed modules, we can fully utilize the internal information of the given images to improve the perceptional features for subsequent building segmentation without any external high-resolution images.
• We propose a dual-encoder integration module (DES) for building segmentation tasks which enables our approach to attain super-resolution building extraction by fully utilizing the texture features and enhanced perceptional features. • We demonstrate that the reconstructed high frequency information of the superresolved image can be transferred into the improvement of the super-resolution building extraction task. The assessment results reveal that our proposed approach ranks the best among all eight compared methods.
The rest of the paper is organized as follows: in Section 2, we introduce the related work, including the existing deep learning-based building extraction methods and single image super-resolution techniques. In Section 3, we provide a detailed description of the proposed approach. Experimental results and discussion are given in Sections 4 and 5. We present our conclusions in Section 6.

Building Extraction Using Deep Learning Approaches
Since Sakrapee Paisitkriangkrai et al. [31] first proposed a CNN based framework to extract buildings in multispectral images, deep learning based building extraction approaches were proposed and have proven to be effective using VHR images [32][33][34]. Despite the great success of deep learning approaches in building semantic segmentation, only a few discussions focus on building extraction in which the given images and the target results differ in spatial resolution. Belgiu and Drǎguţ [10], and Hamaguchi and Hikosaka [12] compared the building segmentation results of several different approaches using multi-resolution remotely sensed images. They found that the accuracy of extraction results differs with respect to each building size and each specific resolution. Schuegraf and Bittner [35] proposed a hybrid deep learning network for obtaining high-resolution (0.5 m) building extraction results using low-resolution (2 m) multi-spectral and panchromatic images, but their experimental results only show slight improvement in extracting buildings of small size. Guo et al. [36] proposed a framework to extract buildings from relatively low-resolution remotely sensed images while using relatively high-resolution images as training material. Nevertheless, their proposed framework could only generate low-resolution segmentation results from the given high-resolution training material. In addition, they used 0.5 m remotely sensed images as "low-resolution images", and their extraction accuracy rapidly declines as the ratio of unaligned resolutions enlarges. Thus, it remains a challenge to obtain fine and high-resolution building extraction from low-resolution remotely sensed images.

Single Image Super-Resolution
Single image super-resolution (SISR), which aims at reconstructing the image into a high-resolution version while providing finer spatial details than those of the original version [28], has emerged as a promising alternative in mapping low-resolution remotely sensed images into versions of higher resolution [37][38][39]. Although super-resolution (SR) can reconstruct essential details of land features from the original datasets into a specific desired spatial resolution, tremendous external paired high-resolution images for training are also generally required [30,40]. Moreover, the reconstructed images generated by those SR models strongly relied on the external information provided by training material, which made the collection of training samples more difficult [41]. At the same time, the SR based models trained in an unsupervised way, e.g., the unsupervised Generative Adversarial Networks (GAN) model for SR [42], have emerged as practical alternatives. However, the performances of those unsupervised algorithms usually are unsatisfactory in the high-frequency region as compared with the supervised approaches [43,44]. Another unsupervised SR model, zero-shot super-resolution (ZSSR) [41], requires thousands of gradient updates in image reconstruction. In addition, remotely sensed images are usually large in size, but the ZSSR model is designed for natural images of small size. Thus, it is still challenging to generate fine super-resolved images without using external high-resolution images. Furthermore, the contributions of the SR methods for the subsequent building extraction lack qualitative evaluation and discussion.

Methodology
In this paper, we aim to utilize the given low-resolution remotely sensed images to achieve building extraction in high-resolution representation. As illustrated in Figure 1, the overall framework SRBuildingSeg is a two-stage architecture. Stage one focuses on reconstructing a high-resolution version from the given low-resolution images. We first propose an internal pairs generation module (IPG) to construct LR-HR training pairs, which can improve the model trained in an unsupervised way. Hereafter, we reconstruct the superresolved images using an edge-aware super-resolution module (EASR) which is trained on the constructed training dataset. Stage two exploits the dual-encoder building segmentation module (DES) to achieve building extraction in high-resolution representation, which takes both super-resolved images and enhanced perceptional features as input in order to improve the segmentation performance. We will elaborate on the details of SRBuildingSeg in the following sections. In Sections 3.1 and 3.2, we respectively introduce the IPG module and EASR module. The description of DES is presented in Section 3.3, the assessment criteria and loss function are presented in Section 3.4.

Internal Pairs Generation Module
Existing supervised SR methods in the remote sensing domain require a large number of LR-HR pairs as training material. In addition, the performance of supervised approaches strongly relies on the external information extracted from LR-HR pairs, e.g., the representativeness of the training dataset. Considering the limitations of supervised SR approaches, we propose an internal pairs generation module (IPG) to obtain LR-HR training pairs without any external high-resolution images. Different from existing supervised approaches, the IPG can fully exploit the self-similarity of the remotely sensed image, which generally covers a large area and thus contains buildings of nearly all various colors, shapes, surroundings, materials, heights, and forms. The internal information of the remotely sensed images is a generalized and representative information source, which proved its effectiveness in the training of the SR model [23,45].
The proposed IPG consists of four steps to generate the HR and its corresponding LR (LR-HR) training pairs from the given low-resolution image I low . First, we obtain the "HR training pairs" by simply splitting and cropping the given low-resolution image I low . In other words, the "HR training pairs" I LR is actually presented in relatively low resolution, which is considered as a high-resolution information source in the process of training dataset generation. The corresponding LR training pairs I LLRs↓ are then obtained by downscaling each image in I LR using bicubic interpolation, where the s represents the desired SR scale factor. The "LR training pairs" I LLRs↓ is actually a lower-resolution version of the given I low . The generated I LR and I LLRs↓ consist of many LR-HR image pairs, which can be used as input and target in the training process of the subsequent SR module. Furthermore, for the sake of robustness, as well as to enrich the diversity of building sizes, we generate many versions of the LR-HR image pairs using a random upscale factor. Finally, the training material is enriched by randomly transforming each image in LR-HR pairs using four rotations (0°, 90°, 180°, 270°), with a mirror reflecting in the vertical and horizontal directions, flipping, resampling, and adding Gaussian noise.
Taking the training pairs generated with a scale factor of 4 as an example, as illustrated in Figure 2, we first generate HR training pairs I LR (2 m) via cropping and splitting the given images I low (2 m). We then downscale each image in I LR and generate the corresponding LR training pairs I LLRs↓ (8 m). After dataset enrichment, the generated LR-HR pairs are used as training material for SR model. Finally, we use the properly trained SR model and the given images I low (2 m) as input to generate super-resolved images I Hs↑ (0.5 m).

Figure 2.
An example workflow of the proposed LR-HR training pairs generation with a scaling factor of 4. In this example, I low represents the raw remote sensing images (2 m), which usually consists one or two images with a large size and covering a large area. I LR represents HR patches (2 m), which is generated via cropping and splitting the I low . For each image patch in I LR , we generate its downscaled version, which consists the corresponding LR training pairs I LLRs↓ (8 m). Several dataset enrichment schemes were applied to the LR-HR training pairs. After the super-resolution model was properly trained on the LR-HR training pairs, the trained SR model takes the given images I low (2 m) as input to generate super-resolved images I Hs↑ (0.5 m).

Edge-Aware Super-Resolution Module for Reconstructing High-Resolution Images
Considering that our LR-HR training pairs are generated using only the given lowresolution images and contain no external information, the high-frequency information of reconstructed images, which plays a vital role in subsequent building extraction [46], remains to be improved. We employ an edge-aware super-resolution module (EASR) to better reconstruct the high frequency of any given low-resolution remotely sensed images. EASR integrates the initial generative adversarial subnetwork and gradient-based enhancement subnetwork. In the training phase, the EASR utilizes the constructed I LR and I LLRs↓ as training material. In the test phase, the EASR takes the given LR image I low as input and outputs super-resolved images I Hs↑ with given scale factor s as follows: The proposed EASR network is illustrated in Figure 3. EASR is a GAN-based architecture consisting of a generator and a discriminator.
The generator, which aims to reconstruct HR image I Hs↑ from given LR image I low with given scale factor s, consists of an initial reconstruction subnetwork and gradient-based enhancement subnetwork. The reconstruction process contains the following three steps: The first step reconstructs an initial SR image I init using the initial generative adversarial subnetwork composed of several residual blocks and a reconstruction layer as decoder for generating the intermediate HR result I init , which thus helps to achieve overall performance for the reconstruction of I Hs↑ . The second step focuses on the reconstruction of high-frequency information I + edge , which plays an important role in distinguishing the borders and edges of buildings in remotely sensed images. In this step, we first utilize gradient guidance operation to detect and extract gradient information from I init , which is intuitively useful for better inferring the local intensity of sharpness. In addition, a frame branch and mask branch is utilized to extract fine edge maps from the gradient information. These two branches are utilized to learn the noise mask through the attention mechanism so that the network can focus on the real edge information to achieve the purpose of removing noises and artifacts. Specifically, the mask branch consists of three convolutional layers, which aims to adaptively learn specific weight matrices with soft attention to the relevant information. The frame branch contains several residual blocks to infer and extract the sharp edge information. Therefore, the gradient-based enhancement subnetwork reconstructs I + edge as follows: where GE(·) denotes the mapping function of the gradient-based enhancement subnetwork, which consists of gradient calculator, frame branch, and masks branch. The enhancement subnetwork can reconstruct the edge while reducing the noises and maintaining sharpness. The third step concatenates the initial SR image I init and enhanced I + edge , and produces the final enhanced SR images I SR as follows: while the generator module is dedicated to reconstructing an SR image that is similar to the ground truth HR image, and the discriminator module aims to distinguish the reconstructed SR images from ground truth HR images. For the discriminator, we take the architectural design in [46] as a reference but use the maximum pooling to replace the strided convolution.

Segmentation Network for Building Extraction
Using the reconstructed HR image I Hs↑ and corresponding building footprint label as training material, we train a dual-encoder segmentation module (DES) for building extraction in stage two. The proposed DES is a modified version of DlinkNet, which was firstly proposed by Zhou et al. [47] and proved to be effective in several recent studies [48][49][50]. The proposed DES contains two encoder submodules and one decoder submodule. As discussed above, the high-frequency information of reconstructed images can help define the building boundaries. Hence, we append an extra encoder module which takes the reconstructed high-frequency information I + edge as input to assist the segmentation module in distinguishing building area from background. The final building segmentation Seg is produced as follows: Each encoder of the proposed DES uses a ResNet-34 pre-trained weight on the Ima-geNet dataset as an initial parameter. In addition, we employ dilated convolution layers with dilation rates of 1, 2, 4, and 8 to improve the global and local representativeness of the buildings. The other submodule is the decoder of the segmentation network, which is in accordance with the decoder in U-net. The decoder uses transposed convolution layers to upscale the feature map to the same size as the size of input images.

Loss Function
In stage one, we utilize commonly used loss functions for SR methods, including reconstruction loss L rec , adversarial loss L adv [46], content loss L cont [51], and a total variation (TV) regularization L tv [52] to constrain the smoothness of I SR . The overall loss is defined as: where α, β, and γ denote the weights of each loss. The reconstruction loss L rec is utilized to preserve the consistency of image content between the super-resolved image I SR and HR image I LR , which is operated pixel-wise and defined as: The content loss enforces the generator to generate an intermediate I SR image similar to I init , which isoperated in pixel-wise and defined as: The adversarial loss aims to help the network to improve the perceptual quality of generated images. The adversarial loss of generator L adv and discriminator L adv−D are defined as follows: L adv = −log(D(G(I LR ))) (8) The total variation (TV) loss aims to constrain the smoothness of I SR , which is defined as: where ∇(·) denotes the gradient operator among the horizontal and vertical directions. In stage two, we utilize commonly used binary cross entropy loss for the segmentation task.

Study Area and Data
The study area contains the main city zone of three megacities in China, including Beijing, Shanghai, and Xi'an. The study area covers a total of approximately 1860 km 2 and contains multiple building types. As shown in Figure 4, the study areas cover variable types, forms, and shapes of buildings, including the most modern buildings in developed areas and factories under development, all of which make the selected areas representative and remarkable for this study. With regard to the annotated dataset, the building footprint was annotated manually with the referenced remotely sensed images retrieved in 2018. The annotated dataset contains spatial coordinates of all annotated building polygons, and a further rasterization process was conducted in the QGIS platform to generate ground truth labels with the corresponding resolution for each baseline case. Note that a few mismatch cases are inevitable between annotation results and the actual 'ground truth' as a result of limitations in human-based interpretation and minor time inconsistency between retrieved images and referenced images for annotating buildings.

Implementation Details
Two experiments are conducted for verifying the effectiveness of our proposed twostage SRBuildingSeg. In the first experiment, we compare the building footprint segmentation results of varied unsupervised SR approaches and the same segmentation approach. The other experiment compares the building extraction performance using the proposed SR methods and varied segmentation approach.
In the training phase, our method is implemented in PyTorch. All the networks in this paper are trained by a mini-batch stochastic gradient descent (SGD) with a momentum of 0.9 and the weight decay of 0.0005. The learning rate of the super-resolution stage is initialized as 0.001 and the learning rate of the segmentation stage is initialized as 0.0001. We utilize a reduced rate of 0.9 after every five epochs for both stages. Our network converges in 100 epochs in both stages, and the batch size is set to 5. An NVIDIA 2080Ti GPU is used for training.

The Effect of Super-Resolution in Building Extraction
In this section, we focus on comparing the effects contributed by the super-resolution stage in achieving super-resolution building extraction. Therefore, we train each DlinkNet [47] for the segmentation stage under the same conditions while using different SR methods in the super-resolution stage. Considering that the IPG module can help train the SR techniques in an unsupervised way, we select two unsupervised SR approaches (i.e., TSE [53], ZZSR [41]) as well as four supervised SR methods (i.e., SRGAN [46], EEGAN [51], waifu [54], DASR [55]) which are trained on the dataset generated by our proposed IPG module. All segmentation networks are trained under the same conditions using 0.5 m super-resolved images. According to the scale ratios in generating those SR images, our building segmentation experiment consists of three cases, including ratio ×2 (the reso-lution of LR images is 1 m), ratio ×4 (the resolution of LR images is 2 m), and ratio ×8 (the resolution of LR images is 4 m). Note that the case of using the bicubic interpolated remotely sensed images to train the segmentation model (BCI) is viewed as a baseline in this experiment. Table 1 presents the quantitative evaluation of the super-resolved building extraction with the scale factors of 2, 4, and 8 by using those methods. Note that it presents the average results of all images collected for testing to provide a global view. All original high-resolution images in the test area are used for testing. According to the quantitative assessment reported in Table 1, the proposed EASR achieves better performance over other methods, which reveals that the EASR module helps to improve the representativeness of reconstructed images distinguished from the background. In comparison with unsupervised methods, the building extraction achieved via integration of the supervised SR method and proposed IPG module achieve better performance, indicating that our proposed IPG module offers an advantage in helping supervised SR methods to fully utilize the internal information of low resolution remotely sensed images, which can also be proven through qualitative evaluation of Figure 5. As shown in Figure 5, the proposed approach exhibits a great advantage in extracting borders and primary structures of buildings in remotely sensed images. As the SR scale ratio enlarges, the IPG module contributes less improvement and even regresses in achieving super-resolution building extraction. The reason for this phenomenon is that the IPG module takes LR images as input to extract useful information for reconstructing edges and borders of super-resolved images. Meanwhile, as the SR scale ratio enlarges, the LR images tend to contain more noise and blurring, which makes it difficult to extract useful information.

The Effect of the Segmentation Module in Building Extraction
In this section, we compare our proposed DES module with six other state-of-theart segmentation methods (i.e., Unet [56], DeepLabv3p [57], PSPNet [58], DlinkNet [47], UNet++ [59], and HRNet [60]). For a fair comparison, we train each segmentation network using the same super-resolved images reconstructed via our proposed EASR. For a fair comparison, all segmentation networks are trained under the same conditions using a 0.5 m super-resolved image.
We demonstrate the qualitative evaluation via visualizing the results between predicted results and ground-truth labels. As demonstrated in Figure 6, the segmentation results of the proposed DES can maintain the main structures and borders of buildings, while others fail to extract buildings, especially in a large scale factor of building extraction tasks (the ratio ×8 cases in Figure 6), which reveals that our approach can significantly improve feature representativeness in the process of building extraction, especially in the region of borders of buildings. Furthermore, our proposed approach shows its robustness in extracting buildings of variable density, height, textures, and forms, while the others result in the unclear contour of buildings (the ratio ×4 cases in Figure 6). The quantitative evaluation, as shown in Table 2, indicates that the proposed approach could achieve better performance over other methods with regard to IoU, recall, F1 score, overall accuracy, and kappa coefficient. This signifies that our proposed approach can enhance the comprehensive features and information of super-resolved remotely sensed images with an appropriate SR scale factor. It could be inferred from the segmentation details, as shown in Figure 6, that the extracted results outperform other methods with fewer false positive cases, especially in the vicinity of building boundaries.

Discussion
Since the above experimental results show the potential of the proposed approach in achieving high-resolution building extraction using low-resolution images, the mechanism and limits of the proposed approach necessitate further discussion. In this section, we primarily discuss two topics: (1) how enhanced high frequency information influences the super-resolution building extraction and (2) what the limits of the proposed approach are in conducting super-resolution building extraction tasks.

The Effectiveness of High Frequency Information in Building Extraction
As demonstrated in Tables 1 and 2, it seems feasible to achieve high-resolution building extraction via integrating super-resolution and building segmentation methods. In comparison to the building extraction results using bicubic interpolated images, all SRintegrated methods achieve better performance with the scaling ratios of 2, 4, and 8. In addition, the details of reconstructed images using our approach are well-maintained in comparison to those of other integrated methods. A simple but vital question is: how does enhanced high frequency information influence the super-resolution building extraction?
As demonstrated in Figure 7, we visualized the penultimate CNN layer of DlinkNet ( Figure 7a) and our proposed DES (Figure 7b). Specifically, we visualized the variance between the penultimate CNN layer of DlinkNet and DES ( Figure 7c), which highlights the enhanced high frequency information in the segmentation task. Owing to the fusion of enhanced details and input images, it is clear that the features in edges and borders of buildings are well represented, resulting in better building extraction results. Meanwhile, the super-resolved images generated via other SR methods maintain better edges and borders in comparison with that of bicubic interpolated images, thus resulting in better performance. This reveals that edges and borders of buildings serve pivotal roles in building segmentation from remotely sensed images.  between (b,c), which denotes the information enhanced in the segmentation task.

The Limitations of the Proposed Approach
Since we can generate high-resolution building extraction results with acceptable accuracy using training material with resolution around 4 m, it seems theoretically possible that we can use even lower resolution material, such as 6 m, to achieve high-resolution building extraction results following the same approach. As shown in the experimental results of remotely sensed image SR and building extraction, the improvements contributed by SR methods rapidly decline as the ratio factor enlarges, especially in the cases with a ratio factor of 8. This is primarily attributed to the unsatisfactory reconstruction results obtained for the edges and borders. Nevertheless, whether there is a limit scaling ratio in conducting high-resolution building extraction using low-resolution material remains to be determined. Furthermore, we demonstrate the proposed approach in conducting high-resolution building extraction with ratio factors of 2, 4, 8, and 12. Note that the training material with a ratio factor of 12 was generated from OHGT using bilinear interpolation. As shown in Table 3 and Figure 8, two shortages emerge as the ratio factor of super-resolution in building extraction continually increases, which could lead to the theoretical ratio limits of our proposed approach. On one hand, the higher the ratio is, the harder the training process becomes. The training pairs' generation module first downscales the given input images, after which much coarser images are generated for the input of the SR module, resulting in a remotely sensed image without sufficient important details for subsequent reconstruction. The strength in providing finer details regarding output images based on the proposed SR module becomes weaker as the scale ratio of the building extraction task enlarges. On the other hand, the improvement contributed by our enhancement module leads to worse performance as the scale ratio of the building extraction task enlarges, which results from the insufficiency of details retrieved from the given low-resolution images. As demonstrated in Figure 8, the high frequency information reconstructed via the proposed EASR becomes coarse, which may even lead to a few artifacts. This indicates that the proposed approach is reaching its limits in conducting the building extraction task at the scale ratio of 12.

Conclusions
In this study, we propose a novel two-stage framework (SRBuildingSeg) to achieve super-resolution building extraction using relatively low-resolution remotely sensed images. SRBuildingSeg can fully utilize inherent information of the given low-resolution images to achieve relatively high-resolution building extraction. For generating LR-HR training pairs, we propose an internal pairs generation module (IPG) with no need for external high-resolution images, which can reconstruct super-resolved images with only the given low-resolution images. The edge-aware super-resolution (EASR) module then generates super-resolved images at the desired higher resolution, after which the super-resolution building extraction result is obtained using the dual-encoder building segmentation module (DES). The experimental results demonstrate the capability of the proposed approach in achieving super-resolution building extraction, which outperforms other methods in terms of both the perceptual quality of the super-resolved remotely sensed image and the building extraction accuracy for all small (×2), middle (×4), and large (×8) scale ratios. Furthermore, we demonstrate how the reconstructed high frequency information affects the subsequent building extraction. The assessment results reveal that our proposed approach ranks the best among all SR-integrated methods. In summary, we present the potential of the proposed straightforward approach in demonstrating the use of widely available low-resolution resolution data to obtain high-resolution building extraction results. This approach is practical and especially useful when extra datasets of high-resolution remotely sensed images are unavailable.  Data Availability Statement: The code of this study is available at https://github.com/xian1234/ SRBuildSeg (accessed on 4 June 2021). The data presented in this study are available on request from the corresponding author.