FSRSS-Net: High-Resolution Mapping of Buildings from Middle-Resolution Satellite Images Using a Super-Resolution Semantic Segmentation Network

: Satellite mapping of buildings and built-up areas used to be delineated from high spatial resolution (e.g., meters or sub-meters) and middle spatial resolution (e.g., tens of meters or hundreds of meters) satellite images, respectively. To the best of our knowledge, it is important to explore a deep-learning approach to delineate high-resolution semantic maps of buildings from middle-resolution satellite images. The approach is termed as super-resolution semantic segmentation in this paper. Speciﬁcally, we design a neural network with integrated low-level image features of super-resolution and high-level semantic features of super-resolution, which is trained with Sentinel-2A images (i.e., 10 m) and higher-resolution semantic maps (i.e., 2.5 m). The network, based on super-resolution semantic segmentation features is called FSRSS-Net. In China, the 35 cities are partitioned into three groups, i.e., 19 cities for model training, four cities for quantitative testing and the other 12 cities for qualitative generalization ability analysis of the learned networks. A large-scale sample dataset is created and utilized to train and validate the performance of the FSRSS-Net, which includes 8597 training samples and 766 quantitative accuracy evaluation samples. Quantitative evaluation results show that: (1) based on the 10 m Sentinel-2A image, the FSRSS-Net can achieve super-resolution semantic segmentation and produce 2.5 m building recognition results, and there is little difference between the accuracy of 2.5 m results by FSRSS-Net and 10 m results by U-Net. More importantly, the 2.5 m building recognition results by FSRSS-Net have higher accuracy than the 2.5 m results by U-Net 10 m building recognition results interpolation up-sampling; (2) from the spatial visualization of the results, the building recognition results of 2.5 m are more precise than those of 10 m, and the outline of the building is better depicted. Qualitative analysis shows that: (1) the learned FSRSS-Net can be also well generalized to other cities that are far from training regions; (2) the FSRSS-Net can still achieve comparable results to the U-Net 2 m building recognition results, even when the U-Net is directly trained using both 2-meter resolution GF2 satellite images and corresponding semantic labels.


Introduction
Human settlements are one of the most important elements on the earth surface, and they play an important role in urban management, development planning and disaster emergency response. With the launch of a large number of remote sensing satellites, human settlement mapping from satellite remote sensing images has been one of the

Related Work On Super Resolution
Some studies related to the super-resolution semantic segmentation are summarized in this section, e.g., image super-resolution using CNN and feature deconvolution used in image semantic segmentation.

Super-Resolution Semantic Segmentation
The super-resolution (SR) semantic segmentation model needs to achieve resolution enhancement and pixel level classification. Because there is rarely relevant research on super-resolution semantic segmentation, we summarize the strategies of super-resolution semantic segmentation based on empirical knowledge. From the perspective of the relationship among images, their features and semantics, there are three kinds of approaches to super-resolution semantic segmentation, i.e.: (1) SR semantic segmentation of up-sampled images, (2) up-sampling of semantic segmentation results, and (3) SR semantic segmentation of up-sampled features.
As shown in Figure 1a, lower-resolution images are firstly up-sampled into higherresolution images, followed by a semantic segmentation network of the up-sampled images. This kind of approach is based on the image super-resolution results, which are only suitable to the situation that exists in a lot of both low-and high-resolution image pairs for training. As for large-scale satellite image mapping, one could not collect enough pairs of low-and high-resolution images of the same region in the same period of time. Thus, this method is very difficult due to the lack of enough pairs of low-and high-resolution images. From another perspective, if there are enough high-resolution images, high-resolution semantic segmentation maps are directly generated from the high-resolution images. The three kinds of approaches to super-resolution semantic segmentation. The "Encoder-Decoder" represents the process of semantic segmentation using a convolutional neural network, e.g., U-Net. The up-sampling with the shape of 'V' denotes the process of enlarging the size of inputting images, features or semantic maps.
As shown in Figure 1c, the segmentation result is generated from the image by using a semantic segmentation network, and then the higher-resolution semantic maps could be obtained by up-sampling or deconvolution of the lower-resolution semantic segmentation results. Consequently, the quality of the lower-resolution semantic maps would dominate the final results, since the rich image features are out of the up-sampling processing. Therefore, this way might be trivial because the up-sampling from the low-resolution semantic results to the high-resolution semantic results discards the information of the image.
As shown in Figure 1b, the super-resolution semantic segmentation can be achieved by using the learning features of images. This approach can make full use of the information from both low-resolution image features and high-resolution semantic maps. The proposed method in this paper belongs to this kind of approach to super-resolution semantic segmentation.

Image Super-Resolution Based on Deep Learning
It is well known that image super-resolution (SR) is an important class of image processing technique to enhance the resolution of images and videos in computer vision.
The key point of image SR is how to perform up-sampling (i.e., generating high-resolution images from low-resolution images). The existing studies of up-sampling SR techniques with deep learning can be roughly grouped into four kinds of frameworks [24], i.e., preup-sampling SR, post-up-sampling SR, progressive up-sampling SR and iterative upand-down sampling SR. The SRCNN [25] is the first way to image super-resolution by using the CNN. In the SRCNN, features are, firstly, extracted from low-resolution images, then mapped to high-resolution images, and finally the mapped features are restored to high-resolution images. Based on the SRCNN, the FSRCNN [26] have been proposed to enhance and accelerate the effect of image super-resolution. The ESPCN [27] proposes an efficient method to extract features directly from low-resolution images and calculate highresolution images. Based on residual network [28], the VDSR [29] takes the interpolated low-resolution image as the input of the network, and then adds the image and the residual learned from the network to obtain the final output of the network. In the past two years, with the success of generative adversarial network in image feature learning and image domain mapping, the SRGAN was proposed [30], which uses perception loss and discriminator loss to improve and restore the authenticity of images.
There are many related research projects on the generation of high-resolution images from low-resolution images [25,27,[29][30][31][32][33]. However, research on the generation of highresolution semantic maps from low-resolution images is very rare. As for high-resolution semantic maps, there is more detailed and abundant information compared with lowresolution. If a model can generate a high-resolution semantic map from a low-resolution image, it will show the contour of the target more precisely. There are some difficulties in obtaining a large number of high-resolution remote sensing images, but the low-and middle-resolution images (Landsat, Sentinel, etc.) can be downloaded freely and widely. For extracting buildings or other ground objects from remote sensing images, it is a great breakthrough to produce fine high-resolution results from low-or middle-resolution images, which might have great practical significance.
These classical semantic segmentation neural networks are mainly based on the idea of an encoder-decoder. The encoders of these architectures often consist of a set of layered convolution and pooling for features extraction. However, the decoders have different structures, such as bilinear interpolation up-sampling in the FCN, features skip-connection and deconvolution in the U-Net. The task of the decoder is to project lower-resolution Remote Sens. 2021, 13, 2290 5 of 25 discriminative features learned by the encoder into a higher-resolution image plane to finish the dense classification.
The decoding process is the inverse operation of encoding. The features of the encoder are up-sampled step by step, and a segmentation map with the same size as the original image is obtained finally. As shown in Figure 2, there are three kinds of methods for upsampling on the decoder, i.e., un-pooling, interpolating and deconvolution. The un-pooling is often used in CNN to represent the reverse operation of the max-pooling [42]. In practical research, the un-pooling is rarely used because of its inflexibility in size recovery. The interpolation method can also realize the size recovery of an image or feature. On the basis of the original image pixels, the nearest neighbor interpolation, bilinear interpolation and cubic convolution interpolation are often used to insert new elements between the pixel values. Because interpolation uses a few local pixel values and the interpolation method is fixed, it is rigid in the process of model learning. Therefore, it might be not the best choice in practice.
convolution and pooling for features extraction. However, the decoders have different structures, such as bilinear interpolation up-sampling in the FCN, features skipconnection and deconvolution in the U-Net. The task of the decoder is to project lowerresolution discriminative features learned by the encoder into a higher-resolution image plane to finish the dense classification.
The decoding process is the inverse operation of encoding. The features of the encoder are up-sampled step by step, and a segmentation map with the same size as the original image is obtained finally. As shown in Figure 2, there are three kinds of methods for up-sampling on the decoder, i.e., un-pooling, interpolating and deconvolution. The un-pooling is often used in CNN to represent the reverse operation of the max-pooling [42]. In practical research, the un-pooling is rarely used because of its inflexibility in size recovery. The interpolation method can also realize the size recovery of an image or feature. On the basis of the original image pixels, the nearest neighbor interpolation, bilinear interpolation and cubic convolution interpolation are often used to insert new elements between the pixel values. Because interpolation uses a few local pixel values and the interpolation method is fixed, it is rigid in the process of model learning. Therefore, it might be not the best choice in practice.
The deconvolution is the inverse process of convolution, which constructs the mapping relationship of pixel values by learning for the size recovery of features [43]. With the successful application of deconvolution in neural network visualization, it has been widely adopted in more and more research [44]. In this paper, the deconvolution is used to implement features up-sampling.  The deconvolution is the inverse process of convolution, which constructs the mapping relationship of pixel values by learning for the size recovery of features [43]. With the successful application of deconvolution in neural network visualization, it has been widely adopted in more and more research [44]. In this paper, the deconvolution is used to implement features up-sampling.

The Proposed Method for Super-Resolution Semantic Segmentation
The structure of neural network used in the proposed method, coupled with its key components, is described in this section.

The Structure of FSRSS-Net
As shown in Figure 3, there are two methods of feature deconvolution before classification in the FSRSS-Net. The first one aims for deconvoluted low-level features, e.g., spectral, geometric and morphological features of buildings, which are learned from the inputting image. The second one aims to extract high-level semantic features using a typical encoder-decoder neural network, followed by a deconvolution of the features. Since the low-level features and the high-level semantic features are used together, the network is called FSRSS-Net.

The Proposed Method For Super-Resolution Semantic Segmentation
The structure of neural network used in the proposed method, coupled with its key components, is described in this section.

The Structure of FSRSS-Net
As shown in Figure 3, there are two methods of feature deconvolution before classification in the FSRSS-Net. The first one aims for deconvoluted low-level features, e.g., spectral, geometric and morphological features of buildings, which are learned from the inputting image. The second one aims to extract high-level semantic features using a typical encoder-decoder neural network, followed by a deconvolution of the features. Since the low-level features and the high-level semantic features are used together, the network is called FSRSS-Net. Inspired by the idea of encoding and decoding for semantic segmentation, we design the structure of FSRSS-Net as shown in Figure 4. For learning the corresponding relationship between middle-resolution (i.e., 10 m Sentinel-2A) images and high-resolution (i.e., 2.5 m buildings) semantic maps, the resolution of images is twofold improved and pixel level semantic segmentation is achieved at the up-sampled image planes.  As shown in Figure 4, the input of the FSRSS-Net is a Sentinel-2A image with the size 64×64. On the one hand, the low-level features with size of 64×64 are directly deconvoluted into feature maps with the size of 128×128. On the other hand, they go through the processing of a variant of the U-Net [41], so that high-level features can be extracted by encoding and decoding in the variant of U-Net. Then, both low-level and high-level features are concatenated and furthermore deconvoluted to features with size of 256×256. Finally, Inspired by the idea of encoding and decoding for semantic segmentation, we design the structure of FSRSS-Net as shown in Figure 4. For learning the corresponding relationship between middle-resolution (i.e., 10 m Sentinel-2A) images and high-resolution (i.e., 2.5 m buildings) semantic maps, the resolution of images is twofold improved and pixel level semantic segmentation is achieved at the up-sampled image planes.

The Proposed Method For Super-Resolution Semantic Segmentation
The structure of neural network used in the proposed method, coupled with its key components, is described in this section.

The Structure of FSRSS-Net
As shown in Figure 3, there are two methods of feature deconvolution before classification in the FSRSS-Net. The first one aims for deconvoluted low-level features, e.g., spectral, geometric and morphological features of buildings, which are learned from the inputting image. The second one aims to extract high-level semantic features using a typical encoder-decoder neural network, followed by a deconvolution of the features. Since the low-level features and the high-level semantic features are used together, the network is called FSRSS-Net. Inspired by the idea of encoding and decoding for semantic segmentation, we design the structure of FSRSS-Net as shown in Figure 4. For learning the corresponding relationship between middle-resolution (i.e., 10 m Sentinel-2A) images and high-resolution (i.e., 2.5 m buildings) semantic maps, the resolution of images is twofold improved and pixel level semantic segmentation is achieved at the up-sampled image planes. As shown in Figure 4, the input of the FSRSS-Net is a Sentinel-2A image with the size 64×64. On the one hand, the low-level features with size of 64×64 are directly deconvoluted into feature maps with the size of 128×128. On the other hand, they go through the processing of a variant of the U-Net [41], so that high-level features can be extracted by encoding and decoding in the variant of U-Net. Then, both low-level and high-level features are concatenated and furthermore deconvoluted to features with size of 256×256. Finally, As shown in Figure 4, the input of the FSRSS-Net is a Sentinel-2A image with the size 64×64. On the one hand, the low-level features with size of 64×64 are directly deconvoluted into feature maps with the size of 128×128. On the other hand, they go through the processing of a variant of the U-Net [41], so that high-level features can be extracted by encoding and decoding in the variant of U-Net. Then, both low-level and high-level features are concatenated and furthermore deconvoluted to features with size of 256×256. Finally, the convolution layer and softmax layer are added to realize pixel level classification. In general, the FSRSS-Net structure makes full use of the primary low-level features and high-level semantic features of images, and after two deconvolution operations, the spatial resolution is upgraded twice, i.e., from 10-meter images to 2.5-meter semantic maps.
The specification of the FSRSS-Net is given in Figure 5, where the input is of 64×64×4, since three visible and one Near-Infrared bands of Sentinel-2A images are used in this paper. First of all, the inputting image is transformed into 64 feature maps with the size of 64×64 by a 3×3 convolution operation with 1 zero-padding. Then, a batch normalization layer is followed by another convolution. After that, it goes through two parallel processing for low-level and high-level feature extraction and deconvolution as shown in Figure 3. At last, the softmax is used to finish the classification for each pixel on the up-sampled image plane. Please note that there are six kinds of operations in the network, as shown in the legend of Figure 5. the convolution layer and softmax layer are added to realize pixel level classification. In general, the FSRSS-Net structure makes full use of the primary low-level features and high-level semantic features of images, and after two deconvolution operations, the spatial resolution is upgraded twice, i.e., from 10-meter images to 2.5-meter semantic maps. The specification of the FSRSS-Net is given in Figure 5, where the input is of 64×64×4, since three visible and one Near-Infrared bands of Sentinel-2A images are used in this paper. First of all, the inputting image is transformed into 64 feature maps with the size of 64×64 by a 3×3 convolution operation with 1 zero-padding. Then, a batch normalization layer is followed by another convolution. After that, it goes through two parallel processing for low-level and high-level feature extraction and deconvolution as shown in Figure 3. At last, the softmax is used to finish the classification for each pixel on the up-sampled image plane. Please note that there are six kinds of operations in the network, as shown in the legend of Figure 5. The variant of U-Net (Encoder-Decoder)

A Variant of the U-Net Used in This Paper
Among many semantic segmentation networks, the U-Net [41] stands out for its unique design and excellent performance. In the process of up-sampling, the feature maps of down-sampling are used, which make full use of the primary features of the shallow layer in the deep convolution. Consequently, this makes the input of convolution more abundant, and the results can reflect the original information of images more. Specially, we have made the following two changes, as shown in Figure 5, comparing with the original U-Net: (1) In the last layer, two convolution layers are added, which makes the depth of the network deeper, and meanwhile, a batch-normalization layer and a dropout layer are added to prevent the over-fitting of the model. (2) In the process of down-sampling and feature extraction, the second and fourth convolutions are modified into dilated convolution with dilation rate of 2 [45], which makes the range of receptive field larger.

A Variant of the U-Net Used in This Paper
Among many semantic segmentation networks, the U-Net [41] stands out for its unique design and excellent performance. In the process of up-sampling, the feature maps of down-sampling are used, which make full use of the primary features of the shallow layer in the deep convolution. Consequently, this makes the input of convolution more abundant, and the results can reflect the original information of images more. Specially, we have made the following two changes, as shown in Figure 5, comparing with the original U-Net: (1) In the last layer, two convolution layers are added, which makes the depth of the network deeper, and meanwhile, a batch-normalization layer and a dropout layer are added to prevent the over-fitting of the model. (2) In the process of down-sampling and feature extraction, the second and fourth convolutions are modified into dilated convolution with dilation rate of 2 [45], which makes the range of receptive field larger.
To reveal the characteristics of the FSRSS-Net, the variant of the U-Net in Figure  To reveal the characteristics of the FSRSS-Net, the variant of the U-Net in Figure 5 is used as a baseline method for the sake of comparison in this paper. Specifically, five convolution layers and a softmax layer are added into the variant U-Net, in order to generate 10 m semantic segmentation results from 10 m Sentinel-2A images. The modified U-Net generates 10 m semantic segmentation results from 10 m images using the network shown in Figure 6.  It should be emphasized that this paper does not compare the advantages and disadvantages of FSRSS-Net and U-Net, nor does it attempt to prove that FSRSS-Net is better than U-Net. In this paper, we hope to prove that FSRSS-Net can generate high-resolution semantic maps from middle-resolution images, and can realize more precise building recognition of remote sensing images.

Experimental Data
For super-resolution semantic segmentation of buildings, due to the lack of directly available datasets, we created a set of experimental data covering 35 cites in China. This section introduces the data source, preprocessing of images and building ground-truth in detail. The experimental data covering the 35 cities are partitioned into three groups: model training area, quantitative accuracy evaluation area and qualitative generalization performance analysis area. Finally, we describe the samples for model training and the way to select these samples. Figure 7 shows a Sentinel-2 image and its corresponding semantic map in the city of Beijing. These data originate from the Google Earth Engine (GEE) platform [46,47] and the Chinese map-world known as "Tiandi Map", respectively.

Data Source and Preprocessing
Sentinel-2 is a wide-swath, middle-resolution, multispectral imaging mission with a global 5-day revisit frequency. The Multispectral Instrument (MSI) samples 13 spectral bands: visible and near infrared (NIR) at 10 m, red edge and short-wave infrared (SWIR) at 20 m, and atmospheric bands at 60 m spatial resolution. The images used in our experiments are the Level-2A of Sentinel-2 (Sentinel-2A) with red, green, blue and near infrared spectrum (NIR) bands. The images have been screened according to imaging time and cloud amount, and preprocessed by splicing, cloud removal and cutting. It should be emphasized that this paper does not compare the advantages and disadvantages of FSRSS-Net and U-Net, nor does it attempt to prove that FSRSS-Net is better than U-Net. In this paper, we hope to prove that FSRSS-Net can generate high-resolution semantic maps from middle-resolution images, and can realize more precise building recognition of remote sensing images.

Experimental Data
For super-resolution semantic segmentation of buildings, due to the lack of directly available datasets, we created a set of experimental data covering 35 cites in China. This section introduces the data source, preprocessing of images and building ground-truth in detail. The experimental data covering the 35 cities are partitioned into three groups: model training area, quantitative accuracy evaluation area and qualitative generalization performance analysis area. Finally, we describe the samples for model training and the way to select these samples. Figure 7 shows a Sentinel-2 image and its corresponding semantic map in the city of Beijing. These data originate from the Google Earth Engine (GEE) platform [46,47] and the Chinese map-world known as "Tiandi Map", respectively.

Data Source and Preprocessing
Sentinel-2 is a wide-swath, middle-resolution, multispectral imaging mission with a global 5-day revisit frequency. The Multispectral Instrument (MSI) samples 13 spectral bands: visible and near infrared (NIR) at 10 m, red edge and short-wave infrared (SWIR) at 20 m, and atmospheric bands at 60 m spatial resolution. The images used in our experiments are the Level-2A of Sentinel-2 (Sentinel-2A) with red, green, blue and near infrared spectrum (NIR) bands. The images have been screened according to imaging time and cloud amount, and preprocessed by splicing, cloud removal and cutting. formation service website built by the National Platform for Common Geospatial Information Service of China (https://www.tianditu.gov.cn/). Digital line drawing map data can be obtained from the platform. The semantic map of buildings is extracted by color value filtering. Then, the 10 m and 2.5 m building ground-truth maps are derived according to nearest neighbor resampling of the digital maps. It can be seen from Figure 7d,e that the 2.5 m ground-truth map exhibits more precise boundaries of buildings than the 10 m map.

Dataset
As shown in Figure 8, Sentinel-2A images covering 35 cities are collected from the GEE platform after cloud removal, radiation correction and other preprocessing. These images are grouped into three subsets according to the distribution of these cities. Figure  8 shows the geographic location of these images. The Chinese map-world known as 'Tiandi Map' is a comprehensive geographic information service website built by the National Platform for Common Geospatial Information Service of China (https://www.tianditu.gov.cn/ accessed on 10 June 2021). Digital line drawing map data can be obtained from the platform. The semantic map of buildings is extracted by color value filtering. Then, the 10 m and 2.5 m building ground-truth maps are derived according to nearest neighbor resampling of the digital maps. It can be seen from Figure 7d,e that the 2.5 m ground-truth map exhibits more precise boundaries of buildings than the 10 m map.

Dataset
As shown in Figure 8, Sentinel-2A images covering 35 cities are collected from the GEE platform after cloud removal, radiation correction and other preprocessing. These images are grouped into three subsets according to the distribution of these cities. Figure 8 shows the geographic location of these images. The Chinese map-world known as 'Tiandi Map' is a comprehensive geographic information service website built by the National Platform for Common Geospatial Information Service of China (https://www.tianditu.gov.cn/). Digital line drawing map data can be obtained from the platform. The semantic map of buildings is extracted by color value filtering. Then, the 10 m and 2.5 m building ground-truth maps are derived according to nearest neighbor resampling of the digital maps. It can be seen from Figure 7d,e that the 2.5 m ground-truth map exhibits more precise boundaries of buildings than the 10 m map.

Dataset
As shown in Figure 8, Sentinel-2A images covering 35 cities are collected from the GEE platform after cloud removal, radiation correction and other preprocessing. These images are grouped into three subsets according to the distribution of these cities. Figure  8 shows the geographic location of these images. The first group, used to train and validate the models, consists of 19 cities and represents the situation of the major cities under different climates and landforms in China. Buildings in China are mainly located in the big cities, and the big cities are mainly located in the coastal areas, or beside the big rivers and lakes (e.g., the Yangtze River, the Yellow River and the Lake Taihu). Therefore, the selection of 19 training areas mainly considers the eastern and southeastern coastal cities, whose climate type represents the most typical tropical and subtropical monsoon climate in China. Among the 19 training areas, Zhengzhou has the largest area of 3957.75 km 2 and Guangzhou has the smallest area of 268.73 km 2 .
The second group, including four cities, is utilized to quantitatively evaluate accuracy for extracting buildings based on the learned models. The four cities include Chongqing, Wuhan, Qingdao and Shanghai with approximately the same area of 203.48 km 2 , and the image pixel size of each city is 1416×1437. The four accuracy evaluation regions represent the different landforms of China. Chongqing is a typical mountain city, in which the buildings are distributed on the slope because of the large topographic relief. Qingdao is a seaport city in the North China Plain, in which the buildings are arranged in order. Shanghai, located in the plain of the middle and lower reaches of the Yangtze River, is a typical international metropolis with densely distributed buildings and numerous tall buildings. Wuhan is located in the central hills, with a dense water network, and the buildings are distributed near the small mountains and rivers.
The third group, consisting of 12 cities, is used for qualitative analysis of the generalization ability of the learned model, in which some cities are far from the cities in the first group. The image coverage area of each city in the qualitative analysis area is close to that of each city in the quantitative accuracy evaluation area. There is no corresponding ground-truth with the images in the 12 qualitative analysis cities. The 12 cities can be divided into four regions, representing different geographical divisions: (1) In the eastern and southeastern regions (Wenzhou, Hong Kong and Yangzhou), the climate type is a tropical and subtropical monsoon climate. The urbanization of this region is well developed and the buildings are densely distributed. (2) In the north and northeast regions (Hohhot, Changchun and Harbin), the climate type is a temperate monsoon climate, which belongs to industrial and agricultural base, and there are many large buildings in these cities. (3) In the southwest and central regions (Changsha, Chengdu and Nanning), the climate type is a subtropical monsoon climate, and the urbanization process is fast; there are many buildings demolished and newly built. (4) In the west and northwest (Lhasa, Urumqi and Xining), the climate is a temperate continental climate and plateau alpine climate. The buildings are relatively low and distributed on both sides of the valley and river. Figure 9 shows two typical training samples from three kinds of situation, i.e., high density buildings in cities, low density buildings in cities and sparse buildings in rural areas. Each sample is related to three subfigures, i.e., one 10 m image with size of 64×64, one 10 m building ground-truth with size of 64×64 and one 2.5 m building ground-truth with size of 256×256.

Training Samples
Since the mapping time of 'Tiandi Map' might not match with the imaging time of Sentinel-2A images, the labels of buildings cannot be directly used as ground-truth for training. We carefully selected 8597 samples to train both the U-Net and FSRSS-Net. These 8597 training samples have been made public, and can be downloaded from the link: https://drive.google.com/open?id=1-ui_KshUbCgaQINbiZ0nvQ3k55HOhO9wj accessed on 10 June 2021. In order to keep the balance between buildings and non-buildings in term of the number of pixels, the number of samples is in proportion to the ration of buildings. Specifically, among all of samples, the proportion of building pixels is 83%, and the proportion of background pixels is 17%. Each sample has more than one building and there is no pure background samples for training. There are 47 samples with more than 50% buildings, 211 samples with 40~50% buildings, 3577 samples with 20~40% buildings, 4762 samples with 0~20% buildings.  In addition, two images approximately containing 10% cloud in Kunming and Nanjing regions are added to the training datasets. As shown in Figure 10, the shape outline of the thick cloud is drawn by visual interpretation and handwork from the image, and then the area with cloud is regarded as the background in the ground-truth. Therefore, for the pixels with clouds in the image, they are all considered as background pixels in ground-truth, regardless of whether the actual category of those pixels are background or building.

Experiment and Discussion
In this section, firstly, the training process of the models is described. Then, both the trained FSRSS-Net and U-Net models are used to identify buildings from Sentinel-2A im- In addition, two images approximately containing 10% cloud in Kunming and Nanjing regions are added to the training datasets. As shown in Figure 10, the shape outline of the thick cloud is drawn by visual interpretation and handwork from the image, and then the area with cloud is regarded as the background in the ground-truth. Therefore, for the pixels with clouds in the image, they are all considered as background pixels in ground-truth, regardless of whether the actual category of those pixels are background or building. pixels is 17%. Each sample has more than one building and there is no pure background samples for training. There are 47 samples with more than 50% buildings, 211 samples with 40 ~ 50% buildings, 3577 samples with 20 ~ 40% buildings, 4762 samples with 0 ~ 20% buildings. In addition, two images approximately containing 10% cloud in Kunming and Nanjing regions are added to the training datasets. As shown in Figure 10, the shape outline of the thick cloud is drawn by visual interpretation and handwork from the image, and then the area with cloud is regarded as the background in the ground-truth. Therefore, for the pixels with clouds in the image, they are all considered as background pixels in ground-truth, regardless of whether the actual category of those pixels are background or building.

Experiment and Discussion
In this section, firstly, the training process of the models is described. Then, both the trained FSRSS-Net and U-Net models are used to identify buildings from Sentinel-2A im-

Experiment and Discussion
In this section, firstly, the training process of the models is described. Then, both the trained FSRSS-Net and U-Net models are used to identify buildings from Sentinel-2A images, and five accuracy evaluation indexes are applied to evaluate the quantitative accuracy and to compare the results of building identification of 10 m and 2.5 m. Finally, the transfer generalization performance of the model is qualitatively analyzed.

Experimental Setting
Although the balance between buildings and non-buildings has been considered during sample selection, the number of pixels in a batch of samples might still be imbalanced.
In order to avoid this problem, we designed a selectively weighted loss function given by Equation (1). For each batch of samples, if the background pixel ratio is higher than 50%, the loss function uses the weighted-binary cross entropy, otherwise it uses the binary cross entropy. It is important to note that this formula only applies to the case where each sample must have building pixels and background pixels.
where y is the label of a sample, i.e., 1 for the building pixel, and 0 otherwise. And p represents the probability that the sample pixel is predicted to be the building. Bp is the ratio of background pixels in each batch of samples.
In our experiments, 8 NVIDIA Tesla K80 GPUs are utilized to train both the U-Net and the FSRSS-Net. The batch size was set to 32 and the epoch was set to 400. We use cross entropy defined in Equation (1) as the loss function, and use the stochastic gradient descent (SGD) as the optimizer to learn the model parameters. The initial learning rate of SGD is set to 0.01. When the learning rate is no less than 0.000001, the learning rate decreases exponentially with the increase of training epoch-index. The calculation formula of the learning rate is given by: where n is the training epoch-index.

Training Accuracy and Loss
The 8597 training samples are applied to train both the FSRSS-Net and U-Net. Then, the accuracy and loss of training processes are calculated as shown in Figure 11

Quantitative Accuracy Evaluation and Comparison
The data of quantitative accuracy evaluation areas, which covers four cities, is used to evaluate and compare the accuracy of 10 m and 2.5 m building recognition results. The trained FSRSS-Net and U-Net are used to extract buildings from the images covering Chongqing, Qingdao, Shanghai and Wuhan. First, we map the spatial distribution of the building extraction results of the four cities. Figures 12-15 show the results of building identification in Chongqing, Qingdao, Shanghai and Wuhan, respectively. In each figure, the left part is the whole range of the

Quantitative Accuracy Evaluation and Comparison
The data of quantitative accuracy evaluation areas, which covers four cities, is used to evaluate and compare the accuracy of 10 m and 2.5 m building recognition results. The trained FSRSS-Net and U-Net are used to extract buildings from the images covering Chongqing, Qingdao, Shanghai and Wuhan. First, we map the spatial distribution of the building extraction results of the four cities. Figures 12-15 show the results of building identification in Chongqing, Qingdao, Shanghai and Wuhan, respectively. In each figure, the left part is the whole range of the quantitative accuracy evaluation city, and the right part is the enlarged display of a local small area in the red box on the left, which used to reflect the details of building identification. The ground-truth is represented in dark green, and the identified buildings are represented in bright red. On a large scale, the results of both 10 m and 2.5 m are fairly good. It is obvious that there is no case in which urban green land and wetland are incorrectly classified into buildings. In addition, the roads between buildings have been clearly distinguished. From the view of local small area, the building recognition results of 2.5 m are more precise than those of 10 m, and the outlines of the buildings are better depicted.  From Figures 12-15, both FSRSS-Net and U-Net were trained successfully, and the two trained models can recognize buildings. Most importantly, the FSRSS-Net can really produce high-resolution semantic segmentation results from middle-resolution images. In this paper, the U-Net inputs 10 m resolution images and outputs 10 m resolution building recognition results, and the FSRSS-Net inputs 10 m resolution images and outputs 2.5 m resolution building recognition results. The output spatial resolutions of the FSRSS-Net and U-Net model are different, so it is inappropriate to compare the performance of FSRSS-Net and U-Net. We pay more attention to the differences in the spatial distribution of building recognition results, and then highlight the advantages of 2.5 m building recognition results. From Figures 12-15, both FSRSS-Net and U-Net were trained successfully, and the two trained models can recognize buildings. Most importantly, the FSRSS-Net can really produce high-resolution semantic segmentation results from middle-resolution images. In this paper, the U-Net inputs 10 m resolution images and outputs 10 m resolution building recognition results, and the FSRSS-Net inputs 10 m resolution images and outputs 2.5 m resolution building recognition results. The output spatial resolutions of the FSRSS-Net and U-Net model are different, so it is inappropriate to compare the performance of FSRSS-Net and U-Net. We pay more attention to the differences in the spatial distribution of building recognition results, and then highlight the advantages of 2.5 m building recognition results.
From the details of the local small area, the results of 10 m cannot identify some small buildings. Some individual buildings with dense distribution can be identified as contiguous built-up areas, which is shown in the yellow circle of subfigures (b2) in Figures 12 to  15.. On the contrary, as shown in subfigure b3 in Figures 12 to 15, the 2.5 m building recognition results can distinguish small houses, and the building contour is more accurate. Therefore, from the spatial distribution map of building extraction results, it can be significantly concluded that the building recognition results of 2.5 m are better than that of 10 m.
In order to quantitatively evaluate the accuracy of the 2.5 m and 10 m building recognition results, we evaluate the accuracy based on the overall accuracy (OA), recall (Rec), precision (Pre), F1-score (F1) and intersection over union (IOU). In particular, in order to From the details of the local small area, the results of 10 m cannot identify some small buildings. Some individual buildings with dense distribution can be identified as contiguous built-up areas, which is shown in the yellow circle of subfigures (b2) in Figures 12-15. On the contrary, as shown in subfigure b3 in Figures 12-15, the 2.5 m building recognition results can distinguish small houses, and the building contour is more accurate. Therefore, from the spatial distribution map of building extraction results, it can be significantly concluded that the building recognition results of 2.5 m are better than that of 10 m.
In order to quantitatively evaluate the accuracy of the 2.5 m and 10 m building recognition results, we evaluate the accuracy based on the overall accuracy (OA), recall (Rec), precision (Pre), F1-score (F1) and intersection over union (IOU). In particular, in order to make the accuracy evaluation scientific and reasonable, for the 2.5 m results of FSRSS-Net, the ground-truth of 2.5 m is used for accuracy evaluation, while for the 10 m results of U-Net, the ground-truth of 10 m is used for accuracy evaluation. Table 1 shows the relationship between prediction results and ground-truth. As defined in Equation (3), OA represents the ratio of the number of correctly predictive pixel to that of all of prediction. As shown in Equation (4), Rec is the ratio of the number of predicted building pixels to that of actual building pixels. For Equation (5), Pre denotes the ratio of the number of actual building pixel to that of all of predicted building pixels. In Equation (6), F1 is the harmonic average of precision and recall, and IOU is the ratio of building intersection and building union, as shown in Equation (7).
where the meanings of TP, TN, FN and FP are shown in Table 1.  Table 2 displays the accuracy of 10 m building recognition results, the accuracy of 10 m results interpolation up-sampling to 2.5 m, and the accuracy of 2.5 m building recognition results over the four cities.
On the whole, the OAs in these four cities are higher than 76%, but it is obviously lower than the training accuracy of 19 cities in the model training areas, because the images of the four cities are not ever seen by the model during training. It can be seen from Table 2 and Figure 16 that there is little difference between the accuracy of 10 m results and 2.5 m results. However, the accuracy of 10 m results interpolation up-sampling to 2.5 m is generally lower than the accuracy of 10 m results and the accuracy of 2.5 m results. Finally, we can conclude that: based on the 10 m Sentinel-2A image, the FSRSS-Net can achieve super-resolution semantic segmentation and produce 2.5 m building recognition results. The 2.5 m building recognition results are better than the 10 m building recognition results interpolation up-sampling to 2.5 m. On the whole, the OAs in these four cities are higher than 76%, but it is obviously lower than the training accuracy of 19 cities in the model training areas, because the images of the four cities are not ever seen by the model during training. It can be seen from Table 2 and Figure 16 that there is little difference between the accuracy of 10 m results and 2.5 m results. However, the accuracy of 10 m results interpolation up-sampling to 2.5 m is generally lower than the accuracy of 10 m results and the accuracy of 2.5 m results. Finally, we can conclude that: based on the 10 m Sentinel-2A image, the FSRSS-Net can achieve super-resolution semantic segmentation and produce 2.5 m building recognition results. The 2.5 m building recognition results are better than the 10 m building recognition results interpolation up-sampling to 2.5 m.

Qualitative Analysis of the Generalization Ability
The learned models are used to identify buildings from the third group of images covering the 12 cities. As shown in Figures 17-20, both 10 m and 2.5 m recognition results of buildings in most of cities look very good.

Qualitative Analysis of the Generalization Ability
The learned models are used to identify buildings from the third group of images covering the 12 cities. As shown in Figures 17-20   These results show that the learned models can be well generalized to some cities in China. However, for some cities, e.g., Lhasa and Hohhot, the results are not as good as other cities, since the two cities are located on the plateau and grassland Gobi, and therefore the land surface of these regions is very different from the cities in the training dataset.
It can be seen from the yellow circled area in subfigures on the second rows of Figures  17-20 that the results of 2.5 m are more precise and accurate, especially in the cities of Wenzhou, Yangzhou, Chengdu, Changsha, Chengdu and Nanning. These results show that the learned models can be well generalized to some cities in China. However, for some cities, e.g., Lhasa and Hohhot, the results are not as good as other cities, since the two cities are located on the plateau and grassland Gobi, and therefore the land surface of these regions is very different from the cities in the training dataset.
It can be seen from the yellow circled area in subfigures on the second rows of  These results show that the learned models can be well generalized to some cities in China. However, for some cities, e.g., Lhasa and Hohhot, the results are not as good as other cities, since the two cities are located on the plateau and grassland Gobi, and therefore the land surface of these regions is very different from the cities in the training dataset.
It can be seen from the yellow circled area in subfigures on the second rows of Figures 17-20 that the results of 2.5 m are more precise and accurate, especially in the cities of Wenzhou, Yangzhou, Chengdu, Changsha, Chengdu and Nanning.
In addition, as shown in the suburbs and villages in subfigures (a1) and (b1) in Figure 21, buildings are sparsely distributed, and the learned models can still accurately identify buildings. When the image is covered by clouds, as shown in Figure 21 (c1), the cloud cannot be mistaken as buildings. This shows that the learned model can deal with the variation caused by imaging environment, to some extension. In addition, as shown in the suburbs and villages in subfigures (a1) and (b1) in Figure  21, buildings are sparsely distributed, and the learned models can still accurately identify buildings. When the image is covered by clouds, as shown in Figure 21 (c1), the cloud cannot be mistaken as buildings. This shows that the learned model can deal with the variation caused by imaging environment, to some extension.

Discussion
In order to further investigate the characteristics of 2.5 m building recognition results of the FSRSS-Net, we discussed two situations compared with 2.5 m results in this subsection. The first one is the U-Net training with 2 m high-resolution images from GF2 satellite and 2 m building ground-truth. The second one is two deconvolution layers are directly added at the end of U-Net, that is, super-resolution semantic segmentation with deconvolution of high-level semantic features.

Comparison between the 2.5 m Results of FSRSS-Net and the 2 m Recognition Results from GF2 Image
The U-Net in Figure 6 is also trained with high-resolution images, e.g., 2 m-resolution GF2 images for generating the 2 m building extraction results. Specifically, the 2 m-resolution images of GF2 and the 2 m ground-truth of the same regions are used to train the U-Net model. After 400 epochs of training, the training accuracy reaches 94% when the model converges and 89% on the test samples. As shown in Figure 22 (a1) and (b1), the two small areas in Kunming represent cities and suburbs, respectively. In the subfigures (a1), (b1), (a3) and (b3) in Figure 22, there are the Sentinel-2A images and building identification results, achieved by using the FSRSS-Net. And the subfigures (a4), (b4), (a5) and (b5) in Fiugre 22 is GF2 images and the predicted 2 m resolution building results by U-Net. It can be seen that the results of 2.5 m are obviously better than those of 10 m, but slightly worse than those of 2 m. In the results of 2 m, the outline of the building is more accurate, and the edge of the house is straight, while in the results of 2.5 m, the edge of

Discussion
In order to further investigate the characteristics of 2.5 m building recognition results of the FSRSS-Net, we discussed two situations compared with 2.5 m results in this subsection. The first one is the U-Net training with 2 m high-resolution images from GF2 satellite and 2 m building ground-truth. The second one is two deconvolution layers are directly added at the end of U-Net, that is, super-resolution semantic segmentation with deconvolution of high-level semantic features.

Comparison between the 2.5 m Results of FSRSS-Net and the 2 m Recognition Results from GF2 Image
The U-Net in Figure 6 is also trained with high-resolution images, e.g., 2 m-resolution GF2 images for generating the 2 m building extraction results. Specifically, the 2 mresolution images of GF2 and the 2 m ground-truth of the same regions are used to train the U-Net model. After 400 epochs of training, the training accuracy reaches 94% when the model converges and 89% on the test samples. As shown in Figure 22 (a1) and (b1), the two small areas in Kunming represent cities and suburbs, respectively. In the subfigures (a1), (b1), (a3) and (b3) in Figure 22, there are the Sentinel-2A images and building identification results, achieved by using the FSRSS-Net. And the subfigures (a4), (b4), (a5) and (b5) in Fiugre 22 is GF2 images and the predicted 2 m resolution building results by U-Net. It can be seen that the results of 2.5 m are obviously better than those of 10 m, but slightly worse than those of 2 m. In the results of 2 m, the outline of the building is more accurate, and the edge of the house is straight, while in the results of 2.5 m, the edge of the house is a little smooth. This is because the information contained in the 10 m image and the 2.5 m ground-truth must be less than that of the 2 m image and the 2 m ground-truth.

Comparison between FSRSS-Net and U-Net+2SR
As stated in Section 3.1, the unique structure of the FSRSS-Net network is to integrate both the up-sampling of primary image features and the up-sampling of high-level semantic features. In order to further validate its rationality, we design the network structure as shown in Figure 23. Two deconvolution layers and three convolution layers are added at the end of the U-Net variant, and the final classification layer still uses the softmax. Based on the U-Net variant, the network is deconvoluted and up-sampled twice. It is called "U-Net+2SR" in the following, since there are two deconvolution layers after U-Net. There is no up-sampling of primary image features in the "U-Net+2SR" network, which is the only difference from the FSRSS-Net. The same 8597 samples are used to train the U-Net+2SR network. In addition, the training hyper-parameters and loss functions are also the same as the FSRSS-Net. In addition, the total training epoch is 400. After about the 220 th epoch, the training accuracy of the model is stable at 0.893, which is 0.014 lower than that of the FSRSS-Net model.

Comparison between FSRSS-Net and U-Net+2SR
As stated in Section 3.1, the unique structure of the FSRSS-Net network is to integrate both the up-sampling of primary image features and the up-sampling of high-level semantic features. In order to further validate its rationality, we design the network structure as shown in Figure 23. Two deconvolution layers and three convolution layers are added at the end of the U-Net variant, and the final classification layer still uses the softmax. Based on the U-Net variant, the network is deconvoluted and up-sampled twice. It is called "U-Net+2SR" in the following, since there are two deconvolution layers after U-Net. There is no up-sampling of primary image features in the "U-Net+2SR" network, which is the only difference from the FSRSS-Net. The same 8597 samples are used to train the U-Net+2SR network. In addition, the training hyper-parameters and loss functions are also the same as the FSRSS-Net. In addition, the total training epoch is 400. After about the 220 th epoch, the training accuracy of the model is stable at 0.893, which is 0.014 lower than that of the FSRSS-Net model.
It is called "U-Net+2SR" in the following, since there are two deconvolution layers after U-Net. There is no up-sampling of primary image features in the "U-Net+2SR" network, which is the only difference from the FSRSS-Net. The same 8597 samples are used to train the U-Net+2SR network. In addition, the training hyper-parameters and loss functions are also the same as the FSRSS-Net. In addition, the total training epoch is 400. After about the 220 th epoch, the training accuracy of the model is stable at 0.893, which is 0.014 lower than that of the FSRSS-Net model. The variant of U-Net (Encoder-Decoder) Figure 23. The structure of the U-Net+2SR.
Taking Shanghai as an example, the trained U-Net+2SR model is used to identify 2.5 m buildings, which is compared with the results of FSRSS-Net. In order to make a clearer comparison, three small areas with different kinds of shapes and densities of buildings are selected and shown in Figure 24. In the three small areas, the results of FSRSS-Net are significantly better than those of U-Net+2SR. As shown in the yellow circle, the results of U-Net+2SR have obvious missed detection and misclassification. The U-Net+2SR model mistakenly identifies multiple backgrounds as buildings, and cannot distinguish the neighboring buildings accurately. Taking Shanghai as an example, the trained U-Net+2SR model is used to identify 2.5 m buildings, which is compared with the results of FSRSS-Net. In order to make a clearer comparison, three small areas with different kinds of shapes and densities of buildings are selected and shown in Figure 24. In the three small areas, the results of FSRSS-Net are significantly better than those of U-Net+2SR. As shown in the yellow circle, the results of U-Net+2SR have obvious missed detection and misclassification. The U-Net+2SR model mistakenly identifies multiple backgrounds as buildings, and cannot distinguish the neighboring buildings accurately. The variant of U-Net (Encoder-Decoder) Figure 23. The structure of the U-Net+2SR.
Taking Shanghai as an example, the trained U-Net+2SR model is used to identify 2.5 m buildings, which is compared with the results of FSRSS-Net. In order to make a clearer comparison, three small areas with different kinds of shapes and densities of buildings are selected and shown in Figure 24. In the three small areas, the results of FSRSS-Net are

Conclusions
Motived by the idea of image super-resolution technologies and CNN semantic segmentation, this paper proposes a super-resolution semantic segmentation network, i.e., Figure 24. Comparison of the extraction results of FSRSS-Net and U-Net+2SR, taking Shanghai as an example (three small areas (a-c) are selected for detailed comparison).

Conclusions
Motived by the idea of image super-resolution technologies and CNN semantic segmentation, this paper proposes a super-resolution semantic segmentation network, i.e., FSRSS-Net, which makes full use of the rich details in the high-resolution ground-truth, and aims to learn the corresponding relationship between the middle-resolution images and the high-resolution semantic segmentation maps. The main experimental results in this paper are as follows: (1) There is little difference between the accuracy of 10 m results by U-Net and 2.5 m results by FSRSS-Net. However, the accuracy of 10 m results interpolation up-sampling to 2.5 m is generally lower than the accuracy of 10 m results and the accuracy of 2.5 m results. (2) In the 12 cities for generalization ability evaluation, the recognition results of buildings are pretty good, indicating that the migration and generalization performance of the models are strong. (3) Comparing 2.5 m results of FSRSS-Net with the results of 2 m-resolution GF2 images, the results of 2.5 m are obviously better than those of 10 m, but slightly worse than those of 2 m. Finally, two main conclusions can be drawn from the above experimental results: (1) Based on the 10 m Sentinel-2A image, the FSRSS-Net can achieve super-resolution semantic segmentation and produce 2.5 m building recognition result. The 2.5 m building recognition results are better than the 10 m building recognition results interpolation up-sampling to 2.5 m. (2) The learned FSRSS-Net can be also well generalized to other cities that are far from training regions.
The research of this paper is an important step in super-resolution semantic segmentation for satellite mapping of buildings, which would be used as a reference for high-resolution mapping of buildings from middle-resolution satellite images. In this paper, a large number of samples are produced automatically based on existing images and online data, which is of great significance to improve the efficiency and effectiveness of building extraction, and can save a lot of manpower and time.
Although there are some innovative research contents and achievements in this paper, there are still some shortcomings and limitations. The following two points need to be explored and studied: (1) In the super-resolution semantic segmentation model-FSRSS-Net, what is the role of higher-resolution building ground-truth in training model? How the mapping relationship between 10 m image and 2.5 m ground-truth learned from the model be understood and visualized? (2) If we design a network with multiple deconvolution layers, when the resolution of image and semantic segmentation results are different many times, such as 10 m image and 1 m building ground-truth, can we still extract buildings better?
According to the shortcomings and limitations of this paper, further research can be carried out in the following aspects in the future: (1) To apply the FSRSS-Net model to global building extraction, we need to select samples from typical regions of other countries in order to train models with strong generalization performance and global applicability.
(2) The network with multiple deconvolution layers is designed, and how to train and optimize the model needs to be studied, so as to realize whether the model can better extract buildings when the resolution of image and semantic segmentation results are often different.