1. Introduction
Automatic building recognition through remote sensing observations has numerous applications in the geographical and social sciences. These include collecting and updating data in Geographic Information System databases, detecting building damage related to disasters, monitoring urban settlements, mapping land-use/land-cover patterns, and managing environmental resources [
1,
2,
3,
4]. The auxiliary information obtained during building detection, such as the spatial distribution of buildings, their size, and quantity, plays a significant role in urban planning and demographic analysis [
3,
5].
Early research on automatic building detection was typically based on aerial imagery due to its spatial resolution up to 0.05 m [
6]. However, acquiring aerial imagery for large areas, e.g., a whole city, is time-consuming [
4] and faces other limitations [
7]. Moreover, it can not be applied for damage assessment when imagery before and after an event is required. In recent years, the increased availability of remote sensing imagery with a wide coverage area has solved this problem [
8]. Remote sensing imagery is represented mainly by multi-spectral and synthetic aperture radar (SAR) data. Compared with multi-spectral data, the processing of SAR data is more complicated due to noise and blurry boundaries, particularly in urban areas with severe geometric distortions such as layover and shadowing [
9]. As a result, multi-spectral satellite data are more commonly used for the development of building recognition algorithms.
For the discrimination of separate building blocks from remote sensing imagery, the spatial resolution plays a more significant role than the number of spectral bands or a narrower wavelength interval [
4,
10]. High-resolution (HR) imagery is typically more expensive than lower-resolution options, as it requires more advanced instruments and systems on earth observation satellites [
11]. Middle-resolution images are often freely available but may not contain important details [
12]. Therefore, it is highly important for a number of remote sensing tasks to ensure data with both high spatial and temporal resolution is available. To produce HR remote sensing images from middle- or low-resolution (LR) data, super-resolution (SR) methods can be used [
13].
Although building segmentation tasks and super-resolution tasks for remote sensing data are often investigated as separate challenges [
14,
15], it is crucial to consider them as part of a common pipeline to achieve higher recognition results. Moreover, recent advances in computer vision have introduced powerful tools for object recognition, such as transformers and diffusion models, which hold great promise in the general computer vision domain. Their implementation in the remote sensing domain, specifically in super-resolution and building segmentation tasks, requires special attention. It is important to integrate these advanced methods into a unified pipeline to improve the accuracy of remote sensing data analysis. Another significant issue that arises in such tasks is the availability of relevant datasets for particular regions that support representability and enable researchers to test and compare their algorithms.
In this study, our aim is to address several challenges in the building segmentation task. The first challenge is the limited availability of annotated datasets for specific geographic regions. Although there are a number of open-access datasets for building recognition via satellite data, transfer from one study area to another location can be inapplicable due to infrastructure specificity. Therefore, we focus on collecting a unique dataset to support building assessment in several regions of Russia. Another important challenge in infrastructure analytics using remote sensing is the availability, cost, and temporal resolution of satellite data. To address this, we focus specifically on Sentinel-2 data, which is a valid choice for rapid remote sensing observations due to its high temporal resolution (approximately 5 days). However, the spatial resolution for RGB bands is only 10 m. To provide more precise building segmentation on 2.5 m per pixel, we set an objective to develop a pipeline that comprises Sentinel-2 imagery adjustment by a factor of 4 and segmentation of HR imagery. To simultaneously analyze image super-resolution and building segmentation, we consider state-of-the-art models such as transformers and diffusion models and verify their performance on different spatial resolutions. One possible practical application of the proposed pipeline is to conduct a more detailed inventory of infrastructure objects. In our study, we demonstrate the possibility of segmenting medium-sized standalone buildings. The usage of satellite data with spatial resolution of 10 m sometimes leads to the situation where two or more separated buildings are recognized as one instance. Thus, it is difficult to make accurate quantitative assessments. Such assessments would be useful, for example, to match addresses with actual buildings or to accurately assess damages after cataclysms such as flooding. Our intention is to use only freely available medium-resolution data, such as that obtained from Sentinel-2, due to its easy accessibility and ability to cover large territories. It is possible to adapt our pipeline to other types of infrastructure if the specific labeled dataset is available. In summary, our goals are:
To create and share a unique dataset covering several regions in Russia;
To conduct a comprehensive overview of benchmarks for the building segmentation task;
To propose an efficient pipeline for building segmentation involving image SR to leverage Sentinel-2 data adjusted to 2.5 m;
To make a comparative study among different SR and segmentation algorithms for solving similar problems in the remote sensing domain.
3. Materials and Methods
3.1. Problem Statement
The building recognition task can be described as the classification of each image pixel into two categories: building or background. In terms of machine learning algorithms, this task is called semantic segmentation [
5,
45]. Semantic segmentation tasks can be solved using two approaches: (1) traditional classification methods or per-pixel classifiers (e.g., support vector machine and random forest classifier) and (2) deep learning (DL)-based methods or object-based classifiers [
10,
45]. Classical approaches have been popular for several decades. However, most of the studies based on traditional methods focused on relatively small regions of interest. This decision was dictated by the necessity of features extraction, such as spectra, shape, texture, color, edge, and shadow. Along with the importance of professional knowledge for feature extraction, features are heavily dependent on sensors, imaging conditions, image acquisition parameters, and location. This leads to instability, reduced accuracy, and limited usability. Recently, DL-based methods have been applied to overcome these drawbacks. DL algorithms allow researchers to combine low-level features with high-level features that represent more semantic information, which makes them robust [
3,
5,
10].
The research consists of two parts, namely, developing a SR model and a semantic segmentation model (
Figure 1). Such decomposition allows us to solve specific, resolution-sensitive problems that could not be solved directly using LR data, such as images from the Sentinel-2 satellite. In the first stage, we adjust the images by a factor of 4. We use initially high-resolution images (Mapbox basemap) to create pairs of low-resolution and high-resolution images, with resolutions of 10 m and 2.5 m, respectively. LR images are obtained from HR images using simple downscaling. The dataset with HR and LR images is used to train an SR model to upscale images and increase their resolution from 10 m to 2.5 m. We then apply the trained model to Sentinel-2 images, which have a spatial resolution of 10 m, to create a new dataset consisting of Sentinel-2 images with 2.5 m resolution, accompanied by the prepared OpenStreetMap (OSM)-based markup. In the next step, we use this new collected dataset to train a semantic segmentation model. During the inference step, a 10 m RGB Sentinel-2 image is passed through both the SR and segmentation neural network models.
3.2. Dataset for Building Segmentation and Super-Resolution Tasks
In this study, we create a dataset covering four regions in Russia: Moscow and the surrounding suburban area, Krasnoyarsk, Irkutsk and Novosibirsk. These territories represent diverse geographical conditions and urban characteristics. The entire study area equals to 1091.2 km
2 and is composed of 30 individual sites, consisting mostly of multi-storey buildings, adjacent territories, and squares. For our research problem statement of SR image segmentation, we collected satellite images of high and middle spatial resolution for the same area. The first set is based on the Mapbox product with the spatial resolution of 1 m per pixel, collected during the summer period of 2022. The basemap product is RGB images with values in the standard range from 0 to 255. We also use remote sensing images derived from the Sentinel-2 satellite, considering the RGB bands with a spatial resolution of 10 m per pixel. There are a number of observations available for the Sentinel-2 satellite. Therefore, in contrast to Mapbox images, we collect a few Sentinel-2 images on different dates during the summer period of 2022 to extend the training dataset. We use a 10% cloud threshold to discard cloudy images. The API of SentinelHub service [
46] is used for filtering, downloading, and satellite data preprocessing. In total, we acquired 124 images.
SR models are trained in a supervised manner and require both LR and HR images. However, such pairs of images cannot be obtained from the same satellite and the same sensor. Therefore, a commonly used approach is to collect images from two satellites with higher and lower spatial resolution and then to match them [
32,
47]. Although transferring from a LR data to a HR one shows remarkable performance, its main limitation is mismatches that can occur between observations for different dates and natural conditions such as atmospheric and lightning effects. Therefore, in this study, we investigate transferring between different resolutions rather than different sensors to train a model. For the SR task, we combine open access datasets representing urban areas and the set of images collected for the study area. Images from the xView dataset and Massachusetts Road and Building Detection datasets [
33] are brought to the spatial resolution of 2.5 m by means of interpolation. Examples of images from these datasets are presented in
Figure 2. We also down-scaled images from Mapbox to 2.5 m to perform further experiments of the resolution adjustment. Open access data for various regions aims at supporting pattern diversity in remote sensing observations, while the selected Mapbox images are more specific for the target urban areas. We leverage this data to create pairs of HR and LR pairs of 2.5 m and 10 m, respectively. The entire study area amounts to 2940 km
2 for Massachusetts Road and Building Detection datasets, 1400 km
2 for the xView dataset, and 1091.2 km
2 for our dataset. The Sentinel-2 images collected for the same sites as Mapbox observations are also involved in the SR task (see
Figure 3). However, Sentinel-2 samples are considered just as LR images without corresponding HR pairs.
We supply the satellite imagery by building a segmentation markup. It is acquired based on OpenStreetMap (OSM) database [
48]. It provides vector geographical data updated and maintained via open collaboration using data from surveys, aerial imagery and freely licensed geographic data sources. We extract elements with non-empty building tags from the OSM database by querying the Overpass API server using the overpass Python library [
49]. The obtained XML data is converted into GeoJSON format and manually corrected for inaccurate boundaries or missing objects. This study focuses on segmenting multi-storey buildings that can be visually identified on medium-resolution satellite data, such as Sentinel-2 imagery, with an original RGB band resolution of 10 m per pixel. Therefore, territories containing private garages (
Figure 4) and low-rise buildings (
Figure 5) with areas less than 5 m
2 are beyond the scope of the present study and they are removed from the resulting markup. The remaining polygons are rasterized into GeoTIFF binary masks using Python GDAL package [
50] based on size, affine transformation coefficients, projection coordinate system and spatial reference system derived from the original satellite images. Since the vector from OSM and the satellite images are brought to a single coordinate system, we can assume that the rasterized masks obtained coincide with real buildings with high accuracy, which means they can be used as ground-truth for the building segmentation experiments. In markup, 0 represents background and 1 represents target class. An example of obtained markup is presented in
Figure 6. We present the properties of the collected data in
Figure 7; it includes a frequency plot of building localization on the scaled sites and shows the sizes of the buildings. For the experiments, we create two sets with markup for the same areas but with two different spatial resolutions of 2.5 m and 1 m per pixel. We aim at analyzing the importance of spatial features and the potential of lower spatial resolution data-sources than 1 m per pixel. A markup of 1 m resolution is used with Mapbox data, while 2.5 m resolution markup is used both with Mapbox images of 2.5 m and Sentinel-2 images up-scaled to 2.5 m. Details on the up-scaling approach are presented in
Section 3.3.
The dataset is divided into three subsets: train, validation, and test. We select 15 territories as the train set, with a total area of 679.5 km
2, including 119.4 km
2 of the target class (17% of the train area). Another five territories represent the validation set, with an area of 116.6 km
2 and 23.3 km
2 of the target label (20%). We use eight territories as test data, comprising an area of 295.1 km
2 with 74.77 km
2 of the target class (27%). The final split has a ratio of 57/16/27 for train, validation, and test territories, respectively. To make sure that validation and test images have versatile urban topographies, at least one image from each region was chosen. Information and statistics about collected data are given in
Table 1.
We share the collected dataset comprising original Sentinel-2 RGB images of 10 m, SR Sentinel-2 images of 2.5 m, and markup of 2.5 m. It can be used independently as a benchmark for SR algorithms and for building segmentation models, or for evaluating pipelines combining both stages for robust building recognition through upscaled satellite data.
3.3. Experiments for SR
We consider two GAN-based approaches for the SR task that have already shown remarkable results for image resolution adjustment both in general and remote sensing domains. We also conduct experiments with attention-based model RCAN and diffusion-based model SR3.
For SRGAN architecture, we use a two-stage strategy. Firstly, we train only the generator using the reduced loss-function that just consists of mean-squared error. It takes 10 epochs. After that, we continue the training process in the GAN common mode which assumes training both the discriminator and generator simultaneously. The weights are updated based on values of binary cross-entropy loss function and combination of binary cross-entropy, mean-squared error, total variance and perceptual loss functions. This stage takes 100 epochs.
For MCGR architecture, we do not pre-train the generator model, rather we train the whole model from scratch. We make several runs for each batch using the following approach. The LR domain is denoted as A and the HR domain is denoted as B, then, for calculating different components of complex loss functions, we consider: , , , , , , and , where represents generator from A domain to B domain and represents discriminator for A domain. Using all these values, we calculate the following components of the loss-function: binary cross-entropy, mean-squared error, total variance, perceptual and cycle losses, which summed up with certain coefficients and used for updating parameters of both generators in the same time. We update the parameters of the discriminators in the same way as in SRGAN, but independently. The whole training process lasts for 100 epochs.
Furthermore, for both models, we utilize a cosine annealing schedule to adjust the learning rate during training. We set the initial learning rate to with a restart period of 10, which increases by a factor of 2 after each restart. Additionally, we set the batch size to 6 for the experiment.
The SR3 model is trained using a combination of loss functions, which includes a mean squared error (MSE) loss and a perceptual loss that optimizes for HR and perceptually realistic images. We use the Adam optimizer with a cosine annealing scheduler for adaptively adjusting the learning rate during training. The training duration is set to 1.000.000 iterations. For the diffusion process, we set the number of iterations to 2000.
We train the RCAN model in two stages. The first stage involves training for 100 epochs using a mean absolute error (MAE) loss function. In the second stage, we replace the loss function with a weighted sum of SSIM loss and MAE. The second stage lasts for 30 epochs. For RCAN, we also use Adam as optimizer and cosine annealing scheduler as learning rate scheduler.
In
Section 3.5.1, we describe evaluation metrics that are computed based on the test areas from the collected dataset within four Russian regions.
Further, we apply the developed algorithm to up-scale Sentinel-2 RGB images from 10 m to 2.5 m. The adjusted images are used to perform the building segmentation task.
3.4. Experiments for Building Segmentation
To assess the potential of different data-sources and spatial resolution in the building segmentation task, we select three state-of-the-art neural network architectures: DeepLabv3 [
51], SWIN transformer [
52], and Twins transformer [
53].
DeepLabv3 is a semantic segmentation architecture that presents an improved version of its DeepLab-family predecessors. DeepLab models have already been successfully used in remote sensing tasks [
54,
55]. DeepLabv3 uses atrous convolutional networks at different scales to capture the features of objects that need to be segmented. The model shows good performance in the semantic segmentation of urban scenes. To ensure the diversification of our models, we train the DeepLabv3 network with three encoders of different sizes, namely, Resnet18, Resnet50, and Resnet101.
We also consider the SWIN Transformer architecture, which uses shifted windows with a hierarchical transformer to capture pixel-level visual entities in an image and generate segmentation masks. The Swin Transformer is a fast and effective model that has been applied to various segmentation tasks [
56].
Another vision transformer that we use in the study is the Twins transformer. Two architecture modifications, PCPVT and SVT, are considered for image segmentation. PCPVT uses conditional positional encoding to tackle the problem of input data of different dimensions. SVT reduces computation complexity by employing spatially separable self-attention (SSSA). SSSA consists of locally-grouped self-attention to capture fine-grained and local features and globally grouped self-attention to capture global information.
The study comprises experiments with three sets of images. The first one is the Mapbox dataset with a spatial resolution of 1 m; the second one is Mapbox images with a spatial resolution of 2.5 m; the third one is Sentinel-2 images brought from 10 m to 2.5 m using the developed SR models. Therefore, each neural network model is trained and validated on each dataset.
Individual areas in the collected dataset are represented by the large sites. Therefore, image sizes are not suitable to train a model without data preprocessing. We crop patches with the shape of 512 × 512 and 256 × 256 pixels for the spatial resolutions of 1 m, and 2.5 m, respectively. The number of patches for Mapbox with a 1 m spatial resolution amounts to 4318 samples, while Sentinel-2 splitting into smaller patches results in 10,164 samples.
To support a meaningful comparison of different architectures, we use the base implementations for each model from Open MMLab’s MMSegmentation repository [
57]. The models are trained using the weighted cross-entropy loss function, Adam optimizer for Swin and Twins transformers, and SGD for DeepLabV3 models. The polynomial learning rate scheduler is set with the maximum of 300 epochs. We save the model with the highest mean
IoU on the validation set for further analysis and performance assessment on the test set. Each batch size ranges from 4 to 16 depending on the architecture. The computations are conducted on a Linux machine equipped with an Intel Xeon processor and a Tesla V100-SXM2 GPU with 16 GB of memory.
3.5. Evaluation Metrics
3.5.1. Super-Resolution
To evaluate the developed SR models, we compute the Peak Signal-to-Noise Ratio (
PSNR) and Structural Similarity Index Measure (
SSIM), both commonly used metrics for image adjustment tasks. We also compute the Frechet Inception Distance (
FID) [
58].
FID is a widely used metric to evaluate the quality of generated images in GANs. It measures both the quality and diversity of the generated images by comparing their feature representations to those of real images. To handle efficiently large images, we split them into smaller patches with the size of 64 × 64 pixels. Then, we average the metrics for all patches to achieve the ultimate value. The metrics are computed according to the following equations:
where
represents the maximum possible pixel value of the image (i.e., 255 for an 8-bit grayscale image), and
represents the mean squared error between the original image and the compressed or distorted image.
where
and
are the pixel sample means,
and
are variances,
is covariance,
and
are variables to stabilize the division with weak denominator, where
L is the dynamic range of the pixel values.
where
and
are the mean vectors of the feature representations of real and fake images, respectively.
and
are the covariance matrices of the feature representations of real and fake images, respectively.
denotes the trace of a matrix.
3.5.2. Building Segmentation
To evaluate the performance of building segmentation models, we utilize two metrics: Intersection over Union (
IoU, also known as Jaccard index) and
F1-score (also known as Dice Score). The equation for computing the
IoU is the following:
The equation for computing the
F1-score is:
where
,
and
are true positives, false positives and false negatives, respectively.
We also present scores for the building class only along with the mean of all classes to consider the issue of class imbalance (background and buildings).
4. Results
As described in the previous sections, we use test images to evaluate our model on Mapbox images using
PSNR and
SSIM metrics. We also make inference of our trained model on all collected Sentinel-2 images. It is impossible to draw conclusions about the performance based on Sentinel-2 images due to absence of HR ground truth images, thus only a visual evaluation is possible.
Figure 8 depicts the occurrence of additional spatial features compared with the original 10 m spatial resolution image.
In
Table 2, we highlight that the best results are achieved using the MCGR model, for which evaluation metrics are 27.54 and 0.79 for
PSNR and
SSIM, respectively. However, these results are not much better than those achieved by SRGAN. On the other hand, SR3 performs worse in terms of
SSIM and
PSNR metrics, but achieves a
FID value of 1.40, which is much higher than the values achieved by other models.
If we visually compare the results of the algorithms for Sentinel-2 images, it is clear that both SRGAN and MCGR models perform well in sharpening the boundaries of objects, even for objects with complex shapes. However, both of these models produce images with some artifacts, while the results for the SR3 model are much clearer.
We train DeepLabv3, SWIN, and Twins models for binary semantic image segmentation on Mapbox and Sentinel-2 datasets. For Sentinel-2 images, we compare three SR models, MCGR, RCAN, and SR3, to upscale images and bring them to 2.5 m resolution. We evaluate the models’ performance based on the
IoU and
F1-score metrics. The results of the evaluation for different scales on Mapbox images and Sentinel-2 enhanced images are reported in
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7. For DeepLabv3 models, we achieve the average
IoU of 76.2% and the
F1-score of 85.2% for Mapbox images with 1 m resolution. The best result for Mapbox dataset with 2.5 m resolution is 72.0% (
IoU) and 81.7% (
F1-score). For Sentinel-2 dataset with image resolution brought to 2.5 m using MCGR model,
IoU and
F1-score equal on average to 68.0% and 78.3%, respectively. The results demonstrate that DeepLabv3 is capable of accurate segmenting objects in the images and achieving good performance in terms both of the
IoU and
F1-score metrics.
The SWIN models show an average
IoU equal to 75.8% and
F1-score equal to 84.9% for the Mapbox dataset with 1 m resolution (see
Figure 9), 71.5 and 81.4 for the Mapbox dataset with 2.5 m resolution (see
Figure 10), and 69.4 and 79.6 for the Sentinel-2 dataset created using the MCGR model (see
Figure 11 and
Figure 12). The results indicate that the SWIN architecture is effective in capturing fine-grained details in the images leading to improved performance of the
IoU metric.
The Twins model leads to the average IoU of 75.5% and F1-score of 84.7% for the Mapbox dataset with 1 m resolution, 71.4 and 81.3 for the Mapbox dataset with 2.5 m resolution, and 69.2 and 79.4 for the Sentinel-2 dataset created using the MCGR model. The experiments demonstrate that the Twins architecture is effective in capturing both local and global features in the images, resulting in improved performance of the F1-score metric.
Overall, the results show that the DeepLab architecture outperforms other models in terms of the IoU and F1-score metrics for the Mapbox datasets with 1 m and 2.5 m resolution. Additionally, two cross-tests experiments are conducted to evaluate the generalization capability of the models through different data sources. Firstly, the best-performing model for the Mapbox dataset with resolution of 2.5 m is applied to the Sentinel-2 dataset with the same resolution. The resulting F1-score for the building class drops drastically to 28.36. Secondly, the best-performing model for the Sentinel-2 dataset is tested on the Mapbox 2.5 m dataset, the F1-score declines from 62.88 to 50.52 for the building class. It depicts that model trained on larger Sentinel-2 dataset is more robust for images from the previously unseen domain, namely Mapbox images.
To assess the need for collecting datasets from diverse geographical regions, we conducted an additional experiment using the Massachusetts Buildings Dataset [
33]. We trained a DeepLab-v3 model with a ResNet50 encoder and validated the model on the test regions from our Mapbox dataset, with the same spatial resolution of 1 m. We achieved an
IoU of 0.57 and an
F1-score of 0.73 for the building class. However, the model fails in building class recognition for new urban areas, the achieved
IoU equals to 11% and the
F1-score equals to 19% on the test set of the Massachusetts Buildings Dataset.
5. Discussion
Upon analysis of the quantitative and qualitative outcomes obtained from the SR process, it is observed that the numerical values and visual quality of the results derived from Mapbox images exceeded those derived from Sentinel-2 images for the same region of interest (RoI). It is believed that this discrepancy can be attributed to the dissimilar domains of the two image sources. Accordingly, the utilization of domain adaptation techniques for the Sentinel-2 images, alongside the inclusion of training images used in the SR model, is anticipated to yield enhanced visual quality of the output images and improve effectiveness of the post-segmentation models.
Additionally, it is worth highlighting certain characteristics of the SR3 SR diffusion model. Owing to the idiosyncrasies of this model, the images produced through its application do not simply constitute an enlarged resolution of the original images, but rather high-resolution images that closely resemble the original ones. Certain small objects may be absent from the final images or, conversely, they may be slightly altered. Nevertheless, within the ambit of the current task, minor deviations do not significantly impact the quality of the building segmentation since their size is much larger.
A number of studies aim at developing robust algorithms to reduce the differences between the real scene and its perception in various computer vision domains [
59]. The results of the study indicate that the quality of satellite image data has also a significant impact on the accuracy of building segmentation. The Mapbox dataset, which has the highest spatial resolution of 1 m, produces the best results in terms of
IoU and
F1-scores. It suggests that higher spatial resolution data is capable of better capturing building features in images and leads to more accurate segmentation results. Although Sentinel-2 data upscaled to 2.5 m shows lower results than Mapbox with 1 m spatial resolution (see
Figure 13), freely-available data with high temporal resolution is significant for practical applications and should be studied deeper. One of the key points in such studies is spatial features that advanced SR architectures are supposed to extract.
Another factor that affects semantic segmentation results is the presence of inaccurate and/or outdated labels in datasets. OSM data is used to create the markup in this study. Although it is a powerful tool for environmental geo-spatial studies, one can face inaccuracies in collected data for neural network training. The main factors are off/on-nadir satellite observations, changes in building features over time, and new buildings that are not presented in polygons but occur in remote sensing images. This highlights the importance of having accurate and up-to-date annotations for training deep learning models, as errors in the labels can lead to lower performance. Manual markup collection and updating is a time-consuming and labor-intensive process. Therefore, one of the promising study topics is weakly supervised learning when markup limitations are addressed automatically [
60]. Such approaches have already shown significant results in the remote sensing domain [
61].
We also compute metrics individually for each region to provide better insight into geographical and urban diversity (
Figure 14). The best model for each dataset is presented. For Sentinel-2 with the resolution of 2.5 m, the results are almost the same for each region on the test subset. However, for Mapbox datasets,
IoU values vary slightly. It poses a relevant option for the future study on the model transferability between these regions and fine-tuning to achieve higher results.
In terms of the choice of deep learning models for segmentation, the study evaluates several popular state-of-the-art models, including DeepLabv3 with different Resnet encoders, as well as SWIN and Twins transformers. The performance of the models do not vary significantly across the different datasets, with similar IoU and F1-score metrics observed for each model. This observation suggests that the quality of the image data has a greater impact on segmentation accuracy than the choice of deep learning model. However, further experimentation with a wider range of models and architectures may be necessary to fully explore their potential for building segmentation.
Figure 15 demonstrates the ability of the proposed approach to identify separately the buildings that are very close to each other (10–30 m). However, analyzing the performance of the algorithms on smaller buildings would require a different annotation approach, which is not feasible on the low resolution considered in this article. One of the main objectives of the study is to demonstrate the potential of utilizing medium-resolution satellite imagery to obtain accurate results that enable quantitative assessments.
The presented study focuses on large multi-storey buildings, but practical applications may require the recognition of different types and sizes of buildings, such as small country houses. To extend the dataset for various tasks, masks of small buildings can be additionally included. Moreover, it is necessary to conduct further research to determine the minimum size of objects that can be accurately detected on different resolutions.
One limitation of the proposed building segmentation approach based on Sentinel-2 images is the use of only the RGB bands. This choice was made to comply with the RGB bands used in the Mapbox basemap. However, a wider spectral range can provide more information for building segmentation. To address this limitation, applying super-resolution techniques to the wider spectral range is a promising avenue for further investigation.
In this study, we consider two sequential tasks with independent models. To optimize the computationally intensive processes of image super-resolution and further image segmentation, one can develop a neural network model that integrates both stages. It will facilitate both training and inference processes. The dataset we have collected provides valuable support for these types of studies.
It is challenging to compare results achieved on different datasets with diverse sensing properties and spatial resolutions. However, the numerical results obtained in our study are consistent with the similar research. For instance, in [
34], the authors achieve an
IoU of 59.31 for the building segmentation task using the MLP model on the Inria Aerial Image Labeling dataset. Similarly, on the SpaceNet dataset [
38], a reported
F1-score of 0.69 is obtained for multiple cities. However, it is worth noting that the dataset comprises images with a spatial resolution of 1 m. In [
13], the
F1-score varies from 42.06 to 53.67 depending on the test regions for Sentinel-2 images up-sampled to 2.5 m.
We summarize the observed remote sensing datasets in
Table 8. In addition to building semantic segmentation datasets, general datasets with other landcover classes and man-made objects are presented. The amount of training data is a common issue in satellite image analysis. Although larger datasets provide a more comprehensive evaluation of artificial intelligence algorithms performance, geographical characteristics and remote sensing data properties are also of high significance. Another avenue to explore is an image augmentation application. In the present work, we use just basic color and geometrical transformations. More advanced techniques such as an object-based augmentation with various natural backgrounds are suggested [
62]. It allows one to extend a training dataset significantly and transfer samples from one geographical region to another one [
55]. Another approach to select more appropriate backgrounds for artificially generated samples is proposed in [
63]. In addition, multispectral augmentation techniques for Sentinel-2 data can be used to boost model performance through data diversity [
64].
Overall, the study highlights the importance of considering image quality and label accuracy when training deep learning models for building segmentation. It provides insights into the relative performance of different models and image datasets. The conducted experiments give a vision about the design of more accurate and effective building segmentation models with potential applications in urban planning, disaster response, and environmental monitoring, using satellite data. Alternative generic monitoring approaches typically involve installing a vast wireless monitoring system with sensors, which can be very costly and challenging to upkeep [
65] or even impossible for particular regions.