Improving the Performance of Automated Rooftop Extraction through Geospatial Stratiﬁed and Optimized Sampling

: Accurate and timely access to building rooftop information is very important for urban management. The era of big data brings new opportunities for rooftop extraction based on deep learning and high-resolution satellite imagery. However, collecting representative datasets from such big data to train deep learning models efﬁciently is an essential problem that still needs to be explored. In this study, geospatial stratiﬁed and optimized sampling (GSOS) based on geographical priori information and optimization of sample spatial location distribution is proposed to acquire representative samples. Speciﬁcally, the study area is stratiﬁed based on land cover to divide the rooftop-dense stratum and the rooftop-sparse stratum. Within each stratum, an equal amount of samples is collected and their spatial locations are optimized. To evaluate the effectiveness of the proposed strategy, several qualitive and quantitative experiments are conducted. As a result, compared with other common sampling approaches (e.g., random sampling, stratiﬁed random sampling, and optimized sampling), GSOS is superior in terms of the abundance and types of collected samples. Furthermore, two quantitative metrics, the F 1 -score and Intersection over Union (IoU), are reported for rooftop extraction based on deep learning methods and different sampling methods, in which the results based on GSOS are on average 9.88% and 13.20% higher than those based on the other sampling methods, respectively. Moreover, the proposed sampling strategy is able to obtain representative training samples for the task of building rooftop extractions and may serve as a viable method to alleviate the labour-intensive problem in the construction of rooftop benchmark datasets.


Introduction
Basic geospatial data are an important foundation for urban sensing and modelling. Its collection, updating and expansion are basic parts for smart city construction [1][2][3]. Buildings, which are an important urban physical element, adequately carry the natural and human activities of human beings. The rooftop area information derived from buildings can be used as significant basic data for sustainable urban development, urban planning, and integrated urban-rural development [4][5][6].
Current rooftop area information collection mainly relies on photogrammetry, manual remote sensing interpretation and airborne laser scanning, which are labour-or materialintensive and difficult to extend to the large-scale data acquisition. In recent years, with Remote Sens. 2022, 14,4961 3 of 17 from the data preparation level is proposed. To maintain the high and consistent quality of the data source, we introduce Google Earth satellite (GES) imagery and recently published vectorized rooftop area data [6]. The former is utilized as the data source, and the latter is utilized as ground truth data for training and validating deep learning models. Land cover is introduced as the a priori information to split the study area into various regions. Thus, spatial simulated annealing (SSA), considering the distance between each sample's location, is adopted to improve sample coverage and acquire highly representative samples for model training. Finally, quantitative comparison studies for sample sets and models are designed to demonstrate the effectiveness of our strategy.
The rest of this paper is organized as follows. In Section 2, the materials of this study are introduced. In Section 3, the design and evaluation metrics of the GSOS are presented. In Section 4, the comparative experiments are demonstrated, and the effectiveness of GSOS is analysed. In Section 5, the limitations of the experiment and possible improvements are discussed. Finally, we conclude in Section 6.

Study Area
The study area is Nanjing in Jiangsu Province, China (see Figure 1). By the end of 2021, this city had a built-up area of nearly 870 km 2 , a resident population of over 9 million, an urbanization rate of 86.9% and an economic output that consistently ranks among the top ten cities in China. Nanjing is the ancient capital of the Six Dynasties and an important central city in eastern China. The long history and important position of this city can offer a diverse and representative range of building styles. The effectiveness of implementing sample preparation optimization based on these methods needs to be further validated.
In this paper, a geospatial stratified and optimized sampling (GSOS) strategy aimed at improving the performance of deep learning-based extractions of building rooftop areas from the data preparation level is proposed. To maintain the high and consistent quality of the data source, we introduce Google Earth satellite (GES) imagery and recently published vectorized rooftop area data [6]. The former is utilized as the data source, and the latter is utilized as ground truth data for training and validating deep learning models. Land cover is introduced as the a priori information to split the study area into various regions. Thus, spatial simulated annealing (SSA), considering the distance between each sample's location, is adopted to improve sample coverage and acquire highly representative samples for model training. Finally, quantitative comparison studies for sample sets and models are designed to demonstrate the effectiveness of our strategy.
The rest of this paper is organized as follows. In Section 2, the materials of this study are introduced. In Section 3, the design and evaluation metrics of the GSOS are presented. In Section 4, the comparative experiments are demonstrated, and the effectiveness of GSOS is analysed. In Section 5, the limitations of the experiment and possible improvements are discussed. Finally, we conclude in Section 6.

Study Area
The study area is Nanjing in Jiangsu Province, China (see Figure 1). By the end of 2021, this city had a built-up area of nearly 870 km 2 , a resident population of over 9 million, an urbanization rate of 86.9% and an economic output that consistently ranks among the top ten cities in China. Nanjing is the ancient capital of the Six Dynasties and an important central city in eastern China. The long history and important position of this city can offer a diverse and representative range of building styles.   GES imagery offers new opportunities for access to urban information due to its wide coverage, speed of update and low cost of acquisition. In this study, GES imagery of approximately 0.6 m/pixel as the data source ( Figure 2a) is used. Based on the open map service API (https://www.google.com/earth, accessed on 15 March 2022) provided by Google, the image data can be downloaded according to the longitude and latitude ranges of Nanjing. Imagery at this resolution shows building rooftop details clearly and with a controllable amount of data.

Land Cover Data
Building rooftops have a high probability of being collected in built-up areas. This can serve as prior information to collect building rooftop samples. Therefore, land use data can be used to provide built-up area information to support stratified sampling. Currently, the Finer Resolution Observation and Monitoring of Global Land Cover (FROM-GLC30 2017) (http://data.ess.tsinghua.edu.cn/, accessed on 8 May 2022) (Figure 2b), with a spatial resolution of 30 metres and an overall accuracy of 72.43%, is an authoritative and public land cover data. This data includes ten categories of land cover, i.e., cropland, forest, grassland, shrubland, wetland, water, tundra, impervious surface, bareland, and tundra.

Land Cover Data
Building rooftops have a high probability of being collected in built-up areas. This can serve as prior information to collect building rooftop samples. Therefore, land use data can be used to provide built-up area information to support stratified sampling. Currently, the Finer Resolution Observation and Monitoring of Global Land Cover (FROM-GLC30 2017) (http://data.ess.tsinghua.edu.cn/, accessed on 8 May 2022) (Figure 2b), with a spatial resolution of 30 metres and an overall accuracy of 72.43%, is an authoritative and public land cover data. This data includes ten categories of land cover, i.e., cropland, forest, grassland, shrubland, wetland, water, tundra, impervious surface, bareland, and tundra.

Vectorized Rooftop Area Data of Nanjing
Considering the trade-off between the manual labelling cost and the consistent quality of labels, vectorized rooftop area data in Nanjing (Figure 2c), which are a high-quality and public rooftop area dataset published by Zhang, Z et al. (2022), is adopted in this study as the ground truth data. The dataset is extracted with a deep learning segmentation model based on high resolution remote sensing imagery and provides clear and detailed rooftop area data with an overall F 1 -score of 83.11%.

Research Framework
The research framework consists of the following three main modules: data preparation based on the GSOS, development of the rooftop extraction models, and evaluation of the impact of the GSOS on the rooftop extraction accuracy. The overall working framework is shown in Figure 3.

Vectorized Rooftop Area Data of Nanjing
Considering the trade-off between the manual labelling cost and the consistent quality of labels, vectorized rooftop area data in Nanjing (Figure 2c), which are a high-quality and public rooftop area dataset published by Zhang, Z et al. (2022), is adopted in this study as the ground truth data. The dataset is extracted with a deep learning segmentation model based on high resolution remote sensing imagery and provides clear and detailed rooftop area data with an overall F1-score of 83.11%.

Research Framework
The research framework consists of the following three main modules: data preparation based on the GSOS, development of the rooftop extraction models, and evaluation of the impact of the GSOS on the rooftop extraction accuracy. The overall working framework is shown in Figure 3. In GSOS-based data preparation, stratified sampling is first carried out by combining a priori information on the land cover to form sample collection strata with different rooftop densities. A single-objective optimization to maximize the average sample distance is utilized to expand the coverage of samples in each sample stratum with a view to increasing the proportion of rooftops and the amount of rooftop categories in the sample set. A series of city-wide sample sets of building rooftops are collected by the GSOS. The constructed sample sets are input into deep learning networks to obtain building rooftop extraction models. By evaluating the sample set and the model, the impact of the GSOS on building rooftop extraction is quantitatively verified. In GSOS-based data preparation, stratified sampling is first carried out by combining a priori information on the land cover to form sample collection strata with different rooftop densities. A single-objective optimization to maximize the average sample distance is utilized to expand the coverage of samples in each sample stratum with a view to increasing the proportion of rooftops and the amount of rooftop categories in the sample set. A series of city-wide sample sets of building rooftops are collected by the GSOS. The constructed sample sets are input into deep learning networks to obtain building rooftop extraction models. By evaluating the sample set and the model, the impact of the GSOS on building rooftop extraction is quantitatively verified.

Stratification Considering the Geographical Context
The proportion of building rooftops in remote sensing imagery is much smaller than that of non-rooftop targets, and the use of simple random sampling is prone to sample category imbalance problems. This can lead to a heavily biased model. Stratified sampling allows for a balanced sample of all categories. This method divides the overall area into strata based on one criterion and randomly selects sample points within each stratum.
In this study, the study area based on land cover information is stratified, and the study area is divided into built-up and unbuilt-up areas based on FROM-GLC30, creating rooftop-dense areas and rooftop-sparse areas (see Figure 4a). Rooftop-dense areas are characterized by high levels of artificial construction activities and building densities; as a result, collecting samples rich in building rooftop information is easy. Rooftop-sparse areas include water, grassland, cropland, and bare ground; thus, collecting samples with a sparse spatial location distribution of rooftops is simple. Although it is less efficient to collect valid information in rooftop-sparse areas, as they are much larger than rooftop-dense areas, we still collected an equal number of samples in both areas to obtain as comprehensive information as possible on the different styles and densities of rooftops in the study area.

Optimal Sampling Considering the Sample Coverage
According to the first and second laws of geography, the spatial distribution of buildings is characterized by the "closer the more similar" and spatial heterogeneity. Random sampling within layers tends to collect neighbouring and similar samples, resulting in problems such as redundancy of information. In the absence of sufficient priori knowledge to support further stratification, making the samples as evenly dispersed as possible within the sample strata can better improve the regional coverage and the performance of rooftop information collection.
Simulated Annealing (SA) is a probability-based method to find global optimal solutions, which is widely applied to objective optimization problems [37][38][39][40]. The SSA arithmetic is an extension of SA in space [41][42][43]. In the sample collection of this study, a certain amount of sample cells will be first collected randomly. Subsequently, SSA will be utilized to maximize the average distance between these sample cells and to extend their coverage. The inverse of the nearest neighbour index (NNI) is introduced as the cost function of SSA. It is calculated as follows: 1 However, empirical studies have shown that it is difficult to collect rich information about rooftops in fragmented and broken built-up areas. Therefore, the division of rooftopdense areas by built-up areas alone is lacking. Built-up areas that are large enough to provide dense rooftops, while those that are too small are not. We therefore considered patches of built-up areas that are more than half the area of one image sample to be a rooftop-dense areas, and the remains are rooftop-sparse areas (see Figure 4b).
To facilitate subsequent optimization of the sample space location, we generated a point matrix for sampling within the study area. The sample points that fall into the rooftop-dense area are expanded into rectangular sample cells, which constitute the rooftopdense stratum. The sample points that fall into the rooftop-sparse area are expanded into rectangular sample cells, which constitute the rooftop-sparse stratum (see Figure 4c,d). In addition, the size of the sample cells depends on the spacing of the point matrix, with no overlap between sample cells.

Optimal Sampling Considering the Sample Coverage
According to the first and second laws of geography, the spatial distribution of buildings is characterized by the "closer the more similar" and spatial heterogeneity. Random sampling within layers tends to collect neighbouring and similar samples, resulting in problems such as redundancy of information. In the absence of sufficient priori knowledge to support further stratification, making the samples as evenly dispersed as possible within the sample strata can better improve the regional coverage and the performance of rooftop information collection.
Simulated Annealing (SA) is a probability-based method to find global optimal solutions, which is widely applied to objective optimization problems [37][38][39][40]. The SSA arithmetic is an extension of SA in space [41][42][43]. In the sample collection of this study, a certain amount of sample cells will be first collected randomly. Subsequently, SSA will be utilized to maximize the average distance between these sample cells and to extend their coverage. The inverse of the nearest neighbour index (NNI) is introduced as the cost function of SSA. It is calculated as follows: Cost stratum = 1 NNI stratum (2) where A stratum is the area of the sample stratum, n stratum is the amount of the sample cells in the stratum and min(d stratum−i ) is the distance between the centroid of the ith sample cell and the nearest neighbour in the stratum except itself. The denominator of the Formula (1) describes the expected distance when the sample cells are randomly distributed in the stratum. The numerator of the Formula (1) describes the average distance between each sample cell and its nearest neighbour in the stratum. The smaller the Cost stratum value is, the more discrete the sample cells in the stratum tend to be.

Image Semantic Segmentation
Image semantic segmentation is a combination of image segmentations and image classifications by assigning the same labels to pixels in an image that belong to the same category. It plays a crucial role in remote sensing image feature information extraction. Compared to traditional non-deep learning image segmentation methods, deep learningbased methods can extract more abstract image features, thus better exploring the unique characteristics of different targets and having higher segmentation accuracies. As one kind of mainstream semantic segmentation network for deep learning, encoder-decoder networks gradually incorporate high-dimensional features into low-dimensional features, allowing the network to capture semantic information at different scales. This solves both the resolution degradation problem and the multi-scale problem. Semantic segmentation models, such as FCN, UNet, SegNet and DeepLab, which are classical encoder-decoder structures, have achieved good results in the field of remote sensing image semantic segmentation [44][45][46][47][48][49].
In this study, the FCN, UNet, SegNet, DeepLab and DeepLabV3+ models are selected to work together for rooftop recognition. Of these, we follow the study by Zhong, T. et al. [11] using DeepLabV3+ as the primary identification model to help fully evaluate the effectiveness of the GSOS. The other four models only serve as complementary models to evaluate the generalizability of the GSOS over different networks.

Evaluation Metrics
A confusion matrix is a situation analysis table that summarizes the true data and model predictions in supervised learning and records the comparison in a matrix that allows quantitative evaluations of the performance of supervised learning algorithms. The columns of the confusion matrix represent the true class, and the rows represent the evaluate the generalizability of the GSOS over different networks.

Evaluation Metrics
A confusion matrix is a situation analysis table that summarizes the true data and model predictions in supervised learning and records the comparison in a matrix that allows quantitative evaluations of the performance of supervised learning algorithms. The columns of the confusion matrix represent the true class, and the rows represent the predicted class. The confusion matrix of the binary classification model and its specific definitions are shown in Figure 5. The confusion matrix describes the number of pixels intuitively. This metric becomes incomparable across different datasets. It is necessary to normalize the results. Most studies calculate the precision and recall based on a confusion matrix, but the two are mutually constraining (as the precision increases, the recall decreases and vice versa.). Therefore, a combined calculation of the two is required to achieve a comprehensive evaluation.
The F1-score is the harmonic mean of the precision and recall. Its evaluation result is closer to the average of the precision and recall. In addition, Intersection over Union (IoU) The confusion matrix describes the number of pixels intuitively. This metric becomes incomparable across different datasets. It is necessary to normalize the results. Most studies calculate the precision and recall based on a confusion matrix, but the two are mutually constraining (as the precision increases, the recall decreases and vice versa.). Therefore, a combined calculation of the two is required to achieve a comprehensive evaluation.
The F 1 -score is the harmonic mean of the precision and recall. Its evaluation result is closer to the average of the precision and recall. In addition, Intersection over Union (IoU) is also a common metric in object detections. It can be re-expressed in terms of the precision and recall, and the result is closer to the worst case of the precision and recall. Moreover, the precision, recall, F 1 -score, and IoU are calculated as follows:

Experiment Configuration
In the GSOS of rooftop satellite imagery in Nanjing, the spacing of the point matrix covering the rooftop-dense area and the rooftop-sparse area is 500 metres. The size of the sample cell based on point expansion is 500 × 500 metres (838 × 838 pixels). There is no overlapping between the sample cells. The image samples are cropped based on the optimized sample cells. A sliding window of 384 × 384 pixels is used to crop the image samples non-overlappingly to generate the image patch set that can be fed into a semantic segmentation model. And 70% of the image patch set is used for training the model and 30% for validation, as shown in Figure 6.
covering the rooftop-dense area and the rooftop-sparse area is 500 metres. The size of the sample cell based on point expansion is 500 × 500 metres (838 × 838 pixels). There is no overlapping between the sample cells. The image samples are cropped based on the optimized sample cells. A sliding window of 384 × 384 pixels is used to crop the image samples non-overlappingly to generate the image patch set that can be fed into a semantic segmentation model. And 70% of the image patch set is used for training the model and 30% for validation, as shown in Figure 6. In this paper, the sampling is repeated eight times, with each set of samples being independent. Image samples never captured in the 8 times sampling will be collected as the independent test set to evaluate the accuracies of the rooftop extraction models. Furthermore, data augmentation is performed in the training phase to reduce model bias in In this paper, the sampling is repeated eight times, with each set of samples being independent. Image samples never captured in the 8 times sampling will be collected as the independent test set to evaluate the accuracies of the rooftop extraction models. Furthermore, data augmentation is performed in the training phase to reduce model bias in this study in the form of rotation, flipping, blurring and noise. The detailed configuration of the training phase of the deep learning models is shown in Table 1.

Loss function BCE&DICE
In this study, quantitative experiments are designed to compare the proposed GSOS strategy with other sampling strategies, i.e., random spatial sampling (RSS), stratified random spatial sampling (SRSS) and distance optimized sampling (DOS). Subsequently, the sample rooftop proportion, abundance and impact on the rooftop extraction accuracies of these sampling strategies are reported.

Comparison of Rooftop Proportion
In this study, the rooftop proportions with the percentage of the rooftop areas to the total area in the sample set are measured. Moreover, this study will be conducted on 8 incremental sample sizes. The statistical comparison results of the rooftop proportion for different sampling strategies are presented in Table 2. The sampling strategies that take land cover information into account (GSOS and SRSS) are the most effective, with the highest proportion at approximately 7.1%. In contrast, the RSS and DOS obtain a lower percentage of rooftops, only approximately half of the former. This suggests that the incorporation of land cover can help to obtain denser rooftop objects. Thereby, a reduced sample size and reduced labelling effort can be achieved while obtaining enough rooftop information to support the model training.
However, there is a slight loss of rooftop area from the sample location optimized sampling carried out to account for rooftop abundance and to collect information of different densities, but only approximately 0.7%. The benefits of rooftop abundance are presented in the next subsection. In addition, the GSOS has a significantly lower standard deviation. This indicates that it is more stable over multiple samplings and less affected by randomness. This helps the GSOS to be further applied to other studies.

Comparison of Rooftop Abundance
To evaluate the ability of different sampling strategies to obtain multiple classes of rooftops, the collected rooftop image patches are classified. The image patches that are larger than 50 × 50 pixels are adopted to generate feature vectors by ResNet18 and then clustered into 2D space by KMeans and TSNE, where K is set to a value much larger than the number of any possible rooftop classes in the study area. Based on the results of the automated clustering, combined with manual visual interpretation, similar classes are iteratively merged, and the number of collected rooftop classes are obtained, as shown in Figure 7.
The clustering results show that the GSOS obtained 15 classes of rooftops, which are six and four more than that of DOS and SRSS, as shown in Figure 7a-c. Abundant classes of samples provide a more detailed portrayal of the rooftop features in the study area. This can bring more typical samples to the rooftop segmentation model. rooftops, the collected rooftop image patches are classified. The image patches that are larger than 50 × 50 pixels are adopted to generate feature vectors by ResNet18 and then clustered into 2D space by KMeans and TSNE, where K is set to a value much larger than the number of any possible rooftop classes in the study area. Based on the results of the automated clustering, combined with manual visual interpretation, similar classes are iteratively merged, and the number of collected rooftop classes are obtained, as shown in Figure 7.

Rooftop Extraction Model Evaluation
To further illustrate the effectiveness of the GSOS, the quantitative results in terms of the rooftop extraction accuracy and generalizability over different models are reported in this study.

Comparison of the Rooftop Extraction Accuracy
The evaluation of the impact of the sampling strategy on the model accuracy is carried out based on DeepLabV3+. Given the nature of the data sampling, randomness and uncertainty in the samples cannot be avoided. To reduce this effect, multiple SnapShots of local optima are captured in each training session according to the loss function. Rooftop identification and evaluation are performed based on these SnapShots, and their confidence intervals are obtained, as shown in Figure 8.

Comparison of Generalizability
The evaluation of the sampling strategies for generalizability under different deep learning networks is carried out based on the FCN, UNet, SegNet, DeepLab and DeepLabV3+ at a sample size of 2000. As shown in Figure 9, the GSOS is superior in terms of generalizability. The stable and high accuracy performance of the GSOS with multiple networks indicates that the rooftop sample set captured by the GSOS is representative of the regional characteristics. Networks of different structures can extract typical features of rooftops in the study area. This helps to support, in the future, further exploration in modelling when producing building rooftop area datasets with deep learning.
It is worth mentioning that in previous studies, the SRSS obtained model accuracies closer to that of the GSOS. However, in generalizability comparisons, the SRSS model accuracies are extremely unstable across different networks. The model accuracy is even lower than that of the RSS when trained with DeepLabV3+. This makes it difficult for the SRSS to support more extensive and deeper studies at the model level in the future. The results showed that as the sample size increased, the F 1 -score and IoU of the model corresponding to each sampling strategy showed a significant increasing trend. This increasing trend slows down after the sample size is larger than 2000. On the other hand, the GSOS generally outperformed the other sampling strategies regarding the model accuracy. When compared to the DSS and SRSS under the same sample size, F 1 -scores increased by an average of 13.40% and 3.01%, respectively, and IoU increased by an average of 17.62% and 4.18%, respectively. In particular, the GSOS and SRSS are significantly superior to the other two strategies, suggesting that geographic priori information of land cover plays a significant role in relation to improving the sample preparation and increasing rooftop recognition accuracies.
In terms of the model confidence intervals, the GSOS also performs well. As the sample size increases, the confidence intervals of the GSOS gradually converge, the fluctuations stabilize, and the models become more reliable. This indicates that the method effectively reduces the effects of randomness.
In addition, it was found that with the GSOS, only a smaller sample size is required to achieve the rooftop extraction effect of the non-optimized case with a larger sample size, especially for sampling strategies that are not guided by land cover information. This helps to save significant overheads in producing building rooftop area datasets with deep learning, as high-quality labelled samples are generally expensive.

Comparison of Generalizability
The evaluation of the sampling strategies for generalizability under different deep learning networks is carried out based on the FCN, UNet, SegNet, DeepLab and DeepLabV3+ at a sample size of 2000. As shown in Figure 9, the GSOS is superior in terms of generalizability. The stable and high accuracy performance of the GSOS with multiple networks indicates that the rooftop sample set captured by the GSOS is representative of the regional characteristics. Networks of different structures can extract typical features of rooftops in the study area. This helps to support, in the future, further exploration in modelling when producing building rooftop area datasets with deep learning.

Uncertainty Analysis
Considering the cost of manual labelling and the difference in the quality of manual labelling, a publicly published dataset of rooftops based on ground truth data extracted from high-resolution remote sensing images and deep learning methods is adopted in this study. However, with an F1-score of 83.11% for this dataset, there are bound to be areas that do not match the ground truth. The uncertainty caused by this error can be mitigated by increasing the sample size, which can collect as much rooftop information with small It is worth mentioning that in previous studies, the SRSS obtained model accuracies closer to that of the GSOS. However, in generalizability comparisons, the SRSS model accuracies are extremely unstable across different networks. The model accuracy is even lower than that of the RSS when trained with DeepLabV3+. This makes it difficult for the SRSS to support more extensive and deeper studies at the model level in the future.

Uncertainty Analysis
Considering the cost of manual labelling and the difference in the quality of manual labelling, a publicly published dataset of rooftops based on ground truth data extracted from high-resolution remote sensing images and deep learning methods is adopted in this study. However, with an F 1 -score of 83.11% for this dataset, there are bound to be areas that do not match the ground truth. The uncertainty caused by this error can be mitigated by increasing the sample size, which can collect as much rooftop information with small errors as possible.
Data sampling is accompanied by randomness and uncertainty. When the sample size is small, the abundance of rooftop information inevitably decreases, and even after optimization, it is difficult to cover the full region. The impact of randomness on the rooftop extraction becomes more significant. On the other hand, the model also introduces parameters with randomness during the training process, increasing the uncertainty of the results. In this study, multiple iterations are used to reduce the effect of randomness, but the fluctuations caused by it are still inevitable.

Potential Improvements of GSOS
The proposed GSOS is constructed for the rooftop area extraction task in Nanjing, considering the built-up area, urbanization rate, and data accessibility. In cities or regions with widely varying physical and human characteristics, i.e., regions with different modalities, the main factors influencing the spatial distribution characteristics of rooftops differ. When the GSOS is extended to these regions, or even to regions at other spatial-temporal scales, the geographical priori information and optimization objectives to be considered change.
In the future, the spatial autocorrelation and spatial heterogeneity of the building rooftops can be considered in more detail in conjunction with digital surface models, urban functional areas, street views, POIs, living footprints and regional economies [50][51][52]. Thus, more possible sample stratification can be explored from different perspectives, such as the building function and style. In addition, multiple source data and feature mining methods can be combined to explore more possible methods of target optimization. Further attempts to adapt the sampling methods applicable to the different modalities of urban clusters can be performed. On the other hand, the integration of multi-source remote sensing imagery may help to further improve and evaluate the generalisability of GSOS.
When extended to other research objectives, the GSOS can serve as a reference for other studies by exploring more appropriate geographic priori information and target optimization methods. For example:

1.
In a study to explore local-scale patterns of urban air pollution, researchers divide cities by landscape and administrative and functional zones to explore urban air NO 2 pollution patterns and their causal factors [53].

2.
On the other hand, varying the spatial simulated annealing optimization objective for different research objectives can also provide a reference for the researchers.

3.
In a study on lake water quality monitoring, researchers have adopted the mean spatial-temporal error (MSTE) as the optimization objective, with a view to reducing the errors arising from spatial-temporal interpolations [42].

Conclusions
Rooftop area information is an important data basis for urban planning and urbanrural integration. Using satellite imagery and deep learning to extract rooftop information is a mainstream solution. However, the current studies focus mostly on algorithm development and overlook the importance of data collection. To address this challenge, an advanced sampling strategy, the GSOS, is proposed in this study to generate a high-quality dataset for training rooftop extraction models. From qualitative and quantitative evaluations, the results show that the generated samples are representative in terms of the rooftop coverage and types in the image samples. In addition, the prediction results of the rooftop extraction models demonstrate that the GSOS-based models are capable of achieving high identification accuracies with small sample sizes. In the future, the advanced sampling strategy may be able to incorporate more fundamental geographical and socio-statistical information to provide a customized solution for data collection with different modalities.