Self-Training Classiﬁcation Framework with Spatial-Contextual Information for Local Climate Zones

: Local climate zones (LCZ) have become a generic criterion for climate analysis among global cities, as they can describe not only the urban climate but also the morphology inside the city. LCZ mapping based on the remote sensing classiﬁcation method is a fundamental task, and the protocol proposed by the World Urban Database and Access Portal Tools (WUDAPT) project, which consists of random forest classiﬁcation and ﬁlter-based spatial smoothing, is the most common approach. However, the classiﬁcation and spatial smoothing lack a uniﬁed framework, which causes the appearance of small, isolated areas in the LCZ maps. In this paper, a spatial-contextual information-based self-training classiﬁcation framework (SCSF) is proposed to solve this LCZ classiﬁcation problem. In SCSF, conditional random ﬁeld (CRF) is used to integrate the classiﬁcation and spatial smoothing processing into one model and a self-training method is adopted, considering that the lack of su ﬃ cient expert-labeled training samples is always a big issue, especially for the complex LCZ scheme. Moreover, in the unary potentials of CRF modeling, pseudo-label selection using a self-training process is used to train the classiﬁer, which fuses the regional spatial information through segmentation and the local neighborhood information through moving windows to provide a more reliable probabilistic classiﬁcation map. In the pairwise potential function, SCSF can e ﬀ ectively improve the classiﬁcation accuracy by integrating the spatial-contextual information through CRF. The experimental results prove that the proposed framework is e ﬃ cient when compared to the traditional mapping product of WUDAPT in LCZ classiﬁcation.


Introduction
The local climate zone (LCZ) scheme is a novel climate-based classification scheme [1], which skillfully relates the urban climate represented by physical traits with urban morphology depicted through landscape cover. The LCZ scheme has the potential to be accepted as a standard description for worldwide cities [2], since its strong performance in indicating the diversity inside cities is also important for urban study. It has been widely used in various fields, such as a series of studies of the urban heat island (UHI) effect [3,4], which adopted the LCZ scheme rather than the traditional dichotomic urban or rural regions to analyze the effects of urban heat islands, making the study of thermal phenomena more specific. The LCZ scheme has also been found to be superior in portraying urban to a revolutionary leap for urban microclimate research [35]. As a result, the LCZ scheme has attracted interest in various fields. The operable supervised LCZ workflow has been published by the World Urban Database and Access Portal Tools (WUDAPT) project, and further study is still in progress. The WUDAPT project [36,37] is an initiative project for the acquisition, storage, and dissemination of climate-relevant data on the physical traits of global cities. The WUDAPT data collections have three levels: level 0 provides a rough LCZ classification based on remote sensing data; level 1 provides more detailed information on the urban form and function via crowdsourcing techniques; and level 2 is responsible for gathering the finer parameters from these zones. The supervised remote-sensing-based LCZ mapping approach [19] proposed by WUDAPT in Figure 2 has been utilized to collect level 0 data from the crowdsourcing community. With the help of Google Earth and the open-source SAGA GIS software, people without any specific knowledge can also map LCZs through data labeling, supervised classification, and filter-based spatial smoothing steps. Unfortunately, many of the LCZ labeled samples collected via the crowdsourcing community are ambiguous, resulting in a lack of high-quality expert-labeled data, which usually causes an inferior prediction. Thus, a method which is able to perform well in the case of a small number of training samples is required. The WUDAPT project [36,37] is an initiative project for the acquisition, storage, and dissemination of climate-relevant data on the physical traits of global cities. The WUDAPT data collections have three levels: level 0 provides a rough LCZ classification based on remote sensing data; level 1 provides more detailed information on the urban form and function via crowdsourcing techniques; and level 2 is responsible for gathering the finer parameters from these zones. The supervised remote-sensing-based LCZ mapping approach [19] proposed by WUDAPT in Figure 2 has been utilized to collect level 0 data from the crowdsourcing community. With the help of Google Earth and the open-source SAGA GIS software, people without any specific knowledge can also map LCZs through data labeling, supervised classification, and filter-based spatial smoothing steps. Unfortunately, many of the LCZ labeled samples collected via the crowdsourcing community are ambiguous, resulting in a lack of high-quality expert-labeled data, which usually causes an inferior prediction. Thus, a method which is able to perform well in the case of a small number of training samples is required.

Self-Training Method.
Semi-supervised learning was developed to deal with the limited data issue, and it can help to unlock the potential of huge unlabeled datasets. The self-training approach proposed by Scudder [38] is the earliest semi-supervised method [39], which has been widely used in gene prediction [40], parsing [41], and image classification [42,43] for its simplicity and clarity.
In the traditional self-training method, a base learner is first trained on the original, small, labeled training set and high-probability pseudo-labels are then used to enrich the original labeled set until the learner is retrained and reaches the stopping conditions. The unlabeled data provide extra information to modify the learner, which is usually ignored in the supervised methods, making the description of the model closer to the real distribution of the data. The procedure of a basic selftraining method is shown in Figure 3 can be summarized as the following four steps: Step 1-Initialization: Train the base classifier on the initial training set ( , ) from the given labeled data ( , ) .
Step 2-Selection: Predict the data in unlabeled set with classifier , and select the highprobability data ( , ) as pseudo-labels.
Step 3-Updating: Remove the selected unlabeled samples from the unlabeled set , and combine the original data ( , ) with the pseudo-labeled data ( , ) to update the training set ( , ) ; Step 4-Retraining: Retrain the classifier with the updated data ( , ), and repeat Steps 2-4 until the stopping conditions are satisfied.

Self-Training Method.
Semi-supervised learning was developed to deal with the limited data issue, and it can help to unlock the potential of huge unlabeled datasets. The self-training approach proposed by Scudder [38] is the earliest semi-supervised method [39], which has been widely used in gene prediction [40], parsing [41], and image classification [42,43] for its simplicity and clarity.
In the traditional self-training method, a base learner is first trained on the original, small, labeled training set and high-probability pseudo-labels are then used to enrich the original labeled set until the learner is retrained and reaches the stopping conditions. The unlabeled data provide extra information to modify the learner, which is usually ignored in the supervised methods, making the description of the model closer to the real distribution of the data. The procedure of a basic self-training method is shown in Figure 3 can be summarized as the following four steps: Step 1-Initialization: Train the base classifier C int on the initial training set (X train , y train ) from the given labeled data (X l , y l ).
Step 2-Selection: Predict the data in unlabeled set X u with classifier C int , and select the high-probability data (X con f , y con f ) as pseudo-labels.
Step 3-Updating: Remove the selected unlabeled samples X con f from the unlabeled set X u , and combine the original data (X l , y l ) with the pseudo-labeled data (X con f , y con f ) to update the training set (X train , y train ); Step 4-Retraining: Retrain the classifier C int with the updated data (X train , y train ), and repeat Steps 2-4 until the stopping conditions are satisfied. However, the traditional self-training method starts the learning from only a few labeled samples, with which it is difficult to capture the clear boundaries between classes, and the mislabeled samples are then brought into the next learning iteration, thereby confusing the classifier. Thus, more powerful pseudo-label selection strategies are urgently needed. Aydav and Minz [44] developed an improved self-training approach using granulation to select the most confident data in a block, rather than a single pixel. Although this approach takes the regional information into account, the confidence obtained from the predicted probability on only a few labels is still low.
Given this fact, in the proposed approach, the commonly used probability is replaced by a strategy considering the spatial-contextual information. After a series of selections from homogenous blocks to candidates and pseudo-labels, the proposed strategy presents a strong performance and provides a preferable probabilistic classification map for CRF.

The Spatial-Contextual Information-Based Self-Training Classification Framework (SCSF) for Local Climate Zones
Spatial-contextual information has been proved important for LCZ classification [24,45]; however, it is usually independent of the classification step and, without a unified theoretical basis, causes the appearance of small, isolated noise areas. Considering this fact, the proposed spatialcontextual information-based self-training framework (SCSF) introduces the conditional random fields (CRF) and self-training method to provide a better solution. To be specific, CRF is adopted for LCZ classification to integrate the spatial-contextual information directly into the classification by simultaneously modeling the relationship between samples and labels and the spatial correlation among samples and labels, which provides strong theoretical guidance. Furthermore, a probabilistic classification map with the self-training method which is improved by spatial-contextual information-based pseudo-label selection is used for the potential function in CRF by enriching the original labeled dataset with pseudo-labels and by retraining the classifier when the training samples are limited in LCZ mapping.
Firstly, the initial feature space is built with Landsat 8 spectral data and the remote sensing indices of the modified normalized difference water index (MNDWI), the normalized difference vegetation index (NDVI), the ratio vegetation index (RVI), the bare soil index (BSI), the normalized difference building index (NDBI), and the normalized difference impervious surface index (NDISI). After principal component analysis (PCA) transformation, the features are divided into labeled and unlabeled sets, and the labeled data are used to train the base classifier. In the meantime, the efficient graph-based segmentation mask is generated from the input data. Next, the pseudo-labeled samples are selected according to the segmentation and prediction. With the adjustment of the base classifier, the performance is gradually improved and becomes much closer to the specific LCZ scheme. Finally, the CRF models the log-based unary potentials and the edge feature function-based pairwise potentials to simulate the relationship and spatial correlation between samples and labels, which However, the traditional self-training method starts the learning from only a few labeled samples, with which it is difficult to capture the clear boundaries between classes, and the mislabeled samples are then brought into the next learning iteration, thereby confusing the classifier. Thus, more powerful pseudo-label selection strategies are urgently needed. Aydav and Minz [44] developed an improved self-training approach using granulation to select the most confident data in a block, rather than a single pixel. Although this approach takes the regional information into account, the confidence obtained from the predicted probability on only a few labels is still low.
Given this fact, in the proposed approach, the commonly used probability is replaced by a strategy considering the spatial-contextual information. After a series of selections from homogenous blocks to candidates and pseudo-labels, the proposed strategy presents a strong performance and provides a preferable probabilistic classification map for CRF.

The Spatial-Contextual Information-Based Self-Training Classification Framework (SCSF) for Local Climate Zones
Spatial-contextual information has been proved important for LCZ classification [24,45]; however, it is usually independent of the classification step and, without a unified theoretical basis, causes the appearance of small, isolated noise areas. Considering this fact, the proposed spatial-contextual information-based self-training framework (SCSF) introduces the conditional random fields (CRF) and self-training method to provide a better solution. To be specific, CRF is adopted for LCZ classification to integrate the spatial-contextual information directly into the classification by simultaneously modeling the relationship between samples and labels and the spatial correlation among samples and labels, which provides strong theoretical guidance. Furthermore, a probabilistic classification map with the self-training method which is improved by spatial-contextual information-based pseudo-label selection is used for the potential function in CRF by enriching the original labeled dataset with pseudo-labels and by retraining the classifier when the training samples are limited in LCZ mapping.
Firstly, the initial feature space is built with Landsat 8 spectral data and the remote sensing indices of the modified normalized difference water index (MNDWI), the normalized difference vegetation index (NDVI), the ratio vegetation index (RVI), the bare soil index (BSI), the normalized difference building index (NDBI), and the normalized difference impervious surface index (NDISI). After principal component analysis (PCA) transformation, the features are divided into labeled and unlabeled sets, and the labeled data are used to train the base classifier. In the meantime, the efficient graph-based segmentation mask is generated from the input data. Next, the pseudo-labeled samples are selected according to the segmentation and prediction. With the adjustment of the base classifier, the performance is gradually improved and becomes much closer to the specific LCZ scheme. Finally, the CRF models the log-based unary potentials and the edge feature function-based pairwise potentials to simulate the relationship and spatial correlation between samples and labels, which offers a better and smoother prediction. The workflow of the spatial-contextual information-based self-training framework is shown in Figure 4 and is described in the following.
Remote Sens. 2019, 11, x FOR PEER REVIEW 7 of 28 offers a better and smoother prediction. The workflow of the spatial-contextual information-based self-training framework is shown in Figure 4 and is described in the following.

Feature Extraction Based on Multiple Indices
According to the height, compactness, surface cover, and thermal admittance, the LCZ system mixes independent land-cover elements, such as building, tree, farm land, and road, to form unique LCZ types, which leads to an inferior spectral separability, especially for the similar categories. In order to express the characteristics of the LCZ categories, six beneficial remote sensing indices are extracted: MNDWI, NDVI, RVI, BSI, NDBI, and NDISI. The indices are introduced in the following.
The MNDWI [46] is a modified formula for extracting water areas, which is, of course, helpful for LCZ G. Compared with the normalized difference water index (NDWI), the MNDWI changes the original band combination by replacing the near-infrared with the mid-infrared band, which is more effective in distinguishing water information from built-up areas.
The NDVI [47] is a common ratio in vegetation identification, and it is beneficial not only for land-cover types such as LCZ A-D but also for the built types, such as LCZ 1-3. The NDVI can be used to describe the growth of plants, with a value ranging from −1 to 1, where a negative value means high-reflectivity objects and a positive value represents vegetation.
The RVI [48] is sensitive to high-density vegetation and sharply decreases when the vegetation fraction is less than 50%. Green and healthy plants make the RVI much larger than 1, while for land without vegetation, the value is around 1. The RVI has potential for types such as LCZ A-C, as it can reflect the sparseness or density of vegetation.
The BSI [49] is usually used for discriminating bare land from other land covers, such as builtup, water, and vegetation, since the value is much higher in bare land areas. With the help of the near-infrared and mid-infrared bands, the BSI is effective in identifying the soil-related LCZ categories, such as LCZ 7, C, and F.

Feature Extraction Based on Multiple Indices
According to the height, compactness, surface cover, and thermal admittance, the LCZ system mixes independent land-cover elements, such as building, tree, farm land, and road, to form unique LCZ types, which leads to an inferior spectral separability, especially for the similar categories. In order to express the characteristics of the LCZ categories, six beneficial remote sensing indices are extracted: MNDWI, NDVI, RVI, BSI, NDBI, and NDISI. The indices are introduced in the following.
The MNDWI [46] is a modified formula for extracting water areas, which is, of course, helpful for LCZ G. Compared with the normalized difference water index (NDWI), the MNDWI changes the original band combination by replacing the near-infrared with the mid-infrared band, which is more effective in distinguishing water information from built-up areas.
The NDVI [47] is a common ratio in vegetation identification, and it is beneficial not only for land-cover types such as LCZ A-D but also for the built types, such as LCZ 1-3. The NDVI can be used to describe the growth of plants, with a value ranging from −1 to 1, where a negative value means high-reflectivity objects and a positive value represents vegetation.
The RVI [48] is sensitive to high-density vegetation and sharply decreases when the vegetation fraction is less than 50%. Green and healthy plants make the RVI much larger than 1, while for land without vegetation, the value is around 1. The RVI has potential for types such as LCZ A-C, as it can reflect the sparseness or density of vegetation. The BSI [49] is usually used for discriminating bare land from other land covers, such as built-up, water, and vegetation, since the value is much higher in bare land areas. With the help of the near-infrared and mid-infrared bands, the BSI is effective in identifying the soil-related LCZ categories, such as LCZ 7, C, and F.
The NDBI [50] is regarded as a substitute for the building surface fraction (BSF), a ratio of building plan area to total plan area, in 10 LCZ basic physical properties. The NDBI can provide building information through its density, usually denoted by a specific value.
The NDISI [51] introduces the thermal infrared band to differentiate impervious surfaces from soil. Moreover, it has the ability to extract more accurate impervious area information through the restriction of the negative influence of sand and water. The NDISI is used as a replacement for the impervious surface fraction (ISF), a ratio of impervious plan area (paved, rock) to total plan area (%).

Probabilistic Classification with Self-Training
In order to provide a more reliable probabilistic classification map in the case of limited expert-labeled data, a self-training method is considered. Since the spatial-contextual information is important for LCZ classification, it is also used in the improvement of the self-training method. The proposed self-training approach with spatial-contextual information-based pseudo-label selection depends on two main assumptions. The first is that samples with more similar features are more likely to belong to the same category, which is developed by the regional information-based segmentation step. The second is that a pixel surrounded by others with the same label is more reliable, which is achieved through the local information-based candidate identification. The experiments undertaken in this study confirmed that both the regional constraints and the neighborhood information are beneficial for more reliable pseudo-label selection. And the improved self-training method is shown in Figure 5. The NDBI [50] is regarded as a substitute for the building surface fraction (BSF), a ratio of building plan area to total plan area, in 10 LCZ basic physical properties. The NDBI can provide building information through its density, usually denoted by a specific value.
The NDISI [51] introduces the thermal infrared band to differentiate impervious surfaces from soil. Moreover, it has the ability to extract more accurate impervious area information through the restriction of the negative influence of sand and water. The NDISI is used as a replacement for the impervious surface fraction (ISF), a ratio of impervious plan area (paved, rock) to total plan area (%).

Probabilistic Classification with Self-Training
In order to provide a more reliable probabilistic classification map in the case of limited expertlabeled data, a self-training method is considered. Since the spatial-contextual information is important for LCZ classification, it is also used in the improvement of the self-training method. The proposed self-training approach with spatial-contextual information-based pseudo-label selection depends on two main assumptions. The first is that samples with more similar features are more likely to belong to the same category, which is developed by the regional information-based segmentation step. The second is that a pixel surrounded by others with the same label is more reliable, which is achieved through the local information-based candidate identification. The experiments undertaken in this study confirmed that both the regional constraints and the neighborhood information are beneficial for more reliable pseudo-label selection. And the improved self-training method is shown in Figure 5.

Segmentation of Features
The Felsenszwalb [52] method in a Python package is used to segment the input features and to generate a segmentation mask, which introduces the regional spatial information to cluster similar features. As a classical image segmentation algorithm, it first computes Felsenszwalb's efficient graph and then builds a minimum spanning tree. The final segmentation comprehensively considers the internal difference of a component and the difference between two components.

The LCZ Base Classifier
After feature extraction, the dataset D is divided into an unlabeled set = { } and labeled set = { ， } , and the latter is further split into a training set = { ， } and test set = { ， } . The base classifier, RF, is then learned on the training set . RF [53] is a diversity-based ensemble method, and its diversity comes from the various training sets with random samples and features. The randomness in RF comes from the random sample selection, where each base classifier has its own dataset through the bootstrap strategy. Further randomness comes from the random attribute selection, where different features are input to the learning process. The final result is voted for by a group of base classifiers in RF, which provides strong robustness.
There are two reasons for employing the RF classifier. Firstly, RF has an innate advantage in dealing with a small amount of data due to its bootstrap technique, which can enlarge the original

Segmentation of Features
The Felsenszwalb [52] method in a Python package is used to segment the input features and to generate a segmentation mask, which introduces the regional spatial information to cluster similar features. As a classical image segmentation algorithm, it first computes Felsenszwalb's efficient graph and then builds a minimum spanning tree. The final segmentation comprehensively considers the internal difference of a component and the difference between two components.

The LCZ Base Classifier
After feature extraction, the dataset D is divided into an unlabeled set D u = {X l } and labeled set D l = X l , y l , and the latter is further split into a training set D train = X train , y train and test set D test = X test , y test . The base classifier, RF, is then learned on the training set D train . RF [53] is a Remote Sens. 2019, 11, 2828 9 of 28 diversity-based ensemble method, and its diversity comes from the various training sets with random samples and features. The randomness in RF comes from the random sample selection, where each base classifier has its own dataset through the bootstrap strategy. Further randomness comes from the random attribute selection, where different features are input to the learning process. The final result is voted for by a group of base classifiers in RF, which provides strong robustness.
There are two reasons for employing the RF classifier. Firstly, RF has an innate advantage in dealing with a small amount of data due to its bootstrap technique, which can enlarge the original database by sampling with the replacement. Secondly, the ensemble strategy, i.e., the majority voting method, makes RF insensitive to noise. One forest includes groups of the base classifier, which are mostly classification and regression trees (CARTs) [54], and each tree is trained on the diverse dataset.
Even if misclassification appears in some of the results, the final voting will be stable. Above all, RF is adequate as the base classifier in the case of the label shortage issue.

Selection of Candidates
The candidate selection consists of two steps: (1) collecting homogeneous segments and (2) selecting potential samples. A superimposed map with a prediction and segmentation mask is generated at this stage, which puts the regional spatial constraints on the unreliable predictions, i.e., incorporating labels with the input data. According to the first hypothesis introduced previously, i.e., neighboring pixels in homogeneous regions usually have the same label, one block with lots of the same labels is defined as a homogeneous region. The most frequent labels inside the homogeneous block then form the final candidates.

Selection of Pseudo-Labels
There are also two procedures used to identify the final pseudo-samples: (1) capturing the 8-neighbor information and (2) sorting the calculated entropy. According to the second assumption mentioned previously, candidates with a lower entropy are likely to be the most useful pseudo-labels.
To be specific, the neighborhood information is captured by a 3 × 3 size moving window. If there are k pixels having the same label among the surrounding 8-neighbors, then the label frequency for each class is defined as follows: where p i j represents the label frequency of class j for the ith pixel in the candidate set, n is the neighbor of the ith pixel in the 8-neighbor region N, and k in j denotes if the neighbor n of the ith pixel belongs to the class j.
The label frequency p i j is then used to calculate the entropy as follows: where e i is the entropy of the ith pixel in the candidate set and |C| represents the number of classes. The entropy captures the certainty of the prediction, with a higher value denoting greater uncertainty.

Update and Stopping Condition
The selected pseudo-labels are updated to enrich the original training set, which can offer more information for the base classifier. The latest prediction is then output to the base classifier to assist the new training process. The classifier is gradually modified until a stable classification accuracy is achieved and the final prediction is produced.

Conditional Random Fields (CRF) for LCZ Classification
The majority of the existing LCZ mapping methods separate the classification and the consideration of the spatial-contextual information into independent steps, which causes the appearance of small, isolated noise regions. However, the concept of an LCZ covers hundreds to thousands of meters, which means that the local climate of a very small area is usually discarded. CRF [55] is able to solve this problem by directly bringing the spatial-contextual information into the classification, which also provides a unified theoretical basis for better exploring the spatial-contextual information.
The most common CRF for the image classification task-pairwise CRF [56]-is used to model the spatial dependencies among the 8-neighbors. The energy function E(x) is defined as follows: where i is the location of the pixel in the image data V = {1, 2, · · · , K}, K is the total number of images, x i is the label corresponding to the ith pixel, and y is the original input data. ψ i (x i , y) and ψ ij x i , x j , y represent the unary potentials and pairwise potentials, respectively. λ is a nonnegative constant used to trade off the performance of the unary and pairwise potentials. N i represents the local neighborhood of pixel i. The unary potentials ψ i (x i , y) reflect the correlation between the single pixel and the particular label and are defined as follows: where P(x i = l t ) denotes the probability of x i being labeled as l t , where the probabilistic classification map from the RF classifier is used. The pairwise potentials ψ ij x i , x j , y incorporate the spatial-contextual information with the image pixels and labels and model a smoothness prior as follows: where where g(i, j) is an edge feature function measuring the difference among neighbors, the pair of (i, j) represents the location of neighboring pixels, θ v controls the degree of smoothing, and θ w is designed as the mean-square difference between the spectral vector of adjacent pixels over the whole image.

Experimental Description
To test the performance of the proposed SCSF method, three experiments were conducted, each using two Landsat 8 images and one ground truth provided as part of the 2017 GRSS Data Fusion Contest (2017DFC). The ability of the introduced CRF method to integrate spatial-contextual information was compared with that of two other LCZ mapping methods. The RF classification was used as a baseline, the widely accepted LCZ mapping workflow proposed by WUDAPT (comprising RF classification and majority-filter(MJ)-based spatial smoothing) was used as another reference framework, denoted as RF+MJ(WUDAPT), and the CRF for LCZ classification was denoted as RF+CRF. In addition, to evaluate the capacity of self-training-based probabilistic classification, the proposed, improved self-training method, namely ST, was integrated into the above approaches. The main comparison experiments were finally divided into two parts: supervised methods (i.e., RF, RF+MJ(WUDAPT), and RF+CRF) and semi-supervised method (i.e., RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)). Other machine learning classifiers, Naive Bayes (NB) and Support Vector Machine (SVM), which had been widely used in LCZ classification [57,58], were also conducted in this section. Also, the improved WUDAPT method [28,45], denoted as CI-WUDAPT, which considered the contextual characteristics including the mean, minimum, maximum, median, and 25th and 75th quantile values of all pixels in a 3 × 3 window, was also used to compare the performance of the proposed SCSF.
All the experiments were implemented in Python, while the RF classifier was set with 32 estimators and 10 as its maximum depth. The range of the majority filter in the WUDAPT method was a 3 × 3 square window size, and the nonnegative constant λ in CRF was set as 0.5. It is well known that the configurations of the training set and testing set play important roles in the assessment of the LCZ classification [28,45,59]. Since the ground truth data was usually limited, 10 labeled samples were randomly selected in each class for simulating the insufficiency of labeled samples and the remaining samples were used as a testing set, which is widely used in many semi-supervised researches [60,61]. Also, the final experimental results were the average performance of 10 run outcomes for providing better representativeness.
The quantitative performances are assessed by three kinds of accuracies: (1) the accuracy of each class; (2) the overall accuracy (OA), which denotes the percentage of correctly classified samples; and (3) the kappa coefficient (Kappa). Moreover, in order to evaluate the statistical significance of the difference between the proposed algorithms, McNemar's test [62] was applied under the same classification conditions. Given two classifiers C 1 and C 2 , McNemar's test can be computed as follows: where M 12 represents the number of pixels misclassified by C 1 but not by C 2 and M 21 represents the number of pixels misclassified by C 2 but not by C 1 . If M 12 + M 21 ≥ 20, then this statistic can be considered as achi-squared distribution χ 1 2 . McNemar's test can evaluate whether the difference between the results of two classifiers is significant. Given the common 5% level of significance, then χ 0.05,1 2 = 3.841459. Also, if X 2 is greater than χ 0.05,1 2 , then the performances of the two classifiers are significantly different.

Berlin Experiments
As a city with high attention to urban planning, Berlin in Germany has a balanced urban spatial structure, making itself a model for urban studies. The LCZ types in Berlin are six LCZ built types (i.e., LCZ 2 compact mid-rise, LCZ 4 open high-rise, LCZ 5 open mid-rise, LCZ 6 open low-rise, LCZ 8 large low-rise, and LCZ 9 sparsely built) and six LCZ natural types (i.e., LCZ A dense trees, LCZ B scattered trees, LCZ C bush or scrub, LCZ D low plants, LCZ F bare soil or sand, and LCZ G water). The first experiment was conducted using two down-sampled Landsat 8 images with a 100-m spatial resolution from 2017DFC, which were acquired on 25th March and 10th April 2014. The experimental images contained 666 × 643 pixels, with seven multi-spectral bands (1-7) and two thermal infrared bands (10)(11). A false-color image consisting of three bands (Red, green and blue (RGB)) is shown in Figure 6a. The spatial distribution of the corresponding labels is presented in Figure 6b, and the number of labeled samples for each type is given in Table 1. In addition, the training data were randomly sampled from the ground truth for each class. resolution from 2017DFC, which were acquired on 25th March and 10th April 2014. The experimental images contained 666 × 643 pixels, with seven multi-spectral bands (1-7) and two thermal infrared bands (10)(11). A false-color image consisting of three bands (Red, green and blue (RGB)) is shown in Figure 6a. The spatial distribution of the corresponding labels is presented in Figure 6b, and the number of labeled samples for each type is given in Table 1. In addition, the training data were randomly sampled from the ground truth for each class.   1534 577 2448 4010 1654 761 4960 1028 1050 4424 359 1732 The LCZ classification maps obtained by the different frameworks (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) for the Berlin images are displayed in Figure 7a-i, respectively. As the figures show, classification without spatial-contextual  The LCZ classification maps obtained by the different frameworks (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) for the Berlin images are displayed in Figure 7a-i, respectively. As the figures show, classification without spatial-contextual information, i.e., RF, presents lots of salt-and-pepper noise. After inputting the spatial constraints from the majority-filter-based spatial smoothing, the other LCZ prediction using the RF+MJ(WUDAPT) workflow produces a smoother LCZ mapping result with a better visual effect. However, the majority filter only considers the narrow neighborhood information given by the predictions, and the small, isolated noise areas remain to be solved. The RF+CRF method generates a much better performance in mitigating the above problem. Nevertheless, misclassification arises from the insufficient training samples, as in area 1 of Figure 7d, where the LCZ F bare soil or sand (yellow) is supposed to be LCZ D low plant (green), which also confuses the CRF modeling, causing the area 1 of Figure 7f (i.e., RF+CRF) to be mislabeled as LCZ F bare soil or sand. Considering this, the semi-supervised self-training methods were then applied to improve the initial predictions. Supported by the enriched training samples from the pseudo-labels, the RF+ST method provides a cleaner map than RF and the comparison between the results of RF+ST+MJ and RF+MJ also reveals the same case. Moreover, area 1 of Figure 7g,h is corrected to LCZ D low plants (green). The RF+ST+CRF(SCSF) workflow exhibits a competitive performance in solving the misclassified noise, with the help of the use of unlabeled data through the improved self-training method and the unified spatial-contextual information-based classification through CRF. Furthermore, results of NB and SVM shown in Figure 7a,b present relatively poor performance with lots of fragile segments. For the prediction of CI-WUDAPT, the smoother boundaries among different classes are presented after the extraction of regional spatial-contextual from features, while heavy misclassified phenomena appeared with the limited training samples (10 samples per class), such as massive LCZ G water (blue) being obviously misclassified to LCZ C bush or scrub (light blue). From the aspect of visual performance, the proposed SCSF delivers the better solution in handling the fragile segments and misclassified areas, which proves the superior capability of the SCSF.
prediction of CI-WUDAPT, the smoother boundaries among different classes are presented after the extraction of regional spatial-contextual from features, while heavy misclassified phenomena appeared with the limited training samples (10 samples per class), such as massive LCZ G water (blue) being obviously misclassified to LCZ C bush or scrub (light blue). From the aspect of visual performance, the proposed SCSF delivers the better solution in handling the fragile segments and misclassified areas, which proves the superior capability of the SCSF. In addition, Table 2 provides the pairwise comparison between the nine methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) using McNemar's test. The value of McNemar's test indicates the difference between the two results of classifiers, while if the value is greater than . , (3.841459), it is considered a significant difference. Furthermore, the greater the value, the more significant the difference. Data shows that all the values are greater than . , , especially for the method between SCSF and SVM, which has a considerably big value, 2970.46, showing the significant difference among two approaches. Moreover, significant differences among every two methods are provided from the statistical aspect in Berlin. In addition, Table 2 provides the pairwise comparison between the nine methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) using McNemar's test. The value of McNemar's test indicates the difference between the two results of classifiers, while if the value is greater than χ 0.05,1 2 (3.841459), it is considered a significant difference. Furthermore, the greater the value, the more significant the difference. Data shows that all the values are greater than χ 0.05,1 2 , especially for the method between SCSF and SVM, which has a considerably big value, 2970.46, showing the significant difference among two approaches. Moreover, significant differences among every two methods are provided from the statistical aspect in Berlin. To better assess the effectiveness of the proposed SCSF method, a quantitative comparison of the different methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) is provided in Table 3. This shows that the classification workflows considering spatial-contextual information (i.e., CI-WUDAPT, RF+MJ(WUDAPT), RF+CRF, RF+ST+MJ, and RF+ST+CRF(SCSF)) give a great improvement of nearly 5-18% in OA over the single classification results (i.e., NB, SVM, RF, and RF+ST) and an improvement of 0.06-0.2 in Kappa, proving the significance of the spatial-contextual information for LCZ mapping. Furthermore, the CRF-based classification workflows (i.e., RF+CRF and RF+ST+CRF(SCSF)) deliver an enhancement of approximately 4% in terms of OA and 0.05 in Kappa, compared with the independent majority-filter-based approaches (i.e., RF+MJ and RF+ST+MJ), which means that simultaneously modeling the correlation between labels and samples, in addition to the spatial relationship among samples, is very helpful. In addition, the accuracies of the self-training-based semi-supervised methods (i.e., RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) are higher than those of the supervised methods (i.e., RF, RF+MJ(WUDAPT), and RF+CRF), with improvements of nearly 2% and 0.2 in OA and Kappa, respectively, demonstrating the effectiveness of the generated pseudo-labels. The SVM classifier generates the worst accuracy with 18% lower than the SCSF in OA, which may be optimized through complex adjustment of its parameters. The CI-WUDAPT gives relatively high accuracy among the supervised methods; however, there is still a gap of 5% OA and 0.06 Kappa compared with the SCSF. The proposed RF+ST+CRF(SCSF) workflow shows the best quantitative performance among all the compared methods with only 10 labeled samples for each class, and the accuracies of 79.83% and 0.77 for OA and Kappa are also acceptable. However, the scattered LCZ types (i.e., LCZ 4 open high-rise, LCZ B scattered tree, and LCZ C bush or scrub) present inferior accuracies with semi-supervised methods, showing that the spatial-contextual information is insufficiently obtained from these LCZ types, which reduce the accuracies.

São Paulo Experiments
The second experimental area is São Paulo, Brazil, a city in the southern hemisphere with a more diverse urban form. The LCZ types in São Paulo cover almost all the built and natural classes except for LCZ 7 lightweight low-rise and LCZ C bush or scrub. Two cloudless Landsat-8 images from 2017DFC acquired on 8th February 2014 and 23rd September 2015 constituted the second experimental dataset. As in the first experiment, the images were down-sampled to a 100-m spatial resolution with a 1067 × 871 pixel dimension. Nine bands (i.e., bands 1-7 and 10-11) covering the infrared to visible spectrum were prepared. The RGB (i.e., bands 4, 3, 2) false-color image and the spatial distribution of the labeled data are shown in Figure 8a,b, respectively. Information about the class numbers is provided in Table 4

São Paulo Experiments
The second experimental area is São Paulo, Brazil, a city in the southern hemisphere with a more diverse urban form. The LCZ types in São Paulo cover almost all the built and natural classes except for LCZ 7 lightweight low-rise and LCZ C bush or scrub. Two cloudless Landsat-8 images from 2017DFC acquired on 8th February 2014 and 23rd September 2015 constituted the second experimental dataset. As in the first experiment, the images were down-sampled to a 100-m spatial resolution with a 1067 × 871 pixel dimension. Nine bands (i.e., bands 1-7 and 10-11) covering the infrared to visible spectrum were prepared. The RGB (i.e., bands 4, 3, 2) false-color image and the spatial distribution of the labeled data are shown in Figure 8a,b, respectively. Information about the class numbers is provided in Table 4.   The LCZ maps produced by the different approaches (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) are shown in Figure 9a-i. Differing from Berlin, much more salt-and-pepper noise appears in São Paulo with the RF-based classification (i.e., RF), which shows obvious changes after the majority-filter-based spatial smoothing (i.e., RF+MJ) and the CRF-based classification (i.e., RF+CRF). Although the prediction of RF+CRF  The LCZ maps produced by the different approaches (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) are shown in Figure 9a-i. Differing from Berlin, much more salt-and-pepper noise appears in São Paulo with the RF-based classification (i.e., RF), which shows obvious changes after the majority-filter-based spatial smoothing (i.e., RF+MJ) and the CRF-based classification (i.e., RF+CRF). Although the prediction of RF+CRF presents much smoother class boundaries than the first two maps, the provided low-quality probabilistic map, which directly influences the modeling of the potential function, still brings a huge amount of misclassification. In particular, area 1 (green) of Figure 9f is supposed to be water, while the mislabeled spatial context confuses the CRF model, and the blue water region becomes green vegetation. Area 2 of Figure 9f is also heavily influenced by its surrounding mislabeled data, where LCZ 3 compact low-rise (rose red) is misclassified as LCZ 2 compact mid-rise (dark red). To relieve the above phenomena, the improved self-training-based classification (i.e., RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) was further adopted. Further improvements are apparent in Figure 9g-i with the improved self-training-based classification workflows (i.e., RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)), and the appearance of scattered noise is much relieved when compared with the supervised results (i.e., RF, RF+MJ(WUDAPT), and RF+CRF). In particular, areas 1-2 of Figure 9i are accurately predicted to be real LCZ types, i.e., LCZ G water (blue) and LCZ 3 compact low-rise (rose red). The LCZ map generated by the proposed SCSF method provides not only an apparent visual improvement in better boundaries than the non-spatial-contextual information classification workflow (i.e., RF+ST) but also a cleaner and more accurate prediction than the result without self-training (i.e., RF+CRF). Moreover, the classifications of the NB, SVM, and CI-WUDAPT deliver quite different performances from that of SCSF. There are lots of misclassified noises that appear in Figure 9a,b and large misclassified areas stand out in Figure 9c, which prove that the visual performance of the SCSF is better than the previous researches.
Moreover, McNemar's test between different methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) is shown in Table 5. All the values obtained among pairwise classification workflow are much greater than χ 0.05,1 2 (3.841459), which means significant differences were found for the compared methods. The value of McNemar's test between NB and SCSF is the biggest, indicating that the proposed workflow gives a significant statistical improvement compared with the classification of NB. improvement in better boundaries than the non-spatial-contextual information classification workflow (i.e., RF+ST) but also a cleaner and more accurate prediction than the result without selftraining (i.e., RF+CRF). Moreover, the classifications of the NB, SVM, and CI-WUDAPT deliver quite different performances from that of SCSF. There are lots of misclassified noises that appear in Figure  9a,b and large misclassified areas stand out in Figure 9c, which prove that the visual performance of the SCSF is better than the previous researches. A quantitative report of the accuracies of the different methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) is given in Table 6. To evaluate the ability of the spatial-contextual information, the accuracies of RF and RF+ST serve as a baseline in the main part of the experiments and are compared with other workflows (i.e., RF+MJ(WUDAPT), RF+CRF, RF+ST+MJ, and RF+ST+CRF(SCSF)). The accuracies of the majority-filter-based spatial smoothing methods (i.e., RF+MJ(WUDAPT) and RF+ST+MJ) present an improvement of about 7% in OA and 0.07-0.08 in Kappa, while the CRF-based methods (i.e., RF+CRF and RF+ST+CRF(SCSF)) show a great improvement of nearly 12% in OA and 0.13-0.14 in Kappa. In addition, the generation of pseudo-labels from the self-training method provides an improvement of 3-4% in OA and 0.03-0.04 in Kappa for the supervised workflows (i.e., RF, RF+MJ(WUDAPT), and RF+CRF) and the corresponding semi-supervised approaches (i.e., RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)), which demonstrates the effectiveness of the proposed strategy. Moreover, the NB classifier presents the worst performance with just 65.27% OA and 0.6 Kappa, which may prove that this kind of method is unsuitable for a city with diverse urban form in the case of limited samples. With the direct utilization of the spatial-contextual information, CI-WUDAPT shows 5-13% improvement in OA with the traditional approaches (i.e., NB, and SVM); however, the proposed SCSF method delivers a more superior performance in terms of OA and Kappa, with 86.4% and 0.84, respectively. Nevertheless, the LCZ types (i.e., LCZ 2 compact mid-rise, LCZ 4 open high-rise, LCZ 5 open mid-rise, LCZ B scattered trees, LCZ D low plants, LCZ E bare rock or paved, and LCZ F bare soil or sand) with relatively small testing samples present poor performance, which can be explained by the testing samples in São Paulo being clustered in the center areas and unable to reflect well the comprehensive condition.

Paris Experiments
The city of Paris in France was selected as the last experimental study area to assess the performance of the proposed SCSF method in a high-density city. The LCZ types are seven built types (i.e., LCZ 1 compact high-rise, LCZ 2 compact mid-rise, LCZ 4 open high-rise, LCZ 5 open mid-rise, LCZ 6 open low-rise, LCZ 8 large low-rise, and LCZ 9 sparsely built) and five natural types ( i.e., LCZ A dense trees, LCZ B scattered trees, LCZ D low plants, LCZ E bare rock or paved, and LCZ G water), revealing the multiformity of the urban region in Paris. The third experimental dataset comprised two Landsat 8 images of 1160 × 988 pixels provided by 2017DFC individually acquired on 19th May 2014 and 27th September 2015. The spatial resolution was again equal to 100 m after down-sampling, and the band information was the same as before, i.e., bands 1-7 and 10-11, amounting to nine channels. Figure 10a shows the false-color RGB image (i.e., bands 4,3,2), and an overview of the corresponding types is presented in Figure 10b. The sample numbers of each class are listed in Table 7.
Remote Sens. 2019, 11, x FOR PEER REVIEW 19 of 28 corresponding types is presented in Figure 10b. The sample numbers of each class are listed in Table  7.
(a) (b)  The performance of the different methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) is shown in Figure 10a-i. The visual appearance of the salt-and-pepper noise in Paris seems more serious than for Berlin and São Paulo, while the variation of the categories is moderate. Compared with the results in the second column (i.e., RF and RF+ST), the other classification workflows (i.e., RF+MJ(WUDAPT), RF+CRF, RF+ST+MJ, and RF+ST+CRF(SCSF)) show an improvement in alleviating the noise issue with the further information from the spatial context. However, as in the São Paulo experiments, confusion appears in the results of RF+CRF, where the LCZ 4 open high-rise (rose red) in area 1 of Figure 11f is supposed to be LCZ 6 open low-rise (brown) and the LCZ C bush or scrub (light blue) in area 2 of Figure 11f is supposed to be LCZ A dense trees (green), which may have been caused by the insufficient training samples. The LCZ maps in the third row are the improved approaches (i.e., RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)), and supported by the enrichment of the training data, the confused pixels decrease a lot from the beginning of the classification, compared with Figure 11d,f, which proves the significance of the training information. In particular, the misclassified areas 1-2 of Figure 11f are corrected to the real types, i.e., LCZ 6 open low-rise (brown) and LCZ A dense trees (green). The classifications of NB, SVM, and CI-WUDAPT in Paris are better than the above cities, which have less evident misclassified segments. However, the LCZ G water nearly disappears in Figure 11a and fragile noises or areas still stand out in Figure 11b,c. In addition, the developed SCSF workflow produces the best visual performance, with the elimination of the isolated pixels and the small, isolated areas.  The performance of the different methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) is shown in Figure 10a-i. The visual appearance of the salt-and-pepper noise in Paris seems more serious than for Berlin and São Paulo, while the variation of the categories is moderate. Compared with the results in the second column (i.e., RF and RF+ST), the other classification workflows (i.e., RF+MJ(WUDAPT), RF+CRF, RF+ST+MJ, and RF+ST+CRF(SCSF)) show an improvement in alleviating the noise issue with the further information from the spatial context. However, as in the São Paulo experiments, confusion appears in the results of RF+CRF, where the LCZ 4 open high-rise (rose red) in area 1 of Figure 11f is supposed to be LCZ 6 open low-rise (brown) and the LCZ C bush or scrub (light blue) in area 2 of Figure 11f is supposed to be LCZ A dense trees (green), which may have been caused by the insufficient training samples. The LCZ maps in the third row are the improved approaches (i.e., RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)), and supported by the enrichment of the training data, the confused pixels decrease a lot from the beginning of the classification, compared with Figure 11d,f, which proves the significance of the training information. In particular, the misclassified areas 1-2 of Figure 11f are corrected to the real types, i.e., LCZ 6 open low-rise (brown) and LCZ A dense trees (green). The classifications of NB, SVM, and CI-WUDAPT in Paris are better than the above cities, which have less evident misclassified segments. However, the LCZ G water nearly disappears in Figure 11a and fragile noises or areas still stand out in Figure 11b,c. In addition, the developed SCSF workflow produces the best visual performance, with the elimination of the isolated pixels and the small, isolated areas. corresponding types is presented in Figure 10b. The sample numbers of each class are listed in Table  7.
(a) (b)  The performance of the different methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) is shown in Figure 10a-i. The visual appearance of the salt-and-pepper noise in Paris seems more serious than for Berlin and São Paulo, while the variation of the categories is moderate. Compared with the results in the second column (i.e., RF and RF+ST), the other classification workflows (i.e., RF+MJ(WUDAPT), RF+CRF, RF+ST+MJ, and RF+ST+CRF(SCSF)) show an improvement in alleviating the noise issue with the further information from the spatial context. However, as in the São Paulo experiments, confusion appears in the results of RF+CRF, where the LCZ 4 open high-rise (rose red) in area 1 of Figure 11f is supposed to be LCZ 6 open low-rise (brown) and the LCZ C bush or scrub (light blue) in area 2 of Figure 11f is supposed to be LCZ A dense trees (green), which may have been caused by the insufficient training samples. The LCZ maps in the third row are the improved approaches (i.e., RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)), and supported by the enrichment of the training data, the confused pixels decrease a lot from the beginning of the classification, compared with Figure 11d,f, which proves the significance of the training information. In particular, the misclassified areas 1-2 of Figure 11f are corrected to the real types, i.e., LCZ 6 open low-rise (brown) and LCZ A dense trees (green). The classifications of NB, SVM, and CI-WUDAPT in Paris are better than the above cities, which have less evident misclassified segments. However, the LCZ G water nearly disappears in Figure 11a and fragile noises or areas still stand out in Figure 11b,c. In addition, the developed SCSF workflow produces the best visual performance, with the elimination of the isolated pixels and the small, isolated areas. Moreover, in order to give the statistical comparisons, McNemar's values between the abovementioned methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) are given in Table 8. Similar to the other two cities, all the values of McNemar's test in Paris are also greater than . , (3.841459), while the value between RF+CRF and SCSF (35.66) is relatively small compared with the RF and RF+CRF methods (1854.59) and the RF and SCSF methods (1854. 51), showing that, although the semi-supervised approach gives some improvements in Paris, the spatial-contextual information-based CRF methods perform with more statistical significance.  Moreover, in order to give the statistical comparisons, McNemar's values between the abovementioned methods (i.e., NB, SVM, CI-WUDAPT, RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) are given in Table 8. Similar to the other two cities, all the values of McNemar's test in Paris are also greater than χ 0.05,1 2 (3.841459), while the value between RF+CRF and SCSF (35.66) is relatively small compared with the RF and RF+CRF methods (1854.59) and the RF and SCSF methods (1854. 51), showing that, although the semi-supervised approach gives some improvements in Paris, the spatial-contextual information-based CRF methods perform with more statistical significance. In terms of the quantitative performance shown in Table 9, the experimental accuracies for Paris are much better than expected and reach the highest level among all three experiments. As in the previous experiments, the accuracies present different improvements with the assistance of the spatial context and the use of unlabeled data. To be specific, the result of the proposed SCSF method shows an improvement of about 7% in OA and 0.1 in Kappa compared with RF+ST; however, this is not apparent in the comparison with RF+ST+MJ and RF+CRF. The explanation for this is that, when the initial classification is already acceptable, the performance improvement of the proposed approach may not be that significant, since SCSF is aimed at solving the unreliable probability issue. Moreover, the accuracies of SCSF prevail over those of NB, SVM, and CI-WUDAPT, presenting about 6-16% and 0.08-0.2 improvement in OA and Kappa, respectively. In particular, LCZ types with an aggregation effect, such as LCZ D low plants, are obviously improved after fusing the spatial-contextual information. In contrast, the accuracies of some LCZ types which have dispersed spatial distribution and relatively small testing samples (i.e., LCZ 9 sparsely built, LCZ B scattered trees, and LCZ E bare rock or paved) present a decreased trend. Reasons can be explained as follows: (1) the SCSF is a spatial-contextual information-based method and, when the LCZ types are scattered or sparse, the spatial-contextual information is provided insufficiently, which may degrade the accuracies of these LCZ types; (2) the small number of testing samples for these LCZ types are unable to credibly evaluate the real condition. In brief summary, for the main experimental part, the proposed SCSF gives the best performance in all three study areas compared with other five methods (i.e., RF, RF+MJ(WUDAPT), RF+CRF, RF+ST, and RF+ST+MJ) in terms of OA and Kappa, showing that the spatial-contextual information-based self-training classification framework for LCZs is undoubtedly effective. The values from McNemar's test between the proposed SCSF with others are considerably large, which represents the significant differences from the statistical aspect. Moreover, the methods considering the spatial-contextual information (i.e., RF+MJ (WUDAPT), RF+CRF, RF+ST+MJ, and RF+ST+CRF(SCSF)) performed better than the other approaches (i.e., RF, and RF+ST), especially the CRF-based methods (i.e., RF+CRF, and RF+ST+CRF(SCSF)), which produce the best accuracies in OA and Kappa. Moreover, the semi-supervised approaches (i.e., RF+ST, RF+ST+MJ, and RF+ST+CRF(SCSF)) in the three experiments provide different improvements compared with the supervised methods (i.e., RF, RF+MJ, and RF+MJ (WUDAPT)).
Furthermore, the performances of the different methods (i.e., NB, SVM, and CI-WUDAPT) in three study areas are also considered to compare with the proposed SCSF, and the NB and SVM present relatively poor performances with many salt-and-pepper noises and misclassified areas. Moreover, the spatial-contextual information-based approaches (i.e., CI-WUDAPT, and SCSF) generate smoother predictions and SCSF delivers the best visual performance. Compared with the traditional machine learning classifier (i.e., NB and SVM), the proposed SCSF gives improvements of 10.65%-21.23% in OA with 0.13-0.24 in Kappa and 13.19%-18.58% in OA with 0.16-0.2 in Kappa, respectively. For CI-WUDAPT, the SCSF also outperforms it with 5.34%-7.75% in OA and 0.06-0.09 in Kappa, which proves the effectiveness of SCSF with the consideration of spatial-contextual information and self-training method.

Effects of the Self-training Method
In order to test the effects of the improved self-training method, which applied the regional to local spatial-contextual information-based pseudo label selection as the strategy, different methods were used to make comparisons. To be specific, the proposed improved self-training method was compared with two other reference strategies. One benchmark was the original self-training approach, named ST-1, based on high-probability labels, which is usually unreliable, especially in the beginning. The other was the improved self-training method developed by Aydav and Minz, named ST-2 [44], based on the average probability in one granulation. The proposed self-training method based on regional and local information is denoted as STS. All of these methods were compared with the original RF classification results without the self-training step, which is denoted as RF. The OAs before and after the three different self-training approaches in the Berlin (BL), São Paulo (SP), and Paris (PA) datasets are listed in Table 10. The mean accuracy, denoted as mean, was also calculated to reflect the generic performance. The quantitative report of the accuracies in Table 10 demonstrates that the ST-1 and ST-2 self-training approaches usually degrade the classification performance in both OA and Kappa, except for São Paulo (SP), where the accuracy of ST-2 is slightly increased. The explanation for this is that, since the initial classification accuracies (i.e., RF) are usually poor, the probability of predictions may be unreliable, while the strategy of ST-1 totally trusts high-probability pseudo-labels and ST-2 believes in the average probability among one granulation, resulting in misclassified noises which usually confuse the classifier. Given this fact, the proposed improved self-training method (i.e., STS) substitutes the spatial-contextual information from regional to local scales for the unreliable probability of the predictions, and the accuracy is enhanced by about 2-5% in OA and 0.02-0.07 in Kappa for the different cities. The generic results in the MEAN column also show the same trend, in that the proposed STS strategy presents the highest accuracies, which prove the effectiveness of the improved self-training method.
The LCZ maps conducted with the above approaches (i.e., RF, ST-1, ST-2, and STS) are shown in Figure 12a-l. Compared with the initial classification (i.e., RF), the LCZ maps based on ST-1 in the second column of Figure 12 generate lots of obvious misclassified noises over the whole image.
Although the ST-2-based results represented by the third column although a lot compared with those of ST-1, noises still appear. Moreover, the classification results obtained using the STS strategy present more distinct class boundaries with less misclassified noises, which is evident in the visual performance. However, there are still many small, isolated areas appearing in the predictions, meaning that further consideration of the spatial-contextual information, i.e., the proposed CRF-based LCZ classification, is necessary. probability of the predictions, and the accuracy is enhanced by about 2-5% in OA and 0.02-0.07 in Kappa for the different cities. The generic results in the MEAN column also show the same trend, in that the proposed STS strategy presents the highest accuracies, which prove the effectiveness of the improved self-training method.
The LCZ maps conducted with the above approaches (i.e., RF, ST-1, ST-2, and STS) are shown in Figure 12a-l. Compared with the initial classification (i.e., RF), the LCZ maps based on ST-1 in the second column of Figure 12 generate lots of obvious misclassified noises over the whole image. Although the ST-2-based results represented by the third column although a lot compared with those of ST-1, noises still appear. Moreover, the classification results obtained using the STS strategy present more distinct class boundaries with less misclassified noises, which is evident in the visual performance. However, there are still many small, isolated areas appearing in the predictions, meaning that further consideration of the spatial-contextual information, i.e., the proposed CRFbased LCZ classification, is necessary.

Effects of the Sample Number
To analyze the effects of the initial sample number for the proposed SCSF, extra experiments were conducted in this section. Different sample numbers, i.e., 5, 10, 25, and 50 samples per class, were set to assess the proposed method, while the WUDAPT method was used as a reference. The generic results calculated by averaging the accuracies over all three areas are denoted as MEAN. Similar to the configuration of Section 3, the final experimental accuracies were the average

Effects of the Sample Number
To analyze the effects of the initial sample number for the proposed SCSF, extra experiments were conducted in this section. Different sample numbers, i.e., 5, 10, 25, and 50 samples per class, were set to assess the proposed method, while the WUDAPT method was used as a reference. The generic results calculated by averaging the accuracies over all three areas are denoted as MEAN. Similar to the configuration of Section 3, the final experimental accuracies were the average performance of 10 run outcomes and the training samples were randomly selected in the whole ground truth data while the remaining was used as testing set.
As shown in Figure 13a-d, all the classifications exhibit a similar trend with the increase of the number for training samples. The experiments prove that the proposed SCSF has superior performance compared to WUDAPT in all the designed conditions. In terms of the performance in different areas, the biggest improvements between SCSF and WUDAPT are in Berlin, with 3-7% in OA. In more complex areas, i.e., São Paulo (SP) and Paris (PA), SCSF achieves a 1-7% improvement of OA. For the generic comparisons shown in the last column of Table 11 and Figure 13, the SCSF also presents a good performance. However, the accuracies of the LCZ maps with relatively large improvement using the SCSF have actually lower values than others, which means the proposed SCSF is more suitable to enhance the accuracy when the initial classification is really poor.

Effects of the Sample Number
To analyze the effects of the initial sample number for the proposed SCSF, extra experiments were conducted in this section. Different sample numbers, i.e., 5, 10, 25, and 50 samples per class, were set to assess the proposed method, while the WUDAPT method was used as a reference. The generic results calculated by averaging the accuracies over all three areas are denoted as MEAN. Similar to the configuration of Section 3, the final experimental accuracies were the average performance of 10 run outcomes and the training samples were randomly selected in the whole ground truth data while the remaining was used as testing set.
As shown in Figure 13a-d, all the classifications exhibit a similar trend with the increase of the number for training samples. The experiments prove that the proposed SCSF has superior performance compared to WUDAPT in all the designed conditions. In terms of the performance in different areas, the biggest improvements between SCSF and WUDAPT are in Berlin, with 3-7% in OA. In more complex areas, i.e., São Paulo (SP) and Paris (PA), SCSF achieves a 1-7% improvement of OA. For the generic comparisons shown in the last column of Table 11 and Figure 13, the SCSF also presents a good performance. However, the accuracies of the LCZ maps with relatively large improvement using the SCSF have actually lower values than others, which means the proposed SCSF is more suitable to enhance the accuracy when the initial classification is really poor.   Furthermore, the proposed SCSF shows a strong performance, especially with a small number of training samples, but the OA shows a higher growth with 5/10 samples compared to 25/50 samples per class; for instance, when the accuracy with 25 or 50 samples in Paris (PA) is already over 90%, the improvement is only about 1-2%. Moreover, the proposed SCSF method is capable of generating an equivalent accuracy to the WUDAPT method with fewer training samples. In particular, the OA with five samples in São Paulo (SP) shows a similar performance to the WUDAPT method using 10 samples per class. In terms of OA, the SCSF method shows great improvement with different sample numbers (i.e., 5, 10, 25, and 50 samples per class) compared with WUDAPT, which shows that the SCSF is really helpful for LCZ classification.

Conclusions
In this paper, we have proposed a spatial-contextual information-based self-training classification framework (SCSF) for LCZs, which introduces CRF for LCZ classification to better utilize the spatial-contextual information and probabilistic classification with self-training to provide more reliable inputs for CRF. Three experiments using Landsat 8 images from three diverse areas-Berlin, São Paulo, and Paris-confirmed the effectiveness of the proposed SCSF method with the widely used protocol developed by WUDAPT.
To be specific, the CRF provides a unified theoretical foundation for directly bringing the spatial-contextual information into classification, mitigating both the appearance of salt-and-pepper noise and small, isolated noise areas. In addition, probabilistic classification provided with an improved self-training-based approach is adopted to consider the lack of high-quality expert-labeled data. Through the enrichment of the limited training data with pseudo-labels, the predicted probability used for the potential function modeling is more reliable.
Moreover, the effects of the improved self-training method and the sensitivity to different sample numbers were also investigated. Overall, the proposed SCSF method showed powerful quantitative and qualitative performances in LCZ mapping not only in the multiform city but also in the high-density city, especially when the training data were limited. In our future work, we will explore the use of a deep learning method for LCZ classification, and other data and larger study areas will also be considered.