High-Resolution Remote Sensing Data Classiﬁcation over Urban Areas Using Random Forest Ensemble and Fully Connected Conditional Random Field

: As an intermediate step between raw remote sensing data and digital maps, remote sensing data classiﬁcation has been a challenging and long-standing problem in the remote sensing research community. In this work, an automated and effective supervised classiﬁcation framework is presented for classifying high-resolution remote sensing data. Speciﬁcally, the presented method proceeds in three main stages: feature extraction, classiﬁcation, and classiﬁed result reﬁnement. In the feature extraction stage, both multispectral images and 3D geometry data are used, which utilizes the complementary information from multisource data. In the classiﬁcation stage, to tackle the problems associated with too many training samples and take full advantage of the information in the large-scale dataset, a random forest (RF) ensemble learning strategy is proposed by combining several RF classiﬁers together. Finally, an improved fully connected conditional random ﬁeld (FCCRF) graph model is employed to derive the contextual information to reﬁne the classiﬁcation results. Experiments on the ISPRS Semantic Labeling Contest dataset show that the presented 3-stage method achieves 86.9% overall accuracy, which is a new state-of-the-art non-CNN (convolutional neural networks)-based classiﬁcation method.


Introduction
As one of the most challenging and important problems in the remote sensing community, high-resolution remote sensing data classification is very useful for many applications such as geographical database construction, digital map updating, 3D building reconstruction, land cover mapping and change detection.The objective of this kind of classification task is to assign an object class to each spatial position recorded by the given data.Although many different algorithms have been proposed in the past, many of the problems related to the classification task have not been solved [1].Based on different criteria, the existing classification methods can be categorized into several groups.A brief review of these existing methods is provided in the following.
According to the types of data sources employed, the existing methods can be categorized into image-based classification, 3D point cloud-based classification and data fusion-based classification.Image-based classification only makes use of the available multispectral or hyperspectral image as the sole data source in the classification, as done in previous studies [2,3].Three dimensional points acquired by light detection and ranging (LIDAR) and dense image matching techniques [4,5] are other effective data sources for classification.For example, Vosselman [6] used high-density point clouds of urban scenes to identify buildings, vegetation, vehicles, the ground, and water.Zhang et al. [7] used the geometry, radiometry, topology and echo characteristics of airborne LIDAR point cloud to perform an object-based classification.To exploit the complementary characteristics of multisource data, data fusion based methods are also popular and have been proven to be more reliable than the single-source data methods used by many researchers [8].For example, both images and 3D geometry data have been used in several previous studies [9][10][11][12].
In terms of the basic element employed in the classification process, the existing methods can be categorized as object-based and pixel/point-based.Object-based methods typically use a cascade of bottom-up data segmentation and regional classification, which makes the system commit to potential errors from the front-end segmentation system [13].For instance, Gerke [14] first segmented an image into small super-pixels and then extracted the features of each super-pixel to input to an AdaBoost classifier.Zhang et al. [7] first grouped points into segments using a surface growing algorithm then classified the segments using a support vector machine (SVM) classifier.Pixel/point-based methods leave out the segmentation process and directly classify each pixel or point.However, due to the lack of contextual information, the classified results usually seem noisy.As a remedial measure, a conditional random field (CRF) is usually used to smooth the classification result.For example, both Marmanis et al. [15] and Paisitkriangkrai et al. [1] used deep convolutional neural networks (CNN) to classify each pixel, then used a CRF to refine the results, whereas Niemeyer et al. [16] first classified each 3D point using a random forest (RF) classifier then smoothed them using a CRF.
Based on the classifiers used, the existing methods can be divided into two types: unsupervised and supervised.For the unsupervised methods, expert knowledge of each class is usually summarized and used to classify the data into different categories.For instance, a rule-based hierarchical classification scheme that utilizes spectral, geometry and topology knowledge of different classes was used by both Rau et al. [9] and Speldekamp et al. [17] to classify different data.For the supervised methods, samples with labeled ground truth data are first used to train a statistical classifier (e.g., AdaBoost, SVM and RF), then the samples without labels are classified by this learned classifier.Previously, samples from small areas have been used to train the classifier, and the features of these samples have all been designed manually [7,16,18].More recently, with the progress of sensor technology, an increasing amount of high quality remote sensing data are available for research.At the same time, progress in graphic processing unit (GPU) and parallel computing technology has significantly increased the computing capability, such that learning a more complicated classifier with a larger amount of training data has becomes accessible to more researchers.Specifically, one of the most successful practices in this direction was the launch of deep CNN (convolutional neural networks) [19][20][21][22] in the computer vision community, which has become the dominant method for visual recognition and semantic classification [13,[23][24][25][26]. Furthermore, one of the most distinct characteristics of CNN is its ability to automatically learn the most suitable features, which has made the manual feature extraction process that is used in the traditional supervised-based classification methods unnecessary.Although there exists great differences between the data used in the computer vision community and the data used in the remote sensing community, some researchers [1,27] have found that the CNN models trained by the computer vision community generalize the remote sensing data and some of the features learned by the models were more discriminative than the hand-crafted features.
To promote the scientific progress of remote sensing data classification, the international society for photogrammetry and remote sensing (ISPRS) launched a semantic labeling benchmark [28] in 2014.Using the datasets provided by the benchmark, different classification methods can be evaluated and compared conveniently.We have observed an interesting phenomenon from these evaluated classification methods.On the one hand, the performances of the CNN-based methods are generally better than the non-CNN-based methods.On the other hand, the non-CNN-based methods generally use fewer training samples than the CNN-based methods.As a result, we want to explore whether the notable gap between the non-CNN and CNN-based methods can be reduced by training a traditional supervised classifier with a larger training dataset.
It is widely known that for CNN-based methods, more benefits are gained when more training data are available.However, too many training samples may lead to disaster for some traditional supervised classifiers.For example, SVMs trained by a large-scale dataset often suffer from large memory storage requirements and extensive time consumption, since an SVM solves a complex dual quadratic optimization problem [29].In addition, the existence of too many support vectors makes the solving process extremely slow [30].Although the RF and AdaBoost classifiers can theoretically handle a large-scale dataset, the large memory storage and computational load still hamper their applications to big training datasets.To tackle this problem and take full advantage of the information in the large-scale dataset, an RF-based ensemble learning strategy is proposed by combining several RF classifiers together in the present study.Ensemble learning or a multiple classifier system (MCS) is well established in remote sensing and has shown great potential to improve the accuracy and reliability of remote sensing data classification over the last two decades [31][32][33].For example, Waske et al. [34] fused two SVM classifiers to classify both optical imagery and synthetic aperture radar data, and each data source was treated separately and classified by an independent SVM.Experiments have shown that their fusion method outperforms many approaches and significantly improves the results of a single SVM that was trained on the whole multisource dataset.Ceamanos et al. [35] designed a classifier ensemble to classify hyperspectral data.Spectral bands of the hyperspectral image were first divided into several subgroups according to their similarities.Then, each group was used to train an individual SVM classifier.Finally, an additional SVM classifier was used to combine these classifiers together.The results also demonstrated the effectiveness of their model fusion scheme.Recently, several fusion methods have investigated the dependence among detectors or classifiers.For example, Vergara et al. [36] derived the optimum fusion rule of N non-independent detectors in terms of the individual probabilities of detection and false alarms and defined the dependence factors.This could be a future line of research in the remote sensing community.In the present study, remote sensing data (both multispectral images and 3D geometry data) are first divided into tiles.Then, some of them are selected and labeled by a human operator.After that, each selected and labeled tile is used to train an individual RF model.Finally, a Bayesian weighted average method [31] is employed to combine these individual RF models into a global classifier.In addition, to take full advantage of the contextual information in the data, an effective fully connected conditional random field (FCCRF) model is constructed and optimized to refine the classified results.
In general, the present study describes a pixel-based, supervised and data fusion-based method.The main contributions of the current study are three-fold.First, a new RF ensemble learning strategy is introduced to explore the information in the large-scale training dataset.Second, through utilizing the contextual information, an improved FCCRF graph model is designed to refine the classification result.Third, an efficient pipeline is designed and parallelized to classify the multisource remote sensing data.By testing the method on the ISPRS Semantic Labeling Contest, we achieved the highest overall accuracy among the non-CNN-based methods, which provides a state-of-the-art method and reduces the gap between the non-CNN and CNN-based methods.
The rest of this paper is organized as follows.In Section 2, details of the presented high-resolution remote sensing data classification method are elaborated.Then, experimental evaluation and analysis are reported in Section 3, and followed by some concluding remarks and future work in Section 4.

Methodology
The present study is composed of three main parts.First, both the high-resolution multispectral image and the 3D geometry data are used for feature extraction, and a total of 24 different features are extracted from each pixel.Second, using these pixels (samples), an RF ensemble model (constructed by combining several individual RF models; denoted by RFE) is trained and used to classify the scene.Finally, the noisy classification results are input into a learned FCCRF model, and a long-range dependencies inference is used to refine the classification results.Figure 1 shows the pipeline of the proposed method.

Feature Extraction
Four types of features are employed in this work, spectral features from the multispectral image, texture features from the multispectral image, height-related features from the 3D geometry data and differential morphological profile features from the 3D geometry data.

Spectral Features
The spectral features used in this paper refer to two types, the features in different color spaces and the vegetation index.They are defined as follows.
(1) Features in different color spaces.Because each color space has its advantages, the HSV [37] and CIE Lab [38] color spaces commonly used in computer graphics and computer vision are employed in addition to the original RGB color space to provide additional information.The HSV color space decomposes colors into their hue, saturation and value components and is a more natural way to describe colors.The CIE Lab color space is designed to approximate human vision.The CIE Lab aspires to achieve perceptual uniformity and is handy to measure the distance of a given color to another color.In Section 3, we will see that both the HSV and CIE Lab are the more effective color spaces to classify remote sensing data compared to the original RGB color space.(2) Vegetation index.To discriminate vegetation from other classes effectively, one of the most popular vegetation indices in remote sensing, the Normalized Difference Vegetation Index , is considered.NDVI is based on the fact that green vegetation has low reflectance in the red (R) spectrum due to chlorophyll and much higher reflectance in the infrared (IR) spectrum because of its cell structure.

Image Texture Features
Image texture can quantify the intuitive qualities in terms of rough, smooth, silky, or bumpy as

Feature Extraction
Four types of features are employed in this work, spectral features from the multispectral image, texture features from the multispectral image, height-related features from the 3D geometry data and differential morphological profile features from the 3D geometry data.

Spectral Features
The spectral features used in this paper refer to two types, the features in different color spaces and the vegetation index.They are defined as follows.
(1) Features in different color spaces.Because each color space has its advantages, the HSV [37] and CIE Lab [38] color spaces commonly used in computer graphics and computer vision are employed in addition to the original RGB color space to provide additional information.The HSV color space decomposes colors into their hue, saturation and value components and is a more natural way to describe colors.The CIE Lab color space is designed to approximate human vision.
The CIE Lab aspires to achieve perceptual uniformity and is handy to measure the distance of a given color to another color.In Section 3, we will see that both the HSV and CIE Lab are the more effective color spaces to classify remote sensing data compared to the original RGB color space.(2) Vegetation index.To discriminate vegetation from other classes effectively, one of the most popular vegetation indices in remote sensing, the Normalized Difference Vegetation Index (NDVI) defined as NDV I = IR−R IR+R , is considered.NDVI is based on the fact that green vegetation has low reflectance in the red (R) spectrum due to chlorophyll and much higher reflectance in the infrared (IR) spectrum because of its cell structure.

Image Texture Features
Image texture can quantify the intuitive qualities in terms of rough, smooth, silky, or bumpy as a function of the spatial variation in the pixel intensities.The effectiveness of using image texture features for remote sensing data classification has been justified by several studies [39][40][41].Similar to [42], the following image texture features are calculated in this work: (1) Local range f range represents the range value (the maximum value minus the minimum value) of the neighborhood centered at the pixel.In areas with a smooth texture, the range value will be small; in areas with a rough texture, the range value will be larger.When computing the three image texture features, as done in [43], the multispectral color images are first converted to gray images, and then a 3-by-3 neighborhood centered at each pixel is used for the f std and f range calculations, and a 9-by-9 neighborhood is used for the f entr computation.

Height Related Features
The height related features can be divided into two types: height features and height-based texture features.They are detailed as follows.
(1) Height features.The corresponding digital surface model (DSM) value and the normalized digital surface model (NDSM) value for each pixel are directly used as the height features.The NDSM is defined as the difference between the DSM and the derived DEM, which describes an object's height above the ground and can be used to distinguish the high object classes from the low object classes.Note that the height features are also used in [1,14,17].(2) Height-based texture features.The height-based texture features or the elevation texture features used in this study are similar to the image texture features calculated in Section 2.1.2.The features include the local geometry range feature f g range , the local geometry standard deviation feature f g std and the local geometry entropy feature f g entr .In this study, we calculate these features from the DSM.

Differential Morphological Profile Features
Morphological profile (MP) or differential morphological profile (DMP) is an effective feature extraction method that is usually used for image classification in remote sensing [3,44].It is used to extract the shape and size of objects based on morphological operations (e.g., opening and closing) by reconstruction.As illustrated in Figure 2, for a certain image I, let γ * k and η * k be the morphological opening and closing operators with structuring element SE k , then ∏ γ(x) and ∏ η(x) are the opening and closing profiles at pixel x of image I, which can be obtained by Equations ( 1) and (2), respectively.The DMP is defined as a vector where the measure of the slope of the opening-closing profile is stored for every step of an increasing SE series Generally, the differential of the morphological profile ( ) or the DMP can be written as the vector, where n is the total number of iterations, − is the size of the morphological transform [45,46].
Here, we use the morphological opening operators with increasing square structuring element sizes by ) to continually process the DSM.The changes brought to the DSM by the different sized opening operators are then stacked, and the residuals between the adjacent levels are computed to form the 6 final DMP features, dmp-n (n = 2, 3, … 7).Finally, we list all 24 features and their abbreviations in Table 1.The contribution rate of each feature to the classification will be explored and compared in Section 3.  The DMP is defined as a vector where the measure of the slope of the opening-closing profile is stored for every step of an increasing SE series SE k = f (k) (k = 0, 1, 2, ...n).The differential of the opening profile ∆γ(x) and the closing profile ∆η(x) are defined as, Generally, the differential of the morphological profile ∆(x) or the DMP can be written as the vector, where n is the total number of iterations, c = 1, . . ., 2n, and |n − c| is the size of the morphological transform [45,46].Here, we use the morphological opening operators with increasing square structuring element sizes by SE k = 2 k + 1(k = 1, 2...7) to continually process the DSM.The changes brought to the DSM by the different sized opening operators are then stacked, and the residuals between the adjacent levels are computed to form the 6 final DMP features, dmp-n (n = 2, 3, . . .7).
Finally, we list all 24 features and their abbreviations in Table 1.The contribution rate of each feature to the classification will be explored and compared in Section 3. Random Forest (RF) is one of the most popular machine learning methods thanks to its relatively good accuracy, robustness and ease of use.It is an ensemble learning method for classification, that operates by constructing a multitude of decision trees during training and integrating the class probabilities of the individual trees at the testing stage [47,48].The training algorithm for RF applies the bootstrap aggregating (bagging) techniques to the tree learners.For a training set X = x 1 , x 2 , ..., x n with labels C = c 1 , c 2 , ..., c n , RF bagging repeatedly (B times) selects a random sample with replacement of the training set and fits the trees to these samples, as shown in Algorithm 1.After training, a tree set {T} can be obtained to predict the classes of the unseen samples by taking the majority vote from all individual classification trees.As shown in Figure 3, the unseen sample V is input into 3 individual decision trees.The class probabilities p t (c|v) of each tree are first computed, then the final classification result is obtained by averaging all 3 probability distributions using Equation (6) [49].
In addition to the predicted class probabilities, RF can also be used to rank the importance of features using the Gini index or the out-of-bag (oob) error [50].During the model training process, the oob error for each sample is recorded and averaged over the entire forest.To measure the importance of the j − th feature, after training, the values of the j − th feature are permuted among the training data and the oob error is again computed on this perturbed data set.The importance score for the j − th feature is computed by averaging the difference in the oob errors before and after the permutation over all trees.The features that produce large values for this score are ranked as more important than the features that produce small values.
Some researchers claim that the feature importance measures may provide misleading results when the variables are of different types, or the number of levels differs in different categorical variables [51,52].However, the effectiveness and robustness of this measure have been recognized by more researchers [50,51,53], especially the researchers in the remote sensing community [54,55].To take full advantage of the information in the large-scale dataset, we trained several RF models independently and fused them to form an RF ensemble to predict the final label of each pixel.In Figure 4, the flowchart for training this ensemble model is provided, and we detail each step as follows.To take full advantage of the information in the large-scale dataset, we trained several RF models independently and fused them to form an RF ensemble to predict the final label of each pixel.In Figure 4, the flowchart for training this ensemble model is provided, and we detail each step as follows.To take full advantage of the information in the large-scale dataset, we trained several RF models independently and fused them to form an RF ensemble to predict the final label of each pixel.In Figure 4, the flowchart for training this ensemble model is provided, and we detail each step as follows.In the sample selection and partition stage, a tile-based strategy is used to partition the samples into a training set, a validation set, and a testing set.Specifically, the large urban area covered by the samples (pixels) is first cut into several small tiles, as shown in Figure 5.Then, we partition samples by dividing the tiles into different groups.After this partition, each type of sample consists of several tiles, and they are input into the different stages for training or testing purposes.In detail, the training samples are used to train the individual RF models; the validation samples are used to both fuse the trained RF models and learn the hyper parameters of the FCCRF model; the effects of each strategy employed in the proposed method is validated using the testing samples.To avoid a serious imbalance of the classes in the training samples, we try to maintain the class balance of the training data in each tile by adjusting the tile size at the tile cutting stage.Furthermore, at the sample selection stage, only those samples that are not located in the object border areas within each tile are selected as valid, since the samples located in the border areas are likely to be mixed and incorrect pixels.Finally, at the model fusing stage, the validation set is used to fuse the trained RF models.Specifically, the validation set is classified by each RF model, and the corresponding classification accuracy is computed.By weighting the prediction result of each model in accordance with its classification accuracy, the class probabilities ( | ) p c v of the RFE model is obtained, as defined in Equation ( 7) where N is the number of RF models, i w is the weight of the i th − RF, and probabilities generated by the i th − RF.At the same time, the feature importance indicator of RFE, ( ) ϕ is obtained by fusing the feature importance indicator of each RF model using Equation ( 8) where N and i w are the same as the parameters defined in Equation (7), and ( ) is the feature importance score of the j th − feature generated by the i th − RF model.

Fully Connected CRF for Refinement
The RFE model can produce the probability of each class conveniently and classify the remote sensing data efficiently.However, this pixel-based classification strategy labels the image pixels independently, which does not take into account the interrelations among them.Therefore, in this section, we further improve the classification performance by employing an effective CRF graph model that can exploit the contextual information from the classified area.
Although using the classical contrast-sensitive potentials [56] in conjunction with the local-range Finally, at the model fusing stage, the validation set is used to fuse the trained RF models.Specifically, the validation set is classified by each RF model, and the corresponding classification accuracy is computed.By weighting the prediction result of each model in accordance with its classification accuracy, the class probabilities p(c|v) of the RFE model is obtained, as defined in Equation ( 7) where N is the number of RF models, w i is the weight of the i − th RF, and p i (c|v) is the class probabilities generated by the i − th RF.At the same time, the feature importance indicator of RFE, ϕ( f j ) is obtained by fusing the feature importance indicator of each RF model using Equation ( 8) where N and w i are the same as the parameters defined in Equation ( 7), and ϕ i ( f j ) is the feature importance score of the j − th feature generated by the i − th RF model.

Fully Connected CRF for Refinement
The RFE model can produce the probability of each class conveniently and classify the remote sensing data efficiently.However, this pixel-based classification strategy labels the image pixels independently, which does not take into account the interrelations among them.Therefore, in this section, we further improve the classification performance by employing an effective CRF graph model that can exploit the contextual information from the classified area.
Although using the classical contrast-sensitive potentials [56] in conjunction with the local-range CRF can potentially improve the results, we found that some thin-structures (tree, low vegetation or cars in the scene) may also be smoothed out in practice, which is harmful to the classification.To overcome these limitations of a short-range CRF, we adopt a long-range FCCRF model in this work; see Section 2.3 for details.In Figure 6, the results of two testing areas refined using short-range CRF and long-range FCCRF methods are compared.The FCCRF model was first proposed by Krahenbuhl [57] to solve the multiclass image segmentation and labeling problems.As shown in Figure 7, different from the commonly used shortrange CRF models [56,58], this complete graph model sees each pixel as a node, and every two nodes are connected by an edge.Then, the corresponding energy function is defined as where V is the node set, E is the edge set and X is the label configuration for the graph.The unary potential ( ) is defined as ( ) log ( ) , where ( ) i P x is the label assignment probability of pixel i generated by a certain classifier.The pairwise potential ( , ) where ( , ) f are the feature vectors of pixel i and j in a feature space, and they are usually built over the information from pixel positions i p and the spectral bands i I .Then, the commonly used combined kernels can be expressed as The first kernel encourages the nearby pixels with similar features to take the same label and the second kernel smooths the result by removing small isolated regions.The hyper parameters α σ , β σ and γ σ control the degree of the kernels' scales.Finally, to minimize this energy function, a mean- field approximate inference algorithm is usually used to refine the label configurations.The FCCRF model was first proposed by Krahenbuhl [57] to solve the multiclass image segmentation and labeling problems.As shown in Figure 7, different from the commonly used short-range CRF models [56,58], this complete graph model sees each pixel as a node, and every two nodes are connected by an edge.Then, the corresponding energy function is defined as where V is the node set, E is the edge set and X is the label configuration for the graph.The unary potential ϕ u (x i ) is defined as ϕ u (x i ) = − log P(x i ), where P(x i ) is the label assignment probability of pixel i generated by a certain classifier.The pairwise potential ϕ p (x i , x j ) is defined as where µ(x i , x j ) is a label compatibility function, which can be given by the simple Potts model, or by learning as done in a previous study [57].Each k (m) is a Gaussian kernel weighted by w (m) , f i and f j are the feature vectors of pixel i and j in a feature space, and they are usually built over the information from pixel positions p i and the spectral bands I i .Then, the commonly used combined kernels can be expressed as The first kernel encourages the nearby pixels with similar features to take the same label and the second kernel smooths the result by removing small isolated regions.The hyper parameters σ α , σ β and σ γ control the degree of the kernels' scales.Finally, to minimize this energy function, a mean-field approximate inference algorithm is usually used to refine the label configurations.The flowchart of the FCCRF-based refinement implemented in the current study is shown in Figure 8, where we can see that the class probabilities generated by the RFE are used to construct the unary potential ( ) . We keep the basic form of the pairwise potential ( , ) Equation (10), where the label compatibility function ( , )  11) is written as: where i S and j S are the 3 most important features selected by RFE model and all other parameters are the same as the parameters defined in Equation (11).To minimize this energy function, we use a mean-field approximate inference algorithm [60] to refine the configuration of the labels.This approximation is iteratively optimized through a series of message passing steps, each of which updates a single variable by aggregating information from all other variables; its efficiency for the FCCRF inference has been demonstrated by Krahenbuhl [57].The flowchart of the FCCRF-based refinement implemented in the current study is shown in Figure 8, where we can see that the class probabilities generated by the RFE are used to construct the unary potential ϕ u (x i ).We keep the basic form of the pairwise potential ϕ p (x i , x j ) unchanged.See Equation (10), where the label compatibility function µ(x i , x j ) is set to 1 if x i = x j , and 0 otherwise (i.e., Potts Model).Different from previous studies, after a feature importance analysis, the feature importance indicators generated by the trained RFE model are used to construct the feature spaces f i and f j , then Equation ( 11) is written as: where S i and S j are the 3 most important features selected by RFE model and all other parameters are the same as the parameters defined in Equation (11).To minimize this energy function, we use a mean-field approximate inference algorithm [60] to refine the configuration of the labels.This approximation is iteratively optimized through a series of message passing steps, each of which updates a single variable by aggregating information from all other variables; its efficiency for the FCCRF inference has been demonstrated by Krahenbuhl [57].11) is written as: where i S and j S are the 3 most important features selected by RFE model and all other parameters are the same as the parameters defined in Equation (11).To minimize this energy function, we use a mean-field approximate inference algorithm [60] to refine the configuration of the labels.This approximation is iteratively optimized through a series of message passing steps, each of which updates a single variable by aggregating information from all other variables; its efficiency for the FCCRF inference has been demonstrated by Krahenbuhl [57].Like the model fusion stage in Section 2.2, the validation set is also used in this stage to learn the best values of the hyper parameters w (1) , w (2) , σ α , σ β and σ γ in Equation (12).As described in Chen [13], a two level coarse-to-fine grid search scheme is used to search for the best values.

Experimental Evaluation
The proposed method in this paper is implemented in C++.Moreover, the OpenCV library [61] is used to supply the RF classifier, and the DenseCRF library [62] is used to optimize the FCCRF graph model.All experiments are performed on an Intel(R) Xeon(R) 8 core CPU @ 3.7 GHz processor with 32 GB RAM.To promote computation efficiency, the main steps of the proposed method are parallelized.Specifically, in the RFE model training stage, we parallelize the presented algorithm by training each single RF classifier with an individual thread.In the classification stage of the RFE model, we parallelize the presented algorithm at the sample level.In addition, the hyper parameters learning stage of the FCCRF model is also parallelized.

The Testing Data Set
The ISPRS Semantic Labeling Contest dataset from Vaihingen [28] is used to test the proposed method.The Vaihingen study site is approximately 25 km northwest of Stuttgart, Germany.As a typical European city, there are three main types of locations: "inner city", "high rise", and "residential areas".The center of this city is the "inner city", and it is characterized by dense development consisting of historic buildings with rather complex shapes and trees.Around the "inner city", the "high rise" areas are characterized by a few high-rising residential buildings that are surrounded by trees.The "residential areas" are purely residential areas with small detached houses.
The original data in this dataset were captured by the German Association of Photogrammetry and Remote Sensing (DGPF) in the summer of 2008 using a Digital Mapping Camera System (DMC).In 2012, the Trimble INPHO 5.3 software was used to generate the 3D point cloud and the DSM using dense image matching.The true orthophoto (TOP) mosaic, with 3 bands of near-infrared, red and green delivered by the camera, is generated by Trimble INPHO OrthoVista, as shown in Figure 9.Both the TOP and the DSM have 9 cm ground sampling distances (GSD).That is, the TOP and the DSM are defined on the same grid, so it is not necessary to consider the geocoding information in the processing.In 2014, the ISPRS benchmark for Semantic Labeling Contest was launched.For convenience, the large TOP and DSM are divided into 33 small tiles with different sizes according to the scene content; in total, there are over 168 million pixels (see Figure 9).At the same time, to evaluate the classification results of the different methods and provide enough training samples for the supervised machine learning algorithm, manually labeled ground truth data for each tile are added to the dataset (see Figure 10).(b) "no boundary" ground truth, the black areas in this reference will be ignored during evaluation.

Experiment Setup and Details
In the feature extraction stage, the true orthophoto is the source of the spectral features.Using this image, the HSV and CIE Lab color space features are first computed according to their definition, followed by the NDVI and the three image texture features defined in Section 2.1.For the geometry features, the DSM provided by the benchmark is directly used to generate other features.Like many other researchers [1,14,63], the results provided by Gerke [14] are used for the NDSM feature.Finally, all 24 features are normalized to [0, 255] for the subsequent classification.
To train the RFE classifier, the 16 tiles with labeled ground truth data are divided into the training, validation and testing sets.Specifically, the validation set consists of 3 tiles (areas: 26, 28, 30) and the testing set consists of 3 tiles (areas: 32, 34, 37).The remaining 10 tiles represent the training set, and each tile is used for training an individual RF classifier.In detail, there are approximately six million training samples on average for training each RF classifier.The number of trees and the number of prediction variables need to be provided for each RF.We find that the classification results are not sensitive to these parameters, consistent with the reports in previous studies [55,64].At last, the number of trees and the number of prediction variables are set to 100 and 4 respectively for all the 10 RF classifiers.Finally, the RFE classifier is obtained by fusing the 10 classifiers with the aid of the validation set, see Section 2.2.
In terms of the FCCRF-based refinement, as described by Krahenbuhl [57], we found that the kernel parameters (2)   w and γ σ do not significantly affect the classification accuracy but yield a small visual improvement.Therefore, these parameters are all set to 3 by default in the experiment.With respect to the hyper parameters (1)  w , α σ and β σ , we obtain them from a two-level grid search scheme.At the first level, we search for values of (1)   w in the range of (3, 10), with steps of 2; α σ and β σ are searched in the range of (5, 50) and (5, 100), respectively, with steps of 5.At the second level, we decrease all search steps to 1 around the best values of the first round.In the final mean-field approximate inference stage, we fix the number of mean field iterations to 10 for all experiments.Only part of the labeled ground truth data (16 of 33) is publicly available, and the remaining ground truth data are unavailable and remain with the benchmark test organizers for evaluating the submitted results.There are six common land cover categories, impervious surfaces, building, grass, tree, car, and clutter/background.In addition to the "full reference" ground truth, the "no boundary" ground truth is also provided to reduce the impact of uncertain border definitions where the boundaries of objects are eroded by a circular disc with a 3-pixel radius.Those eroded areas are then ignored during evaluation (see Figure 10).

Experiment Setup and Details
In the feature extraction stage, the true orthophoto is the source of the spectral features.Using this image, the HSV and CIE Lab color space features are first computed according to their definition, followed by the NDVI and the three image texture features defined in Section 2.1.For the geometry features, the DSM provided by the benchmark is directly used to generate other features.Like many other researchers [1,14,63], the results provided by Gerke [14] are used for the NDSM feature.Finally, all 24 features are normalized to [0, 255] for the subsequent classification.
To train the RFE classifier, the 16 tiles with labeled ground truth data are divided into the training, validation and testing sets.Specifically, the validation set consists of 3 tiles (areas: 26, 28, 30) and the testing set consists of 3 tiles (areas: 32, 34, 37).The remaining 10 tiles represent the training set, and each tile is used for training an individual RF classifier.In detail, there are approximately six million training samples on average for training each RF classifier.The number of trees and the number of prediction variables need to be provided for each RF.We find that the classification results are not sensitive to these parameters, consistent with the reports in previous studies [55,64].At last, the number of trees and the number of prediction variables are set to 100 and 4 respectively for all the 10 RF classifiers.Finally, the RFE classifier is obtained by fusing the 10 classifiers with the aid of the validation set, see Section 2.2.
In terms of the FCCRF-based refinement, as described by Krahenbuhl [57], we found that the kernel parameters w (2) and σ γ do not significantly affect the classification accuracy but yield a small visual improvement.Therefore, these parameters are all set to 3 by default in the experiment.With respect to the hyper parameters w (1) , σ α and σ β , we obtain them from a two-level grid search scheme.At the first level, we search for values of w (1) in the range of (3,10), with steps of 2; σ α and σ β are searched in the range of (5, 50) and (5, 100), respectively, with steps of 5.At the second level, we decrease all search steps to 1 around the best values of the first round.In the final mean-field approximate inference stage, we fix the number of mean field iterations to 10 for all experiments.

Feature Importance Analysis
The feature importance indicator is employed to validate the selected features, as shown in Figure 11.The feature importance indicator is the average value of the 10 RF classifiers.From this, we can see that the geometry feature NDSM derived from the 3D geometry data and the spectral feature NDVI derived from the multispectral image are the two most important features in the experiment.This result shows that both types of data sources are important and indispensable for accurate semantic classification.When comparing their importance by taking all features together, we found that the contribution rates derived from the multispectral image and the 3D geometry data were 69% and 31%, respectively.For the top-12 features, 7 are derived from the multi-spectral image and 5 are derived from the 3D geometry data.This finding reveals that the multispectral image plays a more important role than the 3D geometry data.Considering that there are three bands used to generate the multispectral image-based features but only one band used to compute the geometry-based features, we believe it is reasonable that more information is contained in the multispectral data.Certainly, the quality and number of features selected in this work also affect the comparison.
When comparing the contribution rates of the three different color space features, we see that both the CIE Lab (15.1%) and the HSV (13.8%) color spaces are larger than the RGB color space (11.0%).Furthermore, nearly all types (color space, vegetation index, texture, height, DMP) of features computed appeared in the top-12 feature set, and their orders seem reasonable, which proves the effectiveness of the feature extraction strategy used in this study.

Feature Importance Analysis
The feature importance indicator is employed to validate the selected features, as shown in Figure 11.The feature importance indicator is the average value of the 10 RF classifiers.From this, we can see that the geometry feature NDSM derived from the 3D geometry data and the spectral feature NDVI derived from the multispectral image are the two most important features in the experiment.This result shows that both types of data sources are important and indispensable for accurate semantic classification.When comparing their importance by taking all features together, we found that the contribution rates derived from the multispectral image and the 3D geometry data were 69% and 31%, respectively.For the top-12 features, 7 are derived from the multi-spectral image and 5 are derived from the 3D geometry data.This finding reveals that the multispectral image plays a more important role than the 3D geometry data.Considering that there are three bands used to generate the multispectral image-based features but only one band used to compute the geometrybased features, we believe it is reasonable that more information is contained in the multispectral data.Certainly, the quality and number of features selected in this work also affect the comparison.
When comparing the contribution rates of the three different color space features, we see that both the CIE Lab (15.1%) and the HSV (13.8%) color spaces are larger than the RGB color space (11.0%).Furthermore, nearly all types (color space, vegetation index, texture, height, DMP) of features computed appeared in the top-12 feature set, and their orders seem reasonable, which proves the effectiveness of the feature extraction strategy used in this study.

Effect of Random Forest Ensemble
One of the main contributions of this work is the model fusion strategy.To show the effectiveness of the strategy on classification, we provide the overall accuracies of 10 single RF classifiers and the RFE classifier tested on three different areas in Table 2. From these results we can see that for each area, the classification accuracies from 10 single RF classifiers change in a large range ([50.56%,87.77%] for area 32, [59.92%, 88.05%] for area 34 and [56.92%, 86.62%] for area 37).However, the training accuracies ([89.56%,94.73%]) of the 10 classifiers are rather close.This reveals that the over-fitting occurred in some individual RF classifiers.When we combine these individual RFs together, the highest classification accuracies in all the three areas are obtained.That is, using RF model fusion, we alleviate the over-fitting problem in single RFs and improve the discrimination ability of the final classifier.This demonstrates that training a more complicated model with a larger scale dataset is beneficial for classification, which is consistent with the conclusions drawn by Waske et al. [34] and Ceamanos et al. [35].The overall accuracies of different RF ensembles after combining different number of RF classifiers are given in Table 3.The accuracies in Table 3 are used to explore the robustness of the model fusion strategy with respect to different training set sizes.Specifically, RFE-10 represents the RF ensemble that combined 10 RF classifiers, and each RF classifier is labeled as RF-1, RF-2, . . .and RF-10, respectively.After that, we remove the two RF classifiers that had the highest and lowest accuracies, and then combined the rest of the RF classifiers together to form a new RF ensemble, namely RFE-8.Then, by removing the highest and lowest ones again among the 8 RF classifiers, RFE-6 is obtained.A similar procedure is used to obtain RFE-4 and RFE-2.Table 3 shows that the accuracies achieved by the RFE classifiers are generally higher than those achieved by component classifiers.That is, the proposed model fusion strategy is robust with respect to different training set sizes.

Effect of Fully Connected CRF
Although the classification performance of an RF can be promoted dramatically by model fusion, there are still some obvious artifacts in the classified maps, since the context information is not considered in the classification process, as shown in Figure 12a.To smooth this noisy classification map, CRF and its variants are commonly used techniques [1,13,16,57,65].As described in Section 2.3, an improved FCCRF is used in this study and the effect is shown in Figure 12b.By comparing it with Figure 12a, we found that the labels of some small misclassified regions are reassigned correctly, and the visual effect is improved significantly.In Figure 12c, the classification result refined by the classical short-range CRF is also shown.As in Figure 12b, its visual improvement is also evident.However, the superiority of the FCCRF can be observed in those areas inside the red ellipses.
Qualitatively speaking, our results also fall into the general paradigm that the CRF can generally improve the visual quality.However, from the quantitative perspective, there are different conclusions.For example, in a previous study [14], the authors stated that the CRF reduced the overall classification accuracy in their case.We think this is partially related to the classification strategies employed.For example, a CRF defined on the super-pixel level usually has a negative effect on the accuracy [14].On the other hand, the values assigned to the hyper parameters in the graph model have a great influence on the final accuracy.In this study, we found a maximum accuracy increment of 0.88% with the hyper parameters w (1) = 3, σ α = 20 and σ β = 31.However, only 0.57% of the increase is achieved by the classical short-range CRF.To investigate the influences of the hyper parameters on our improved pixel-level FCCRF model thoroughly, we show the relationship between the parameter values and the classification accuracy in Figure 13. Figure 13 suggests that when σ α ∈ [15,25] and σ β ∈ [20,40] a satisfactory result can be obtained.Otherwise, the accuracies declined sharply.In addition to the improved FCCRF, we also investigated the hyper parameters of the original FCCRF model [57], which obtained a maximum accuracy increment of 0.82% with the hyper parameters (1) 3 w = , 6 α σ = and 79 β σ = , as shown in Figure 14.From this figure we can see that the parameter α σ has a significant influence on the accuracy in this case, and if it is set too large, a negative effect will occur.Comparing Figure 14 with Figure 13, we can see that the accuracy Val-2 Val-3 Average  In addition to the improved FCCRF, we also investigated the hyper parameters of the original FCCRF model [57], which obtained a maximum accuracy increment of 0.82% with the hyper parameters (1) 3 w = , 6 α σ = and 79 β σ = , as shown in Figure 14.From this figure we can see that the parameter α σ has a significant influence on the accuracy in this case, and if it is set too large, a negative effect will occur.Comparing Figure 14 with Figure 13, we can see that the accuracy In addition to the improved FCCRF, we also investigated the hyper parameters of the original FCCRF model [57], which obtained a maximum accuracy increment of 0.82% with the hyper parameters w (1) = 3, σ α = 6 and σ β = 79, as shown in Figure 14.From this figure we can see that the parameter σ α has a significant influence on the accuracy in this case, and if it is set too large, a negative effect will occur.Comparing Figure 14 with Figure 13, we can see that the accuracy increment achieved by the improved FCCRF is superior to the original one if the hyper parameters are set reasonably.

Performance Analysis
To evaluate the proposed method deeply and objectively, we also ran our algorithm on all 17 areas with undisclosed ground truth data in addition to the 16 tiles with publicly available ground truth data.The classification results were submitted to the ISPRS Semantic Labeling Contest for evaluation, and three of the evaluation results are shown in Figure 15.From these results, we can see that most buildings, trees and grass areas are labeled correctly, although some small objects and pixels near the object boundaries are misclassified.

Performance Analysis
To evaluate the proposed method deeply and objectively, we also ran our algorithm on all 17 areas with undisclosed ground truth data in addition to the 16 tiles with publicly available ground truth data.The classification results were submitted to the ISPRS Semantic Labeling Contest for evaluation, and three of the evaluation results are shown in Figure 15.From these results, we can see that most buildings, trees and grass areas are labeled correctly, although some small objects and pixels near the object boundaries are misclassified.

Performance Analysis
To evaluate the proposed method deeply and objectively, we also ran our algorithm on all 17 areas with undisclosed ground truth data in addition to the 16 tiles with publicly available ground truth data.The classification results were submitted to the ISPRS Semantic Labeling Contest for evaluation, and three of the evaluation results are shown in Figure 15.From these results, we can see that most buildings, trees and grass areas are labeled correctly, although some small objects and pixels near the object boundaries are misclassified.To assess the results quantitatively, the accumulated confusion matrix and some derived measures (precision, recall and F1 score) for the whole unseen testing set are calculated and shown in Table 4. Mayer et al. [66] stated that in many cases if the classification correctness is approximately 85% and the completeness is approximately 70%, it can be used in practice.By this criterion, our classification results can be considered relevant and useful for practical applications, except for the 'car' class.As also reported by other related studies [14,17,67], compared to the other classes, the 'car' class is the most difficult category to classify, and its accuracy is usually around or even lower than 50% for most hand-crafted-features-based methods.After careful analysis, we think the following reasons may account for the low classification accuracies.First, because the 'car' usually has a low area size, a small height difference from the road and different NDVIs, it is widely recognized that the design of 'car' sensitive hand-crafted features is challenging.Second, as the 'car' samples only To assess the results quantitatively, the accumulated confusion matrix and some derived measures (precision, recall and F1 score) for the whole unseen testing set are calculated and shown in Table 4. Mayer et al. [66] stated that in many cases if the classification correctness is approximately 85% and the completeness is approximately 70%, it can be used in practice.By this criterion, our classification results can be considered relevant and useful for practical applications, except for the 'car' class.As also reported by other related studies [14,17,67], compared to the other classes, the 'car' class is the most difficult category to classify, and its accuracy is usually around or even lower than 50% for most hand-crafted-features-based methods.After careful analysis, we think the following reasons may account for the low classification accuracies.First, because the 'car' usually has a low area size, a small height difference from the road and different NDVIs, it is widely recognized that the design of 'car' sensitive hand-crafted features is challenging.Second, as the 'car' samples only account for 1% of all samples, a class imbalance problem exists.Third, all kinds of vehicles, including large trucks and small cars, are considered as 'car' despite the large inter-category difference in the dataset.In a word, the 'car' class needs to be further studied specifically as done in [68], and the CNN-based methods are expected to be more suitable for detecting different types of cars.Table 4 also shows that no samples are classified into the 'clutter' category in the testing set, and the recall rate of the 'clutter' category is 0%.This fact indicates that our model cannot learn the common characteristics of this category due to the overlarge inter-category difference and the too few samples of this category.Finally, we compared the current method with the other related methods submitted to the ISPRS Semantic Labeling Benchmark.In total, there were 42 different classification results from 17 different research groups submitted to the benchmark, 31 of which are deep CNN-based results.We did not change the names of the methods submitted to the Benchmark and the method presented in this study is denoted as "NLPR".For the sake of clarity and readability, the best result achieved by each research group was collected for comparison in Table 5.
As seen from Table 5, our "NLPR" performs the best among all of the non-CNN-based methods, and the overall accuracy of our method is 86.9%.As far as the five specific classes are concerned, our method ranks first in the "imp surf" and "tree" classes; and second in the "building", "grass" and "car" classes.The strongest competitors are IVFL (86.5%, an object-based method) and RIT [67] (86.3%, a structured path-based method).Both methods classify the data using a random forest model.In Table 6, we compare our results with the results of these two methods in three areas.We can see that the classified results from our method are cleaner than the results from the other two methods.Moreover, the boundaries segmented by our method are more accurate than those of the other two methods, which makes our method superior in real applications such as digital map generation and object contour detection.Table 7 shows the time required at the different stages of the proposed method for a tile with 3 Table 7 shows the time required at the different stages of the proposed method for a tile with 3 Table 7 shows the time required at the different stages of the proposed method for a tile with 3 Table 7 shows the time required at the different stages of the proposed method for a tile with Compared to the CNN-based methods, our "NLPR" can surpass several of them, such as the "ETH_C" and "UCal".Compared to the best CNN-based methods, namely "CASIA2" and "DLR_10", our "NLPR" still has a notable gap.However, to get a higher performance, an extremely deep and complicated network with 101 layers is used by "CASIA2", several deep networks with different structures are combined in "DLR_10" [69], and the quality of NDSM used in "DLR_10" [69] is also higher than ours.We think this gap is reasonable.
Table 7 shows the time required at the different stages of the proposed method for a tile with 2000 × 2500 pixels.We can see that most of the time is spent on the RFE classification, and the amount of time required for the RFE classification is related to the number of RFs.Specifically, it takes 138 s to classify a tile of 2000 × 2500 pixels with a single RF classifier, and it takes 1415 s (1380 s for RFs classification and 35 s for combining) to classify the same data with an RFE model with 10 RFs.In contrast, at the feature extraction and FCCRF refinement stages, the time costs are not very high and only account for approximately 10% of the whole time, which is approximately 25 min.These suggest that the total computational time of the proposed method mainly depends on the number of RF classifiers, which can be seen as a drawback of the ensemble-learning-based methods.Note that with the aid of high-performance GPU, Marmanis et al. [15] spent about 18 min (9 min for coarse classification, another 9 min for refinement) to classify a tile of the same size with a state-of-the-art CNN-based method.Considering that no GPU was used in our case, our proposed method seems comparable to theirs in terms of computational efficiency.Note also that in case the computational time is a primary concern, the classification efficiency can be promoted by reducing the number of individual RF classifiers, as indicated in Table 3.

Conclusions
The current study combines methods from RF and probabilistic graphical models to classify high-resolution remote sensing data.Using high-resolution multispectral images and 3D geometry data, the method proceeds in three main stages: feature extraction, classification and refinement.A total of 13 features (color, vegetation index and texture) from the multispectral image and 11 features (height, elevation texture and DMP) from the 3D geometry data are first extracted to form the feature space.Then, the random forest is selected as the basic classifier for semantic classification.Inspired by the big training data and ensemble learning strategy adopted in the machine learning and remote sensing communities, a tile-based scheme is used to train multiple RFs separately.The multiple RFs are then combined to jointly predict the category probabilities of each sample.Finally, the probabilities and the feature importance indicators are used to construct an FCCRF graph model, and a mean-field-based statistical inference is carried out to refine the above classification results.
Experiments on the ISPRS Semantic Labeling Contest data show that features from both the multispectral image and the 3D geometry data are important and indispensable for the accurate semantic classification.Moreover, multispectral image-derived features play a greater role than the 3D geometry features.When comparing the classification accuracy of the single RF classifier and the fused RF ensemble, we found that both the generalization capability and the discriminability were enhanced significantly.Consistent with the conclusions drawn by others, the smoothness effect of the CRF is also evident in the presented work.Moreover, by introducing the 3 most important features to the pairwise potential of CRF, the classification accuracy is improved by approximately 1% in the presented experiments.
Note that in addition to urban land cover mapping, the current study can also be extended and used for other activities, such as vegetation mapping, water body mapping, and change detection.
Among the non-CNN-based methods, we achieve the highest overall accuracy, 86.9%.When compared to the CNN-based approaches, the gap between the current method and the best CNN method in the contest is still notable.However, we found that the presented method indeed outperformed several CNN-based methods.Furthermore, CNN can be conveniently integrated into the present RFE framework, as a feature extraction submodule to further boost the classification performance of the presented method.Hopefully, the integrated method could outperform the current best CNN method, which will be one of our future directions.

26 Figure 1 .
Figure 1.Pipeline of the proposed classification method.

Figure 1 .
Figure 1.Pipeline of the proposed classification method.

( 2 )
Local standard deviation f std corresponds to the standard deviation of the neighborhood centered at the pixel.It can describe the degree of variability of a certain region.(3) Local entropy f entr = − n ∑ i=1 p i log(p i ), where p i (i = 1, 2, ..., n) represents the statistics of the local histogram distribution, measures the entropy value of the neighborhood centered at the pixel.It indicates the randomness of a certain region.

Figure 2 .
Figure 2. Illustration of the basic principle of the DMP.

1 ∏Figure 2 .
Figure 2. Illustration of the basic principle of the DMP.

Figure 4 .Figure 3 .
Figure 4. Flowchart for RFE model training.In the sample selection and partition stage, a tile-based strategy is used to partition the samples into a training set, a validation set, and a testing set.Specifically, the large urban area covered by the samples (pixels) is first cut into several small tiles, as shown in Figure 5.Then, we partition samples by dividing the tiles into different groups.After this partition, each type of sample consists of several tiles, and they are input into the different stages for training or testing purposes.In detail, the training samples are used to train the individual RF models; the validation samples are used to both fuse the trained RF models and learn the hyper parameters of the FCCRF model; the effects of each strategy employed in the proposed method is validated using the testing samples.To avoid a serious imbalance of the classes in the training samples, we try to maintain the class balance of the training

Figure 4 .Figure 4 .
Figure 4. Flowchart for RFE model training.In the sample selection and partition stage, a tile-based strategy is used to partition the samples into a training set, a validation set, and a testing set.Specifically, the large urban area covered by the samples (pixels) is first cut into several small tiles, as shown in Figure 5.Then, we partition samples by dividing the tiles into different groups.After this partition, each type of sample consists of several tiles, and they are input into the different stages for training or testing purposes.In detail, the training samples are used to train the individual RF models; the validation samples are used to both fuse the trained RF models and learn the hyper parameters of the FCCRF model; the effects of each strategy ISPRS Int.J. Geo-Inf.2017, 6, 245 9 of 26 stage, only those samples that are not located in the object border areas within each tile are selected as valid, since the samples located in the border areas are likely to be mixed and incorrect pixels.

Figure 5 .
Figure 5. Tile-based samples selection and partition, where each type of sample consists of several tiles.

Figure 5 .
Figure 5. Tile-based samples selection and partition, where each type of sample consists of several tiles.

Figure 6 .
Figure 6.The results of two testing areas refined using short-range CRF and long-range FCCRF methods respectively.(a) the multispectral image; (b) the classification result from the RFE model; (c) the refined result from the short-range CRF model; (d) the refined result from the long-range FCCRF model.
compatibility function, which can be given by the simple Potts model, or by learning as done in a previous study[57].Each( )

Figure 6 .
Figure 6.The results of two testing areas refined using short-range CRF and long-range FCCRF methods respectively.(a) the multispectral image; (b) the classification result from the RFE model; (c) the refined result from the short-range CRF model; (d) the refined result from the long-range FCCRF model.

Figure 7 .
Figure 7.Comparison of the short-range CRF and the fully connected CRF models.(a) In the shortrange CRF, only the 4 nearest neighbors are connected to a certain node; (b) in the fully connected CRF, every two nodes are connected by an edge (from [59]).
e., Potts Model).Different from previous studies, after a feature importance analysis, the feature importance indicators generated by the trained RFE model are used to construct the feature spaces i f and j f , then Equation (

Figure 8 .Figure 7 .
Figure 8. Flowchart of the FCCRF-based refinement.The class probabilities and the feature importance indicators generated by the trained RFE model are used to construct the unary potential and the pairwise potential, respectively.

Figure 7 .
Figure 7.Comparison of the short-range CRF and the fully connected CRF models.(a) In the shortrange CRF, only the 4 nearest neighbors are connected to a certain node; (b) in the fully connected CRF, every two nodes are connected by an edge (from [59]).The flowchart of the FCCRF-based refinement implemented in the current study is shown in Figure8, where we can see that the class probabilities generated by the RFE are used to construct the unary potential ( )u i x ϕ

Figure 8 .Figure 8 .
Figure 8. Flowchart of the FCCRF-based refinement.The class probabilities and the feature importance indicators generated by the trained RFE model are used to construct the unary potential and the pairwise potential, respectively.

Figure 9 .
Figure 9.The true orthophoto of the test area, which is divided into 33 tiles.Figure 9.The true orthophoto of the test area, which is divided into 33 tiles.

Figure 9 .Figure 10 .
Figure 9.The true orthophoto of the test area, which is divided into 33 tiles.Figure 9.The true orthophoto of the test area, which is divided into 33 tiles.

Figure 10 .
Figure 10.Ground truth data used for the training and evaluation: (a) "full reference" ground truth; (b) "no boundary" ground truth, the black areas in this reference will be ignored during evaluation.

Figure 12 .Figure 13 .
Figure 12.The classification results before and after the CRF refinement.(a) Classification map generated by the RFE; (b) The refined classification map produced by our long-range FCCRF; (c) The refined classification map produced by the classical short-range CRF.

Figure 12 .
Figure 12.The classification results before and after the CRF refinement.(a) Classification map generated by the RFE; (b) The refined classification map produced by our long-range FCCRF; (c) The refined classification map produced by the classical short-range CRF.

Figure 12 .Figure 13 .
Figure 12.The classification results before and after the CRF refinement.(a) Classification map generated by the RFE; (b) The refined classification map produced by our long-range FCCRF; (c) The refined classification map produced by the classical short-range CRF.

Figure 15 .
Figure 15.Classification results of 3 areas evaluated on the ISPRS unseen testing set.Right (a,d,g) The original TOP (for better comparison); middle (b,e,h) the classified results obtained by the proposed method; right (c,f,i) the red/green error map (red pixels indicate wrongly classified pixels).

Figure 15 .
Figure 15.Classification results of 3 areas evaluated on the ISPRS unseen testing set.Right (a,d,g) The original TOP (for better comparison); middle (b,e,h) the classified results obtained by the proposed method; right (c,f,i) the red/green error map (red pixels indicate wrongly classified pixels).

Table 1 .
Different types of features and their abbreviations used in this work.

Table 1 .
Different types of features and their abbreviations used in this work.

Table 2 .
Overall accuracies of 10 single RF classifiers (their training accuracies are shown in the second row) and the RF ensemble (RFE) tested on the testing set (area: 32, 34 and 37).The highest accuracy values for each area are shown in bold.

Table 3 .
Overall accuracies of the RF ensemble (RFE) with different numbers of RF classifiers.

Table 4 .
Accumulated confusion matrix and some derived measures (precision, recall and F1 score) of the ISPRS Semantic Labeling Contest benchmark on the unseen testing set.

Table 5 .
Comparison with the best results achieved by each research group in the ISPRS Semantic Labeling Contest.Our method is denoted as "NLPR".

Table 5 .
Comparison with the best results achieved by each research group in the ISPRS Semantic Labeling Contest.Our method is denoted as "NLPR".

Table 5 .
Comparison with the best results achieved by each research group in the ISPRS Semantic Labeling Contest.Our method is denoted as "NLPR".

Table 5 .
Comparison with the best results achieved by each research group in the ISPRS Semantic Labeling Contest.Our method is denoted as "NLPR".

Table 5 .
Comparison with the best results achieved by each research group in the ISPRS Semantic Labeling Contest.Our method is denoted as "NLPR".

Table 5 .
Comparison with the best results achieved by each research group in the ISPRS Semantic Labeling Contest.Our method is denoted as "NLPR".

Table 5 .
Comparison with the best results achieved by each research group in the ISPRS Semantic Labeling Contest.Our method is denoted as "NLPR".

Table 5 .
Comparison with the best results achieved by each research group in the ISPRS Semantic Labeling Contest.Our method is denoted as "NLPR".

Table 5 .
Comparison with the best results achieved by each research group in the ISPRS Semantic Labeling Contest.Our method is denoted as "NLPR".

Table 7 .
Time costs at different stages of the proposed classification pipeline for a tile of 2000 × 2500 pixels.
n: the number of single RFs in RFE.