Mapping Heterogeneous Urban Landscapes from the Fusion of Digital Surface Model and Unmanned Aerial Vehicle-Based Images Using Adaptive Multiscale Image Segmentation and Classification

Considering the high-level details in an ultrahigh-spatial-resolution (UHSR) unmanned aerial vehicle (UAV) dataset, detailed mapping of heterogeneous urban landscapes is extremely challenging because of the spectral similarity between classes. In this study, adaptive hierarchical image segmentation optimization, multilevel feature selection, and multiscale (MS) supervised machine learning (ML) models were integrated to accurately generate detailed maps for heterogeneous urban areas from the fusion of the UHSR orthomosaic and digital surface model (DSM). The integrated approach commenced through a preliminary MS image segmentation parameter selection, followed by the application of three supervised ML models, namely, random forest (RF), support vector machine (SVM), and decision tree (DT). These models were implemented at the optimal MS levels to identify preliminary information, such as the optimal segmentation level(s) and relevant features, for extracting 12 land use/land cover (LULC) urban classes from the fused datasets. Using the information obtained from the first phase of the analysis, detailed MS classification was iteratively conducted to improve the classification accuracy and derive the final urban LULC maps. Two UAV-based datasets were used to develop and assess the effectiveness of the proposed framework. The hierarchical classification of the pilot study area showed that the RF was superior with an overall accuracy (OA) of 94.40% and a kappa coefficient (K) of 0.938, followed by SVM (OA = 92.50% and K = 0.917) and DT (OA = 91.60% and K = 0.908). The classification results of the second dataset revealed that SVM was superior with an OA of 94.45% and K of 0.938, followed by RF (OA = 92.46% and K = 0.916) and DT (OA = 90.46% and K = 0.893). The proposed framework exhibited an excellent potential for the detailed mapping of heterogeneous urban landscapes from the fusion of UHSR orthophoto and DSM images using various ML models.

and woodland in Deyang, China. RF exhibits a higher classification accuracy compared to SVM. Cao et al. [55] applied two classification algorithms, namely, SVM and k-nearest neighbors (KNN), in the GEOBIA domain to map mangrove species from a UAV hyperspectral image. The result showed that SVM performs better with 89.55% accuracy compared to KNN with 81.70% accuracy. Akar [56] compared various ML algorithms to perform LULC classification using the UAV images collected from urban and rural areas. The results showed that rotation forest (92.52%) outperforms RF (90.52) and gentle AdaBoost (87.52%). Liu et al. [57] proposed a SVM-deep belief network (restricted Boltzmann machine) method to extract eight land cover classes, namely, tree, building 1, road, grass, river, building 2, building 3, and bare-land, using the fusion of LiDAR data and UAV images. Their proposed technique shows an overall accuracy (OA) of 92.16% and a kappa (K) value of 0.904%.
In this study, an adaptive MS segmentation and classification approach was adopted to classify heterogeneous urban areas through the fusion of the UHSR orthophoto and digital surface model (DSM). The main objectives of the current study are to (1) develop an adaptive MS-optimized image object approach for detailed urban LULC mapping from UAV-based data, (2) investigate the effects of MS segmentation on FS computation (CFS and SVM) and its impact on classification accuracy, (3) compare the performance of three mature ML classification algorithms, namely, RF, SVM, and decision tree (DT), at MS levels, and (4) assess the transferability of the adopted framework. The remainder of this paper is organized as follows. Section 2 outlines the geographical location of the study area and describes the ground truth (GT) data. Section 3 presents a generic overview of the methodological framework and detailed information about image processing, image segmentation optimization, FS, MS classification, and evaluation metrics. Section 4 describes the results, and Section 5 discusses the experimental findings. Section 6 provides the conclusions.

Study Area
The location of the study area is geographically positioned at the University of Science, Malaysia (USM) campus, Penang, Malaysia. The study area represents an urban area of Penang island with different LULCs, including vegetation, water bodies, buildings, roads, and bare soil. The RGB images were acquired on February, 2018, ( Figure 1) using a Canon PowerShot SX230 HS (4000 × 3000 resolution) boarding on a UAV from an altitude of 353 m. The ground resolution of the orthomosaic is approximately 10 cm, with an 8-bit radiometric resolution. The first dataset was a subset of 2.24 km 2 from the produced orthomosaic photos, located between 100 • 18 7.43 E, 5 • 21 51.574 N and 100 • 19 2 E, 5 • 21 8.143 N. A DSM with 0.8 m resolution was generated from 3500 points using Agisoft PhotoScan Professional (version 1.3.4, http://www.agisoft.com). The second subset (with the coordinates of 100 • 17 29.341 E, 5 • 21 38.6 N and 100 • 18 9.622 E, 5 • 21 5.345 N), covering an area of 1.27 km 2 , was selected for investigating the transferability of the methodology.

GT Data
In the first study area, a total of 1177 GT samples for various urban LULC classes were prepared through field surveys with the aid of Google Earth images. Twelve different classes were identified, including water bodies, bare soil, grass, trees, clay tiles type 1, clay tiles type 2, metallic roofs type 1, metallic roofs type 2, concrete, dark concrete roofs, asbestos cement roofs, and asphalt. The training and testing GT samples were prepared as vector points and meticulously selected to ensure that all the available urban LULC classes are well represented. Table 1 presents several types of LULC classes available in the UAV-based images. Amongst the collected GT samples, 70% of each class was utilized in the training of ML models, and 30% was dedicated for testing them. For the second study area, sample statistics derived from the training samples of the first study area were used in the classification models, and 305 GT testing samples were used to evaluate the classification results.

GT Data
In the first study area, a total of 1177 GT samples for various urban LULC classes were prepared through field surveys with the aid of Google Earth images. Twelve different classes were identified, including water bodies, bare soil, grass, trees, clay tiles type 1, clay tiles type 2, metallic roofs type 1, metallic roofs type 2, concrete, dark concrete roofs, asbestos cement roofs, and asphalt. The training and testing GT samples were prepared as vector points and meticulously selected to ensure that all the available urban LULC classes are well represented. Table 1 presents several types of LULC classes available in the UAV-based images. Amongst the collected GT samples, 70% of each class was utilized in the training of ML models, and 30% was dedicated for testing them. For the second study area, sample statistics derived from the training samples of the first study area were used in the classification models, and 305 GT testing samples were used to evaluate the classification results.

Water bodies
Water bodies with light blue and green colors

LULC Type Images Description
Water bodies

GT Data
In the first study area, a total of 1177 GT samples for various urban LULC classes were prepared through field surveys with the aid of Google Earth images. Twelve different classes were identified, including water bodies, bare soil, grass, trees, clay tiles type 1, clay tiles type 2, metallic roofs type 1, metallic roofs type 2, concrete, dark concrete roofs, asbestos cement roofs, and asphalt. The training and testing GT samples were prepared as vector points and meticulously selected to ensure that all the available urban LULC classes are well represented. Table 1 presents several types of LULC classes available in the UAV-based images. Amongst the collected GT samples, 70% of each class was utilized in the training of ML models, and 30% was dedicated for testing them. For the second study area, sample statistics derived from the training samples of the first study area were used in the classification models, and 305 GT testing samples were used to evaluate the classification results.

Roads
Urban roads with grey color

Overview
In this study, MS image segmentation optimization, MS feature computation and evaluations, and supervised hierarchical ML models were conducted for accurate detailed mapping of a heterogeneous urban landscape from UAV-based images. As depicted in Figure 2, the adopted methodology comprises five main phases. Firstly, drone-based images were acquired and preprocessed to generate the orthophoto and the DSM. Secondly, the optimum MS segmentation parameters were identified using unsupervised segmentation quality metrics. Thirdly, the most significant features were selected at MS levels on the basis of CFS and SVM wrapper approaches. Fourthly, adaptive MS segmentation optimization and classification were conducted for detailed urban LULC mapping using the RF, SVM, and DT algorithms. Finally, the transferability of the proposed methodology to a different study area was investigated.
Urban roads with grey color 3. Methodology

Overview
In this study, MS image segmentation optimization, MS feature computation and evaluations, and supervised hierarchical ML models were conducted for accurate detailed mapping of a heterogeneous urban landscape from UAV-based images. As depicted in Figure 2, the adopted methodology comprises five main phases. Firstly, drone-based images were acquired and preprocessed to generate the orthophoto and the DSM. Secondly, the optimum MS segmentation parameters were identified using unsupervised segmentation quality metrics. Thirdly, the most significant features were selected at MS levels on the basis of CFS and SVM wrapper approaches. Fourthly, adaptive MS segmentation optimization and classification were conducted for detailed urban LULC mapping using the RF, SVM, and DT algorithms. Finally, the transferability of the proposed methodology to a different study area was investigated. Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 30

Image Preprocessing
Various photogrammetric steps, such as interior, relative, and absolute orientations, have been conducted to establish the mathematical relationship between the image and the ground and subsequently generate the digital elevation model and the orthophoto (an image with the same characteristics of the map, where distortions caused by relief displacement are removed and the image has a uniform scale). Throughout this process, image matching, automatic aerial triangulation, geopositioning, orthorectification, and image mosaicking were performed to create the orthomosaic

Image Preprocessing
Various photogrammetric steps, such as interior, relative, and absolute orientations, have been conducted to establish the mathematical relationship between the image and the ground and subsequently generate the digital elevation model and the orthophoto (an image with the same characteristics of the map, where distortions caused by relief displacement are removed and the image has a uniform scale). Throughout this process, image matching, automatic aerial triangulation, geopositioning, orthorectification, and image mosaicking were performed to create the orthomosaic image and the DSM from the UAV data using Agisoft PhotoScan and ArcGIS 10.4.1. The process commenced by estimating the exterior and interior orientation parameters that estimate the positions of the camera in each image and the camera calibration parameters. The RGB images were geometrically corrected and geotagged to the WGS1984 (world geodetic reference system) using the files extracted from the Global Positioning System units in the drone and the ground reference station. The images were projected to a Universal Transverse Mercator coordinate system (zone 38 North) and converted from JPEG to GeoTiff format. The following steps, such as aligning images, building field geometry, and orthophoto generation, were implemented to create a DSM (a 3D polygon mesh representing surface ground) and an orthomosaic. The DSM was generated with the nearest-neighbor interpolation method and resampled to the same resolution of the orthomosaic. The spatial resolution of the final orthomosaic for the two study areas was 10 cm, and the spatial resolution of the DSM was 80 cm.

MS Image Segmentation Optimization
The optimal segmentation level is defined in most of the unsupervised methods as the level that maximizes the between-object heterogeneity (i.e., adjacent objects can be distinguished from their surroundings) and the within-object homogeneity (i.e., pixels belonging to the same objects are similar) [40,41]. The likeness between each image and its neighbors, known as the undersegmentation metric, is determined through spatial autocorrelation (Moran's I (MI)) [58], whereas the internal homogeneity of an image object, known as the oversegmentation metric, is determined through the area-weighted variance (WV) [41].
An adaptive segmentation optimization approach that integrates unsupervised quality measures, namely, the F-measure, accompanied with a machine learning classification model was adopted in this study to identify the optimal scale(s) for each urban LULC class. The F-measure quality measure [39] was utilized to determine the hierarchical scale values from a set of given segmentation outputs. The F-measure value for estimating the optimum MS of an application can be computed using Equation (1).
where WV norm and MI norm represent the normalized area-WV (oversegmentation metric) and the normalized Moran's I (undersegmentation metric), respectively. The relative weights of WV norm and MI norm are controlled through a scene-independent factor (ϕ). The ϕ values are selected to ensure that the generated segmentation levels vary considerably in terms of the within-object homogeneity and between-object heterogeneities. For instance, ϕ = 3 signifies that triple weighting is assigned to WV norm , ϕ =0.5 indicates half weighting for WV norm , and ϕ = 1 denotes that equal weighting is considered for WV norm and MI norm . Additional details about the unsupervised parameter optimization can be found in [39,59]. The levels defined by the F-measure are used in the second phase to perform a single scale (SS) classification of each defined segmentation scale. The class-specific accuracy (F-measure) is used to evaluate the accuracy of each class at multiple levels, as shown in Section 4.3. Then, the optimal scale(s) for extracting each class is determined and used for subsequent analysis.

Feature Computation and Selection
Considering the spectral similarity between the various urban LULC classes in the UHSR RGB images, various features, including spectral values, color invariants, and geometrical textural features, were computed and assessed at multiple scales, as shown in Table 2. Selecting the significant features prior to classification is necessary to minimize the computational time by excluding the redundant attributes and enhance the accuracy of an ML classifier [47]. In this study, CFS and SVM as wrapper FS techniques were utilized to identify the most relevant MS features of image objects from UAV datasets. Table 2. Detailed description of the evaluated attributes (features).

Feature Type
Tested Feature Name Description Reference

Mean
The mean intensity values computed for an image segment of the RGB channels and the DSM [60] Standard deviation The standard deviation values computed for an image segment of the RGM channels and the DSM. [60] Max_ difference The maximum difference between the RGB channels.
[60] Brightness The average of means of the RGB channels. [ Texture

Mean
The grey level co-occurrence matrix (GLCM) mean sum of all directions determined for each band from the RGB channels and the DSM. [64]

Homogeneity
The GLCM homogeneity sum of all directions determined for each band from the RGB channels, and the DSM. [64]

Contrast
The GLCM contrast sum of all directions determined for each band from the RGB channels, and the DSM. The grey level difference vector (GLDV) matrix contrast sum of all directions determined for each band from the RGB channels, and the DSM. [64]

Entropy
The GLCM and GLDV entropy sum of all directions determined for each band from the RGB channels, and the DSM. [64]

Correlation
The GLCM correlation sum of all directions determined for each band from the RGB channels, and the DSM. [64] Standard deviation The GLCM standard deviation sum of all directions determined for each band from the RGB channels, and the DSM. [64]

Dissimilarity
The GLCM dissimilarity sum of all directions determined for each band from the RGB channels, and the DSM. [64] Angular second moment The GLCM angular second-moment sum of all directions determined for each band from the RGB channels, and the DSM. [64]

Length\Width
The ratio between the length and width. [60] Rectangular Fit A ratio that is based on how well an image object fits into a rectangle. [60] Shape_index A ratio that defines border smoothness of image objects and can be computed by dividing the border length of an image object by four times the square root of its area. [60] Density It can be computed by dividing the area covered by an image object by its radius. [60] Elliptic_fit A ratio based on how well an image object can fit into an ellipse. [60] Compactness It is expressed as the ratio of the area of an image object to the area of a circle with a similar perimeter. [60] The seventy features listed above were computed for three optimized MS image objects, and two efficient FS methods, namely, CFS and SVM, were used to find the relevant feature subset for each optimized image object level.

CFS
CFS performs fast processing to appropriately select the optimal feature subset [53,65,66]. It uses a search algorithm that heuristically assesses each attribute's predictive capability and the degree of intercorrelation between the attributes [67]. In other words, this evaluating mechanism calculates the correlations between the features and classes to classify highly correlated features to the target class whilst considering the low correlations and low level of redundancy amongst the features [68]. The estimations of the correlation between the subset of attributes and target classes are performed using Equation (2).
where s denotes the number of features, r ci represents the correlation average between the subset features and the class variable, and r ii denotes the intercorrelation average between the subset features. Accordingly, the high correlation coefficients between the feature attributes and the target labels are considered to be relevant to the respective class characterization with a high level of association, whilst lower intercorrelation (r ii ) is desired [68].

SVM
SVM is a widely applied regression algorithm with a nonparametric supervised statistical learning task and is highly suitable for GEOBIA FS and classification tasks [51,69]. This algorithm seeks an optimal separating hyperplane using the training dataset of so-called support vectors that can effectively separate the input features (datasets) into target classes with a minimum misclassification and a maximum margin amongst the target classes [70][71][72]. When the task is linearly separable, the hyperplane can be represented using Equation (3): where w indicates the coefficient vector that determines the orientation of the hyperplane in the feature space. The offsets of the hyperplane from the original and positive slack variables are represented by b and δ i , respectively [73]. Equation (4) determines the optimized hyperplane, where many hyperplanes can be designed to distinguish between classes.
where a i denotes the Lagrange multipliers and C is the penalty.

Supervised MS Image Object Classification
Image classification is the final phase in GEOBIA, and the common classification methods used in this phase are supervised ML models or rule-based methods. In this study, the MS image object classification was implemented using three supervised classification algorithms, namely, RF, SVM, and DT. The classification models were trained using the sample statistics derived from the GT dataset of the first study area. The object-based classification outcomes at different scales were used to quantitatively evaluate the MS segmentation results and select the optimum scale for each urban LULC class. Then, the classification scheme started with a single classification of each optimized image-level using the selected feature subsets for each level. After acquiring the proper information about the optimal scale(s) for each class, ML models were used to initially classify large objects at large SP. The classification results were then copied to a new level, where the unclassified objects were only resegmented to a fine segmentation level, and the ML models were then used to classify the resegmented objects on the basis of the selected significant features at that level. The process iteratively continued until all classes were accurately classified or no improvement was detected in the OA and class-specific accuracy (F-measure). The two of the aforementioned ML algorithms are briefly described in the following paragraphs.
A DT is a supervised and nonparametric ML technique that is operable without prior knowledge on data distribution, with easy interpretation and capability to model and handle the data complexity reduction and the relationships between variables [74][75][76][77][78][79]. It is a flexible, fast, and robust algorithm that can be used to control the nonlinearity between the input features and discrete classes [75]. DT hierarchically utilizes IF-THEN rules to label the variables of each class, where the tree structures, leaves, and end nodes represent the discrete class labels (decision), and the branches assist in assigning the labels on the basis of the attributes and majority voting [76]. A heuristic DT recursively partitions a dataset into homogenous subsets in conjunction with the attribute values at each branch or node in the single tree [77].
The RF algorithm is an ensemble of DT classifiers that improves the classification of variables with high accuracy, and its robustness against overfitting the training dataset along with insensitivity to nonnormal and noisy data makes it suitable for LULC classification [51,78,79]. RF is an ensemble method that exploits many DTs as a forest generated from bootstrap and utilizes each tree's vote to assign the most frequent class label to the input variables [78,80]. Each tree then randomly selects the predictors and object features from the input vector of every tree node to increase the generalization error [78,81]. The prediction of the samples is calculated on the basis of the majority votes amongst the trees [80,81]. The discrimination assignment is calculated using Equation (6): where θ k is a random vector for the kth tree, X is an input vector, I(·) is an indicator function, h(·) is a single DT, Y is an output variable, and argmax y denotes the Y value in the maximization of

Evaluation Metrics
The evaluation metrics of the classified images were generated through the frequently applied confusion matrix and its derivatives, including the OA, K, precision, recall, and F-measure. The error matrix (confusion matrix) evaluates the classification results versus the reference data in two dimensions as actual classes in rows and predicted classes in columns.

OA
The OA, which is a percentage indicator of the classification performance, can be defined as the sum of the correctly classified variables into discrete classes (true positives plus true negatives) to the total tested variables. OA can be computed from the confusion matrix by dividing the total number of correctly classified objects/pixels ( D ij or the sum of the major diagonal) with the total number of objects/pixels (N):

K Statistics
The K statistic is another statistical measure that defines the observed level of agreement or accuracy between a detailed map and reference data. The K value approaches +1 when the contribution of the chance of agreement diminishes and becomes negative when the effects of chance agreements increases. Conversely, a K value equaling 0 indicates no agreement, indicating that the classification is entirely conducted by chance or random assignment. A negative K value signifies that the agreement is worse than occurring by chance. The K statistic is computed using Equation (8): where m denotes the number of urban LULC classes in the confusion matrix, D ij denotes the number of observations (objects/pixels) that are correctly classified in row i and column j, R i denotes the total number of objects/pixels in row i, C j denotes the total number of observations in column j, and N denotes the total number of objects/pixels.

Precision, Recall, and F-measure
The F-measure is the weighted average or harmonic mean of two ratios known as precision (p). Recall (r) metric is another performance measure used to assess the class-specific accuracy from retrieved information [82,83]. It can be computed using Equation (9) on the basis of the average of p and r. The F-measure value ranges from 0 (lowest) to 1 (highest).
The p or the confidence of a LULC class is determined by dividing the number of true positives (number of objects\pixels correctly belonging to the actual class) by the total number of objects categorized as the positive class (i.e., the sum of true positives and false positives, which are objects/pixels incorrectly categorized as belonging to the class). The r or the sensitivity shows the proportion of true positive objects/pixels that are correctly predicted and identified and can be defined as the number of true positives divided by the total number of objects/pixels that are members of the positive class (i.e., the sum of true positives and false negatives). p and r can be calculated using Equations (10) and (11), respectively. A perfect predictor's value for p and r would be described as 1.

Results
This section summarizes the various outcomes of this study, including the MS image segmentation optimization and parameter selection, FS, and classification results.

Results of MS Image Segmentation
In this study, the quantitative evaluation of image segmentation results at MS levels through unsupervised segmentation quality measures aims to determine the optimal SP that allows excellent delineation and extraction of urban LULC classes that may share a similar spectral response with each other and vary in structure, size, and their surrounding contrast. The oversegmentation (WV) and undersegmentation (MI) metrics were computed from the three RGB channels, and their mean values were normalized and used to compute the F-measure (for selecting the three optimum SPs), as shown in Table 3. Three values, namely, 3, 1, and 0.33, of the scene-independent variables (ϕ) were selected to pinpoint the three SPs from the computation of Equation (1). These values were empirically selected and supported by the study of Johnson et al. [62] to ensure that the adopted segmentation levels vary remarkably from each other in terms of the between-object homogeneity and within-object heterogeneity. The highest values on the last three columns in Table 2 correspond to the optimal MS levels, and these scales are 200, 100, and 50. Figure 3a,b depict the image segmentation results of a small subset at the scale of 200, where large homogenous objects, such as water bodies, grass, bare soil, and some clay tiles, are well delineated. Figure 3c,d show the image segmentation results of a small subset at the scale of 100, where medium objects, such as some types of roofing materials, are well identified. Figure 3e,f display the image segmentation results of a small subset at the scale of 50, where large and medium objects are oversegmented but small roofing materials and trees are well distinguished.  (e,f) scale 50.

Results of FS
Following the optimization of segmentation SPs, several spectral, geometrical, and textural features were computed at MS levels for FS, as shown in Table 1. Two wrapper approaches, namely, CFS and SVM, combined with the KNN algorithm, were used to assess all features as a part of classification. Table 4 compares the OA, K, and other relevant features selected by SVM and CFS at scales of 50, 100, and 200. The results of CFS and SVM exhibited significant differences in terms of the number and type of selected features in each scale. However, the two methods eliminated 60%

Results of FS
Following the optimization of segmentation SPs, several spectral, geometrical, and textural features were computed at MS levels for FS, as shown in Table 1. Two wrapper approaches, namely, Remote Sens. 2020, 12, 1081 14 of 27 CFS and SVM, combined with the KNN algorithm, were used to assess all features as a part of classification. Table 4 compares the OA, K, and other relevant features selected by SVM and CFS at scales of 50, 100, and 200. The results of CFS and SVM exhibited significant differences in terms of the number and type of selected features in each scale. However, the two methods eliminated 60% from the total number of features, whereas less than 40% of the features contributed to achieving high accuracy. CFS attained a slight improvement in terms of the OA and number of selected features, as presented in Table 2, and was selected for subsequent processing.

Classification Results
The detailed mapping of impervious surfaces in a heterogeneous urban area from UAV-based images is particularly challenging when only three spectral channels, RGB, are used because of the spectral similarity of various urban LULC classes. In such a case, a successful extraction of urban objects should consider the information of the variation in size and the surroundings of the different types of LULC that exist in the image. For instance, asbestos cement and dark concrete roofs or cemented pavements may share similar spectral responses because of the presence of cement in their contents. To minimize the confusion between different LULC classes, the information of the suitable scale(s) that provides the best accuracy and ensures the strong differentiation between classes is necessary to obtain a holistic view and to perform hierarchical classification.
The initial stage of classification in this study is to find the optimum level for extracting each class, which can be achieved using ML models, followed by a class-specific accuracy measure. Three standard classification algorithms, namely, RF, SVM, and DT, were used to classify the first study area at the selected optimal scales (SP 200, SP 100, and SP 50). The accuracy of each classification level was evaluated on the basis of OA, K, and F-measure.  Table 5 shows the SS classification results for the first study area. The highest SS classification results were obtained by SS-RF at scale 50, with an OA of 92.2 and a K of 9.14, followed by SS-SVM at scale 100 with an OA of 90.5 and a K of 0.896 and SS-DT at scale 50 with an OA of 88.1% and a K of 0.87. Finding the optimum scale for extracting the LULC in heterogeneous urban areas can vary on the basis of the adopted classification algorithm by comparing the class-specific accuracy measures of SS-RF, SS-SVM, and SS-DT classification results. For instance, the SS-RF classification results showed that the SP 50 exhibited the highest OA for the extraction of water bodies, trees, grass, dark concrete, type 2 clay tiles, and type 2 metallic roofs, whereas the SP 200 showed enhanced extraction of bare soil, asphalt, type 1 metallic roofs, concrete, type 1 clay tiles, and asbestos cement roofs. The classification results of SS-SVM showed better extraction for clay tiles (types 1 and 2) at the largest optimized SP, whereas the smallest optimized SP was optimal for extracting water bodies, bare soil, trees, and metallic roofs (types 1 and 2). The previous step was adopted prior to the hierarchical classification approach to provide a diagnostic result where SP is suitable for extracting 12 urban LULC classes and for ensuring reasonable discrimination between the classes. Table 4. Results of correlation-based feature selection (CFS) and support vector machine (SVM) at MS levels.     Utilizing the preliminary information acquired from the SS classification of the RF, SVM, and DT algorithms, the hierarchical classification was conducted for the first study area. The results are shown in Figure 5. Table 6 illustrates the OA, K, and class-specific accuracies of the first study area using the hierarchical RF, SVM, and DT classification algorithms. Similar to SS classification, the MS-RF classification was superior with an OA of 94.40% and a K of 0.938, followed by MS-SVM with an OA of 92.50% and a K of 0.917 and MS-DT with an OA of 91.60% and a K of 0.908. Table 6. Class-specific accuracy measures for the MS RF, SVM, and DT of the first study area. Compared to SS classification, the hierarchical classification results noticeably improved the extraction of urban LULC classes. For instance, an improvement of 2.24% in the OA was observed in the MS-RF algorithm, along with a significant improvement in the differentiation and extraction of asbestos cement, concrete, and asphalt roofs. Similarly, the MS-SVM classification exhibited an enhancement in the class-specific accuracies, OA, and K of trees, grass, and asphalt classes. The OA accuracy of MS-DT showed an improvement with 3.57%, which achieved an overall improvement in the extraction of trees, grass, and asbestos cement roofs. To validate the transferability of the hierarchical classification approach, the MS-RF, MS-SVM, and MS-DT classifications were applied in the second study area using the sample statistic file derived from the image of the first study area. Figure 6 and Table 7 show the classification results for the second study area. The results of the second dataset showed that the MS-SVM classification was superior with an OA of 94.45% and a K of 0.938, followed by MS-RF with an OA of 92.46% and a K of 0.916 and MS-DT with an OA of 90.46% and a K of 0.893. The proposed hierarchical classification approach demonstrates excellent potential for the detailed mapping of heterogenous urban areas from RGB-UAV images and DSM. The proposed methodology can be adopted for various areas with different LULCs.  To validate the transferability of the hierarchical classification approach, the MS-RF, MS-SVM, and MS-DT classifications were applied in the second study area using the sample statistic file derived from the image of the first study area. Figure 6 and Table 7 show the classification results for the second study area. The results of the second dataset showed that the MS-SVM classification was superior with an OA of 94.45% and a K of 0.938, followed by MS-RF with an OA of 92.46% and a K of 0.916 and MS-DT with an OA of 90.46% and a K of 0.893. The proposed hierarchical classification approach demonstrates excellent potential for the detailed mapping of heterogenous urban areas from RGB-UAV images and DSM. The proposed methodology can be adopted for various areas with different LULCs.

Discussion
Considering that segmenting UHSR UAV-based images of a heterogeneous and complex urban landscape is a challenging task in GEOBIA, the selection of the optimum SP(s) is an imperative step to ensure that different landscapes are well delineated at different scales. This study conducted a detailed mapping of a heterogeneous urban area, an area covered with various natural and impervious surfaces that vary in size and structure, from the fusion of the UAV-based orthophoto and DSM by improving the GEOBIA frameworks with different solutions to some of the issues stated in related studies section. An adaptive MS segmentation that assimilates an unsupervised image segmentation evaluation metric (i.e., F-measure) and ML algorithms were proposed to identify the optimal MS parameters for extracting the detailed urban LULC classes.
Although GEOBIA can leverage the computation and use various features in the classification process, adding many features can reduce the classification accuracy and increase the computational time. CFS and SVM were used in this study to select the most significant features computed for each level from the optimized three-scale levels. An object's spectral, geometrical, and textural feature values are different because the size of the generated image objects (i.e., roofing material and roads) varies on the basis of the selected scale level. CFS obtained a maximum OA of 93.78% (K = 0.93) at the scale level of 200 by selecting 27 significant features, whereas SVM obtained a minimum accuracy at the scale level of 50, with the value of OA = 91.61 (K = 0.91) by selecting 21 features. CFS and SVM selected a set of features that vary in terms of the number and type in each segmentation level. However, various spectral features, such as R, B, DSM, Ratio-G, Ratio-B, Vegetation, the normalized difference between the red and green channels (NDRG), the standard deviation of image objects derived from the DSM (SD-DSM), and the normalized difference between the blue and red channels (NDBR), were commonly selected at all levels. The selection and incorporation of DSM-derived features, along with other selected features, remarkably contributed in the differentiation of spectrally similar classes, such as asbestos cement roofs, dark concrete roofs, old pavements, and asphalt. In a complex landscape without height information, an ML model might erroneously categorize bare soil as a roofing material or the opposite in accordance with the parallel spectral and textural characteristics. Al-Najjar et al. [7] utilized the fusion of DSM and optical images to generate generic automatic LULC classes for a complex urban area.
As stated in Section 3, the SS RF, SVM, and DT classification models were initially examined in the first study area at the optimal scales, identified through the F-measure along with the significant features, and selected by CFS at each optimal scale level. The SS classification results varied from one level to another when each SS model was applied, and the OA classification accuracies ranged from 79% to 92.2%. The comparison of SS classification maps showed that clear misclassifications were obtained by the DT algorithm, especially on clay tiles, asphalt, grass, and trees.
Following the FS and SS classification, iterative adaptive MS classification models were developed in the first study area and were tested in the second study area. The adopted hierarchical classification scheme of the first study area using the RF algorithm showed outstanding performance (OA = 94.4%) on the extraction of urban LULC classes compared to SVM (OA = 92.5%) and DT (OA = 91.6%). However, slight confusion was observed between different classes, as shown in Figure 7. For instance, MS-SVM showed a minor confusion between metal type1, asphalt, bare soil, and grass, as demonstrated in Figure 7b,k. Similarly, MS-DT showed a great confusion amongst the asphalt, grass, and metal type 1, as shown in Figure 7c. In addition, a remarkable confusion was found between some asbestos cement roofs, dark concrete roofs, asphalt, clay tiles type 2, and bare soil classes in some areas when MS-DT was used, as shown in Figure 7f,i,l. MS-RF showed an outstanding performance but exhibited a minor confusion between some asphalt objects that mixed with shadows as water bodies, as depicted in Figure 7d,j. The comparison of SS and MS approaches showed that the accuracy of some classes (i.e., trees, type 1 clay tiles, and asbestos classes) clearly improved with the use of the proposed approach. Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 30 The applicability of the adopted scheme in the second study area indicated that MS-RF and MS-SVM exhibited relatively similar classification results. However, the MS-SVM algorithm (OA = 94.45% and K = 0.938) was superior to RF (OA = 92.46% and K = 0.816), with a slight improvement in the OA and K values. All MS algorithms showed some degrees of confusion between some objects with grass and water bodies, as shown in Figures 8j ̶ l, which may be attributed to the existence of new water objects that vary in spectral characteristics as the second study area was classified on the basis of the sample statistics derived from the first study area. As represented in Figure 8j, the water body was poorly classified using MS-RF and showed confusion with the grass class. In this scenario, the present water body in the second study area was a pond with extremely different reflectance from the training samples obtained in the first study area and was more obvious in the MS-RF classified map compared to other algorithms because of the RF algorithm sensitivity to training. RF is sensitive to the size of training samples and the selection of an accurate representative of each class for classification [84]. Moreover, utilizing MS-DT resulted in misclassification between the tree and dark concrete classes, and between the grass and tree classes (Figure 8c). DT demonstrated a minor The applicability of the adopted scheme in the second study area indicated that MS-RF and MS-SVM exhibited relatively similar classification results. However, the MS-SVM algorithm (OA = 94.45% and K = 0.938) was superior to RF (OA = 92.46% and K = 0.816), with a slight improvement in the OA and K values. All MS algorithms showed some degrees of confusion between some objects with grass and water bodies, as shown in Figure 8j-l, which may be attributed to the existence of new water objects that vary in spectral characteristics as the second study area was classified on the basis of the sample statistics derived from the first study area. As represented in Figure 8j, the water body was poorly classified using MS-RF and showed confusion with the grass class. In this scenario, the present water body in the second study area was a pond with extremely different reflectance from the training samples obtained in the first study area and was more obvious in the MS-RF classified map compared to other algorithms because of the RF algorithm sensitivity to training. RF is sensitive to the size of training samples and the selection of an accurate representative of each class for classification [84].
Moreover, utilizing MS-DT resulted in misclassification between the tree and dark concrete classes, and between the grass and tree classes (Figure 8c). DT demonstrated a minor confusion between the bare soil, type 2 clay tiles, dark concrete, and asphalt, as shown in Figure 8c,f,i,l. As shown in Figure 8d,e, most of the roof types were categorized in an extremely similar manner by utilizing RF and SVM. MS-SVM showed a minor confusion in some areas between the asphalt and dark concrete, as shown in Figure 8h, whereas MS-RF showed a relatively better differentiation between asphalt and dark concrete for the same area, as shown in Figure 8g.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 30 confusion between the bare soil, type 2 clay tiles, dark concrete, and asphalt, as shown in Figures 8c,f,i,l. As shown in Figures 8d,e, most of the roof types were categorized in an extremely similar manner by utilizing RF and SVM. MS-SVM showed a minor confusion in some areas between the asphalt and dark concrete, as shown in Figure 8h, whereas MS-RF showed a relatively better differentiation between asphalt and dark concrete for the same area, as shown in Figure 8g.

Conclusion
Accurate and up-to-date urban LULC information is crucial for urban planning, management and environmental applications. UAVs allow the acquisition of remotely sensed data with UHSR, as high as 1 cm, in a flexible and inexpensive manner, significantly contributing to the initiation of a wide spectrum of applications. This study aimed to achieve an accurate and detailed urban LULC classification in a heterogeneous landscape using GEOBIA and ML models from UHSR drone-based images. Given the high-level details of UAV images and the limited amount of spectral information, a MS GEOBIA approach that integrates MS image segmentation evaluation, MS FS, and hierarchical ML classification algorithms was used to generate detailed LULC urban maps from the fusion of

Conclusions
Accurate and up-to-date urban LULC information is crucial for urban planning, management and environmental applications. UAVs allow the acquisition of remotely sensed data with UHSR, as high as 1 cm, in a flexible and inexpensive manner, significantly contributing to the initiation of a wide spectrum of applications. This study aimed to achieve an accurate and detailed urban LULC classification in a heterogeneous landscape using GEOBIA and ML models from UHSR drone-based images. Given the high-level details of UAV images and the limited amount of spectral information, a MS GEOBIA approach that integrates MS image segmentation evaluation, MS FS, and hierarchical ML classification algorithms was used to generate detailed LULC urban maps from the fusion of orthophotos and DSMs. Two UAV-based images were used to implement and evaluate the efficiency of the proposed method. Three commonly used supervised ML models, namely, RF, SVM, and DT, were compared within the MS/hierarchical segmentation and classification approach. The MS-RF classification achieved the highest accuracy, with an OA of 94.40% and a K of 0.938, followed by MS-SVM with an OA of 92.50% and a K of 0.917 and MS-DT with an OA of 91.60% and a K of 0.908. The applicability of the proposed approach to the dataset of the second study area showed excellent performance when MS-SVM and MS-RF were used. The proposed framework exhibited enormous potential for the detailed mapping of heterogeneous urban areas from UHSR RGB and DSM images. The results obtained from this approach can serve as vital information and input for scientists, decision makers, and city planners.