1. Introduction
The world urban population began to increase significantly since the 1950s. In 1980, around 40% of the total world population lived in cities. This percentage increased to 50% in 2010. In the near future, the growth rate of the urban population is expected to be approximately 1.63% and 1.44% per year between 2020 and 2025 and between 2025 and 2030, respectively. Moreover, by 2050, the urban population is expected to almost double, increasing from approximately 3.4 billion (American billion = 10
) in 2009 to 6.4 billion [
1]. This tangible demographic transition has environmental impacts, such as the loss of agricultural lands, reduction in wildlife habitats, air pollution, and a worsening of water quality and accessibility, leading to a deterioration of regional hydrology. Furthermore, urbanization has a significant side effect of exacerbating the already uneven allocation of resources in urban areas. Resource disparity is a well-known progenitor of a plethora of societal issues, including scarcity of affordable housing, poverty, constrained social services, and an increased crime rate [
2,
3]. To bridge the gap between the available resources and the needs of an urban population, a thorough urban land-use classification becomes an urgent necessity for further urban assessment, and future city planning and management [
1].
High-resolution optical images provide a high level of detail along object boundaries with height variations. Therefore, they are a good source of high horizontal-accuracy data and a rich inventory of spectral and textural information essential for segmentation and classification [
4]. However, these conventional sensors lack vertical accuracy due to their inability to fully represent spatial patterns in 3D [
5]. In some cases, they acquire information from a relatively restricted number of bands, each with large spectral intervals. Consequently, they are not suitable for detailed analysis of classes sharing similar characteristics in contrast to the new generation of passive hyperspectral sensors [
6]. On the other hand, LiDAR (Light Detection and Ranging) systems acquire high-density 3D point clouds about the shape of the Earth and its surface characteristics. The penetration capability of the laser pulses reveals certain ground objects in a single pulse with multiple returns. Hence, they can effectively cover large regions and can collect data in steep terrains with no shadow effects or occlusion if sufficient swath overlap can be achieved. Moreover, all obtained data are directly geo-referenced [
7,
8,
9]. Nevertheless, since the discontinuous laser pulses can hardly hit breaklines, LiDAR technologies show poor performance in revealing surfaces with sudden elevation changes. Thus, they produce insufficient 3D point clouds at these positions, weakening the horizontal accuracy of such locations [
10].
Given the complementary properties of LiDAR and optical data imaging, integrating the two modalities in the information extraction procedure becomes attractive as a means to amplify their strengths and to overcome their respective weaknesses. Such a fusion provides a practical alternative to the more expensive multispectral/hyperspectral LiDAR systems. The case for integrating these two modalities is especially compelling in applications that require high-density acquisitions for large regions [
11,
12]. In particular, the high cost associated with flights and radiometric corrections makes integrating LiDAR and optical data an excellent candidate when dealing with data obtained from airborne platforms [
11].
This integration opens the door to mapping LULCs (Land Use Land Cover) other than urban settings. For example, LULCs can be used in agricultural and forestry morphologies to observe crop biomass and to estimate their production and environmental states. Such applications have recently turned to relying on acquiring aerial images and LiDAR point clouds using sensors mounted on UAVs (Unmanned Aerial Vehicle); therefore, combining both data types reinforces the textual analysis of agricultural fields [
13]. In addition, infrastructural inspection and structural health-monitoring applications depend on information extracted from thermal images and LiDAR scans collected by UASs (Unmanned Aircraft System) for precise close-range inspection and trajectory control [
14]. These utilizations are added to retrofitting applications that align 3D laser scans with camera videos for complex industrial modeling such as those in pipeline plants [
15]; aerial cinematography that utilizes dynamic camera viewpoints for entertainment, sports, and security applications [
16]; and synchronized human–robot interaction applications such as the prediction of pedestrian attitude on roads for potential urban automated driving [
17].
Classification is the process of partitioning data into a finite number of individual categories based on the attributes stored in each data element. The purpose of classification is to identify homogeneous groups representing various urban classes of interest [
18]. The classification results are fundamental in further analysis for many environmental and socioeconomic applications, such as in change detection and simulation/prediction models. Therefore, a classification system should be informative, exhaustive, separable, and homogeneous with a range of variability of each respective class. The classification process itself requires consideration of different factors. These factors include the needs of the end-user, the scale of the area studied, the complexity of the landscape within it, and the nature of the data being processed [
19,
20]. Machine learning is an AI (Artificial Intelligence) application that enables computers to learn automatically with minimal human assistance. Machine learning algorithms identify patterns in observations to make predictions or to improve existing predictions when trained on similar data [
21]. Machine learning approaches can be grouped into different categories: supervised, unsupervised, semi-supervised, and reinforcement based on the learning type; parametric versus nonparametric based on the assumption of the data distribution; hard versus soft/fuzzy based on the number of outputs for each spatial unit; and finally, pixel-based, object-based, and point-based classifications based on the nature of data at hand [
19,
21].
In supervised learning, analysts must select representative samples for each category for the classifier to determine the classes for each element of the entire dataset [
22]. In contrast, unsupervised classification does not require any prior class knowledge of the study region. Instead, concentrations of data with similar features (clusters) are detected within a possibly heterogeneous dataset. Unsupervised learning could be advantageous to supervised learning in cases where prior class knowledge is incomplete. However, the resulting data clusters from unsupervised learning do not necessarily align with the class classification criteria as prescribed by the problem being investigated [
23,
24]. Semi-supervised learning falls in between supervised and unsupervised learning since it uses both labeled and unlabeled training data. It has become a hot topic in classifying remotely sensed data that contain large sets of unlabeled examples with very few labeled ones. Its algorithms are extensions of other approaches that make assumptions about modeling unlabeled data [
21,
25]. Reinforcement learning follows a trial-and-error search for the ideal behavior to boost its performance. The best action is detected by simple reward feedback known as the reinforcement signal [
21]. Parametric approaches assume the existence of an underlying data probability distribution (usually the normal distribution). The parameters of this underlying distribution (i.e., mean vector and covariance matrix) have to be estimated from the training samples to trigger the classification process. Nonparametric algorithms make no assumptions about the form of the data probability distribution. They have been proven to perform efficiently with various class distributions as long as their signatures are adequately distinct. Moreover, nonparametric methods allow for ancillary data incorporation into the classification procedure in contrast to parametric algorithms that are more susceptible to noise while dealing with complex landscapes and often present challenges in incorporating non-remote-sensing data within classification [
22,
24,
26].
Each data element in hard thematic maps is assigned to a single class. Hard classifiers sometimes cause misclassification errors, especially when using coarse spatial resolution data with multi-class representation. In contrast, soft classifiers determine a range of data similarity to all classes. Consequently, each data element is assigned to every class with varying probabilities, resulting in a more informative and potentially more accurate classification. However, soft classification lacks spatial information about each class within each data element [
26,
27]. In pixel-based classification, the classifier gathers spectra of every pixel used in the training samples to develop a signature of each class [
26] while simultaneously assigning a single predefined category to each pixel based on the brightness value the pixel stores [
23]. Meanwhile, object-based approaches generate fundamental units of image classification from object segmentation at various scale levels based on specific homogeneity criteria instead of raster data elements: a single pixel [
23,
28]. Pixel-based techniques best fit coarse data, while object-based approaches perform well in matching data acquired at fine spatial resolutions. Point-based classifiers are commonly used to process LiDAR point clouds. These point-based classifiers are similar to pixel-based techniques in terms of dealing with each single data element—a single point—individually and assign a particular predefined class for each of those points [
23].
Brownlee [
25] added another classification of machine learning algorithms based on similarity and functionality. Here is a summary of their taxonomy:
Regression algorithms model the relationship between variables and refine it iteratively by adjusting prediction errors (i.e., linear and logistic regression).
Instance-based algorithms use observations to construct a database, to which new data are compared to find the best match for prediction using a similarity measure (i.e., KNN (K-Nearest Neighbors) and SVM (Support Vector Machine)).
Regularization algorithms extend regression methods to prioritize simple models over complex ones for better generalization (i.e., ridge and least-angle regression).
Decision tree algorithms build a model of decisions based on the observations’ attribute values. These decisions branch until a prediction is made for a certain record (i.e., decision tump and conditional decision trees).
Bayesian algorithms apply Bayes’ Theorem (i.e., naive Bayes and Gaussian naive Bayes).
Clustering algorithms use built-in data structures to produce groups of maximum commonality (i.e., k-means and k-medians).
Association rule learning algorithms derive formulas that best explain the relationships between the variables of observations. These rules are generic enough to be applied to large multidimensional datasets (i.e., apriori and eclat).
Artificial neural network algorithms are pattern-matching-based methods inspired by biological neural networks (i.e., perceptron and MLP (Multilayer Perceptron Neural Network)).
Deep learning algorithms are higher levels of artificial neural networks (i.e., convolutional and recurrent neural networks).
Dimensionality reduction algorithms reduce data dimensionality in an unsupervised approach for better space visualization. Abstracted dimensions can be input into a subsequent supervised algorithm (i.e., PCA (Principal Component Analysis) and PCR (Principal Component Regression)).
Ensemble algorithms are strong learners consisting of multiple weaker learners that are independently trained in parallel and vote toward the final prediction of a record (i.e., Bagging, AdaBoost, and RF (Random Forest)).
Most initial studies that addressed LiDAR data classification tended to convert point clouds into 2D images [
29] by applying interpolation techniques, which vary in complexity, ease of use, and computational expense; and thus, no interpolation method is universally superior [
30]. Interpolation methods are either deterministic (i.e., IDW (Inverse Distance Weighted) and spline), or probabilistic (i.e., Kriging). Deterministic algorithms calculate the value of an unsampled point given the known data values of its immediate surrounding sampled points. A predetermined mathematical formula calculates the weight of each surrounding point contributing to the interpolated result. The calculations end up with a continuous spatial surface. Probabilistic algorithms utilize spatial autocorrelation properties (i.e., distance and direction) of the surrounding values to determine their importance [
30,
31]. Cell size determination is the primary concern in generating LiDAR images via interpolation techniques. Therefore, resolutions have to ensure the acquisition of sufficient details but also with the least amount of point data that meets the minimum accuracy assigned for the phenomenon being studied. Too fine-resolutions are overrepresentative and computationally ineffective. Hence, an optimal grid size should consider data source density, terrain complexity, and nature of the application [
31]. For instance, DEM accuracy is affected by spatial resolution along with interpolation error and algorithm, which have to be taken into consideration, since low DEM accuracy accordingly impacts the accuracy of its derived terrain attributes, such as slope [
30,
32].
The LiDAR capabilities of recording different point attributes (i.e., intensity, elevation, range, and number of returns) have encouraged analysts to resample point clouds into 2D grids based on a predefined resolution. In doing so, analysts gain performance advantages in computations at the expense of sacrificing details representing data in 2D instead of 3D [
23,
29]. These missing details can be vital in some applications, mostly when oriented toward objects that are not easily separable from other features. For example, in evaluating the solar potential of roof planes, downsampling and interpolation adversely affect the estimation of their aspect, inclination angles, and areas. These defects significantly impact the assessment results [
33].
Figure 1 graphically illustrates how certain information is lost in resampling irregularly spaced classified points to a grid in order to produce a LULC map [
23]. The authors identified the extent of the study area (
Figure 1a) and constructed a grid, where each cell was assigned the same size as the laser beam footprint (
Figure 1b). Then, the resampled point of each cell had the most frequent class of classified points within the cell itself (
Figure 1c). Empty cells were assigned an unclassified grid point and were treated as gaps that were to be filled by NN (Nearest Neighbor) and IMMW (Iterative Majority Moving Window). In NN, unclassified resampled points were assigned to the class of the nearest original classified LiDAR point. In IMMW, a moving window was centered at each of the unclassified resampled points. They were assigned to the most frequent class of their neighbors iteratively until all gaps were filled. The study indicated no difference between both filling approaches, as the overall classification accuracies using ML (Maximum Likelihood) did not exceed 58% using the points’ elevation and intensity. In another study, the same classification workflow was examined against the pixel-based classification of LiDAR images of two different datasets [
34]. It yielded overall map accuracies of 44% and 50% with the pixel-based classification and of 53% and 65% with the point-based approach, respectively, for each dataset. The higher achieved map accuracy was highlighted with LiDAR raw data points classification, as it was carried out based on the original data values.
Lodha et al. [
35] resampled 3D LiDAR data into a regular 2D grid, which they geo-registered to a grey-scale aerial image of the same grid size. After fusing the height and intensity data, they generated a 5D feature space of height, height variation, normal variation, LiDAR intensity, and image intensity. The feature space was subsequently used to classify an urban layout into roads, grass, buildings, and trees, using the AdaBoost algorithm. Since AdaBoost was originally a binary classifier, the study applied the AdaBoost.M2 extension for multiclass categorization and achieved an accuracy of around 92%. Gao and Li [
36] applied the AdaBoost algorithm but followed a direct point-based classification on 3D LiDAR clouds, with only height and intensity included in the feature space. When tested on different LiDAR datasets, the highest accuracy of this scenario was close to what was achieved in [
35].
Wang et al. [
37] also rasterized airborne LiDAR point clouds into grids for data based on height information; however, they constructed the feature space differently. They randomly selected two neighboring pixels of each pixel from a moving window with a predefined size. Afterward, they calculated the height difference between the two pixels. To enlarge the dimension of the feature space, they repeated the random selection. With each new selection, a different feature was compared in addition to the height value. Therefore, for each pixel, the total number of random selections of neighboring pixels was the number of features used in the classification. The study applied RF to classify the urban scene into ground, buildings, cars, and trees. It resulted in an overall map accuracy of 87% after postprocessing using majority analysis and identified major problems distinguishing cars and trees from the ground.
In an attempt to enrich the feature space for better point-based classification results, Zaboli et al. [
38] considered eigenvalues, geometric, and spatial descriptors along with intensity records in the discrete point classification of mobile terrestrial LiDAR data. These contextual features were computed for each point within a neighborhood to explore further information about spatial dependency. The study compared the GNB (Gaussian Naive Bayes), KNN, SVM, MLP, and RF’s potential capabilities to classify an urban scene containing cars, trees, poles, powerlines, asphalt roads, sidewalks, buildings, and billboards. The maximum overall accuracy was 90% by RF in contrast to an accuracy of 53% achieved by GNB. Niemeyer et al. [
39] also used waveform and geometry descriptors but with a hierarchical classifier in the point-based classification of airborne LiDAR data covering roads, soil, low vegetation, trees, buildings, and cars. The study implemented a two-layer CRF (Conditional Random Field), where a point-based classification runs in the first layer and a segment-based classification is performed in the second one. RF was applied in both classifications. Each segment’s probability of belonging to a certain class was generated in the second layer and delivered to the first layer at each iteration via a cost term. In this way, regional context information between segments was transferred to the point layer, where misclassification was eliminated. Although this approach increased the overall accuracy by 2% on a point-based level, it still hovered around 80%.
Researchers have recently investigated multispectral LiDAR data as a promising option towards widening the radiometric feature space. Huo et al. [
40] constructed their feature space by converting the three intensities of Optech Titan LiDAR point clouds into images. Furthermore, they densified the space with some auxiliary features: NDSM (Normalized Digital Surface Model), pseudo NDVI (Normalized Difference Vegetation Index), MP (Morophological Profile), and HMP (Hierarchical Morophological Profile). The study applied SVM to classify buildings, grass, roads, and trees. The authors tried different combinations of feature sets and concluded that using only intensity features resulted in the worst overall map accuracy, i.e., 75%, while the addition of the remaining features was the best scenario, which achieved a 93% accuracy.
On the other hand, Morsy et al. [
41] processed 3D point clouds acquired by a multispectral LiDAR sensor, which covered an urban scene consisting of four classes: buildings, trees, roads, and grass. The data were ground filtered. Jenks natural breaks optimization method clustered the points into the four predefined classes based on three NDVIs calculated for each point using its intensity of the three channels: NIR (Near Infrared), MIR (Mid Infrared), and Green. The clustering algorithm determined NDVI thresholds to segregate buildings from trees and roads from grasses. It relies on the fact that vegetation has higher reflectance in the abovementioned channels than built-up elements. A subset of the LiDAR point clouds from each class was selected by a set of polygons, as reference points, and together with a geo-referenced aerial image were used to assess the classification accuracy. The results showed that the NDVI produced from the NIR and Green channels had the highest overall accuracy, i.e., 92.7%. Alternatively, the authors adopted the Gaussian decomposition clustering approach for the determination of NDVI thresholds [
42]. They used the peak detection algorithm to find the highest two peaks of the NDVI histograms. These peaks were considered the mean of each Gaussian component of the histogram, which had two components, as the target was to divide each histogram into two clusters: either buildings and trees or roads and grass. The component’s mean, standard deviation, and relative weight were fitted on two steps: expectation and maximization. The first determined the probability of each bin belonging to either Gaussian components, starting with the assumption that both components have the same weight, and the latter computed the ML estimates of the component’s statistics. Both steps were repeated until either a convergence or a maximum number of trials was reached. The intersection of both fitted components was considered a threshold. The algorithm resulted in an overall accuracy of 95.1%, with the same NDVI as in the Jenks natural breaks optimization clustering method. Despite the higher accuracy achieved without auxiliary data (i.e., NDSM), the study reported overfitting problems, which forced the authors to apply different modeling functions, sacrificing precise definition of the threshold value and consequently affecting the accuracy of some classes.
Imagery data represent a rich source of radiometric characteristics, other than multispectral LiDAR sensors. Several studies have introduced aerial and high-resolution satellite images as an ancillary data source in LiDAR point clouds classification. They were even a substantial source of 3D point clouds utilized in some research materials. Baker et al. [
43] processed aerial photogrammetric images to obtain their respective dense point clouds, which were classified using the RF and GBT (Gradient Boosted Trees) classifiers to detect buildings, terrain, high vegetation, roads, cars, and other human-made objects. Pearson product–moment correlation coefficient and the Fisher information were processed on the training samples. Both metrics indicated that HSV (Hue Saturation Value) was the most informative portion of the color space used in color feature extraction. Different combinations of features and classifiers were run and evaluated, revealing a significant increase in the overall classification accuracy when color features were combined and in the GBT classifier’s superiority versus RF. Zhou et al. [
44] considered DIM (Dense Image Matching) point clouds as auxiliary data to classify airborne LiDAR clouds on a point-based level within a novel framework. For DIM and LiDAR point clouds, they built a 13D feature space that included purely geometric descriptors: height, eigenvalue, normal vector, curvature, and roughness features. Then, they extracted useful samples from DIM points as training data using the extended multi-class TrAdaboost and input the features of all DIM and LiDAR point clouds to a weighted RF model. In the following iteration, DIM misclassified samples were assigned lesser weights. Consequently, poor DIM sample points were eliminated and useful ones were retained. The authors tested the method on different urban scenes with buildings, ground, and vegetation. The best overall accuracy they obtained was 87%. However, their approach saves computational time since it does not require DIM-LiDAR registration. Moreover, DIM points can be acquired for different scenes, as long as they capture similar classes.
This literature review concludes that the classification of LiDAR data on a point-based level is more accurate than pixel-based classification since spatial and spectral information are partially lost in the latter. Moreover, it highlights the importance of expanding the feature space dimensions to include more descriptors for better mapping results. Most importantly, the referenced experimental work indicates the effectiveness of considering a slim set of color features over a large number of geometric descriptors in enhancing the classification results. The overall accuracy did not exceed 90% in the vast majority of the mentioned studies, which focused on computing high-dimensional geometric-oriented feature space. On the other hand, the mapping accuracy barley passed 95% in [
42], only with the addition of the NDVI features derived from a multispectral LiDAR sensor. Therefore, we focus in this paper on fusing spectral descriptors inherited from more affordable sensors in the point-based classification of airborne LiDAR clouds. In particular, we explore the effect of the gradual inclusion of the red, green, blue, NIR, and NDVI features integrated from corresponding aerial photos, on the accuracy results using the ML, SVM, MLP, and Bagging supervised classifiers. The presented work gives thorough insights on each scenario in terms of training and validation data size and distribution, learning behavior, processing time, and per-class accuracy for each classifier.
The remaining body of the paper presents the following sections: Methods, Experimental Work, Results, Discussion, and Conclusions.
4. Results
Early classification experiments revealed errors in the geo-registration performed in [
45] using the third-order polynomial model.
Figure 11 illustrates some of these errors featured by the displacement of sidewalks (
Figure 11a) as well as misalignment of buildings and grasses (
Figure 11c). The switch to the affine registration model at a spatial resolution of 40 cm along with the fusion of CED as described in [
46] resulted in 3074 and 3045 candidate control points on the LiDAR and aerial images, respectively, which were matched into a set of 975 pairs of final control points. The whole processing took 14.7 h and yielded 1.78 and 1.19 pixels as model development and validation errors, respectively.
Figure 11b,d display the effect of these modifications on correcting the displacement and misalignment of geo-registered points, respectively.
Figure 12 outlines the overall classification accuracy on the validation and test sets for the four investigated scenarios. For the same classifier, each combination of features shows close accuracy values on both datasets: validation and test. Additionally,
Figure 13 displays the results of the
k-fold cross-validation performed on the MLP neural network classification models. The bars represent the mean accuracy of the ten folds calculated by averaging their results when each participated as the held-out testing set. The accuracy is shown separately for the ground and non-ground classifications since the models operated independently on both data after the ground filtering. The figures above the bars are the standard deviation of the ten accuracies, assuming a normal distribution. They are minimal, as the highest is ±0.26% for the non-ground classification of the second scenario except for the ±7.47% recorded by the ground classification of the first scenario.
The skills achieved by the classifiers in each feature space group are illustrated in
Figure 14. The overall accuracy (
Figure 14a) of the first scenario that involved the I and H did not exceed 78%, which tangibly increased to cross 95% with the inclusion of the R, G, and B spectral properties in the feature space denoted by the second scenario. The overall accuracy increment looks uniform for the last three scenarios; however, the third feature combination that cumulatively appends the NIR provides the best scenario, with the largest overall accuracy surpassing 96%. The addition of the NDVI in the fourth feature set insignificantly lowered the overall accuracy, which was still above 96%.
The influence of adding the radiometric characteristics to the feature space is better addressed in the ground classification (
Figure 14b) than the non-ground (
Figure 14c). The latter shows a relatively confined range of accuracies in all feature combinations, contrary to the ground classification, where the inclusion of the R, G, and B in the second scenario considerably improved the accuracy to around 98% after barely passing 60% in the first scenario. In addition, the ground classification shows accuracies higher than the non-ground in all feature combinations except the first one.
The MLP shows superiority over the ML and SVM classifiers, as it scores the highest overall accuracy of 97.75% with the third scenario’s feature space. This supremacy is proven in all scenarios and even highlighted in the ground classification, for which the accuracy passes 98% in the third scenario. The ML and SVM results are very similar; nevertheless, the ML displays preponderance in all scenarios except the first one. It achieves a 96.74% overall-accuracy versus 96.36% for the SVM in the third scenario. In terms of running time,
Figure 15 illustrates the processing time required for the models’ training, validating, and testing in the four considered scenarios. The ML classifiers are the fastest predictive models in terms of the time expected for them to classify the data proportions directly with the feature space dimension. It gradually increases from 19 to 21, 24, and 25 min for the four feature combinations, respectively. The MLP networks show slightly longer processing times, as they run in 26–27 min. The SVM is the most computationally expensive classifier that noticeably slowly processes the data in 108 min for the first scenario. This time plunges to remain relatively steady for the rest of the scenarios, which take 43, 32, and 39 min, respectively.
Figure 16 provides a profound look at the per-class accuracy of the best scenario classification performed by the three classifiers. An individual class accuracy is exposed by producer and user accuracies. The first reflects the goodness of the classification itself, while the latter reveals the classification accuracy compared to the ground truth classes. Except for the vehicles and low vegetation classes, the bar charts display similar figures that are above 93%. The accuracies of the building, high vegetation, and light asphalt classes exceed 97%, and the sidewalk and grass accuracies even reached 100%. However, the vehicles and low vegetation classes have the lowest accuracies. The producer accuracies of the vehicle class are 61, 63, and 70%, and 89, 80, and 96% for the low vegetation class, as recorded by the three classifiers ML, SVM, and MLP, respectively.
Bagging classification results showed similar overall and per-class accuracy results. Therefore, we went directly to the Voting ensemble classification using the standalone predictions of ML, SVM, and MLP neural networks. Since the latter demonstrated the best accuracy results, LiDAR point clouds were assigned the MLP predictions in case of a draw frequency. The overall classification accuracy is 97.36%. This figure is higher than the maximum urban mapping accuracy to our knowledge so far, i.e., 95%, which was achieved by Morsy et al. [
42] using costly multispectral LiDAR sensors. Moreover, it surpasses other accuracy figures that barely exceeded 90% using computationally expensive geometric feature spaces, as mentioned previously at the end of the Introduction section.
Figure 17 displays the final map, for which a qualitative inspection reveals minor additional misclassification of high vegetation as buildings, light asphalt as dark asphalt, and buildings as high vegetation, marked as A, B, and C, respectively. The transmission towers were identified manually as non-ground scene features.
5. Discussion
As concluded in [
45,
46], it is better to apply the PC model for LiDAR and imagery data geo-registration on 2D images resampled at lower spatial resolutions to avoid computationally expensive data processing and high model development error values. However, including CED with the PC filter as a scene abstraction approach significantly reduces the model development error values (from 5.74 to 1.78 pixels using the affine model) when processing the data at high spatial resolutions (40 cm). It also applies to the model validation error figures but are not as noticeable (from 1.29 to 1.19 pixels). The double filtering eliminates noise that impacts precise edge detection. This multifilter fusion diminishes the gap between model development and validation errors. Consequently, it reconsiders registration models with data processed at high spatial resolutions that are excluded using the clustering scene abstraction approach due to their false indication of poor registration performance. Hence, the filter combination draws attention to decreasing the execution time when dealing with high spatial-resolution data (especially if they cover large study regions) by dividing them into tiles and/or by considering multithreading calculations.
On the other hand, the errors demonstrated in
Figure 11 highlight the importance of a thorough qualitative post-registration inspection while validating the chosen empirical registration model. A model’s ability to minimize its development and validation errors should not be the sole criterion for model selection. For example, the third-order polynomial model produced the least and closest development and validation errors [
45]; however, the model caused overparameterization that resulted in displacement and misalignment errors. Therefore, examining feature edges (i.e., building roofs and grasses) after ground filtering as well as a preliminary classification of some ground objects and matching them to a basemap are recommended practices for a reliable qualitative geo-registration assessment.
The results described in
Figure 12 indicate the consistency of the classification models since the same scenario exhibits almost the same accuracy value when assessed separately by the validation and testing data. They prove the models’ consistency and precision and, consequently, the reliability of either set in demonstrating the classification accuracy. The
k-fold cross-validation results shown in
Figure 13 support this conclusion for the MLP neural networks, where the whole data not only contributed to the testing but also entirely participated in the training process yet produced compatible models with consistent results. However, a slim feature space, such as the first scenario, increases the variance of accuracies that is reflected by a high standard deviation to ±7.47% obtained in the ground classification. Nevertheless, since height values are non-distinctive for ground objects, they are a poor choice for being the sole feature in a feature space.
Generally, randomness is a characteristic in many machine learning algorithms due to stochastic domains that may include data with noise, random errors, or incomplete samples from a broader population (i.e., train/test data split). Furthermore, some algorithms use probabilistic decisions to ensure smooth learning that is generic enough to accommodate unseen data by fitting different models each time they run even on the same data [
86]. Neural networks have additional sources of randomness as optimization classifiers that stochastically select subsequent areas of learning in the feature space based on areas that have been already searched. Moreover, the randomness in observation order and initial weights highly affects their internal decisions [
87]. However, the results of the
k-fold cross-validation (
Figure 13) confirm robust MLP models thanks to adequate and representative training data.
Expanding the poor LiDAR feature space to include the R, G, and B radiometric signatures as in the second scenario boomed the overall mapping accuracy (
Figure 1a). Since the naked-eye scene visualization (
Figure 6b) is discriminative, this addition to the H and I features enabled the classifiers to differentiate between buildings, sidewalks, dark and light asphalts. Moreover, consideration of the NIR spectral property in the third scenario provided a better recognition of the green objects such as grasses, and high and low vegetation, which improved the overall accuracy. On the other hand, the intangible contribution of the NDVI is justified as it is a calculated index derived from the already included NIR and R features, and this repetition did not add significant value to the classification accuracy. As described previously in this section, the first scenario represents a 1D feature space in the ground classification. Therefore, its accuracy (
Figure 14b) was influenced by adding R, G, and B to the feature space in the second scenario more than the non-ground classification, where the H values do represent a discriminative feature (
Figure 14c). It also explains the plunge in accuracy of the ground classification compared to the non-ground in the same scenario.
The higher accuracy figures denoted by the ML over the SVM classifier, despite being very close, support the validity of assuming a Gaussian probability data distribution, which makes ML more advisable than SVM since the latter requires substantial time for training and hyperparameter fine-tuning. However, this conclusion does not apply to 1D feature spaces, as in the ground classification of the first scenario, where ML dipped the accuracy to 38.09% compared to 59.47% by SVM (
Figure 14b) and subsequently lowered the overall accuracy to 64.76% compared to 74.97% of SVM (
Figure 14a). On the other hand, the outperformance of the MLP networks reflects their flexibility in addressing nonlinear predictive problems due to their capacity of approximating unknown functions, which justifies their outstanding accuracy figures even with 1D feature spaces. Its overall accuracy reached 77.59% in the first scenario, which is a considerably better performance than ML and SVM, given the same situation.
Another distinguishing feature of the MLP neural networks is the updating of the parameters and the backpropagation of errors at each epoch, which eventually stabilizes the models by ensuring good learning without overfitting or underfitting.
Figure 18 shows the learning curves of the MLP network classification of the best scenario, i.e., the third feature combination. The plots reveal an association between the model error evaluated on the training and validation sets and the number of epochs used in the model configuration. An underfit model shows a better performance (lower loss) on the training than the validation data, with a noticeably poor validation performance that either declines or levels off with the increase in epochs. The performance decline can be solved by increasing the number of epochs and/or the training data size, whereas expanding the model capacity to include more neurons and/or hidden layers may help to solve the leveling-off problem. On the other hand, an overfit network gives improved accuracies on the training and validation data, with the validation curve overlapping with the training’s and then deviating from it with the increase of epochs. Terminating the model at the inflection point and/or introducing more training data may solve the issue. Therefore, the plots in
Figure 18 confirm a good fit since the models perform well on both sets, in which the curves stabilize around the same loss value while the epochs grow.
ML is a systematic classifier that straightforwardly follows Equation (
1), where the feature space dimension directly affects the size of the mean vector and variance-covariance matrix. Therefore, a progressively increasing processing time is observed along with the four feature combinations (
Figure 15). Despite the highest model skills recorded by the MLP networks, they are slightly more time-consuming than ML, with a training period occupying the vast majority of the total processing time regardless of the feature space dimension. The run time almost did not change along with the given scenarios, as it is connected to the network structure that remained the same in all feature combinations. Nevertheless, the models once created carried out predictions quickly (i.e., 1.7 min in the classification using the best scenario). SVM spends a long time building hyperplanes to differentiate various classes, in general. Such a drawback is significantly highlighted when training the models using a narrow feature space as in the ground classification of the first scenario, the processing of which took twice the time needed for the non-ground data.
The per-class accuracies obtained by the three classifiers (
Figure 16) indicate satisfying results, except for the vehicles for which the producer accuracy figures were notably lower due to the class misclassification with the low vegetation. This consequently decreased the non-ground classification accuracies (
Figure 14c) as compared to the ground ones (
Figure 14b). Vehicles are small-sized mobile urban features with no distinguishable spectral characteristics. They are unlikely to maintain their locations on scenes captured by different sensors at various times as in this geo-registration case, where the LiDAR point clouds were obtained in 2015 independently from the aerial photo that was captured a year ahead.
Figure 19 shows how the vehicle points inherited the radiometric properties of the ground surfaces where they usually park (i.e., grassy sidewalks). As non-ground objects, vehicles were closer to the low vegetation class spectrally and geometrically, given that the height was the only geometric feature considered in the classification. Considering additional geometric properties in the classification feature space can help eliminate this misclassification. ML and MLP classifiers gave relatively close vehicle-class accuracies, notably higher than the SVM that hardly creates discriminative hyperplanes to differentiate between similar classes [
29,
88].
Bagging classification results were similar, indicating a good training dataset; however, Bagging took a substantial amount of processing (>4 h) to classify the LiDAR data 11 times. Hence, neither the Bagging classification nor the
k-fold cross-validation (processing ten folds also exceeded four hours) are advisable when having adequate training data, which were already proven reliable on validation and testing data. The transmission towers in the final map (
Figure 17) do not have sufficiently frequent appearance in the study area to be considered an independent class. Their points inherited vegetation radiometric properties after geo-registration, and their height misled the classifiers to predict them as high vegetation. Therefore, they were picked up manually and reclassified.
Figure 20 plots the locations of the extra misclassification examples denoted in
Figure 17 on the corresponding aerial photo that contributed to the geo-registration. Misclassification of some high vegetation points as buildings (A examples) is due to aerial and LiDAR data acquired at different times. These high trees were not planted at the time of obtaining the aerial image, so they inherited the dark asphalt’s spectral characteristics. Eventually, they became similar to the buildings’ radiometric and geometric properties, as they also share similar height values. The B examples on the aerial photo mark spots of shadows that belong to light asphalt but are wrongly predicted as dark asphalt, as the latter radiometrically resemble shadows. Finally, the C examples address the misprediction of some edge-building points as high trees. It is a result of an inaccurate orthorectification of the aerial image, where the building facades are still observed. The zoom-in in
Figure 20 exhibits an overlay of the geo-referenced LiDAR and aerial images, where the inexact orthorectification shifted the roofs from their correct projection and revealed the interfaces. Consequently, the building points located on the edges acquired the spectral characteristics of the facades and surrounding surfaces that are planted, leading to being wrongly assigned to high vegetation, especially with close height ranges.
The shadow effect appears in the per-class accuracy results (
Figure 16), where the dark asphalt figures are slightly below the rest of the classes. On the other hand, the impact of the high-vegetation/buildings misclassification due to different acquisition times and inaccurate orthorectification of aerial data does not show up in the accuracy results. That is because LiDAR point clouds at such locations were not included in the training data (
Figure 9b), of which the validation and testing sets are a part (
Figure 10). The abovementioned impacts are the drawbacks associated with the geo-registration of airborne LiDAR data to aerial/satellite images. Nevertheless, it provides an affordable yet accurate means of densifying classification feature spaces for better mapping.
In the end, we can summarize the contribution of our proposed study to the fusion of airborne LiDAR and imagery data in the classification of urban land-uses as follows:
Referring to the geo-registration of both data types by the PC model and scene abstraction algorithm that locate control points, which are not limited to points on traditional linear elements and intersections. The approach does not require data to be georeferenced, applies empirical registration models without knowing the physical model characteristics, and overcomes the problem of airborne LiDAR data not coming with onboard optical images collected during the same flight mission. Full details about the geo-registration approach are found in [
46].
Suggesting the inclusion of the CED to the PC filter for a more concise detection of geo-registration primitives. This study highlights the importance of a thorough qualitative inspection of the abovementioned geo-registration method’s results, as empirical models might cause overparameterization that cannot be quantitatively revealed.
Introducing a new data fusion approach to LiDAR point cloud classification that constructs radiometric-based feature spaces from spectral characteristics inherited from aerial images. With affordable sensors and computationally inexpensive feature spaces that include only I and H from LiDAR data and RGB, NIR, and NDVI from aerial photos, we achieved a mapping accuracy over 95% with a conventional classifier as ML and over 97% with MLP.
Investigating different combinations of the features as mentioned above with various commonly known classifiers: ML, SVM, and MLP in regards to the logic behind each classifier, overall and per-class accuracies, and processing time. This comparison helped exclude SVM and derived features (i.e., NDVI) from enhancing the mapping accuracies.
For future work, we recommend applying shadow-removal and post-orthorectification strategies to aerial data before geo-registration to eliminate the impact of LiDAR point clouds receiving incorrect radiometric properties and thus to minimize misclassification errors. In addition, we suggest investigating the expansion of feature spaces by performing an object-based classification to allow geometric features and/or extra radiometric ones from hyperspectral imagery data.
6. Conclusions
This study examines the effect of including different radiometric properties along with the height values on the classification results of LiDAR point clouds of an urban morphology. The morphology comprises classical urban-residential objects: sidewalks, grasses, dark and light asphalt, buildings, vehicles, and high and low vegetations. In contrast to traditional point-based classification work that relies on high-dimensional geometric-oriented feature space, our research uses spectral properties inherited from an aerial photo after geo-registering it to the LiDAR point clouds to form different feature combinations. That provides an affordable approach towards rich feature spaces instead of the computationally expensive conventional ones and the costly multispectral airborne LiDAR sensors. The geo-registration added R, G, B, and NIR to the LiDAR’s I and H feature space as well as the calculated NDVI.
We considered four scenarios for ground and non-ground classifications. The first scenario started with H and I and then incrementally involved RGB, NIR, and NIR for the remaining feature combinations. We applied three supervised machine learning algorithms to classify the scene, i.e., ML, SVM, and MLP neural networks, and used three separate sets out of the original LiDAR data for model training, validation, and testing. The validation and testing results of the three classifiers and the k-fold cross-validation of the MLP networks showed consistent accuracies, which indicated that the models were reliable for the prediction of unseen data.
The best classifying features were those of the third scenario. The NDVI is already derived from the NIR and R signatures, so it did not add to the results. The best scenario gave overall accuracies of 96.74, 96.36, and 97.75% for the three classifiers. The MLP results were the most outperforming, followed by the ML algorithm. That is because of the layering feature of MLP that approximates nonlinear prediction problems and the ML’s valid assumption about the normal distribution of data. The SVM was not as promising as ML and MLP, especially with mixed classes and narrow feature spaces, where it struggled to build discriminative hyperplanes. The three classifiers processed the data in 24, 32, and 26 min, respectively. The run time of the ML, SVM, and MLP models relied on the feature dimension, feature dimension and complexity of classes, and the network structure, respectively.
The per-class accuracies showed a low-vegetation/vehicle and dark/light-asphalt misclassifications. The first is due to vehicles acquiring the properties of grassy parking surfaces, and the latter is a consequence of the shadow existence in the aerial image. Moreover, the LIDAR/imagery data acquisition at different times and the inaccurate orthorectification of the aerial photo resulted in misclassification of the buildings and high-vegetation classes. These mispredictions highlight the shortcomings of the LiDAR/aerial data geo-registration as well as the importance of a thorough qualitative post-classification assessment. The inclusion of geometric indices can eliminate the abovementioned misclassifications. Nevertheless, the addition of traditional spectral signatures such as R, G, B, and NIR to the LiDAR’s height values boomed the mapping results to exceed 97%. Finally, the Bagging classification and k-fold cross-validation exceeded four hours to process the data. Hence, they are not recommended practices in cases with sufficient training samples. The majority voting results of the three classifiers produced an overall accuracy of 97.36%.