Hyperspectral and Lidar Data Applied to the Urban Land Cover Machine Learning and Neural-Network-Based Classiﬁcation: A Review

: Rapid technological advances in airborne hyperspectral and lidar systems paved the way for using machine learning algorithms to map urban environments. Both hyperspectral and lidar systems can discriminate among many signiﬁcant urban structures and materials properties, which are not recognizable by applying conventional RGB cameras. In most recent years, the fusion of hyperspectral and lidar sensors has overcome challenges related to the limits of active and passive remote sensing systems, providing promising results in urban land cover classiﬁcation. This paper presents principles and key features for airborne hyperspectral imaging, lidar, and the fusion of those, as well as applications of these for urban land cover classiﬁcation. In addition, machine learning and deep learning classiﬁcation algorithms suitable for classifying individual urban classes such as buildings, vegetation, and roads have been reviewed, focusing on extracted features critical for classiﬁcation of urban surfaces


Introduction
Over the last few decades, global urbanization has grown rapidly. By 2050, around 68% of the world's population will be living in urban areas [1]. This can cause environmental challenges, including ecological problems, poor air quality, deterioration of public health, microclimate changes leading to severe weather, higher temperatures, limited access to water, persistent vulnerability to natural hazards, and the release of toxic particles from fast industrialization into the atmosphere [2,3]. These challenges lead to difficulties in advanced urban analyses due to urban surfaces' spectral and structural diversity and complexity over a small area [4,5]. Therefore, constant monitoring of urban areas is often highly required. Systematic monitoring and updating of maps are critical in urban areas, where many objects are mobile (vehicles and temporary buildings), and the infrastructure, vegetation, and construction are constantly changing.
Spatiotemporal investigations of the urban regions are today provided by remote sensing technology advances [6]. Especially, airborne remote sensing is a powerful developing tool for urban analysis that offers time-efficient mapping of a city essential for diverse planning [7], management activities [8], and monitoring urban and suburban land uses [9]. It has been proven as a common technique for mapping urban land cover changes to investigate, e.g., social preferences, the regional ecosystem, urbanization change, and biodiversity [10]. Urban remote sensing, in particular, is widely used for the investigation of three-dimensional urban geometry that is crucial for modeling urban morphology [11], identifying various objects, heterogeneous material, and mixtures. However, the growing machine learning as deep learning (DL). DL has been proven as an effective technique for feature extraction of HS data on the spatio-spectral level [35][36][37][38][39][40]. Although ML and DL methods are considered relevant classification tools in remote sensing, different algorithms perform best, extracting different pixel-or object-based features. Choosing a classification algorithm for HS data requires knowledge about the features that can be extracted. Especially, DL has gained popularity, thanks to finding unique deep parameters in a pixelwise manner [41]. However, in the urban context, a per pixel classification can lead to noisy results considering high spatial distribution.
Moreover, classification results mainly depend on the number of training samples, limiting the performance and accuracy when the training dataset is insufficient for learning the network algorithm [42]. In order to improve the classification results and reduce the heterogeneity issue, the inclusion of contextual information around pixels and objectoriented classification [43,44] were considered, which allowed retrieving spatial information of HS data and extracting critical spatial patterns of urban land cover classes [45,46]. ML-and DL-based land cover classification in the urban environment from lidar is primarily directed to detect buildings or high vegetation [47]. This is due to the lidar`s ability to extract geometric features from objects, deriving their shape, elevation, and other properties that are useful for a classification purpose. Especially, lidar, in combination with HS, is a powerful tool for classifying urban materials. However, since the objects in the urban scene are complex, analyses with conventional classifiers achieve a low accuracy [48]. Commonly, the application of ML and DL algorithms for classification purposes in the urban environment outperforms traditional classifiers developing very quickly [49].
This review study presents the latest ML and DL urban mapping methods focusing on airborne HS and lidar data. The datasets cover the reflective spectral range of the electromagnetic spectrum (VNIR, SWIR). The paper focuses on ML and DL classification algorithms applied in the urban environment for land cover classes, such as buildings, roads, vegetation, and water analysis. We point out algorithms applicable for HS, lidar, and HL-Fusion and the challenges of applying each algorithm to hyperspectral and lidar data.
The structure of this review paper is arranged as follows ( Figure 1): in Section 2, typical urban land cover classes are described in terms of their complexity in HS and lidar data analysis. Section 3 synthesizes the general characteristics of HS and lidar data, highlighting the automated and handcrafted features extracted from both sensors. In Section 4, classification algorithms for urban mapping purposes are described. Section 5 shows the results and discussion of the presented algorithms in urban environment classification. Finally, we point out conclusive remarks on the mapping methods, HL-Fusion potential, perspectives for further research, and recommendations for new research fields.

Classified Urban Land Cover Classes
The urban land cover consists of very complex physical materials and surfaces that are constantly having anthropological impacts. The urban surface types are a mosaic of seminatural surfaces such as grass, trees, bare soil, water bodies, and human-made mate-

Classified Urban Land Cover Classes
The urban land cover consists of very complex physical materials and surfaces that are constantly having anthropological impacts. The urban surface types are a mosaic of seminatural surfaces such as grass, trees, bare soil, water bodies, and human-made materials of diverse age and composition, such as asphalt, concrete, roof tiles for energy conservation and fire danger [50], and generally impervious surfaces for urban flooding studies and pollution [51]. The complexity of urban analysis also depends on the scale chosen and its purpose. Many classifications refer to urban materials with fine spatial Remote Sens. 2021, 13, 3393 4 of 39 resolution deepening the heterogeneity, allowing a more detailed mapping result. The classification of urban objects, which consist of many different materials and variance within a class, although significant (e.g., in city map updates), becomes a challenge due to the highly nonlinear and heterogeneous composition of different objects surfaces and materials, and thus, there is the need to use more training data for classification purposes, which is time-consuming and computationally expensive.

Buildings
Buildings in an urban context can be recognized as shapes with planar surfaces and straight lines [52]. Building detection based on remote sensing methods plays a crucial role in many applications in the urban environment, such as in 3D monitoring of urban development in time [53], urban planning, telecommunication network planning, vehicle navigation [33], urban energy planning [53], city management, and damage assessment [54]. Many mapping techniques are based on shape identification, outlines, and preliminary model data [54]. Besides detecting buildings as objects, building roof extraction has recently been a hot topic within the remote sensing community. Building roofs are defined by planarity properties and height derivatives based on elevation. A 3D visualization of buildings is of great importance for infrastructure management and modeling, 3D city mapping, simulations, change detection, and more [55]. Both airborne-based optical and lidar data have been used recently to map buildings. A common way to detect buildings is to use a digital surface model (DSM) [56,57], a normalized DSM (nDSM) [58,59], or a point cloud extracted from lidar data [60][61][62][63]. Lidar is capable of extracting building heights and planar roof faces [33]. It is beneficial for spatiotemporal assessment and investigation of building density for sustainability study and residential development in cities [53].
By contrast, airborne-based HS data can better distinguish between materials at the roof surfaces due to their spectral differences [33]. However, not including the elevation information from the lidar scanner, the classification of buildings and their roofs can be too complex without human expertise. One example is a building surrounded by an arid lawn with open soil, a grass rooftop, a building with an asphaltic parking lot, or bitumen roofing surrounded by asphaltic parking at the building's ground-level high vegetation (trees) overhanging buildings [64]. Therefore, an HL-Fusion can improve the building classification results offering high accuracy on a spectral and spatial basis.

Vegetation
Vegetation is recognized by its geometrical complexity, defined by parameters such as the roughness, point density ratio measure [65], and chlorophyll spectral feature. In the last decade, active (Sentinel-1, LiDAR, and radar) and passive (Quickbird, Worldview, Sentinel-2, Landsat, and MODIS) remote sensing has been widely applied to vegetation detection. Lidar data are used to generate virtual 3D tree models [66], map low and high vegetation [67], and, using multispectral lidar, assess vegetation variety regarding its health and density [68], as well as extract vegetation indices, e.g., NDVI [69] for monitoring changes caused by urbanization, anthropogenetic activities, and harvesting applying wavelet transform [70,71]. However, vegetation detection is not a straightforward approach. The analysis is often complex and detailed due to the increasingly finer spatial resolution of remote sensing devices, such as distinguishing photosynthetic and nonphotosynthetic vegetation [72]. Vegetation is often not defined as a whole but as groups, for example, as low vegetation (grass), middle vegetation (shrubs), and high vegetation (trees). One of the more complex challenges is the similar morphology of low/young trees and shrubs, causing misclassification of shrubs as high trees [73]. HS data are also used to detect vegetation on a spectral basis (chlorophyll reflectance), differentiating between vegetation types and healthiness. More biophysical parameters can be defined due to more spectral bands than multispectral lidar (usually 2-3 wavelengths), such as the leaf area index, fractional cover, and foliage biochemistry [74]. Both sensors have been fused in many Remote Sens. 2021, 13, 3393 5 of 39 studies, e.g., for canopy characterization for biomass assessment and estimation of risk of natural hazards [75] and urban tree species mapping [76].

Roads
Road detection from airborne-based HS and lidar data is essential in remote sensing applications, e.g., a road navigation system, urban planning and management, and geographic information actualization [77,78]. The elevation feature derived from lidar data has been proven as a significant parameter to extract time-efficient road methods compared to optical methods [79]. DSM distinguishes more precise boundaries of surfaces, even in occluded regions [80]. However, only lidar-data-based classification is limited when roads are at the same elevation but made of different materials, such as asphalt, concrete, or other impervious materials [18]. Therefore, HS imaging can differentiate between different materials and their conditions to complement road classification purposes. It has already been proven by Herold et al. [81] for the following uses: map alteration, degradation, and structural damages of road surfaces based on spectral analysis. Usually, to detect roads, texture information is implemented [82]. In addition, lane marks can be used as an indicator for new roads; however, this approach is illumination sensitive [83]. HS data classification without topographic information is challenging when differentiating between two objects made from the same material: differentiation between a parking lot, parking at the ground level, cycleway, and a road [30].

Miscellaneous
Apart from the above-described land cover classes, the urban environment consists of more complex thematic classes. They commonly cannot be chemically or physically described by a single hyperspectral absorption feature or other single features, such as height or shape, which are, however, extracted from contextual information. Thus, spatial context is critical and necessary for identifying industrial areas, commercial or residential buildings, playgrounds, and harbors in coastal cities. The combination of spectral and spatial features from HS and lidar data shows potential, allowing identifying thematic class and assessing its condition in terms of quality and materials.

Key Characteristics of Hyperspectral and Lidar Data
In urban land cover classification, handcrafted feature engineering plays an important role in standard shallow ML algorithms, such as support vector machines (SVM) and random forest (RF). Features are manually derived from remotely sensed data and defined to describe an object of interest, starting from spectral bands through, for example, spectral indices and contextual information, which are generally very useful in defining important biophysical parameters, e.g., for vegetation [84]. However, manually derived features may not sufficiently represent the highly complex and unique urban environment [85]. Depending on the classification objective and classified objects, different features are required. However, in DL, the feature engineering process is simplified as features are extracted during the training step [86]. These automatic high-level features can represent complex spatial correlations and nonlinear relationships. Examples of handcrafted features for both HS and lidar data are described below in this section.

Hyperspectral (HS) Images
HS data retrieved from an imaging spectrometer are a three-dimensional cube that includes two-dimensional spatial information (x, y) with spectral information at each pixel position x i y j [87]. Each pixel in the obtained digital data contains a nearly continuous spectrum covering the reflective spectral range of the visible, near-infrared (VNIR: 400-1000 nm) and short-wave infrared (SWIR: 1000-2500 nm) [88,89]. HS as a passive system is dependent on the given lighting conditions resulting in high intraclass (within a class) spectral variability. In these wavelength ranges of the electromagnetic spectrum, particular absorption features and shapes make it possible to identify the material's chemical and physical properties [90]. For example, in urban land cover classification, the reflective spectral range is often used to map diverse soils [91], vegetation [92], rooftop materials [93,94], and other complex physical materials [12,[95][96][97].
A high spectral resolution characterizes airborne-based HS applications at the expense of spatial resolution since the HS sensor's spatial resolution linearly depends on the flight altitude and the instantaneous field of view (IFOW) [98]. However, due to technology development, the spatial resolution of HS is enhanced. Spectrometers with high spectral and spatial resolution have been used to identify detailed urban materials [12,13,94,99]. With a higher spatial resolution of the hyperspectral camera, it is more likely that the spectral signals are less mixed, producing pure pixels and thus detecting materials in the urban environment with high geometric detail and material accuracy. However, a high resolution can lead to difficulties, detecting more diverse materials within a single object, thus increasing heterogeneity and making object-based classification on a coarser scale more challenging. Especially in urban remote sensing, the spatial complexity of the objects and their heterogeneity have been an issue for limited spatial resolution in many studies [94,100]. When within a single pixel, the spectral mixture is very complex, the different spectral properties of individual urban materials are lost, making classification at the level of relevant urban materials challenging [101]. Therefore, a high spatial resolution of hyperspectral sensors has become a crucial parameter in urban mapping.
Land cover classification based on HS data is affected by spatial and spectral resolution, classification purposes (scale and defined land cover classes), mapping methods, and data acquisition and preprocessing. The latter can be the optical geometry, integration time, and other parameters during the acquisition [102]. Especially in airborne-based HS imaging, the sensor experiences altitude variation, which results in geometric distortions in the HS scene [103]. It is always a compromise between off-nadir distortion, spatial resolution, mixed pixels, and SNR (signal-to-noise ratio). Therefore, the strategy and flight scheme must be adapted to the level of the classification target in an advanced way. The flight line's swath width is reduced at a lower altitude, which requires more flight lines to be flown to cover the target area with changing light conditions due to long integration time [104] and leads to higher off-nadir distortions [105]. However, there are challenges for flying at higher altitudes, such as a high degree of mixed pixels due to a low spatial resolution [106]. In addition, the short integration time at lower altitudes results in lower SNR and decreased sensor sensitivity, producing a more elevated noise floor.

Spectral Features
Within one material, spectral features can vary due to color, coating, degradation, alteration, roughness, the illumination of material, data acquisition, location of the material, and preprocessing data ( Figure 2) [97,107,108]. These variations within a material are more and more investigated, generating spectral libraries of complex urban materials [12,109,110] and normalization based on advanced preprocessing. HS images result in high-dimensional data leading to computationally expensive analyses. For this reason, the first common step of the classification process of the HS data is very often a spectral dimensionality reduction to the relevant components applying linear spectral transformations without losing important spectral information [111]. Standard techniques for dimensionality reduction are often statistically based, such as principal component analysis (PCA) [112], linear discriminant analysis (LDA) [113], multivariate curve resolution (MCR) [114], and other unsupervised classification methods. Such data compression saves computing time, reduces noise, and retains needed information [115]. They are often based on the individual image statistic, and thus they are not directly transferable to other flight lines or flight campaigns. In addition, quantification procedures based on the spectral signature are no longer possible. Statistical calculations have been applied to the spectral features of the urban materials, such as continuum removal [116,117]. The continuum-removal algorithm is applied to identify spectral absorption features by their wavelength positions and shapes, removing the overall albedo of the reflectance curve and reducing the searched material's superimposition [118]. However, the general shape of an absorption feature is relevant for material identification and quantification. Continuum removal may prove effective only for limited studies, excluding the original shape of the spectra. Some handcrafted target-specific features can be calculated from optical remote sensing data, such as normalized difference vegetation index (NDVI) for vegetation detection [8,18,119,120], new impervious index, road detection index, new roof extraction index for the detection of built-up, roads, and roofs [121], normalized difference built-up index [122], visible red and green near-infrared built-up indices [123], road extraction index [124], and hyperspectral difference water index for the detection of urban water bodies [125].
based on the individual image statistic, and thus they are not directly transferable to other flight lines or flight campaigns. In addition, quantification procedures based on the spectral signature are no longer possible. Statistical calculations have been applied to the spectral features of the urban materials, such as continuum removal [116,117]. The continuumremoval algorithm is applied to identify spectral absorption features by their wavelength positions and shapes, removing the overall albedo of the reflectance curve and reducing the searched material's superimposition [118]. However, the general shape of an absorption feature is relevant for material identification and quantification. Continuum removal may prove effective only for limited studies, excluding the original shape of the spectra. Some handcrafted target-specific features can be calculated from optical remote sensing data, such as normalized difference vegetation index (NDVI) for vegetation detection [8,18,119,120], new impervious index, road detection index, new roof extraction index for the detection of built-up, roads, and roofs [121], normalized difference built-up index [122], visible red and green near-infrared built-up indices [123], road extraction index [124], and hyperspectral difference water index for the detection of urban water bodies [125].

Spatial Information
Spatial-context information is widely used to achieve robust and accurate classification maps considering the neighborhood in the target pixel. While spectral features are the most relevant features in material-based classification, adding spatial features to object classification makes it easier to group pixels with some spectral variance into one class representing an object or land cover type [126] (see Section 3.3). In addition, the spatial noise of the classification results can be reduced [127,128]. In [129], the authors proposed a context-sensitive semisupervised SVM classification technique using contextual information without assumptions about the labeling of contextual pixels. In [130,131], the authors also added the contextual features into hyperspectral image classification, including the information in the classification map generation step. Spatial information is often incorporated in hyperspectral classification problems applying Markov random field where a predefined neighborhood of a pixel assumes that the central pixel belongs to the same class [36,132,133]. Contextual features can also be extracted considering texture (see Section 3.3.1), morphological features (see Section 3.3.2), and image segmentation.

Spatial Information
Spatial-context information is widely used to achieve robust and accurate classification maps considering the neighborhood in the target pixel. While spectral features are the most relevant features in material-based classification, adding spatial features to object classification makes it easier to group pixels with some spectral variance into one class representing an object or land cover type [126] (see Section 3.3). In addition, the spatial noise of the classification results can be reduced [127,128]. In [129], the authors proposed a context-sensitive semisupervised SVM classification technique using contextual information without assumptions about the labeling of contextual pixels. In [130,131], the authors also added the contextual features into hyperspectral image classification, including the information in the classification map generation step. Spatial information is often incorporated in hyperspectral classification problems applying Markov random field where a predefined neighborhood of a pixel assumes that the central pixel belongs to the same class [36,132,133]. Contextual features can also be extracted considering texture (see Section 3.3.1), morphological features (see Section 3.3.2), and image segmentation.

Lidar Data
Lidar data is a three-dimensional point cloud (x, y, z) which delivers by default information about elevation, multiple-return, the reflected intensity, texture, and waveformderived feature spaces from the object hit by laser pulse [31,134]. As an active sensor, a lidar system emits radiation from one bandwidth (more in the case of multiwavelength lidar scanners) to the object surface at high repetition rates. Lidar scanners are whiskbroom-type instruments and typically use the monochromatic laser in visible-532 (bathymetric/coastal mapping)-and near-infrared-1064 and 1550 nm-for example, for vegetation detection and differentiation between asphaltic and nonasphaltic roads [135] which can be used as an additional intensity feature in land cover mapping in the reflective spectral range [31]. The advantage of using airborne lidar is insensitivity to relief displacement and illumination conditions [31], retaining full 3D geometry of data.

Height Features and Their Derivatives (HD)
The height feature is used to calculate the three-dimensional coordinates (x,y,z) that generate a gridded 2.5-dimensional topographical profile of the area of interest [31]. Especially in the urban environment, the z value height is crucial for precise contour generation of elevated objects [31]. In addition, the height difference between the lidar return and the lowest point in cylindrical volume has been investigated and proven as an important feature in discriminating ground and nonground points [136,137]. Moreover, a digital surface model (DSM) ( Figure 3A) is extracted from the height information applying interpolation of 3D points onto a 2D grid. From a DSM, a surface roughness layer [138] and a normalized DSM (nDSM) ( Figure 3C) are calculated, subtracting the digital terrain model (DTM) ( Figure 3B) from the DSM [31]. The overlapping of the building height information and the terrain height information is thus excluded. The object representation heterogeneity is therefore reduced, which helps the classification procedure.
form-derived feature spaces from the object hit by laser pulse [31,134]. As an active sensor, a lidar system emits radiation from one bandwidth (more in the case of multiwavelength lidar scanners) to the object surface at high repetition rates. Lidar scanners are whiskbroom-type instruments and typically use the monochromatic laser in visible-532 (bathymetric/coastal mapping)-and near-infrared-1064 and 1550 nm-for example, for vegetation detection and differentiation between asphaltic and nonasphaltic roads [135] which can be used as an additional intensity feature in land cover mapping in the reflective spectral range [31]. The advantage of using airborne lidar is insensitivity to relief displacement and illumination conditions [31], retaining full 3D geometry of data.

Height Features and Their Derivatives (HD)
The height feature is used to calculate the three-dimensional coordinates (x,y,z) that generate a gridded 2.5-dimensional topographical profile of the area of interest [31]. Especially in the urban environment, the z value height is crucial for precise contour generation of elevated objects [31]. In addition, the height difference between the lidar return and the lowest point in cylindrical volume has been investigated and proven as an important feature in discriminating ground and nonground points [136,137]. Moreover, a digital surface model (DSM) ( Figure 3A) is extracted from the height information applying interpolation of 3D points onto a 2D grid. From a DSM, a surface roughness layer [138] and a normalized DSM (nDSM) ( Figure 3C) are calculated, subtracting the digital terrain model (DTM) ( Figure 3B) from the DSM [31]. The overlapping of the building height information and the terrain height information is thus excluded. The object representation heterogeneity is therefore reduced, which helps the classification procedure. The nDSM represents the above-ground points that correspond to the actual heights of the object, omitting information about the objects which could complicate the classification, for example, the differentiation of buildings in lowland or hilly regions. The height information from lidar data helps differentiate between high and low vegetation [139], tree-level characterization applying the canopy height model (CHM) [140], and roads and buildings in the urban environment [8]. In addition, slope calculation (first derivative of any elevation product) and surface curvature (second derivative of the elevation surface) have been applied for detecting surface roughness [141,142] and changes in the normal vectors of the surface [143]. Moreover, calculated skewness and kurtosis models from the lidar elevation data were applied by Antonarakis et al. [144] to determine planted and natural riparian forests and their ages [32]. In the classification approaches, Charaniya et al. [145] included height variation, Bartels and Wei [146] calculated mean variance and standard derivation of the height in the first echo from lidar to measure the roughness, The nDSM represents the above-ground points that correspond to the actual heights of the object, omitting information about the objects which could complicate the classification, for example, the differentiation of buildings in lowland or hilly regions. The height information from lidar data helps differentiate between high and low vegetation [139], tree-level characterization applying the canopy height model (CHM) [140], and roads and buildings in the urban environment [8]. In addition, slope calculation (first derivative of any elevation product) and surface curvature (second derivative of the elevation surface) have been applied for detecting surface roughness [141,142] and changes in the normal vectors of the surface [143]. Moreover, calculated skewness and kurtosis models from the lidar elevation data were applied by Antonarakis et al. [144] to determine planted and natural riparian forests and their ages [32]. In the classification approaches, Charaniya et al. [145] included height variation, Bartels and Wei [146] calculated mean variance and standard derivation of the height in the first echo from lidar to measure the roughness, and Im et al. [147] added homogeneity, contrast, and entropy of height as feature spaces after image segmentation ( Figure 4).

Intensity Data
Intensity values extracted from lidar data correspond to the peak amplitudes from the illuminated object [31]. Applying intensity as a feature space, Song et al. [148] presented an approach to determine asphalt roads, grass, house roofs, and trees. However, trees' diverse intensity values undermine the classification due to the canopies' complex geometry [149]. Moreover, lidar-based intensity can differentiate between low vegetation and impervious surfaces, such as built-up areas. MacFacen et al. [150] applied the estimated mean intensity values from a lidar dataset in an object-based image classification approach. Intensity data are unstable and contain artifacts in the overlapping regions of single strips and eccentricity Remote Sens. 2021, 13, 3393 9 of 39 caused by the gain response, sensor scanning, and environmental factors [151][152][153]. To remove the noise from the intensity data, interpolation, filtering methods, and radiometric calibration are commonly used [148,154]. Additionally, the influence of flying altitude variations, topography, and atmospheric conditions can be corrected, adjusting intensity values, which is called range compensation [155].
sented an approach to determine asphalt roads, grass, house roofs, and trees. However, trees' diverse intensity values undermine the classification due to the canopies` complex geometry [149]. Moreover, lidar-based intensity can differentiate between low vegetation and impervious surfaces, such as built-up areas. MacFacen et al. [150] applied the estimated mean intensity values from a lidar dataset in an object-based image classification approach. Intensity data are unstable and contain artifacts in the overlapping regions of single strips and eccentricity caused by the gain response, sensor scanning, and environmental factors [151][152][153]. To remove the noise from the intensity data, interpolation, filtering methods, and radiometric calibration are commonly used [148,154]. Additionally, the influence of flying altitude variations, topography, and atmospheric conditions can be corrected, adjusting intensity values, which is called range compensation [155].

Multiple-Return
A lidar-based laser pulse can split into multiple laser returns if it hits a permeable object such as a tree canopy and obtains a response from, e.g., branches, leaves, stems, and the ground [31]. Multiple-return data has been recently used as an additional feature space in the urban mapping in the commercial building, small house, and tree determination [146]. Charaniya et al. [145] and Samadzadegan et al. [48] extracted the first, the last, and the normalized difference (NDI) between these returns to investigate roads and buildings. However, multiple returns occur as well if the laser pulse reaches building edges [156].

Multiple-Return
A lidar-based laser pulse can split into multiple laser returns if it hits a permeable object such as a tree canopy and obtains a response from, e.g., branches, leaves, stems, and the ground [31]. Multiple-return data has been recently used as an additional feature space in the urban mapping in the commercial building, small house, and tree determination [146]. Charaniya et al. [145] and Samadzadegan et al. [48] extracted the first, the last, and the normalized difference (NDI) between these returns to investigate roads and buildings. However, multiple returns occur as well if the laser pulse reaches building edges [156].

Waveform-Derived Features
Full-waveform lidar scanners can retrieve the entire signal of the backscattered laser pulse as a 1D signal profile in the chronological sequence [134,156,157]. A full-waveform lidar system can better correct the intensity values than the discrete systems, such as accurate estimation of the surface slope [158], eliminating the assumption of Lambertian reflectors [159]. However, before using any classification approach, proper radiometric calibration is needed to adjust waveform data from different flight campaigns. Such a radiometric calibration should include preflight, onboard, and vicarious calibration, as presented by Wagner [155]. The waveform-derived features extracted from the gaussian decomposition function have been tested for urban mapping purposes [47,136,160,161]. They include the waveform amplitude, (normalized) number of echoes, their width (Gaussian standard deviation), the difference between the first and the last return, echo shape, and echo cross-section. The latter provides high values for buildings, medium values for vegetation, and low values for roads [137]. For building facades and vegetation that meet multiple echoes, the normalized number of echoes feature is, therefore, relevant [137]. Jutzi and Stilla [162] extracted linear features on roofs based on full-waveform data. Chehata et al. [136] provided that by adding echo width as a feature, the classification results improved for low vegetation. Echo shape was investigated by [137,163], providing low values to roofs and high values to vegetation. It has been proven that the waveform geometry helps to differentiate between trees and built-up areas [136,156,164], determine tree species [165,166], and segment lidar point clouds in an urban area [167]. The waveform amplitude depends on the target. High amplitudes were observed by Chehata et al. [136] for rooftops, gravel, cars, bare soil, and grass, and low amplitudes for asphalt, tar street, and water. Mallet and Bretar [156] observed high amplitudes for grass and bare earth and found that the spread in the pulse and low amplitudes can be assigned to flat surfaces by increasing the incident angle. The echo waveform classification has been applied by Lin and Mills [168] and Doneus et al. [169]. The terrain echoes were separated from echoes from bushes and low vegetation. The echo pulse is wider on the canopy surface and plowed field than on the meadow and street [156]. High point density in full-waveform lidar data helps to detect vegetation types and states [170].

Eigenvalue-Based Features
The eigenvalues are calculated based on the covariance matrix of x, y, and z dimensions of the 3D point cloud as λ 1 , λ 2 , and λ 3 . Eigenvalues as features help detect geometrical parameters, such as plane, edge, and corner [171]. The following structure features have been applied to lidar data: omnivariance, anisotropy, planarity, sphericity, linearity, and eigenentropy for features for context-driven target detection [172] building detection [171]. Some of them are shown in Figure 5. The planarity feature is proven relevant for road classification or other flat surfaces and sphericity for building and natural ground (low vegetation) detection [136]. composition function have been tested for urban mapping purposes [47,136,160,161]. They include the waveform amplitude, (normalized) number of echoes, their width (Gaussian standard deviation), the difference between the first and the last return, echo shape, and echo cross-section. The latter provides high values for buildings, medium values for vegetation, and low values for roads [137]. For building facades and vegetation that meet multiple echoes, the normalized number of echoes feature is, therefore, relevant [137]. Jutzi and Stilla [162] extracted linear features on roofs based on full-waveform data. Chehata et al. [136] provided that by adding echo width as a feature, the classification results improved for low vegetation. Echo shape was investigated by [137,163], providing low values to roofs and high values to vegetation. It has been proven that the waveform geometry helps to differentiate between trees and built-up areas [136,156,164], determine tree species [165,166], and segment lidar point clouds in an urban area [167]. The waveform amplitude depends on the target. High amplitudes were observed by Chehata et al. [136] for rooftops, gravel, cars, bare soil, and grass, and low amplitudes for asphalt, tar street, and water. Mallet and Bretar [156] observed high amplitudes for grass and bare earth and found that the spread in the pulse and low amplitudes can be assigned to flat surfaces by increasing the incident angle. The echo waveform classification has been applied by Lin and Mills [168] and Doneus et al. [169]. The terrain echoes were separated from echoes from bushes and low vegetation. The echo pulse is wider on the canopy surface and plowed field than on the meadow and street [156]. High point density in fullwaveform lidar data helps to detect vegetation types and states [170].

Eigenvalue-Based Features
The eigenvalues are calculated based on the covariance matrix of x, y, and z dimensions of the 3D point cloud as λ1, λ2, and λ3. Eigenvalues as features help detect geometrical parameters, such as plane, edge, and corner [171]. The following structure features have been applied to lidar data: omnivariance, anisotropy, planarity, sphericity, linearity, and eigenentropy for features for context-driven target detection [172] building detection [171]. Some of them are shown in Figure 5. The planarity feature is proven relevant for road classification or other flat surfaces and sphericity for building and natural ground (low vegetation) detection [136].

Textural Features
Besides spectral information of hyperspectral sensors, pixel-wise spatial features are relevant for image content, such as textural features. The textural attributes in a hyperspectral scene can be extracted by the local binary patterns (LBP) operator proposed by [173], providing information about the surface granularity [174]. To include spatial information in the classification purposes, the textural operators are window based. Peng et al. [175] extracted them as rotation-invariant features for urban classification purposes except for spectral features and Gabor features [176]. The latter are frequential filters interpreting the texture of the hyperspectral bands used by [177,178]. The texture can be analyzed by applying the gray-level co-occurrence matrix (GLCM) measures [53,179]. GLCM measures, first proposed by Haralick et al. [180], consist of energy, contrast, correlation, entropy, and homogeneity. GLCM dissimilarity, entropy, homogeneity, and second-moment help to detect building edges and height differences. However, contrast, correlation, and variance do not improve building classification and temporal change [53]. Texture features have been used to classify urban materials for pattern recognition in lidar, satellite, and airborne data [48,[181][182][183][184]. Samadzadegan et al. [48] calculated four measures: mean, entropy, standard deviation, and homogeneity to classify trees, buildings, and grounds. Huang et al. [181] applied, except for homogeneity and entropy, the angular second moment and dissimilarity from the DSM in the classification approach.

Morphological Features
Mathematical morphology contains operators such as erosion, dilation, opening, closing, rank filters, top hat, and other derived transforms. Mainly, these operators are applied on panchromatic images from hyperspectral sensors, binary or greyscale images with isotropic and geodesic metrics with a structural element [185]. For example, the opening operator focuses on the bright spots, removing objects smaller than the structural element, whereas the closing operator acts on the dark objects ( Figure 6). Morphological features with a structural element contain information about the minimum size of the target being investigated [18]. They help reduce shape noise, enhance edges, interpret the texture and extract structures on images regarding their shapes, orientation, and sizes [185][186][187][188]. In image processing, morphological features are based on both spectral and spatial information involving pixels in the neighborhood. They are widely used in hyperspectral image classification [178,[187][188][189][190][191], noise reduction in lidar [192], building detection [193], and HL-Fusion-based classification [18]. It has been proven that the inclusion of morphological features improves the accuracy in differentiation between roads and buildings [8].
applying the gray-level co-occurrence matrix (GLCM) measures [53,179]. GLCM measures, first proposed by Haralick et al. [180], consist of energy, contrast, correlation, entropy, and homogeneity. GLCM dissimilarity, entropy, homogeneity, and second-moment help to detect building edges and height differences. However, contrast, correlation, and variance do not improve building classification and temporal change [53]. Texture features have been used to classify urban materials for pattern recognition in lidar, satellite, and airborne data [48,[181][182][183][184]. Samadzadegan et al. [48] calculated four measures: mean, entropy, standard deviation, and homogeneity to classify trees, buildings, and grounds. Huang et al. [181] applied, except for homogeneity and entropy, the angular second moment and dissimilarity from the DSM in the classification approach.

Morphological Features
Mathematical morphology contains operators such as erosion, dilation, opening, closing, rank filters, top hat, and other derived transforms. Mainly, these operators are applied on panchromatic images from hyperspectral sensors, binary or greyscale images with isotropic and geodesic metrics with a structural element [185]. For example, the opening operator focuses on the bright spots, removing objects smaller than the structural element, whereas the closing operator acts on the dark objects ( Figure 6). Morphological features with a structural element contain information about the minimum size of the target being investigated [18]. They help reduce shape noise, enhance edges, interpret the texture and extract structures on images regarding their shapes, orientation, and sizes [185][186][187][188]. In image processing, morphological features are based on both spectral and spatial information involving pixels in the neighborhood. They are widely used in hyperspectral image classification [178,[187][188][189][190][191], noise reduction in lidar [192], building detection [193], and HL-Fusion-based classification [18]. It has been proven that the inclusion of morphological features improves the accuracy in differentiation between roads and buildings [8].

Hyperspectral-Lidar Data Fusion
HL-Fusion combines spectral-contextual information obtained by an HS sensor and a lidar scanner's spectral-spatial-geometrical information. Even if the active and passive sensors characterize different physics, their features can be combined from both sensors. Both sensors cover the reflective spectral range intersecting either in the VIS (532 nm) or the SWIR (1064, 1550 nm) wavelength regions. More rarely, multi-spectral lidar systems are used, which overlap in several of the three common wavelengths, allowing the identification of materials or objects using spectral properties [194]. Under laboratory conditions, prototypical hyperspectral lidar systems are being developed [69,195,196]. The combination of HS and lidar sensors significantly impacts remote sensing, opening up possibilities for fully three-dimensional target analysis [196]. Examples include civil engineering, historical preservation, geomorphological studies, and material processing. However, it is not only the classification concerning 3D geometry determined by sensor fusion. Most rely on geometric simplification of high-dimensional data, reducing both lidar data and HS data to 2.5 grids, where geometrically aligned lidar and HS data are classified based on raster data.
HL-Fusion is usually conducted by adjusting the spatial resolution of one sensor to another (HL to lidar), empirically correcting for geometric errors. Such fusion does not consider the different sensor characteristics (e.g., scan, view, or incidence angles). This kind of fusion also fails when the scene has low-contrast areas, as it is very sensitive to illumination, losing information about details important in complex and heterogeneous urban environments. Despite the dimensional degradation, HL-Fusion has great potential for achieving enhanced results in land cover classification rather than using single sensors, especially when combining spectral and spatial features. In the last decade, fusion has been attempted in this way, for example, by adding to the spectral features extracted from HS data, elevation information, intensity, and other lidar-derived features, which allowed one to upgrade the level of the classification from pixel-to object-based analyses.
Spectral-spatial-based classification on fused data often improves the certainty of a pixel's belonging to a class. On the other hand, an increasing number of features extracted for classification purposes from different sensors can lead to a curse of dimensionality, especially when the training data are limited [197]. HL-Fusion can also be performed physically, taking into account sensor parameters, measuring principles, quantities, illumination sources, the position of the sensors, and attitude in the preprocessing phase [198,199]. Intensity values can describe the physical link between the spectral and spatial responses of both sensors' overlapping wavelengths [199]. However, single studies provided HL-Fusion based on fitting spectral data to the first return from lidar data, thus preserving full 3D geometry and structure, improving the scale of analysis and its performance and robustness [200].

Classification of Urban Land Cover Classes
Urban land cover classification based on remote sensing data has been carried out on a pixel or object-based classification. Pixel-by-pixel analysis assigns only one of the defined classes to each pixel without considering neighboring pixel decisions [201]. In remote sensing, pixel-based classification relies on the spectral properties of each pixel from the scene. However, pixel-based approaches for high-dimensional remotely sensed HS and lidar data were assumed to be inaccurate for reliable classification purposes [202,203]. Therefore, object-based classification has become relevant, reconstructing reality more truthfully, managing fine spatial resolution data, and suppressing noise. Object-based methods include spatial, textural, contextual, topological, and spectral information [204,205], where objects are defined as classification units [43]. Moreover, the object-based analysis consists of image segmentation, grouping spectrally homogeneous regions, and classification, assigning the segments to the corresponding classes with various properties [206]. Both pixel and object-based classification can be driven in the unsupervised, for example, deep belief networks (DBN) [207][208][209], and stacked autoencoder (SA) [41,[210][211][212]) or supervised (RF, SVM) matter.
Analyses on the unsupervised basis separate classification units relying on their common features without providing reference data. This kind of classification is helpful if the knowledge about the study area is limited. In addition, unique classes can be recognized that may have been overlooked applying supervised classification. However, the control over the generated classes is limited, or the final results do not present the analytics intentions, for example, if the desired class is not directly correlated. Supervised classification identifies unknown pixels/objects, validating the accuracy by reference classes assigned to known pixels/objects [213]. One of the advantages of using supervised classification is controlling the number and name of the class labels, which are then assigned to the classification units in the final step [214]. However, supervised classification requires human expertise and the preparation of such reference or ground-truth data adequate for selected area and classification purposes. Such ground-truth sampling includes the removal of outliers and remains representative samples for overall input [215]. This can be accomplished by applying active learning [208,216], random sampling, or stratified random sampling [217].
The ground-truth labeling often requires an equal number of instances assigned to a class. Therefore, a class imbalance issue leads in (multiclass) HS classification to decreased accuracy of many standard algorithms such as decision trees, k-nears neighbor, neural networks, and SVM [218]. Especially for high-dimensional data (HS) and ML/DL-based multiclass problems, the minority classes are neglected or misclassified [219]. Various strategies can be applied to overcome imbalance class issues partially: simplification of the network architecture [38], data augmentation for minority classes, and random sampling for equal class distribution [220].
Complex urban land cover mapping is mainly based on spectral and spatial features of remote sensing data, implemented in classification algorithms. Such an analysis is mainly limited to comparing classification approaches, a general classification scheme, or a small data set, which provides high-accuracy results on local space, excluding generalization and transferability aspects [221]. Often, the evaluation of the classification approach is complicated since the training data may not be representative enough for independent testing data set. In addition, urban land cover analysis usually depends on human expertise at a local scale [84].
Various ML and DL algorithms have been recently explored in feature extraction, pattern recognition, and image classification to deal with high-dimensional space [49,88]. Feature extraction in remote sensing analysis contains mainly shallow supervised and unsupervised and deep feature extraction [222]. In HS data, spectral feature extraction is applied to reduce the high dimensionality and to avoid redundant bands preserving only relevant spectral information. This strategy can also help in increasing separability between different classes. However, spatial feature extraction (texture and morphology) finds the contextual relationship of adjacent pixels improving the only spectral-based classification [132,133,179,209]. In DL, automated extraction of features is common and outperforms shallow ML if the training data fed to the algorithm are not limited.
Aiming to analyze the complexity and improve the DL algorithm learning process quality, a thorough understanding is required of the filter function in the DL architecture [223]. One way to do this is to visualize the parameters of the entire algorithm architecture. However, studies on urban land cover classification based on HS and lidar rarely focus on explaining how the DL algorithms work. As the limited amount of highdimensional remotely sensed data is fed as input to DL classifiers, there is a probability that the hyperparameter tuning causes overfitting. To avoid this issue, e.g., data augmentation, adding noise, model regularization methods (max-pooling and dropout [224]), and simplifying the model are used. Data augmentation helps diversify training data without new labeling costs, thus leading to more robust classification and adequate classification. In remotely sensed-based classification, training data have been flipped and rotated [225,226], mirrored across horizontal, vertical, and diagonal axes on HS [226,227] and lidar data [228], mixup strategy [229], and generation of virtual training samples through Generative Adversarial Networks (GANs) [230] on HS data. In addition, noise is proven to be suited as a data augmentation type. Haut et al. [231] added random occlusion data augmentation (rectangular figures of different sizes) in various HS image patches. Many studies applied Gaussian white noise during simulation to improve the robustness of the classification and reduce the model's dependence on local attributes in HL-Fusion [80] and HS data [232].
Apart from overfitting issues, the time-expensive DL algorithms deal with vanishing gradient problems where the learning is unstable and saturates the activations [233]. This problem can be solved by implementing data normalization between each network layer (e.g., local response normalization [234], batch normalization [235], and layer normalization [236]), choosing proper optimizers and nonlinear activation functions [45].
The following section describes the most common ML and DL algorithms for the classification in the urban environment, such as SVM, RF, CNN (convolutional neural network), and RNN (recurrent neural network) (Table 1). Nevertheless, there are many more ML and DL classification algorithms that are not included in this review. Starting with ML algorithms, over time and with technology development, they have become more advanced. Urban analysis with conventional learning-based classifiers was based on interpreting handcrafted low-level features, linear classifiers and nonlinear classifiers, and binary and multiclass classification [88]. Examples are statistical learning on HS data [237], logistic regression on HS data [133], and maximum likelihood classification on lidar data [146]. However, the DL algorithms evolve in classifying urban objects on a larger scale, automatically extracting high-level features. In addition, DL can handle the issue of the complex spatial distribution of spectral information. Automatically derived features in DL rely on a mathematical basis, tuning the model by changing the parameters and neglecting its standard implementations the physical aspect of remote sensing data. In addition to CNNs and RNNs, which have been included in this article, many different DL network frameworks show promising potential for further analysis and a deeper understanding of DL, primarily for HS data. Some of these algorithms are DBN [207][208][209] with SA [41,[210][211][212] and GAN [35]. However, these algorithms are in the initial phase of implementation and were not applied until 2019 to HL-Fusion data. Integration of spatial and spectral features (contextual SVM) [242]

Support Vector Machines (SVM)
SVM is a supervised ML algorithm that performs the classification of locating a hyperplane between two classes [241]. Such a hyperplane separates two groups in the training dataset, finding the largest margin between the support vectors from different groups [271]. The SVM approach is widely used in pattern recognition, regression, and solving linear equations [271]. It has been proven to be a classifier that can handle the high-dimensional HS data being insensitive to noisy samples [272][273][274][275][276]. Moreover, SVM can deal with smaller training datasets more efficiently than artificial neural networks and maximum likelihood classification algorithms [53]. The decision function of the SVM can be specified by different kernels such as radial basis function (RBF), spectral-based [277], and Gaussian function [19], which classify only in the spectral domain, and composite kernels that include contextual information to the classification [241,278]. The kernel-based methods define the segments by applying the nonlinear geometrical separators [272]. The spectral-based kernel uses the spectral angle of the support vectors to define the hyperplane between them, while for each pixel, spatial information is derived and combined with spectral features in kernel composition. Deep SVM has been implemented with exponential radial basis function, gaussian radial basis function, and neural and polynomial kernel functions, achieving better robustness than conventional classifiers [279].

Buildings
In recent years, a multiwavelength lidar scanner has become an interesting mapping device that can differentiate objects with the same height, such as buildings and trees, based on pseudonormalized difference vegetation index (pseudoNDVI) [68] and geometrical features, e.g., roughness (curvature) [120]. Teo and Wu [15] provided a case study where curvature, intensity, and nDSM were used on multispectral lidar. They applied these lidar features as input for image clustering and found that especially geometric features are suitable for building detection. Huo et al. [8] applied the SVM algorithm with RBF kernel on multispectral lidar data. In the paper, the authors focused, among other things, on building extraction using the combination of nDSM, morphological profiles, novel hierarchical morphological profiles (HMP) [186], pseudoNDVI, and intensity values. Intensity values only extracted from lidar can lead to misclassification of building asphalt roof (parking lot) and a road with similar spectral properties. Shirowzhan and Trinder [53] provided the SVM classification method for building extraction, including DSM, nDSM, and intensity map. However, the results provided a misclassification between roads and buildings in the hilly or vegetation-rich area. A pixel-based classification method is often not able to separate buildings and vegetation boundaries. Samadzadegan et al. [48] proposed a multiclass SVM on building extraction. The authors used first-and last-pulse intensities, first-and lastpulse ranges, entropy, standard deviation, homogeneity, and other geometrical features and showed that texture features improve the final results for building detection. In building analysis based on HS data, the spectral classifie's domain has limitations in the classification of building roofs (roofing tiles, bitumen, concrete, fiber cement, metals, and slates) [97,238]. To overcome the limitations of single sensor applications, HL-Fusion can complete robust building analysis using spatio-spectral-elevation information. Spectral features from HS data can exclude vegetation growing around and on buildings and differentiate between roof materials. By contrast, lidar data provide shape information that can help determine roof types and building types.

Vegetation
SVM classifier is a standard algorithm in vegetation detection in the urban environment. The authors of [48] suggested a multiclass framework for lidar data, analyzing the normalized difference between the first and the last laser pulse. High vegetation class (trees) was falsely classified as buildings due to limited training data. Teo et al. [120] stated that lidar penetration improves the overall accuracy of vegetation analysis. However, by splitting vegetation into high and low vegetation, lidar data cannot distinguish low height classes such as roads and grass. Huo et al. [8] applied SVM on multispectral lidar data calculating the NDVI and pseudoNDVI [178] and improving the overall classification accuracy, however, having challenges in distinguishing between low and high vegetation. Wang et al. [67] addressed a similar problem in the study and compared single-and dualwavelength lidar by applying, among others, full-waveform data that were not included in previous studies. The authors showed that dual-wavelength improves the accuracy of low and high vegetation and bare soil and low vegetation compared to single-wavelength lidar. In HS analysis, spectral features are still more accurate in chlorophyll detection than lidar, mainly when a class is characterized by low material variations [40,239]. In addition, HS has been proven to characterize fraction coverage of photosynthetic vegetation, nonphotosynthetic vegetation, and soil [72]. Furthermore, by adding spatial features to the hyperspectral analysis, vegetation detection becomes facilitated [241,242]. Spatial information is also used in HL-Fusion in object-based classification, being able to classify different types of vegetation (tree species) [32,76] and also, in the case of generating hyperspectral point clouds, maintain higher reality factors such as full 3D geometry, generic and robust characteristics [200].

Roads
Huo et al. [8] and Teo et al. [120] applied SVM on multispectral lidar data to detect roads. Achieving high accuracy classification, Huo et al. [8] referred to the misclassification of roads as lawn and bare soil, which can be easily solved by adding HS to the lidar data due to access to more detailed spectral information than lidar only. One of the causes can be similar spectral signatures and insufficient distinctive spectral-spatial features to differentiate between objects. Teo et al. [120] mentioned classification issues applying geometrical features among grass, road, and soil due to similar height. However, spectral features from multiwavelength lidar can overcome the challenge.
In contrast, spectral features in HS analysis applying SVM are often insufficient for achieving robust and accurate results of road classification [40,239]. This is due to considering only spectral information without contextual information and remarkable spectral similarity between physical material belonging to different classes. SVM has also been widely used in road classification on fused HS and lidar data. Brell et al. [200] generated an HS point cloud, where they classified different road materials such as concrete and asphalt. The challenge in distinguishing concrete and asphalt is the influence of shadow deteriorating discrimination between different road materials. The spectral properties of those materials can vary locally based on aging, deterioration, contamination, roughness properties, and other conditions [200].

Random Forest (RF)
RF is a nonparametric ensemble learning algorithm based on a combination of binary decision tree classifiers [280]. A decision tree in the ensemble is independent of other trees and is trained with random variables by bootstrap sampling [77]. For classification purposes, each tree gives a class prediction as an output. The class that most trees have chosen is considered to be the final result [281]. RF has become a widely used classification algorithm in HS imaging due to its high accuracy and high processing speed [282]. Moreover, RF can handle high-dimensional data selecting redundant spectral bands without overfitting [18,77]. RF has also been applied to airborne-based lidar data as a classifier solving multiclass problems and selecting the essential features for urban mapping [136].

Buildings
Niemeyer et al. [283] proposed a new building classification method based on the 3D point cloud from lidar data. The classification technique transforms the RF classifier into a conditional random field (CRF) framework [218] and provides high-accuracy results for large buildings over 50 m 2 . However, misclassification occurs at building facades and dormers. In addition, various features derived by lidar have been tested by Chehata et al. [136].
In the paper, multiecho, full-waveform, different height-based, local plane-based, and eigenvalue-based features have been applied to classify buildings. However, confusion errors occurred for transition points between buildings and the ground class.
Further, echo-based features did not have any influence on classification results. Debes et al. [18] presented a fusion framework consisting of unsupervised classification that supports the supervised classification on ensemble learning. They showed that lidar elevation information is required to differentiate between buildings and vegetation or different building types in addition to HS spectral data.

Vegetation
Niemeyer et al. [283] applied an RF classification framework with conditional random fields on lidar data to discriminate vegetation and buildings from each other. Chehata et al. [136] applied RF on lidar data experiencing issues in the classification between vegetation and artificial (roads) and natural ground (grass), respectively. However, applying intensity, height, and texture features of multispectral lidar is very promising for ground-level classes, for example, low vegetation [245]. In HS analysis, spectral features fed to RF classifier provide high vegetation accuracy, good robustness, and insensitivity to noise [244]. Debes et al. [18] chose an RF algorithm on HL-Fusion with elevation features from lidar and NDVI from HS data that outperformed urban area classification [18].

Roads
Niemeyer et al. [283] proposed an RF classification framework for lidar data described in Section 3.1, where one of the classes was asphalt considered a road. However, other objects apart from roads are also made of asphalt, such as roof parking lots, making the analysis difficult, e.g., using only HS data. Jackson et al. [284] mentioned this issue clarifying that the road class pixels are contaminated by other materials and objects such as gravel, puddles, and cars. In addition, vehicles appearing in the image usually have highly reflective properties, making road classification difficult for RF classifiers [8]. Recently, lidar point cloud intensity data have been proposed for road landmark inventory with active learning [285].

Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN)
CNN is a DL algorithm that has become an important HS, lidar, and HL-Fusion classification method. The network's deep convolutional architecture can effectively deal with complex remote sensing data solving nonlinear issues [286] with an example architecture in Figure 7.

Vegetation
Niemeyer et al. [283] applied an RF classification framework with conditional random fields on lidar data to discriminate vegetation and buildings from each other. Chehata et al. [136] applied RF on lidar data experiencing issues in the classification between vegetation and artificial (roads) and natural ground (grass), respectively. However, applying intensity, height, and texture features of multispectral lidar is very promising for ground-level classes, for example, low vegetation [245]. In HS analysis, spectral features fed to RF classifier provide high vegetation accuracy, good robustness, and insensitivity to noise [244]. Debes et al. [18] chose an RF algorithm on HL-Fusion with elevation features from lidar and NDVI from HS data that outperformed urban area classification [18].

Roads
Niemeyer et al. [283] proposed an RF classification framework for lidar data described in Section 3.1, where one of the classes was asphalt considered a road. However, other objects apart from roads are also made of asphalt, such as roof parking lots, making the analysis difficult, e.g., using only HS data. Jackson et al. [284] mentioned this issue clarifying that the road class pixels are contaminated by other materials and objects such as gravel, puddles, and cars. In addition, vehicles appearing in the image usually have highly reflective properties, making road classification difficult for RF classifiers [8]. Recently, lidar point cloud intensity data have been proposed for road landmark inventory with active learning [285].

Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN)
CNN is a DL algorithm that has become an important HS, lidar, and HL-Fusion classification method. The network's deep convolutional architecture can effectively deal with complex remote sensing data solving nonlinear issues [286] with an example architecture in Figure 7. CNN has two characteristics different from other DL algorithms, such as local connections and share weights. Local connections help find the data's spatial relationship, and share weights reduce the number of parameters needed for training purposes and generate a filter [16,286]. CNN architectures can be trained in an unsupervised or supervised way. The unsupervised method is the greedy layer-wise pretraining of hyperspectral data [287][288][289][290]. The supervised method is the standard backpropagation [234,286,291,292]. However, CNNs require a high number of model parameters. The high dimensionality and limited training samples of the remotely sensed data can lead to over- CNN has two characteristics different from other DL algorithms, such as local connections and share weights. Local connections help find the data's spatial relationship, and share weights reduce the number of parameters needed for training purposes and generate a filter [16,286]. CNN architectures can be trained in an unsupervised or supervised way. The unsupervised method is the greedy layer-wise pretraining of hyperspectral data [287][288][289][290]. The supervised method is the standard backpropagation [234,286,291,292]. However, CNNs require a high number of model parameters. The high dimensionality and limited training samples of the remotely sensed data can lead to overfitting and longer processing time than other classification techniques [37,39,293]. An advantage of applying CNN is that the input data must not be preprocessed. CNN is capable of automatically learning abstract features and detecting high-level objects [54]. CNN is used for land cover mapping based on HS data [35,[37][38][39] and HL-Fusion [16,30,264,293]. Moreover, CNN has been recently used in combination with other algorithms such as MRF [37] to extract HS pixel vectors on a spatial and spectral basis and extinction profiles [30], capable of effectively reducing noise and improving classification accuracy.
RNN is a DL algorithm widely used to work on sequential data [37]. RNN is a compound of successive recurrent layers capable of extracting contextual parameters at consecutive time steps (Figure 8) [266]. An advantage of RNN is that the input sequences may have different lengths [266]. However, RNN requires a longer processing time than standard ML algorithms, such as SVM or RF [37]. RNN has already been used to extract contextual information of HS data [37,266] and recognize temporal changes of objects measured by the lidar scanner [294].
Remote Sens. 2021, 13, 3393 24 of 41 fitting and longer processing time than other classification techniques [37,39,293]. An advantage of applying CNN is that the input data must not be preprocessed. CNN is capable of automatically learning abstract features and detecting high-level objects [54]. CNN is used for land cover mapping based on HS data [35,[37][38][39] and HL-Fusion [16,30,265,293]. Moreover, CNN has been recently used in combination with other algorithms such as MRF [37] to extract HS pixel vectors on a spatial and spectral basis and extinction profiles [30], capable of effectively reducing noise and improving classification accuracy. RNN is a DL algorithm widely used to work on sequential data [37]. RNN is a compound of successive recurrent layers capable of extracting contextual parameters at consecutive time steps (Figure 8) [266]. An advantage of RNN is that the input sequences may have different lengths [266]. However, RNN requires a longer processing time than standard ML algorithms, such as SVM or RF [37]. RNN has already been used to extract contextual information of HS data [37,266] and recognize temporal changes of objects measured by the lidar scanner [294].

Buildings
Zhou and Gong [54] focused on building detection in different conditions in damage assessment. Their approach relies on roof object extraction, challenging for lidar data due to sparse points in the boundaries between rooftops and the vertical facades of buildings damaged. In addition, very complex roof configurations cannot be distinguished using lidar data when the training data are too homogeneous. However, DL algorithms such as CNN provided accurate classification results of pre-and post-disaster data with minimal required preprocessing of lidar data and time consumption. Shahajan et al. [55] provided a DL approach that extracts the lidar points from different views assigned to roofs applying height-derived features. However, like the previous study, height-derived features are insufficient in roof type differentiation, and the CNN algorithm requires a large training data set. CNN classifier has also been used in HS data analysis due to its relevant spectralspatial domain [40]. However, CNN on HS data is time-consuming even if the preprocessing of the fed input is minimized, requires a more extensive data set than shallow ML classifiers, and is not transferable with the same model parameters to other independent test data. Li et al. [30] proposed a DL framework based on spatial and elevation features extracted from extinction profiles, spectral, spatial, and elevation features extracted from the CNN model to classify buildings, among others, on HL-Fusion. Extinction profiles were also used to derive spectral, spatial, and elevation features from HS and lidar data. These features were applied as input for CNN classification on buildings [16]. Morchhale et al. [265] have proven that CNN-based classification on HL-Fusion can distinguish between commercial buildings and highways and between residential buildings and parking lots, improving generalization capability.
Wu et al. [266] introduced deep RNN for HS data classification combining with CNN and creating a convolutional recurrent neural network (CRNN). This framework enabled

Buildings
Zhou and Gong [54] focused on building detection in different conditions in damage assessment. Their approach relies on roof object extraction, challenging for lidar data due to sparse points in the boundaries between rooftops and the vertical facades of buildings damaged. In addition, very complex roof configurations cannot be distinguished using lidar data when the training data are too homogeneous. However, DL algorithms such as CNN provided accurate classification results of pre-and post-disaster data with minimal required preprocessing of lidar data and time consumption. Shahajan et al. [55] provided a DL approach that extracts the lidar points from different views assigned to roofs applying height-derived features. However, like the previous study, height-derived features are insufficient in roof type differentiation, and the CNN algorithm requires a large training data set. CNN classifier has also been used in HS data analysis due to its relevant spectral-spatial domain [40]. However, CNN on HS data is time-consuming even if the preprocessing of the fed input is minimized, requires a more extensive data set than shallow ML classifiers, and is not transferable with the same model parameters to other independent test data. Li et al. [30] proposed a DL framework based on spatial and elevation features extracted from extinction profiles, spectral, spatial, and elevation features extracted from the CNN model to classify buildings, among others, on HL-Fusion. Extinction profiles were also used to derive spectral, spatial, and elevation features from HS and lidar data. These features were applied as input for CNN classification on buildings [16]. Morchhale et al. [264] have proven that CNN-based classification on HL-Fusion can distinguish between commercial buildings and highways and between residential buildings and parking lots, improving generalization capability.
Wu et al. [266] introduced deep RNN for HS data classification combining with CNN and creating a convolutional recurrent neural network (CRNN). This framework enabled the extraction of hidden feature representations and provided highly accurate results for building detection. For HS image classification, Mou et al. [37] provided an RNN framework with a GRU activation function that maintains a constant error, helping the network learn more effectively in a high-dimensional space. As a result, his classifier achieved very high accuracy in recognizing commercial and residential areas in the urban environment. Even though only spectral features without contextual information were considered, RNN outperformed standard ML algorithms and CNNs.

Vegetation
CNN is used in HS analysis for vegetation detection [36,39,42,225,239,[246][247][248]. Li et al. [246] simultaneously extracted spectro-spatial features of HS data benefitting structural properties needed for detailed vegetation interpretation. However, more and more algorithms for vegetation classification are based on HL-Fusion data. Ghamisi et al., Morchhale et al.,and Li et al. [16,30,264] proposed different frameworks based on CNNs. Chen et al. [265] created a CNN framework used to extract the spectral-spatial features of HS data and the elevation features of lidar data. He applied a fully connected DNN to fuse the derived features from both sensors, ending the classification approach with the logistic regression to generate the final classification map [265]. Deep RNN introduced by Mou et al. [37] has been used for vegetation classification. Although RNN resulted in high overall accuracy, the most significant challenges occurred in classifying different grass class types, such as healthy grass, stressed grass, and synthetic grass [37].

Roads
CNN algorithms have already been widely applied as an initial framework for road classification as objects or materials, e.g., gravel, concrete, and asphalt. Santara et al. [38] compared different ML and DL algorithms, including the CNN framework. CNN classified roads as asphalt and gravel with high accuracy only on HS data. Recently, much more often, CNN is used as a classifier for HL-Fusion. Morchhale et al. [264] compared CNN on HS data and HL-Fusion. The classification and differentiation accuracy between road, parking lot, and highway increased in the HL-Fusion. Li et al. [28] proposed that he focuses on classification challenges between similar spectral characteristics of road materials, e.g., asphalt and concrete, and the similar height of different objects, such as grass and asphalt road. Ghamisi et al. [25] applied the CNN classifier with logistic regression and mentioned the challenge of similar spectral signatures of roofs and roads for HS data classification. Wu et al. [266] and Yang et al. [39] proposed CRNNs for HS image classification. Mou et al. [37] presented a different framework-the deep RNN. In both network frameworks, the road was grouped into road and highway. The deep RNN outperformed other conventional classifiers in differentiating similar objects, such as road, highway, and railway [37].

Discussion
Airborne HS and lidar data-based classification in the urban environment over the last 20 years has increased significantly since 2016, as shown in the annual number of articles reviewed in this paper found up to 2021 ( Figure 9). Therefore, it can be assumed that the interest in HS and lidar remote sensing, advances in sensor technology, computing power, and easy access to remote sensing-based datasets are relevant factors paving the way for large-scale urban environment analysis. However, it has to be noted that the HS-based land cover classification far exceeds lidar and HL-Fusion analyses. Since 2016, the scientific production of urban classification methods based on ML and DL has significantly increased for HS, lidar, and HL-Fusion. This is due to the availability of more advanced computer infrastructures, less expensive sensors with higher resolution, and more accessible data for HS and lidar.
Nevertheless, the HS continued to be widely used. Firstly, this may have been due to the lack of data of the same study area from the two sensors acquired simultaneously. Secondly, most land cover classification approaches are based on physical material classification, which relies significantly on spectral analysis. Sometimes, therefore, it is not necessary to fuse two sensors for some purposes to improve classification by a small fraction with much more effort and time spent fusing the sensors. However, assuming that urban analysis is a highly complex task, one of the fusion application arguments may be that HS and lidar complement each other in spectral and spatial analysis with the addition of elevation information and active and passive sensor characteristics. Nevertheless, the HS continued to be widely used. Firstly, this may have been due to the lack of data of the same study area from the two sensors acquired simultaneously. Secondly, most land cover classification approaches are based on physical material classification, which relies significantly on spectral analysis. Sometimes, therefore, it is not necessary to fuse two sensors for some purposes to improve classification by a small fraction with much more effort and time spent fusing the sensors. However, assuming that urban analysis is a highly complex task, one of the fusion application arguments may be that HS and lidar complement each other in spectral and spatial analysis with the addition of elevation information and active and passive sensor characteristics.

HS-Based Classification
SVM and RF have been proven to be insensitive to noise in HS-based analysis, providing very high accuracy for material classification with limited training data. The only spectral-based SVM classifier can quickly identify object-based classification classes with low material variations. However, for complex urban land cover classification, the spectral domain of the SVM is not sufficient to capture the heterogeneity of the objects or land cover classes built from various materials, for example, identification of impervious and nonvegetated pervious surfaces [238]. For such an analysis, contextual information is necessary. The spatial features can be added to the SVM classifier by applying the composite kernel, improving the accuracy and generalization capabilities.
RF also applies spectral-domain only for HS data. As for SVM, the land cover classes having high material variations within a class are often misclassified, such as road materials (concrete, asphalt, and gravel). The difference and advantage over the SVM classifier is the capability to select important features. This aspect is also advantageous for the DLbased classifiers since shallow ML-based algorithms use handcrafted features controlled and transferred to other independent and unknown test areas.
On the other hand, when the classification objective is focused on a smaller study area, automated features of the DL algorithm may prove to be a better solution for highdimensional HS data. One factor is that the relationships between objects or land cover classes are not linear in a complex urban environment. By extracting the handcrafted features, we have control and knowledge about them. In contrast, the automated features can

HS-Based Classification
SVM and RF have been proven to be insensitive to noise in HS-based analysis, providing very high accuracy for material classification with limited training data. The only spectral-based SVM classifier can quickly identify object-based classification classes with low material variations. However, for complex urban land cover classification, the spectral domain of the SVM is not sufficient to capture the heterogeneity of the objects or land cover classes built from various materials, for example, identification of impervious and nonvegetated pervious surfaces [238]. For such an analysis, contextual information is necessary. The spatial features can be added to the SVM classifier by applying the composite kernel, improving the accuracy and generalization capabilities.
RF also applies spectral-domain only for HS data. As for SVM, the land cover classes having high material variations within a class are often misclassified, such as road materials (concrete, asphalt, and gravel). The difference and advantage over the SVM classifier is the capability to select important features. This aspect is also advantageous for the DL-based classifiers since shallow ML-based algorithms use handcrafted features controlled and transferred to other independent and unknown test areas.
On the other hand, when the classification objective is focused on a smaller study area, automated features of the DL algorithm may prove to be a better solution for highdimensional HS data. One factor is that the relationships between objects or land cover classes are not linear in a complex urban environment. By extracting the handcrafted features, we have control and knowledge about them. In contrast, the automated features can obtain high-level features that may allow a much better classification result by recognizing complex relationships that cannot be analyzed by applying shallow ML at the expense of generalizability and transferability. The advantage of the CNN is in its spectral-spatial domain, which searches for high-level features, e.g., by simultaneously extracting spectral and spatial features. As features in CNN are retrieved during the algorithm and the the original, however, normalized, HS data are fed into the model. This saves time for preprocessing, which is necessary for SVM or RF classifiers.
On the other hand, normalization of the extraction of high-level features is notably more time-consuming than classification with shallow ML algorithms. In totally, the increase in dimensions is an enormous challenge in DL classification. The hidden complex relationships are not universally and globally representative. The easiest way to influence this in DL is to ensure the global representativeness of the training and test data which may be hardly possible in remote sensing. Another way is the support of handcrafted features that underrepresent the local properties. In addition, assuming that the algorithm extracts the most important features for correct classification, these features may vary depending on the complexity and diversity of the training data. Therefore, the transferability and generalizability of a model, which is critical, e.g., automatic map updating, is limited. In addition, DL algorithms require a larger training data set, which may not be feasible due to a lack of data and computationally expensive DL algorithms, such as CNN and RNN. The RNN has been proven to outperform even in the spectral domain only compared to standard ML algorithms and CNNs [37]. However, for single classes such as asphalt or concrete-made objects (roads, parking lots, highway), RNN may not solve misclassification only in the spectral domain. The RNN requires more computative time than the CNN, and an extensive training data set is needed. Since RNN considers the temporal domain, this classifier shows greater robustness, transferability, and generalization.

Lidar-Based Classification
Lidar-based urban land cover classification is not a straightforward approach due to the complexity of the urban environment, where different classifiers with different derived features can identify different land cover classes. Nevertheless, the SVM is a common approach for lidar-based classification. In particular, those land cover classes are distinguished by their unique geometry and where the material composition is not essential. For example, building detection requires the capture of complex geometry, including roofs (planar surfaces) and facades (vertical surfaces). For this purpose, full-waveform data and geometric features are commonly used [243]. However, depending on the specific purpose of the building classification, different features may play an important role. For example, in the analysis of various roof types, the focus was on height-derived features that were insufficient when the roof had a very complex geometry [55]. However, the problem may lie in too low resolution, too sparse point density of the lidar system, or the CNN classifier, which needs a much larger training data set considering heterogeneity and complexity of the objects of interest. In addition, the transition from 3D point cloud to 2.5D representation is challenging to preserve inherent point cloud information.
Raster (2.5D) processing is more efficient in data handling as soon as it comes to spectral-spatial neighborhood analyses and is therefore preferred by most classifiers. The SVM is mainly used for building detection, differentiate between low and high vegetation, and distinguish trees and buildings. The differentiation between low and high vegetation is still a problem. It appears that using height-derived features and full-waveform data from single-wavelength lidar is not sufficient. However, using the same features with a dualwavelength lidar scanner significantly improved low and high vegetation classification results. Therefore, it can be concluded that spectral features play a significant role in the detailed classification of land cover classes [67]. This assumption of the importance of spectral features in lidar-based classification applying the SVM was also mentioned in a study where spectral features were more critical than geometrical features in classification on a multispectral lidar scanner [120]. In this study, it was found that geometrical features are not able to detect ground-level classes such as roads and grass, which on the other hand, is possible using the spectral features of the multispectral lidar.
Ground-level classes cause many problems also when using full-waveform data. For example, in one study, incorrect classification of grass and sand was caused because the training data contained no balanced classes in the SVM classification [243]. On the other hand, a similar problem of incorrect RF-based classification (this time of low vegetation and roads) also appeared when applying full-waveform data [136]. Therefore, it can be concluded that the last return from lidar is not sufficient for differentiating between ground-level classes. However, this problem has been solved by applying the RF classifier using multispectral lidar data but adding texture features to the elevation and spectral information [245]. However, neither SVM nor RF on single or multispectral lidar can differentiate very heterogeneous classes, such as asphaltic, concrete, or gravel road. This issue can only be solved by adding hyperspectral information on the material using the little available hyperspectral lidar, including reflected intensity information, or integrating lidar with hyperspectral imaging.

HL-Fusion Classification
HL-Fusion aims to combine the two different sensors with improving the classification result. In urban land cover classification based on HL-Fusion, DL turns out to be the most commonly used method (Table 1). One reason for the DL selection could be the intentional neglect of the more complex preprocessing. Thereby, however, there is a risk of losing transferability and generalizability. This is especially critical for optical data, e.g., if the atmosphere is not corrected according to physical models or the shadow has not been corrected, training data must cover all atmospheric conditions and represent the existing urban heterogeneities. However, both the enormous, rapid development of DL, combined with the progress of sensor technology and multisensory fusion, are becoming an interesting field for further scientific research in the near future. Especially in the analysis of complex urban environments, only a single sensor is usually insufficient for classifying urban land cover classes correctly. Besides material characteristics, spatial correlation is essential and full 3D geometry and topography information. Using context in a more spacious neighborhood for classification purposes, training time increases significantly, especially for DL algorithms (CNN). For shallow ML and DL algorithms, spectral-spatial classification with handcrafted features has been proven to always be more accurate, with the capability of transferability and generalization [16,18,29,264]. HL-Fusion with SVM classifier improved the classification result, but the limited studies did not include the variation between different features derived from HS and lidar. In this case, the application of spectral features from HS and height and derivatives and intensity data proved accurate. Unfortunately, fusing two different sensors also come with some challenges. Adding to the already high-dimensional HS data more dimensions, one can meet the curse of dimensionality problems. By limited training data, high-dimensional feature spaces are often insufficient to recognize desired patterns due to the low ratio of training data to the high dimensional features [295]. More dimensions in source data mean more necessary training and test data due to increased heterogeneity and the number of features, and the need for more computational power and storage.
Although object-based classification is much more comprehensive than pixel-based classification, objects or land cover have become important in classification because they reflect reality much more closely. Spectral features from HS data are reliable for material classification, even in complex urban environments. However, the lack of topographic and geometric information makes accurate results based on only one sensor impossible. Lidar for providing these needed features is very promising in complex urban land cover classification. Lidar complements HS data to add height information to vegetation detection enabling identification of individual trees (full 3D geometry), bushes, low vegetation. In addition, in the detection of the road (edges), lidar provides refined features providing precise boundaries [296]. Thanks to HL-Fusion, there is no need to limit oneself to classify land cover classes, monitor the urbanization processes, and study the urban environment.
The potential capabilities of the two sensors enable urban analysis in a holistic, multi-aspect and multidisciplinary way.

Conclusions and Future Perspectives
ML and DL revolutionize digital processing of remotely sensed data such as HS, lidar, and HL-Fusion. A significant factor influencing such a great advance in technology is the variety of information obtained time-efficiently by remote sensing systems. Both HS and lidar-based data are used for urban analysis by applying ML and DL algorithms. This review provides the latest information on advances in mapping techniques based on HS and lidar data in urban environments based on the reflective spectral range (400-2500 nm). This multidisciplinary research described in this article was intended to summarize urban land cover classification for ML and DL experts and remote sensing specialists. Particular attention should be paid to DL implementations in HL-Fusion, which may be the key to classifying land cover classes in a complex urban environment. DL is a promising tool for extracting spectral-spatial features and more complex features than classical ML algorithms, which usually improves the accuracy of the classification results. One of the main challenges related to DL's use is the need for a globally representative dataset for the model training purposes and the availability of annotated lidar data to make it generalizable and transferable: this might require extensive manual work that can be costly but may be overcome applying data augmentation strategy. The HL-Fusion-based classification opens up a new dimension of urban analysis, approximating ML and DL classification results to the reality and going beyond human expertise to discover and care for the urban environment.
The growing trend of using DL in classification will probably remain unchanged over the next few years, discovering new network algorithms, which are already implemented in single case studies. However, as the technology continues to improve, HL-Fusion, despite its high dimensionality, should be considered in analyzing complex urban environments. Crucial is the transferability and generalization aspect, one of the biggest concerns since DLs are usually valid only locally. Inferring from this, it does not allow, for example, the significant updates of city maps.