Geospatial Computer Vision Based on Multi-Modal Data—How Valuable Is Shape Information for the Extraction of Semantic Information?

: In this paper, we investigate the value of different modalities and their combination for the analysis of geospatial data of low spatial resolution. For this purpose, we present a framework that allows for the enrichment of geospatial data with additional semantics based on given color information, hyperspectral information, and shape information. While the different types of information are used to deﬁne a variety of features, classiﬁcation based on these features is performed using a random forest classiﬁer. To draw conclusions about the relevance of different modalities and their combination for scene analysis, we present and discuss results which have been achieved with our framework on the MUUFL Gulfport Hyperspectral and LiDAR Airborne Data Set.


Introduction
Geospatial computer vision deals with the acquisition, exploration, and analysis of our natural and/or man-made environments.This is of great importance for many applications such as land cover and land use classification, semantic reconstruction, or abstractions of scenes for city modeling.While different scenes may generally exhibit different levels of complexity, there are also various different sensor types that allow for the capture of significantly different scene characteristics.As a result, the captured data may be represented in various forms such as imagery or point clouds, and acquired spatial (i.e., geometric), spectral, and radiometric data might be given at different resolutions.The use of these individual types of geospatial data as well as different combinations is of particular interest for the acquisition and analysis of urban scenes which provide a rich diversity of both natural and man-made objects.
A ground-based acquisition of urban scenes nowadays typically relies on the use of mobile laser scanning (MLS) systems [1][2][3][4] or terrestrial laser scanning (TLS) systems [5,6].While this delivers a dense sampling of object surfaces, achieving a full coverage of the considered scene is challenging, as the acquisition system has to be moved through the scene either continuously (in the case of an MLS system) or with relatively small displacements of a few meters (in the case of a TLS system) to handle otherwise occluded parts of the scene.Consequently, the considered area tends to be rather small, i.e., only street sections or districts are covered.Furthermore, the dense sampling often results in a massive amount of geospatial data that has to be processed.In contrast, the use of airborne platforms equipped with airborne laser scanning (ALS) systems allows for the acquisition of geospatial data corresponding to large areas of many km 2 .However, the sampling of object surfaces is not that dense, with only a few tens of measured points per m 2 .The lower point density in turn makes semantic scene interpretation more challenging as the local geometric structure may be quite similar for different classes.To address this lack with respect to the spatial resolution, additional devices such as cameras or spectrometers are often involved to capture standard color imagery or hyperspectral imagery in addition to 3D point cloud data.In this regard, however, there is still a lack regarding the relevance assessment of the different modalities and their combination for scene analysis.

Contribution
In this paper, we investigate the value of different modalities of geospatial data acquired from airborne platforms for scene analysis in terms of a semantic labeling with respect to user-defined classes.For this purpose, we use a benchmark dataset for which different types of information are available: color information, hyperspectral information, and shape information.Using these types of information separately and in different combinations, we define feature vectors and classify them with a random forest classifier.This allows us to reason about the relevance of each modality as well as the relevance of multi-modal data for the semantic labeling task.Thereby, we focus in particular on analyzing how valuable shape information is for the extraction of semantic information in challenging scenarios of low point density.Furthermore, we take into account that, in practical applications, e.g., focusing on land cover and land use classification, we additionally face the challenge of a classification task where only very few training data are available to train a classifier.This is due to the fact that expert knowledge might be required for an appropriate labeling (particularly when using hyperspectral data) and the annotation process may hence be costly and time-consuming.To address such issues, we focus on the use of sparse training data of only few training examples per class.In summary, the key contributions of our work are:

•
the robust extraction of semantic information from geospatial data of low spatial resolution; • the investigation of the relevance of color information, hyperspectral information, and shape information for the extraction of semantic information; • the investigation of the relevance of multi-modal data comprising hyperspectral information and shape information for the extraction of semantic information; and • the consideration of a semantic labeling task given only very sparse training data.
A parallelized, but not fully optimized Matlab implementation for the extraction of all presented geometric features is available at http://www.ipf.kit.edu/code.php.
After briefly summarizing related work in Section 1.2, we present our framework for scene analysis based on multi-modal data in Section 2. Subsequently, in Section 3, we demonstrate the performance of our framework by evaluation on a benchmark dataset with a specific focus on how valuable shape information is for the considered classification task.This is followed by a discussion of the derived results in Section 4. Finally, we provide concluding remarks in Section 5.

Related Work
In recent years, the acquisition and analysis of geospatial data has been addressed by numerous investigations.Among a range of addressed research topics, particular attention has been paid to the semantic interpretation of 3D data acquired via MLS, TLS, or ALS systems within urban areas [1][2][3][4][6][7][8][9][10][11] which is an important prerequisite for a variety of high-level tasks such as city modeling and planning.In such a scenario, the acquired 3D data corresponds to a dense sampling of object surfaces preserving many details of the geometric structure.Thus, the main challenges for scene analysis are typically represented by the irregular point sampling and the complexity of the observed scene.Numerous investigations on interpreting such data focused on a data-driven extraction of local neighborhoods as the basis for feature extraction (Section 1.2.1), on the extraction of suitable features (Section 1.2.2), and on the classification process itself (Section 1.2.3).

Neighborhood Selection
When using geometric features for scene analysis, it has to be taken into account that such features are used to describe the local 3D structure and hence are extracted from the local arrangement of 3D points within a local neighborhood.For the latter, either a spherical neighborhood [12,13] or a cylindrical neighborhood [14] is typically selected.Such neighborhoods, in turn, can be parameterized with a single parameter.Assuming that a cylindrical neighborhood is aligned along the vertical direction, it is normally defined by two parameters: radius and height.When analyzing ALS data, however, the height is typically set as infinitely large.The remaining parameter is commonly referred to as the "scale" and in most cases represented by either a radius [12,14] or the number of nearest neighbors that are considered [13].As a suitable value for the scale parameter may be different for different classes [7], it seems appropriate to involve a data-driven approach to select locally-adaptive neighborhoods of optimal size.Respective approaches for instance rely on the local surface variation [15] or the joint consideration of curvature, point density, and noise of normal estimation [16,17].Further approaches are represented by dimensionality-based scale selection [18], and eigenentropy-based scale selection [7].Both of these approaches focus on the minimization of a functional, which is defined in analogy to the Shannon entropy across different values of the scale parameter, and select the neighborhood size that delivers the minimum Shannon entropy.
Instead of selecting a single neighborhood as the basis for extracting geometric features [1,7,19], multi-scale neighborhoods may be involved to describe the local 3D geometry at different scales and thus also how the local 3D geometry changes across scales.In this regard, one may use multiple spherical neighborhoods [20], multiple cylindrical neighborhoods [10,21], or a multi-scale voxel representation [5].Furthermore, multiple neighborhoods could be defined on the basis of different entities, e.g., in the form of both spherical and cylindrical neighborhoods [11], in the form of voxels, blocks, and pillars [22], or in the form of spatial bins, planar segments, and local neighborhoods [23].
In our work, we focus on the use of co-registered shape, color, and hyperspectral information corresponding to a discrete raster.This allows for data representations in the form of a height map, color imagery, and hyperspectral imagery.Consequently, we involve local 3 × 3 image neighborhoods as the basis for extracting 2.5D shape information.For comparison, we also assess the local 3D neighborhood of optimal size for each 3D point individually as the basis for extracting 3D shape information.The use of multiple neighborhoods is not considered in the scope of our preliminary work on geospatial computer vision based on multi-modal data, but it should definitively be the subject of future work.

Feature Extraction
Among a variety of handcrafted geometric features that have been presented in different investigations, the local 3D shape features, which are represented by linearity, planarity, sphericity, omnivariance, anisotropy, eigenentropy, sum of eigenvalues, and local surface variation [15,24], are most widely used, since each of them describes a rather intuitive quantity with a single value.However, using only these features is often not sufficient to obtain appropriate classification results (in particular for more complex scenes) and therefore further characteristics of the local 3D structure are encoded with complementary features such as angular statistics [1], height and plane characteristics [9,25], low-level 3D and 2D features [7], or moments and height features [5].
Depending on the system used for data acquisition, complementary types of data may be recorded in addition to the geometric data.Respective data representations suited for scene analysis are for instance given by echo-based features [9,26], full-waveform features [9,26], or radiometric features [10,21,27].The latter can be extracted by evaluating the backscattered reflectance at the wavelength with which the involved LiDAR sensor emits laser light.However, particularly for a more detailed scene analysis as for instance given by a fine-grained land cover and land use classification, multi-or hyperspectral data offer great potential.In this regard, hyperspectral information in particular has been in the focus of recent research on environmental mapping [28][29][30].Such information can, for example, allow for distinguishing very different types of vegetation and to a certain degree also different materials, which can be helpful if the corresponding shape is similar.The use of data acquired with complementary types of sensors has for instance been proposed for building detection in terms of fusing ALS data and multi-spectral imagery [31].Despite the fusion of data acquired with complementary types of sensors, technological advancements meanwhile allow the use of multior hyperspectral LiDAR sensors [27].Based on the concept of multi-wavelength airborne laser scanning [32], two different LiDAR sensors have been used to collect dual-wavelength LiDAR data for land cover classification [33].Here, the involved sensors emit pulses of light in the near-infrared domain and in the middle-infrared domain.Further investigations on land cover and land use classification involved a multi-wavelength airborne LiDAR system delivering 3D data as well as three reflectance images corresponding to the green, near-infrared, and short-wave infrared bands, using either three independent sensors [34], or a single sensor such as the Optech Titan sensor which carries three lasers of different wavelengths [34,35].While the classification may also be based on spectral patterns [36] or different spectral indices [37,38], further work focused on the extraction of geometric and intensity features on the basis of segments for land cover classification and change detection [39][40][41].Further improvements regarding scene analysis may be achieved via the use of multi-modal data in the form of co-registered hyperspectral imagery and 3D point cloud data for scene analysis.Such a combination has already proven to be beneficial for tree species classification [42] as well as for civil engineering and urban planning applications [43].Furthermore, the consideration of multiple modalities allows for simultaneously addressing different tasks.In this regard, it has for instance been proposed to exploit color imagery, multispectral imagery, and thermal imagery acquired from an airborne platform for the mapping of moss beds in Antarctica [44].On the one hand, the high-resolution color imagery allows an appropriate 3D reconstruction of the considered scene in the form of a high-resolution digital terrain model and the creation of a photo-realistic 3D model.On the other hand, the multispectral imagery and thermal imagery allow for drawing conclusions about the location and extent of healthy moss as well as about areas of potentially high water concentration.
Instead of relying on a set of handcrafted features, recent approaches for point cloud classification focus on the use of deep learning techniques with which a semantic labeling is achieved via learned features.This may for instance be achieved via the transfer of the considered point cloud to a regular 3D voxel grid and the direct adaptation of convolutional neural networks (CNNs) to 3D data.In this regard, the most promising approaches rely on classifying each 3D point of a point cloud based on a transformation of all points within its local neighborhood to a voxel-occupancy grid serving as input for a 3D-CNN [6,[45][46][47].Alternatively, 2D image projections may be used as input for a 2D-CNN designed for semantic segmentation and a subsequent back-projection of predicted labels to 3D space delivers the semantically labeled 3D point cloud [48,49].
In our work, we focus on the separate and combined consideration of shape, color, and hyperspectral information.We extract a set of commonly used geometric features in 3D on the basis of a discrete image raster, and we extract spectral features in terms of Red-Green-Blue (RGB) color values, color invariants, raw hyperspectral signatures, and an encoding of hyperspectral information via the standard principal component analysis (PCA).Due to the limited amount of training data in the available benchmark data, only handcrafted features are considered, while the use of learned features will be subject of future work given larger benchmark datasets.

Classification
To classify the derived feature vectors, the straightforward approach consists in the use of standard supervised classification techniques such as support vector machines or random forest classifiers [5,7] which are meanwhile available in a variety of software tools and can easily be applied by non-expert users.Due to the individual consideration of feature vectors, however, the derived labeling reveals a "noisy" behavior when visualized as a colored point cloud, while a higher spatial regularity would be desirable since the labels of neighboring 3D points tend to be correlated.
To impose spatial regularity on the labeling, it has for instance been proposed to use associative and non-associative Markov networks [1,50,51], conditional random fields (CRFs) [10,52,53], multi-stage inference procedures relying on point cloud statistics and relational information across different scales [19], spatial inference machines modeling mid-and long-range dependencies inherent in the data [54], 3D entangled forests [55] or structured regularization representing a more versatile alternative to the standard graphical model approach [8].Some of these approaches focus on directly classifying the 3D points, while others focus on a consideration of point cloud segments.In this regard, however, it has to be taken into account that the performance of segment-based approaches strongly depends on the quality of the achieved segmentation results and that a generic, data-driven 3D segmentation typically reveals a high computational burden.Furthermore, classification approaches enforcing spatial regularity generally require additional effort for inferring interactions among neighboring 3D points from the training data which, in most cases, corresponds to a larger amount of data that is needed to train a respective classifier.
Instead of a point-based classification and subsequent efforts for imposing spatial regularity on the labeling, some approaches also focus on the interplay between classification and segmentation.In this regard, many approaches start with an over-segmentation of the scene, e.g., by deriving supervoxels [56,57].Based on the segments, features are extracted and then used as input for classification.In contrast, an initial point-wise classification may serve as input for a subsequent segmentation in order to detect specific objects in the scene [4,58,59] or to improve the labeling [60].The latter has also been addressed with a two-layer CRF [52,61].Here, the results of a CRF-based classification on point-level are used as input for a region-growing algorithm connecting points which are close to each other and meet additional conditions such as having the same label from the point-based classification.Subsequently, a further CRF-based classification is carried out on the basis of segments corresponding to connected points.While the two CRF-based classifications may be performed successively [61], it may be taken into account that the CRF-based classification on the segment level delivers a belief for each segment to belong to a certain class [52].The beliefs in turn may be used to improve the CRF-based classification on the point level.Hence, performing the classification in both layers several times in an iterative scheme allows improving the segments and transferring regional context to the point-based level [52].A different strategy has been followed by integrating a non-parametric segmentation model (which partitions the scene into geometrically-homogeneous segments) into a CRF in order to capture the high-level structure of the scene [62].This allows aggregating the noisy predictions of a classifier on a per-segment basis in order to produce a data term of higher confidence.
In our work, we conduct experiments on a benchmark dataset allowing the separate and combined consideration of shape, color, and hyperspectral information.As only a limited amount of training data is given in the available data, we focus on point-based classification via a standard classifier, while the use of more sophisticated classification/regularization techniques will be subject of future work given larger benchmark datasets.

Materials and Methods
In this section, we present our framework for scene analysis based on multi-modal data in detail.The input for our framework is represented by co-registered multi-modal data containing color information, hyperspectral information, and shape information.The desired output is a semantic labeling with respect to user-defined classes.To achieve such a labeling, our framework involves feature extraction (Section 2.1) and supervised classification (Section 2.2).

Feature Extraction
Using the given color, hyperspectral and shape information, we extract the following features:

•
Color information: We take into account that semantic image classification or segmentation often involves color information corresponding to the red (R), green (G), and blue (B) channels in the visible spectrum.Consequently, we define the feature set S RGB addressing the spectral reflectance I with respect to the corresponding spectral bands: Since RGB color representations are less robust with respect to changes in illumination, we additionally involve normalized colors also known as chromaticity values as a simple example of color invariants [63]: Furthermore, we use a color model which is invariant to viewing direction, object geometry, and shading under the assumptions of white illumination and dichromatic reflectance [63]: , arctan , arctan Among the more complex transformations of the RGB color space, we test the approaches represented by comprehensive color image normalization (CCIN) [64] resulting in S CCIN and edge-based color constancy (EBCC) [65] resulting in S EBCC .

•
Hyperspectral information: We also consider spectral information at a multitude of spectral bands which typically cover an interval reaching from the visible spectrum to the infrared spectrum.
Assuming hyperspectral image (HSI) data across n spectral bands B j with j = 1, . . ., n, we define the feature set S HSI,all addressing the spectral reflectance I of a pixel for all spectral bands: S HSI,all = I B 1 , . . ., • PCA-based encoding of hyperspectral information: Due to the fact that adjacent spectral bands typically reveal a high degree of redundancy, we transform the given hyperspectral data to a new space spanned by linearly uncorrelated meta-features using the standard principal component analysis (PCA).Thus, the most relevant information is preserved in those meta-features indicating the highest variability of the given data.For our work, we sort the meta-features with respect to the covered variability and then use the set S HSI,PCA of the m most relevant meta-features M j with j = 1, . . ., m which covers p = 99.9% of the variability of the given data: • 3D shape information: From the XYZ coordinates acquired via airborne laser scanning and transformed to a regular grid, we extract a set of intuitive geometric features for each 3D point whose behavior can easily be interpreted by the user [7].As such features describe the spatial arrangement of points in a local neighborhood, a suitable neighborhood has to be selected first for each 3D point.To achieve this, we apply eigenentropy-based scale selection [7] which has proven to be favorable compared to other options for the task of point cloud classification.For each 3D point X i , this algorithm derives the optimal number k i,opt of nearest neighbors with respect to the Euclidean distance in 3D space.Thereby, for each case specified by the tested value of the scale parameter k i , the algorithm uses the spatial coordinates of X i and its k i neighboring points to compute the 3D structure tensor and its eigenvalues.The eigenvalues are then normalized by their sum, and the normalized eigenvalues λ i,j with j = 1, 2, 3 are used to calculate the eigenentropy E i (i.e., the disorder of 3D points within a local neighborhood).The optimal scale parameter k i,opt is finally derived by selecting the scale parameter that corresponds to the minimum eigenentropy across all cases: = arg min Thereby, K contains all integer values in [k i,min , k i,max ] with k i,min = 10 as lower boundary for allowing meaningful statistics and k i,max = 100 as upper boundary for limiting the computational effort [7].
Based on the derived local neighborhood of each 3D point X i , we extract a set comprising 18 rather intuitive features which are represented by a single value per feature [7].Some of these features rely on the normalized eigenvalues of the 3D structure tensor and are represented by linearity L i , planarity P i , sphericity S i , omnivariance O i , anisotropy A i , eigenentropy E i , sum of eigenvalues Σ i , and local surface variation C i [15,24].Furthermore, the coordinate Z i , indicating the height of X i , is used as well as the distance d i between X i and the farthest point in the local neighborhood.Additional features are represented by the local point density ρ i , the verticality V i , and the maximum difference ∆ i and standard deviation σ i of the height values of those points within the local neighborhood.To account for the fact that urban areas in particular are characterized by an aggregation of many man-made objects with many (almost) vertical surfaces, we encode specific properties by projecting the 3D point X i and its k i,opt nearest neighbors onto a horizontal plane.From the 2D projections, we derive the 2D structure tensor and its eigenvalues.Then, we define the sum Σ 2D,i and the ratio R 2D,i of these eigenvalues as features.Finally, we use the 2D projections of X i and its k i,opt nearest neighbors to derive the distance d 2D,i between the projection of X i and the farthest point in the local 2D neighborhood, and the local point density ρ 2D,i in 2D space.For more details on these features, we refer to [7].Using all these features, we define the feature set S 3D : • 2.5D shape information: Instead of the pure consideration of 3D point distributions and corresponding 2D projections, we also directly exploit the grid structure of the provided imagery to define local 3 × • Multi-modal information: Instead of separately considering the different modalities, we also consider a meaningful combination, i.e., multi-modal data, with the expectation that the complementary types of information significantly alleviate the classification task.Regarding spectral information, the PCA-based encoding of hyperspectral information is favorable, because redundancy is removed and RGB information is already considered.Regarding shape information, both 3D and 2.5D shape information can be used.Consequently, we use the features derived via PCA-based encoding of hyperspectral information, the features providing 3D shape information, and the features providing 2.5D shape information as feature set S HSI,PCA+3D+2.5D: S HSI,PCA+3D+2.5D= {S HSI,PCA , S 3D , S 2.5D } For comparison only, we use the feature set S RGB+3D as a straightforward combination of color and 3D shape information, and the feature set S HSI,PCA+3D as a straightforward combination of hyperspectral and 3D shape information: S HSI,PCA+3D = {S HSI,PCA , S 3D } Furthermore, we involve the combination of color/hyperspectral information and 2.5D shape information as well as the combination of 3D and 2.5D shape information in our experiments: S HSI,PCA+2.5D= {S HSI,PCA , S 2.5D } (16) In total, we test 15 different feature sets for scene analysis.For each feature set, the corresponding features are concatenated to derive a feature vector per data point.

Classification
To classify the derived feature vectors, we use a random forest (RF) classifier [66] as a representative of modern discriminative methods [67].The RF classifier is trained by selecting random subsets of the training data and training a decision tree for each of the subsets.For a new feature vector, each decision tree casts a vote for one of the defined classes so that the majority vote across all decision trees represents a robust assignment.
For our experiments, we use an open-source implementation of the RF classifier [68].To derive appropriate settings of the classifier (which address the number of involved decision trees, the maximum tree depth, the minimum number of samples to allow a node to be split, etc.), we use the training data and conduct an optimization via grid search on a suitable subspace.

Results
In the following, we present the involved dataset (Section 3.1), the used evaluation metrics (Section 3.2), and the derived results (Section 3.3).

Dataset
We evaluate the performance of our framework using the MUUFL Gulfport Hyperspectral and LiDAR Airborne Data Set [69] and a corresponding labeling of the scene [70] shown in Figure 1.The dataset comprises co-registered hyperspectral and LiDAR data which were acquired in November 2010 over the University of Southern Mississippi Gulf Park Campus in Long Beach, Mississippi, USA.According to the specifications, the hyperspectral data were acquired with an ITRES CASI-1500 and correspond to 72 spectral bands covering the wavelength interval between 367.7 nm and 1043.4 nm with a varying spectral sampling (9.5 nm to 9.6 nm) [71].Since the first four bands and the last four bands were characterized by noise, they were removed.The LiDAR data were acquired with an Optech Gemini ALTM relying on a laser with a wavelength of 1064 nm.The provided reference labeling addresses 11 semantic classes and a remaining class for unlabeled data.All data are provided on a discrete image grid of 325 × 220 pixels, where a pixel corresponds to an area of 1 m × 1 m.

Evaluation Metrics
To evaluate the performance of different configurations of our framework, we compared the respectively derived labeling to the reference labeling on a per-point basis.Thereby, we consider the evaluation metrics represented by the overall accuracy OA, the kappa value κ, and the unweighted average F1 of the F 1 -scores across all classes.To reason about the performance for single classes, we furthermore consider the classwise evaluation metrics represented by recall R and precision P.

Results
First, we focus on the behavior of derived features for different classes.Exploiting the hyperspectral signatures of all data points per class, we derive the mean spectra for the different classes as shown in the left part of Figure 2. The corresponding standard deviations per spectral band are shown in the right part of Figure 2 and reveal significant deviations for almost all classes.Regarding the shape information, a visualization of the derived 2.5D shape information is given in Figure 3, while a visualization of exemplary 3D features is given in Figure 4.The color encoding is in accordance with the color encoding defined in Figure 1.Regarding the color information, it can be noticed that each of the mentioned transformations of the RGB color space results in a new color model with three components.Accordingly, the result of each transformation can be visualized in the form of a color image as shown in Figure 5.Note that the applied transformations reveal quite different characteristics.The representation derived via edge-based color constancy (EBCC) [65] is visually quite similar to the original RGB color representation.First of all, we discard 17,813 data points that are labeled as C12 ("unlabeled"), because they might contain examples of different other classes (c.f.black curves in Figure 2) and hence not lead to an appropriate classification.Then, we randomly select 100 examples per remaining class to form balanced training data as suggested in [72] to avoid a negative impact of unbalanced data on the training process.The relatively small number of examples per class is realistic for practical applications focusing on scene analysis based on hyperspectral data [30].All 52,587 remaining examples are used as test data.To classify these test data, we define different configurations of our framework by selecting different feature sets as input for classification.For each configuration, the derived classification results are visualized in Figure 6.The corresponding values for the classwise evaluation metrics of recall R, precision P and F 1 -score are provided in Tables 1-3, respectively, and the values for the overall accuracy OA, the kappa value κ, and the unweighted average F1 of the F 1 -scores across all classes are provided in Table 4.

Discussion
The derived results reveal that color and shape information alone are not sufficient to obtain appropriate classification results (c.f.Table 4).This is due to the fact that several classes are characterized by a similar color representation (c.f. Figure 1) or exhibit a similar geometric behavior when focusing on the local structure (c.f.Figures 3 and 4).A closer look at the classwise evaluation metrics (c.f.Tables 1-3) reveals poor classification results for several classes, yet the transfer to different color representations (c.f.S RGB,norm , S C1,C2,C3 , S CCIN , and S EBCC ) seems to be beneficial in comparison to the use of RGB color representations.In contrast, hyperspectral information allows for a better differentiation of the defined classes.However, it can be observed that a PCA should be applied to the hyperspectral data to remove redundancy which becomes visible in neighboring spectral bands that are highly correlated (c.f. Figure 2) and has a negative impact on classification (c.f.Table 4).
Furthermore, we can observe a clear benefit of the use of multi-modal data in comparison to the use of data of a single modality.The significant gain in OA, κ, and F1 (c.f.Table 4) indicates both a better overall performance and a significantly better recognition of instances across all classes, which can indeed be verified by considering the classwise recall and precision values and F 1 -scores (c.f.Tables 1-3).The best performance is obtained when using the feature set S HSI,PCA+3D+2.5Drepresenting a meaningful combination of hyperspectral information, 3D shape information, and 2.5D shape information.In the following, we consider three exemplary parts of the scene in more detail.These parts are shown in Figure 7 and the corresponding classification results are visualized in Figure 8.The classification results derived for Area 1 are shown in the top part of Figure 8 and reveal that in particular the extracted 3D shape information can contribute to the detection of the given building, while the extracted 2.5D shape information is less suitable.For both cases, however, the different types of ground surfaces surrounding the building can hardly be correctly classified since other classes have a similar geometric behavior for the given grid resolution of 1 m.Using color and hyperspectral information, the surroundings of the building are better classified (particularly for the classes "mostly-grass ground surface" and "mixed ground surface", but also for the classes "road" and "sidewalk"), while the correct classification of the building remains challenging with the RGB and EBCC representations.Even the use of all hyperspectral information is less suitable for this area, while the PCA-based encoding of hyperspectral information with lower dimensionality seems to be favorable.For Area 2, the derived classification results are shown in the center part of Figure 8.While the building can be correctly classified using 3D shape information, the use of 2.5D shape information hardly allows reasoning about the given building in this part of the scene.This might be due to the fact that the geometric properties in the respectively considered 3 × 3 image neighborhoods are not sufficiently distinctive for the given grid resolution of 1 m.The data samples belonging to the flat roof of the building then appear similar to the data samples obtained for different types of flat ground surfaces.For the surrounding of the building, a similar behavior can be observed as for Area 1, since shape information remains less suitable to differentiate between different types of ground surfaces for the given grid resolution unless their roughness varies significantly.In contrast, color and hyperspectral information deliver a better classification of the observed area, but tend to partially interpret the flat roof as the "road".This could be due to the fact that the material of the roof exhibits similar characteristics in the color and hyperspectral domains as the material of the road.For Area 3, the derived classification results are shown in the bottom part of Figure 8 and reveal that classes with a similar geometric behavior (e.g., the classes "mixed ground surface", "dirt and sand", "road" and "sidewalk") can hardly be distinguished using only shape information.Some data samples are even classified as "water" which is also characterized by a flat surface, but is not present in this part of the scene.Using color information, the derived results seem to be better.With the RGB and EBCC representations, however, it still remains challenging to distinguish the classes "mixed ground surface" and "dirt and sand", while the representation derived via simple normalization of the RGB components, the C1,C2,C3 representation, and the CCIN representation seem to perform much better in this regard.Involving hyperspectral information leads to a similar behavior as that visible for the use of color information for Area 3, but the classification results appear to be less "noisy".For both color and hyperspectral information, some data samples are even classified as "roof".This might be due to the fact that materials used for construction purposes exhibit similar characteristics in the color and hyperspectral domains as the materials of the given types of ground surfaces.Using data of different modalities for the classification of Areas 1-3, the derived classification results still reveal characteristics of the classification results derived with a separate consideration of the respective modalities.Yet, it becomes visible that the classification results have a less "noisy" behavior, and tend to be favorable in most cases.In summary, the extracted types of information reveal complementary characteristics.Shape information does not allow separating classes with a similar geometric behavior (e.g., the classes "mixed ground surface", "dirt and sand", "road" and "sidewalk"), yet particularly 3D shape information has turned out to provide a strong indication for buildings.In contrast, color information allows for the separation of classes with a different appearance, even if they exhibit a similar geometric behavior.This for instance allows separating natural ground surfaces (e.g., represented by the classes "mixed ground surface" and "dirt and sand") from man-made ground surfaces (e.g., represented by the classes "road" and "sidewalk").The used hyperspectral information covers the visible domain and the near-infrared domain for the given dataset.Accordingly, it should generally allow a better separation of different classes.This indeed becomes visible for the PCA-based encoding, but the use of all available hyperspectral information is not appropriate here due to the curse of dimensionality.Consequently, the combination of shape information and a PCA-based encoding of hyperspectral information is to be favored for the extraction of semantic information.
Regarding feature extraction, the RGB color information and the hyperspectral information across all spectral bands can directly be assessed from the given data.Processing times required to transform RGB color representations using the different color models are not significant.For the remaining options, we observed processing times of 0.11 s for the PCA-based encoding of hyperspectral information, 51.10 s for eigenentropy-based scale selection, 4.20 s for extracting 3D shape information, and 1.55 s for extracting 2.5D shape information on a standard laptop computer (Intel Core i7-6820HK, 2.7 GHz, 4 cores, 16 GB RAM, Matlab implementation).In addition, 0.05 s were required for training the RF classifier and 0.06 s for classifying the test data.
We also want to point out that, in the scope of our experiments, we focused on rather intuitive geometric features that can easily be interpreted by the user.A straightforward extension would be the extraction of geometric features at multiple scales [5,10,20] and possibly different neighborhood types [11,22].Furthermore, more complex geometric features could be considered [11] or deep learning techniques could be applied to learn appropriate features from 3D data [6].Instead of addressing feature extraction, future efforts may also address the consideration of spatial regularization techniques [8,10,52,67] to address the fact that neighboring data points tend to be correlated and hence derive a smooth labeling.While all these issues are currently addressed in ongoing work, solutions for geospatial data with low spatial resolution and only few training examples still need to be addressed.Our work delivers important insights for further investigations in that regard.

Conclusions
In this paper, we have addressed scene analysis based on multi-modal data.Using color information, hyperspectral information, and shape information separately and in different combinations, we defined feature vectors and classified them with a random forest classifier.The derived results clearly reveal that shape information alone is of rather limited value for extracting semantic information if the spatial resolution is relatively low and if user-defined classes reveal a similar geometric behavior of the local structure.In such scenarios, the consideration of hyperspectral information in particular reveals a high potential for scene analysis, but still typically suffers from not considering the topography of the considered scene due to the use of a device operating in push-broom mode (e.g., like the visible and near-infrared (VNIR) push-broom sensor ITRES CASI-1500 involving a sensor array with 1500 pixels to scan a narrow across-track line on the ground).To address this issue, modern hyperspectral frame cameras could be used to directly acquire hyperspectral imagery and, if the geometric resolution is sufficiently high, 3D surfaces could even be reconstructed from the hyperspectral imagery, e.g., via stereophotogrammetric or structure-from-motion techniques [73].Furthermore, the derived results indicate that, in contrast to the use of data of a single modality, the consideration of multi-modal data represented by hyperspectral information and shape information allows deriving appropriate classification results even for challenging scenarios with low spatial resolution.

Figure 2 .
Figure 2.Mean spectra of the considered classes across all 64 spectral bands (left) and standard deviations of the spectral reflectance for the considered classes across all 64 spectral bands (right).The color encoding is in accordance with the color encoding defined in Figure1.

Figure 3 .Figure 4 .
Figure 3. Visualization of the derived 2.5D shape information.The values for the maximum height difference and the standard deviation of height values are normalized to the interval [0, 1].

Figure 5 .
Figure 5. Aerial image in different representations-the original image in the RGB color space and color-invariant representations derived via a simple normalization of the RGB components, the C1,C2,C3 color model proposed in[63], comprehensive color image normalization (CCIN)[64], and edge-based color constancy (EBCC)[65].

Figure 7 .
Figure 7. Selection of three exemplary parts of the scene: Area 1 and Area 2 contain a building and its surrounding characterized by trees and different types of ground surfaces, while Area 3 contains a few trees and several types of ground surfaces.
3 image neighborhoods.Based on the corresponding XYZ coordinates, we derive the features of linearity L *

Table 4 .
Classification results derived with different feature sets.