Urban Land-Cover Classiﬁcation Using Side-View Information from Oblique Images

: Land-cover classiﬁcation on very high resolution data (decimetre-level) is a well-studied yet challenging problem in remote sensing data processing. Most of the existing works focus on using images with orthographic view or orthophotos with the associated digital surface models (DSMs). However, the use of the nowadays widely-available oblique images to support such a task is not sufﬁciently investigated. In the effort of identifying different land-cover classes, it is intuitive that information of side-views obtained from the oblique can be of great help, yet how this can be technically achieved is challenging due to the complex geometric association between the side and top views. We aim to address these challenges in this paper by proposing a framework with enhanced classiﬁcation results, leveraging the use of orthophoto, digital surface models and oblique images. The proposed method contains a classic two-step of (1) feature extraction and (2) a classiﬁcation approach, in which the key contribution is a feature extraction algorithm that performs simpliﬁed geometric association between top-view segments (from orthophoto) and side-view planes (from projected oblique images), and joint statistical feature extraction. Our experiment on ﬁve test sites showed that the side-view information could steadily improve the classiﬁcation accuracy with both kinds of training samples (1.1% and 5.6% for evenly distributed and non-evenly distributed samples, separately). Additionally, by testing the classiﬁer at a large and untrained site, adding side-view information showed a total of 26.2% accuracy improvement of the above-ground objects, which demonstrates the strong generalization ability of the side-view features.


Introduction
Land-cover classification of high resolution data is an intensively investigated area of research in remote sensing [1][2][3]. The classification often assumes applications to top-view images (e.g., orthographic satellite images and orthophotos of photogrammetric products) or information of other modalities (e.g., digital surface models (DSMs)) [4][5][6]. Spectral and spatial features are two basic types of image features which separately record the optical reflections at different wavelengths and the texture information in a continued spatial domain. Since different objects have different reflection characteristics corresponding to different spectral bands, many indexes have proposed as classification clues, such as normalized difference vegetation index (NDVI) [7], normalized difference water index (NDWI) [8] and normalized differenced snow index (NDSI_snow) [9]. Based on these indexes, there are many variations, including near surface moisture index (NSMI), which models the relative surface snow moisture [10], and normalized difference soil index (NDSI_soil) [11]. For hyper-spectral to find and attach the vertical side-views to their hosts in the overview of orthographic images, could be a challenging problem. There is no direct connection between a region in the orthographic image and its possible side textures in most remote sensing data, even we schematically linked them in Figure 1. Oblique images are not purposed for cartographic mapping, but their mapping products, such as orthophoto and DSM have been extensively analyzed for the land-cover classification [24]. By observing the geometric constraints between the oblique images and the orthophoto and DSM, finally, we found a way to incorporate the side-view information for land-cover classification. Firstly, from the DSM, the above-ground objects can be segmented out as individual regions that could have side-view information. Then, for each above-ground region, a virtual polygon boundary would be calculated to map the side-view textures in the oblique images via a perspective transformation. Finally, from these textures, the side-view information of each above-ground segment can be extracted and incorporated in the land-cover classification with their top-view features.
Following this idea, in this study, we aimed to leverage the extra side-view information to improve the land-cover classification with the oblique imagery. In general, the main contributions of this work include: (1) to the authors' best knowledge, this is the first work which proposes using the side-view textures to support the top-view based land-cover classification; (2) a feasible framework is proposed to extract the side-view features that can be precisely incorporated into top-view segments and can improve the classification accuracy, especially when the training samples are very limited.

Materials and Methods
To incorporate side-view information in land-cover classification, firstly, we segment the above-ground objects with which the textures can be mapped to the side-views. Then, based on the segmentation boundaries, their side-view textures are mapped and selected from oblique images via a perspective transformation. Finally, side-view information, including color and texture features, are extracted for each above-ground segment.

Above-Ground Object Segmentation
Above-ground object segmentation is a complicated problem which has been studied for years [30,31], but still does not have a general solution. To simplify this problem, we assume all above-ground objects have flat roofs; for example, if a building has two conjoint parts with different heights, then the two parts are treated as two objects. With this assumption we are able to efficiently segment the above-ground objects at the individual level with a simple height clustering algorithm, in which the connected pixels that share similar heights are grouped as one above-ground object. To implement, firstly, we use the DSM to calculate a gradient map which can approximate the above-ground height with respect to surrounding areas. Then, from the highest to the lowest, the connected pixels with height differences within 1 m are sequentially grouped as individual segments. Finally, the segments which have 2.5 m average above-surroundings heights are classified as above-ground objects, as shown in Figure 2.
It is possible that the resulting clusters may contain errors, such as incomplete segments and incorrect above-ground heights, which are mainly in multi-layer objects (e.g., the towers on the roof and the gullies on the ground). To fix these errors, we post-process these segments by simply using neighboring merge technique.

Side-View Texture Cropping and Selection
Similar to 3D building façade texture mapping [32], the vertical faces of above-ground objects can be mapped and cropped from oblique images. However, unlike buildings which often have well-defined plane/multi-plane structures in their façades, many above-ground objects, for example, trees, do not possess a specified vertical face. To solve this problem, we convert the boundaries of above-ground segments into polygons with the Douglas-Peucker algorithm [33], thereby creating pseudo vertical faces by cascading the top edges of each object to the ground, as shown in Figure 3, image (a) and (b). In the experiment, only the three longest lines are used to extract side-view textures. As illustrated in Figure 3, image (b), the vertical face is defined as a rectangle with four space points (P1, P2, P3, P4). The upper points (P1, P2) are the two ending points of a polygon line with the object height, while the lower points (P3, P4) are at the same positions but with ground height. The georeferenced 3D coordinates (X, Y, Z) of the four points in the object space can be acquired from the orthophoto and DSM; thus, their corresponding oblique image coordinates can be calculated via a perspective transformation: where (u, v, 1) are the 2D homogeneous coordinates in the oblique image with s as a scale factor, and P 3×4 is a perspective transform matrix which contains the intrinsic and extrinsic camera parameters that are calibrated in the photogrammetric 3D processing. The reader can find more details about the photogrammetry in [34]. As illustrated in Figure 3, image (c) and (d), after this perspective transform, the four points can define a region of the side-view in many multi-view oblique image.
To get better side-views for the later feature extractions, we rectify the textures to the front view through a homography transform that maps the points in one image to the corresponding points in the other image (e.g., mapping P2, P1, P3 and P4 to the top-left, top-right, bottom-left and bottom right corner of a rectangle image, separately), as shown in Figure 3e. The readers can find more details about homography in [34]. There is in general more than one oblique image that can capture the side-view of an object. To select the best one, we consider three factors: (1) V( f ), the quality of the angle between the normal of the face plane and the camera imaging plane, (2) N( f ), the quality of the angle between the face normal and the line through camera and face centers, (3) O( f ), the proportion of the observable part. Based on these factors, the best side-view is selected by a texture quality measurement: where the Q( f ) measures the quality of side-view f, while the m 1 , m 2 and m 3 are the weights of different quality factors. In the experiment, m 1 , m 2 and m 3 are set as 0.25, 0.25 and 0.5, respectively, as we found the visibility is more important. While the first two factors can be easily calculated, the visibility is complicated to measure due to the fact that occlusions often exist in urban areas. Inspired by a Z-buffer based occlusion detection [29], we examine the visibility with a distance measurement, as illustrated in Figure 4. . An illustration of the occlusion detection through Z-buffer with the DSM. A texture point (e.g., p1), must be close to the side-view plane (yellow rectangle) in the object space; otherwise (e.g., p2 which is pointing at a tree) it should be an occlusion point.
For each side-view region in the multi-view oblique images, we can simulate emitting rays from the camera center through the side-view texture and reach the DSM in the object space. If a pixel is not part of the plane (e.g., due to occlusion), as with P2 in Figure 4, we determine that as an invalid pixel for feature extraction. The resulting masked image is shown in Figure 3e.

Side-View Feature Description
To capture the side-view features, we compute the average color and the standard deviation in R, G, B channels. The histogram of oriented gradients (HOG) [35] and Haar-like features [36,37] are also adopted for the texture description.
HOG descriptor counts occurrences of the gradient orientation in different localized portions of an image with a histogram. By normalizing and concatenating all local HOGs, such as different parts of a human body, we are able to effectively describe object boundaries. In our case, the entire side-view texture is treated as a single patch because there is no dominant or specified distribution. On the other hand, considering that the elements (e.g., windows) in the building façades usually have a regular and repetitive layout, we adopt the rectangle Haar-like features to the side-view images, as has been shown to be highly descriptive. The rectangle Haar-like feature is defined as the difference of the sums of the pixel intensities inside different rectangles. For the side-view textures, a triple-rectangle pattern Haar-like structure (e.g., black-white-black) is designed and used at the vertical and horizontal direction, separately, at 3 different sizes (total 6 feature vectors). Finally, from pixels to blocks, the color, gradient and Haar-like features are combined to describe the side-view for each above-ground segment.

Classification with Side-View and Top-View
Following the idea of object-based image classification, we first segment the top-view image into small segments as basic classification units. Then, for each segment, top-view features are directly extracted from orthophoto and DSM, while the side-view features are assigned based on the overlaps between the segments and above-ground objects. Finally, with the top-view and side-view features, a random forest classifier is trained to perform the classification.

Image Segmentation with Superpixels
Several image segmentation algorithms have been used for the remote sensing data, such as mean-shift [20,22] and superpixel segmentation [38,39]. Without valuing the shape as a main rule, the mean-shift algorithm can generate well-articulated segments, but the size of segments may vary and the result is sensitive to the algorithm parameters, leading to unpredictable segments. The superpixel algorithm generates compact segments with regular shapes and scales, which are more robust and suitable to associate with the side-view features without unexpected mistakes. Hence, in this study, we generated the SLIC superpixel segments [40] and assigned each segment the side-view features based on its overlap with the above-ground objects, as illustrated in Figure 5. On the other hand, for superpixels which are not in the above-ground areas, their side-view features will be set as zeros.

Classification Workflow
Side-view serves as a piece of complementary information and can be incorporated in any land-cover classification framework with top-view features. Hence, in this work, we directly adopted the framework introduced in [24] which uses a dual morphological top-hat profile (DMTHP) to extract the top-view features and the random forest to classify the segments. More specifically, the top-view features include the DMTHP features extracted from the DSM and brightness and darkness orthophoto images produced by the principal component analysis (PCA) [41]. The DMTHP extracts the spatial features with class-dependent sizes which are adaptively estimated by the training data. This mechanism avoids exhaustive morphology computation of a set of sizes with regular intervals and greatly reduces the dimensions of the feature space. On the other hand, the random forest classifier is widely used for hierarchical feature classifications [42]. The voting strategy of multiple decision trees and the hierarchical examination of the feature elements make this method have high accuracy. The entire classification workflow can be found in Figure 6, and more details about the top-view feature extraction and the random forest classifier can be discovered in [24].

Results
In the experiment, 306 aerial images were used as the study data, including 73 top-view, 64 forward-view, 47 backward-view, 62 left-view and 60 right-view images taken by a 5-head Leica RCD30 airborne camera. The size of all images is 10,336 × 7788 pixels; the four oblique cameras were mounted with a tilt angle of 35 degrees (see Table 1). These images were calibrated by a professional photogrammetric software called Pix4DMapper software (Pix4D SA, Switzerland) which was also used to produce the orthophoto and DSM. The georeferencing accuracy, computed from 9 ground control points, is 2.9 cm. The ground sampling distance (GSD) of the orthophoto and DSM is 7.8 cm.
The study area centers around the campus of the National University of Singapore (NUS), where the terrain contains a hilly ridge with tall and low buildings, dense vegetation, roads and manufacturing structures, as illustrated in Figure 7. To analyze the improvement using our method, six sites that each contain all the types with different scenarios were selected. As shown in Figure 7, site A is a complex campus area which includes dormitories, dining halls, study rooms and multi-function buildings. Site B and Site E are residential areas with different types of residential buildings. Site C is a secondary school containing challenging scenarios: the education buildings and a playground are on the roof. Site D is a parking site. Site F, with a complicated land-cover classes, is a much larger area which is used to test the generalization capability of the method.  In this study, the image was classified into (1) ground classes, including road, bare ground and impervious surfaces; (2) grassland; (3) trees; (4) rain-shed including pedestrian overpasses; and (5) building. Other objects, such as cars, rivers/pools and lower vegetation, were not considered. The reference masks of the land-cover were manually drawn by an operator who visually identified the objects in the orthophoto, DSM and oblique images. For each test site (except site F), around 2% of the labeled superpixel segments were used to train the classifier, and more statistics about the experimental setup are listed in Table 2. For the random forest classifier [43], 500 decision trees were used for training, while the number of variables for classification was set as the square root of the feature dimension, which was 35 in the experiment.

Validation of Above-Ground Object Segmentation
The above-ground object segmentation is an initial and critical step for the side-view information extraction. To validate the above-ground segments, we compared the segments with the reference labels of tree, rain-shed and building. The above-ground segmentation accuracy for the five test sites is shown in Table 3. The evaluation metrics include the accuracy per-class, overall accuracy and commission error, each corresponding to the percentage of correctly identified above-ground pixels in the class, in total, and the miss-classified above-ground pixels, respectively. As observed from Table 3, most of the above-ground pixels were successfully segmented (92.7% overall accuracy), and only a few were misclassified (2.42% commission error). For class accuracy, most of the buildings were identified correctly, but with some fuzzy edges. This was mainly caused by the smoothing operation in the generation of DSM, and this operation also reduced the accuracy of the rain-shed, which is low and close to buildings. On the other hand, some pixels of trees were not identified mainly due to the complex structures, such as tree branches, generally not being reconstructed well in current 3D reconstruction approaches, as illustrated by rectangles in Figure 8. In addition, different objects may be segmented as one single object if they are close and have similar heights. This kind of error may make objects have wrong side-views; for instance, the rain-shed would have the side-views of trees, as marked by the circles in Figure 8. However, this error will not significantly impact the final classification, because the side-view is just a piece of complementary information; the top-view features still play an important role in the final classification.

Classification with Different Samples
For supervised land-cover classification, the training samples are critical. In practice, depending on the distribution, there are two kinds of samples: (1) evenly collected samples over the entire test site, which we refer to as evenly distributed samples; (2) selectively collected samples covering part of the test site, which we refer to as non-evenly distributed samples. As illustrated in Figure 9a, the evenly distributed samples can offer abundant intra/inter-class information, but they need a considerable amount of labor with scrutiny over the entire image. On the other hand, using non-evenly distributed samples can reduce the manual work and is more efficient at larger scales, but they may not sufficiently represent the data distribution. Considering that these two sample concepts are both very common in practice, we experimented with both of them in our tests.

Classification with Evenly Distributed Samples
The evenly distributed training samples of each class were evenly picked up from reference data with certain intervals. Following the training and prediction process, as described in Section 2.4.2, we performed the classification with/without side-view features, and the results are shown in Table 4 with user accuracy (calculated by taking the total number of correct classifications for a particular class and dividing it by the row total). As we can observe from Table 4, the results with side-view have higher overall accuracy and Kappa values (on average our method improved 1.1% and 1.5%, separately) which means the side-view information offers useful clues for the land-cover classification. The improvement seems to be limited, as the training samples supply the full capacity of the classifier that is difficult to be further improved. As proven by the experiment, the side-view can still improve the classification if we do not consider the ground objects (ground, grassland) which are not benefited by this extra information. The average per-class accuracy improvement is 1.7%.
As shown from Figure 10, the classification without side-view incorrectly classified some trees into buildings (marked by circles). This misclassification is mainly caused by the fact that many vegetation-covered roofs would make their top-view features have high similarity to the trees. On the other hand, some tropical trees with dense and flat crowns, could have very similar top-view features compared to vegetation-covered roofs. Besides, low vegetation on the roof, as marked by the rectangles in Figure 10, could be misclassified as trees, since it has enough height. However, with the differences in side-views, for example, trees are usually more green and darker, the classifier could identify fewer trees as buildings, and vice versa.
It is possible that the side-view information can be incorrect and damage the classification, as is shown in the circle in Figure 11, where the trees are better identified without side-view information. From the 3D visualization of this area, we observe that the trees are growing through a roof, making the trees have building side-views. This kind of error is mainly caused by the incorrect above-ground segmentation, as we discussed in Section 3.1. Different objects are segmented together, leading to a mismatching of side-views. However, even though the superpixels of trees are assigned with building side-views, their top-views still insure some of them are correctly classified.

Classification with Non-Evenly Distributed Samples
Usually, the non-evenly distributed samples are more common and practical in real applications. As illustrated in Figure 9b, in the experiment, the non-evenly distributed training samples were generated by selecting training samples of a sub-region of an image. With the same training and prediction process, the user accuracies of classification with/without side-view features are given in Table 5.   As compared to the results of evenly distributed training samples, the non-evenly distributed samples have a degraded performance (around 10% and 16% lower overall accuracy for classification with/without side-view, separately). It is well-understood such a training sample selection process may not sufficiently represent the data distribution. However, in such a situation, the side-view still improved the average overall accuracy by 5.6%, the building was even improved by 9.3%. However, in sites B, D and E, the side-view information reduced the accuracy of the tree class. Trees close to buildings and the rain-shed could have unstable side-view features due to the occlusion and the 3D structure of some trees not being reconstructed, which may introduce errors in tree recognition. Thus, if with only limited training samples, this instability may damage the training leading to unreliable predictions. Nevertheless, with high quality training samples, or even limited-quality ones, the involvement of side-views can still greatly improve the land-cover classification, as demonstrated by the classification of the buildings.

Generalization Ability of Side-View Information
To further analyze the generalization ability of the side-view information, we experimented with the trained classifier at a much larger area (site F which is 16 times larger than other sites) at the center of NUS campus. Due to the very high resolution of this data (GSD is 7.8 cm, contains 8262 × 8721 pixels), we down sampled it to one-third of the original size. In this area, a total of 180,152 superpixel segments were generated, containing short and high buildings, tropical trees with smooth canopies, interchanging roads and constructions along the ridge of a hill. In the experiment, the classifier was trained by the reference data of previously mentioned five test sites, and the classification results (with/without side-view) are shown in Figure 12 and Table 6.  In Figure 12, we can observe that the side-view has greatly improved the classification accuracy: (1) With the side-view, the overall accuracy and Kappa have been improved by 14.5% and 18.9%.
(2) For the above-ground objects, the overall accuracy of the building and rain-shed have been improved by 28.37% and 34.53%, leading to a large category average improvement of 26.2% (including the tree). If the classification is performed without side-view information, many buildings are identified as ground, while some trees are identified as buildings. This site (F) contains a complicated area with various man-made objects and dense trees crossing a hill with large topographic relief (more than 90 meters). In Singapore, many buildings with green/playground roofs can be challenging and often mislead the algorithm to produce incorrect results, for example, by classifying green roofs to the tree class. Especially, if the buildings are surrounded by trees or at the hillside, they could be classified as ground due to the relief of the DSM, as shown in Figure 13. Unlike the top-view or the elevation features that can be sensitive to the DSM relief changes, the side-view features are much more consistent and robust to varying scenarios. For the two examples illustrated in Figure 13, with the side-view information, the buildings at the hillside can be correctly classified, as can the one with a playground roof. As mentioned above, the classification could be sensitive to the training samples. To analyze that, we changed the training samples by alternatively removing samples site by site. In other words, we alternatively selected samples from the four of five test sites (A-E) and tested the performance robustness with varying training samples. The results with accuracy and Kappa values, and their average (Avg.) and standard deviation (Sd.) values can be found in Table 7. From Table 7, we observe the classification with side-view is more robust to the change of training data, and has smaller standard deviations for both overall accuracy and Kappa. We can find the training samples from site A and D are crucial for the top-view features, as the classifications have obviously decreased without their training data, indicating the top-view features are sensitive to training samples. On the contrary, with the side-view information, the performance is stable, indicating the side-view features are more steady and robust to the randomness of training samples.

Discussion
As demonstrated in the experiments, the side-view information can steadily improve the classification performance. However, there are still some issues we need to further discuss. Firstly, as mentioned, the above-ground objects segmentation which decides the boundaries of each object and the corresponding textures is critical for the side-view information extraction. In this study, we tried several methods to segment the above-ground objects [30,31]. However, it is a quite complicated problem and we did not find an obviously better solution than the adopted height-grouping algorithm. There are two issues in the segmentation, incorrect boundaries and under-segmentation of multi-objects. We observed the first issue will not damage the side-view information due to the fact that incorrect boundary can still offer appropriate locations for the side-view texture. On the other hand, the under-segmentation of multi-objects cannot be ignored. It will confuse the side-views between different objects and classes. To solve this, the color difference could be considered, with the height-grouping in the above-ground object segmentation. However, this introduction usually causes over-segmentation, fragmenting objects into pieces and hindering the side-view extraction. The deep learning neural networks [44][45][46] could be promising solutions which we would explore in the future.
The selection of training samples is another important fact that decides the classification performance. As mentioned in the results, the evenly-distributed samples have much better performance than non-evenly distributed ones, because this kind of training sample can supply category-level features, instead of object-level ones. The classifier can be well trained with complete data, leading to ideal performance which is hard to be further improved. On the contrary, the dataset underrepresented by the non-evenly distributed samples and the classifier training will be partial, leading to poor classification. This is mainly caused by the high intra-class variability of top-view features that makes the classifier vulnerable to untrained data. As shown in the results, the side-view information is more robust and consistent. This also inspires us to consider multiple dimension features for object classification and recognition in future works.
In our experiment, we also observed a few misclassified areas, for example, many rain-sheds were not classified correctly. There are two main challenges for rain-shed identification: the rain-sheds are short in height and are close to the buildings and trees, the side-view of which might be misleading. On the other hand, we found the ground objects have slightly worse classification results with the side-view. This is mainly caused by the errors in the above-ground segmentation. Many ground areas are wrongly segmented as above-ground objects due to the limited accuracy of the DSM. Particularly, objects in slope may be mixed with ground area in the slope. Hence, how to extract and use side-view information still needs further development.

Conclusions
In this study, we aimed to fully utilize the possible information acquired by the oblique aerial image and analyze the potential of using side-view information for land-cover classification. To contribute the side-view information to the top-view segments, we proposed a side-view information extraction method, described in Section 2. More specially, to get the side-view information, we first segment out the above-ground segments with a height grouping algorithm. Then, based on the boundaries which have been converted to polygons, their 3D vertical side-view planes are defined. With the perspective transformation, the side-view textures of above-ground objects can be cropped and selected from oblique images. Finally, from these oblique textures, the side-view information, including color, HoG and Haar-like features, are extracted as extra information for the classification. Our experiment in different test sites shows that the side-view can steady improve the classification accuracy either with evenly distributed or non-evenly distributed training samples (by 1.1% and 5.6%, respectively). Also, the generalization ability of the side-view is evaluated and demonstrated as a 14.5% accuracy improvement as tested at a larger and untrained area.
Even though the side-view features show strong consistency and high robust to different sites, the training samples are still critical to the classification. In our experiments we observed some commission errors, which were primarily from incorrect segmentation results, which should be further improved.
Author Contributions: R.Q. and C.X. initiated this research. C.X. performed the experiment, and X.L. contributed to part of the data processing. C.X. wrote this manuscript; R.Q. and X.L. helped with the edit. All authors have read and agreed to the published version of the manuscript.
Funding: This material is based on research/work supported by the National Research Foundation under Virtual Singapore, award number NRF2015VSG-AA3DCM001-024.