Object classification generally consists of land-cover classification and land-use classification [
3,
4]. For urban areas, objects in remote sensing images are usually classified into diverse land-use categories, while objects are classified as land-cover sorts for non-urban areas. However, in this paper, object classification referred to land-cover classification for the lack of social attribute data. Desirable object classification was based on solid segmentation results. Different segmentation algorithms abound, and these methods can be divided into diverse types from different perspectives [
5,
6]. According to the nature of segmentation unit, segmentation algorithms are classified as pixel-level and object-level (pixel-based and object-based) segmentation [
7,
8,
9,
10]. The term of geographic object-based image analysis (GeOBIA) was first proposed by Blaschke [
11]. For different experiment purposes, corresponding segmentation algorithms are selected. A superpixel-based segmentation method has been proposed for classification of very high-spatial resolution images (VHSRIs) [
12,
13]. Despite obvious advantages compared to pixel-based segmentation algorithms, GeOBIA still requires improvements. To further improve segmentation results, the strategy of stratification was introduced. Zhou et al. and Xu et al. [
14] adopted grey-level co-occurrence matrix (GLCM) for a pre-segmentation which roughly segments the image into several homogeneously alike regions before multiresolution segmentation (MRS), which brought new possibilities for fine-MRS. Additionally, an algorithm adopting the voting strategy, which further increases classification efficiency was proposed for convolutional neural network (CNN)-based GeOBIA [
15,
16,
17,
18].
Deep learning is a currently adopted technique in GeOBIA-based land cover classification. Deep learning was reinterpreted in 2006 and has been booming ever since then under the unremitting efforts by Hinton [
19]. The very first application adopts and realizes DL is from Lecun [
20] and has been compared to traditional machine learning methods, including random forest and support vector machine, ever since [
21]. However, LeNet fails to recognize images as complex as VHSRIs and is inefficient for land-cover classification in RS. DL consists of supervised learning, semi-supervised and unsupervised learning [
22]. Among all models, CNNs (Convolutional Neural Networks) are one of the most frequently adopted deep supervised learning models. Alexnet, which won the Large Scale Visual Recognition Challenge 2012 (ILSVRC2012), successfully adopted rectified linear units (ReLU) as activate function and proposed LRN [
23]. The merit of LRN is that it renders Alexnet a strong generalization capacity which enables Alexnet to learn and extract features from complex images including VHSRIs. Meanwhile, the advancement in GPU also encouraged the revival of machine learning. CNN has been applied in numerous domains, especially in computer vision related fields. In the RS field, CNN has been proven as a reliable tool for extraction and classification [
24]. Scott et al. [
25] further improves land-cover classification accuracy by adopting a deep CNN model. Apart from Alexnet, other newly proposed CNNs including VGGNet (Visual Geometry Group Network), ResNet (Residual Neural Network) and FCN (Fully Convolutional Network) are also successfully applied in GeOBIA [
26]. Moreover, VGG which believes that CNNs with deeper architectures generate higher classification accuracy has been applied in VHSRI classification [
27]. Apart from CNNs models, traditional machine learning (ML) algorithms and other DL methods perform effectively as well. Hong et al. [
28] and Lu et al. [
29] introduced the richer convolutional feature (RCF) to road and building edge detection in VHSRIs and overwhelmed traditional methods. Patch-based (CNN) and FCN are two of the most used models currently [
30,
31]. FCN outputs a result of the exact same size of input through deconvolution (backward learning). However, the structure of FCN is tedious. Even though the combining of CNN and RS images has generated great outcomes, a significant type of information, i.e., height, is missing in classification. Therefore, the introduction of light detection and ranging (LiDAR) becomes inevitable. Point cloud are data generated by LiDAR and are usually utilized as supplementary data in geoscience related fields. Multisource data fusion is commonly seen in RS and its concrete application such as object classification [
32]. Multisource data fusion has always been an indispensable topic in RS since the beginning. Data sensed by diverse sensors, such as point cloud (PC), synthetic aperture radar (SAR), points of interests (POI), social sensing data [
33,
34] and surveyed data, have been applied in RS for deeper analysis in the past two decades [
35,
36,
37]. Different data reflects unique features for diverse objects. POIs refer to objects that attract specific researchers for a certain purpose and is promising for RS image classification. POIs can usually be buildings, bus stops, railway stations, hospitals, etc. Compared to RS imagery, POIs contain social attributes [
38]. However, the addition of POIs has its own drawbacks that the information reflected by POIs may not be correct or timely. Similarly, social sensing data also reflects social properties that traditional RS data hardly shows [
39]. Surveyed data reflects geometric information and may serve well as auxiliary data. Point cloud are generated by LiDAR laser sensors and mainly reveals elevation of the study area [
40,
41,
42]. PC can be studied separately and be the auxiliary data in RS imagery analysis as well. The combination of PC and RS images demonstrates that the adjunction of elevation information further improves classification results [
43,
44].
However, although research adopting GeOBIA and DL in VHSRIs classification abound, as well as combining point cloud fusion in remote sensing land cover classification, more efforts are still potentially required. Firstly, due to that the complexity of scale dependence in segmentation of GeOBIA, fine segmentation is the key to GeOBIA classification. Secondly, from the perspective of classification features, how to effectively enhance the feature difference between different object classes or significantly increase the information entropy is the main purpose of data fusion. As a result, the performance of point cloud fusion and the quality of image segmentation should be further improved.
Aiming at these requirements, this paper explored a CNN based land cover classification method combining stratified segmentation and fusion of point cloud and VHSRI data, in which a new fusion named standard normalized digital surface model (StdnDSM) of PC and VHSRIs was first proposed in this paper, then a stratification strategy combining GLCM and MRS was applied to segment the fused data, and finally a finely tuned CNN model was utilized to train samples and classify land-cover objects. Image entropy was introduced to evaluate image quality for StdnDSM. For CNN based GeOBIA classification, the region majority voting strategy was applied to accelerate the procedure and avoid extreme situations that former methods fail to solve. A scene in Helsinki was chosen as the study area and the corresponding data was collected for study.