Incorporating Deep Features into GEOBIA Paradigm for Remote Sensing Imagery Classiﬁcation: A Patch-Based Approach

: The fast and accurate creation of land use / land cover maps from very-high-resolution (VHR) remote sensing imagery is crucial for urban planning and environmental monitoring. Geographic object-based image analysis methods (GEOBIA) provide an e ﬀ ective solution using image objects instead of individual pixels in VHR remote sensing imagery analysis. Simultaneously, convolutional neural networks (CNN) have been widely used in the image processing ﬁeld because of their powerful feature extraction capabilities. This study presents a patch-based strategy for integrating deep features into GEOBIA for VHR remote sensing imagery classiﬁcation. To extract deep features from irregular image objects through CNN, a patch-based approach is proposed for representing image objects and learning patch-based deep features, and a deep features aggregation method is proposed for aggregating patch-based deep features into object-based deep features. Finally, both object and deep features are integrated into a GEOBIA paradigm for classifying image objects. We explored the inﬂuences of segmentation scales and patch sizes in our method and explored the e ﬀ ectiveness of deep and object features in classiﬁcation. Moreover, we performed 5-fold stratiﬁed cross validations 50 times to explore the uncertainty of our method. Additionally, we explored the importance of deep feature aggregation, and we evaluated our method by comparing it with three state-of-the-art methods in a Beijing dataset and Zurich dataset. The results indicate that smaller segmentation scales were more conducive to VHR remote sensing imagery classiﬁcation, and it was not appropriate to select too large or too small patches as the patch size should be determined by imagery and its resolution. Moreover, we found that deep features are more e ﬀ ective than object features, while object features still matter for image classiﬁcation, and deep feature aggregation is a critical step in our method. Finally, our method can achieve the highest overall accuracies compared with the state-of-the-art methods, and the overall accuracies are 91.21% for the Beijing dataset and 99.05% for the Zurich dataset.


Introduction
Due to its real-time and low-cost, land use/land cover (LULC) mapping using remote sensing images has received widespread attention in recent decades [1]. With the developments in remote sensing technologies, the spatial resolution of remote sensing images becomes increasingly finer, which not only provides new opportunities for mapping LULC with more detailed resolution [2] but also brings some new challenges. Ground entities in very-high-resolution (VHR) imagery often appear to have complex structures, demonstrating the increased heterogeneities within ground entities, interclass similarities and intraclass differences, thus hindering LULC mapping [3,4]. Therefore, there is a high demand for an effective VHR image classification strategy to resolve these challenges.

Methodology
The framework (Figure 1) of the proposed approach includes the following three steps:

•
Object features extracting and patches generating. The VHR image was segmented into image objects using the multiresolution segmentation method [35] with eCognition, and object features were extracted simultaneously. Additionally, image objects are irregular polygons, while the inputs of the CNN are regular image patches. To extract deep features for image objects, a patches representation strategy was presented to represent every image object as a set of regular image patches. • CNN training and deep features learning. The reference map of VHR images was obtained by careful visual interpretation, and labelled patches were obtained by random sampling from the reference map. A CNN model was trained through the labelled image patches for obtaining the deep features of patches. Then, a deep feature aggregation approach was performed to obtain the deep features of image objects.

•
Features concatenating and image objects classification. Labelled image objects were selected through the reference map, and a random forest classifier (RFC) was trained. Finally, object and deep features of image objects were concatenated together to obtain classification results by the trained RFC.

Patches Representations of Image Objects
As shown in Figure 2, image objects are irregular polygons, while the inputs of the CNN are regular patches. Thus, patch-based learning must be adapted for analyzing image objects. To resolve this issue, a patch-based strategy was presented for representing an image object as a set of patches, with each patch being a part of the image object. The method of determining the patche representation has great influences on the final classification results. In this study, the center pixel of an image object was considered as the center of the first patch. Then, the patches with fixed sizes were placed to cover the entire image object, and any two patches could not overlap. Therefore, a set of patches were obtained to represent an irregular image object, and then deep features could be extracted from these patches.

Convolutional Neural Networks
CNN is a multilayer feed-forward neural network commonly composed of convolutional layers, pooling layers, and fully connected layers [14]. The convolutional layers and pooling layers are used to extract deep features from the input image patches, while the fully connected layers are exploited to classify image patches with these deep features. The operations performed in the convolutional layers and the pooling layers can be summarized as: where the denotes the input feature maps of the layer; the and the denote the weights and biases of convolutional layer, respectively; the (•) represents the nonlinearity function (e.g., sigmoid, tanh, or rectified linear unit (ReLU)); and the represents a pooling operation (max pooling or mean pooling) with a kernel size of . The output feature maps of the layer can be obtained through those operations [36].
Many CNN models have been proposed in recent years, such as GoogLeNet [37], VGG [38], ResNet [39], and DenseNet [40]. Since DenseNet outperforms the others in deep learning tasks, it was chosen to extract deep features of image objects in this study. DenseNet is mainly composed of dense

Patches Representations of Image Objects
As shown in Figure 2, image objects are irregular polygons, while the inputs of the CNN are regular patches. Thus, patch-based learning must be adapted for analyzing image objects. To resolve this issue, a patch-based strategy was presented for representing an image object as a set of patches, with each patch being a part of the image object. The method of determining the patche representation has great influences on the final classification results. In this study, the center pixel of an image object was considered as the center of the first patch. Then, the patches with fixed sizes were placed to cover the entire image object, and any two patches could not overlap. Therefore, a set of patches were obtained to represent an irregular image object, and then deep features could be extracted from these patches.

Patches Representations of Image Objects
As shown in Figure 2, image objects are irregular polygons, while the inputs of the CNN are regular patches. Thus, patch-based learning must be adapted for analyzing image objects. To resolve this issue, a patch-based strategy was presented for representing an image object as a set of patches, with each patch being a part of the image object. The method of determining the patche representation has great influences on the final classification results. In this study, the center pixel of an image object was considered as the center of the first patch. Then, the patches with fixed sizes were placed to cover the entire image object, and any two patches could not overlap. Therefore, a set of patches were obtained to represent an irregular image object, and then deep features could be extracted from these patches.

Convolutional Neural Networks
CNN is a multilayer feed-forward neural network commonly composed of convolutional layers, pooling layers, and fully connected layers [14]. The convolutional layers and pooling layers are used to extract deep features from the input image patches, while the fully connected layers are exploited to classify image patches with these deep features. The operations performed in the convolutional layers and the pooling layers can be summarized as: where the denotes the input feature maps of the layer; the and the denote the weights and biases of convolutional layer, respectively; the (•) represents the nonlinearity function (e.g., sigmoid, tanh, or rectified linear unit (ReLU)); and the represents a pooling operation (max pooling or mean pooling) with a kernel size of . The output feature maps of the layer can be obtained through those operations [36].
Many CNN models have been proposed in recent years, such as GoogLeNet [37], VGG [38], ResNet [39], and DenseNet [40]. Since DenseNet outperforms the others in deep learning tasks, it was chosen to extract deep features of image objects in this study. DenseNet is mainly composed of dense

Convolutional Neural Networks
CNN is a multilayer feed-forward neural network commonly composed of convolutional layers, pooling layers, and fully connected layers [14]. The convolutional layers and pooling layers are used to extract deep features from the input image patches, while the fully connected layers are exploited to classify image patches with these deep features. The operations performed in the convolutional layers and the pooling layers can be summarized as: where the X l−1 denotes the input feature maps of the l th layer; the W l and the B l denote the weights and biases of convolutional layer, respectively; the f (·) represents the nonlinearity function (e.g., sigmoid, tanh, or rectified linear unit (ReLU)); and the pool p represents a pooling operation (max pooling or mean pooling) with a kernel size of p. The output feature maps of the l th layer X l can be obtained through those operations [36].

of 24
Many CNN models have been proposed in recent years, such as GoogLeNet [37], VGG [38], ResNet [39], and DenseNet [40]. Since DenseNet outperforms the others in deep learning tasks, it was chosen to extract deep features of image objects in this study. DenseNet is mainly composed of dense blocks (DBs) and transition layers (TLs). Each DB further contains several convolutional blocks (CBs). Unlike the other CNN structures, there is a connection between every two CBs in a DB. Consider a DB composed of L CBs. x 0 is the input of the first CB and x l is the output of the l th CB. F l is a composite function of operations of the l th CB such as batch normalization (BN), ReLU, pooling, and convolution (Conv.) (Figure 3). The l th CB receives the feature maps of all preceding CBs, x 0 , · · · , x l−1 , as input ( Figure 4): where the operation [·] concatenates different feature maps together [40]. Because feature maps with different sizes cannot be concatenated together, a TL ( Figure 5) was needed to connect two DBs. The size of feature maps was reduced by the TL due to the pooling operation. Finally, DenseNet ( Figure 6) is composed of DBs and TLs alternately connected, whose inputs are image patches and outputs are the classes of center pixels of these image patches, and it was trained through labelled image patches which were obtained from the reference map. Hu et al. [41] demonstrated that the outputs of the last convolutional layer are effective for scene classification. For standardizing deep features, the outputs of BN layer connected after the last convolutional layer were considered as deep features in this paper.  (Figure 3). The CB receives the feature maps of all preceding CBs, , ⋯ , , as input ( Figure 4): where the operation • concatenates different feature maps together [40]. Because feature maps with different sizes cannot be concatenated together, a TL ( Figure 5) was needed to connect two DBs. The size of feature maps was reduced by the TL due to the pooling operation. Finally, DenseNet ( Figure  6) is composed of DBs and TLs alternately connected, whose inputs are image patches and outputs are the classes of center pixels of these image patches, and it was trained through labelled image patches which were obtained from the reference map. Hu et al. [41] demonstrated that the outputs of the last convolutional layer are effective for scene classification. For standardizing deep features, the outputs of BN layer connected after the last convolutional layer were considered as deep features in this paper.      (Figure 3). The CB receives the feature maps of all preceding CBs, , ⋯ , , as input ( Figure 4): where the operation • concatenates different feature maps together [40]. Because feature maps with different sizes cannot be concatenated together, a TL ( Figure 5) was needed to connect two DBs. The size of feature maps was reduced by the TL due to the pooling operation. Finally, DenseNet ( Figure  6) is composed of DBs and TLs alternately connected, whose inputs are image patches and outputs are the classes of center pixels of these image patches, and it was trained through labelled image patches which were obtained from the reference map. Hu et al. [41] demonstrated that the outputs of the last convolutional layer are effective for scene classification. For standardizing deep features, the outputs of BN layer connected after the last convolutional layer were considered as deep features in this paper.      (Figure 3). The CB receives the feature maps of all preceding CBs, , ⋯ , , as input ( Figure 4): where the operation • concatenates different feature maps together [40]. Because feature maps with different sizes cannot be concatenated together, a TL ( Figure 5) was needed to connect two DBs. The size of feature maps was reduced by the TL due to the pooling operation. Finally, DenseNet ( Figure  6) is composed of DBs and TLs alternately connected, whose inputs are image patches and outputs are the classes of center pixels of these image patches, and it was trained through labelled image patches which were obtained from the reference map. Hu et al. [41] demonstrated that the outputs of the last convolutional layer are effective for scene classification. For standardizing deep features, the outputs of BN layer connected after the last convolutional layer were considered as deep features in this paper.

Deep Feature Aggregation of Image Objects
Through the patches of an image object and the trained CNN model, deep features (the outputs of the last BN layer) of each patch can be learned. The feature map of each patch is composed of matrices with size × . However, these deep features are patch-based, object-based deep features which are needed for classifying image objects. Additionally, the numbers of patches of image objects vary over the sizes and the shapes of image objects. Therefore, different numbers of deep features can be learned for different image objects due to the differences in the number of patches. However, to classify image objects, deep features with identical dimension for all image objects are needed. To resolve this issue, a deep feature aggregation method is proposed in this paper to aggregate the various dimension features of different image objects into identical dimension features for all image objects.
As shown in Figure 7, considering an image object represented by patches, a total of × feature maps with size × can be learned, and these feature maps are arranged by their spatial position in Figure 7b. The feature map set 1 in Figure 7b contains the first feature map of all patches: feature map set 2, feature map set 3, and feature map set . Since the × dimension of each feature map was too high, each feature map needed to be compressed into a single value, so that each feature map set contained values. In other words, there are -dimensional vectors in Figure 7c. The feature compression can be achieved by one of the following operators: mean, variance, maximum, and minimum, for example. The mean operator was used in this paper. To assign these -dimensional deep features of patches to one -dimensional deep features of the image object, an aggregation process needs to be performed. Since in some cases, only parts of patches fall inside an image object, a weighted summation was performed to achieve the aggregation and solve the boundary problem. The weight of each patch was equal to the number of pixels belonging to the image object in the patch divided by the total number of pixels of the image object or the number of pixels of the patch. The result of the aggregation was a -dimensional vector as shown in Figure 7e. Finally, image objects could be classified with these aggregated features and object features.

Deep Feature Aggregation of Image Objects
Through the patches of an image object and the trained CNN model, deep features (the outputs of the last BN layer) of each patch can be learned. The feature map of each patch is composed of m matrices with size s × s. However, these deep features are patch-based, object-based deep features which are needed for classifying image objects. Additionally, the numbers of patches of image objects vary over the sizes and the shapes of image objects. Therefore, different numbers of deep features can be learned for different image objects due to the differences in the number of patches. However, to classify image objects, deep features with identical dimension for all image objects are needed. To resolve this issue, a deep feature aggregation method is proposed in this paper to aggregate the various dimension features of different image objects into identical dimension features for all image objects.
As shown in Figure 7, considering an image object represented by n patches, a total of n × m feature maps with size s × s can be learned, and these feature maps are arranged by their spatial position in Figure 7b. The feature map set 1 in Figure 7b contains the first feature map of all n patches: feature map set 2, feature map set 3, and feature map set m. Since the s × s dimension of each feature map was too high, each feature map needed to be compressed into a single value, so that each feature map set contained n values. In other words, there are n m-dimensional vectors in Figure 7c. The feature compression can be achieved by one of the following operators: mean, variance, maximum, and minimum, for example. The mean operator was used in this paper. To assign these n m-dimensional deep features of patches to one m-dimensional deep features of the image object, an aggregation process needs to be performed. Since in some cases, only parts of patches fall inside an image object, a weighted summation was performed to achieve the aggregation and solve the boundary problem. The weight of each patch was equal to the number of pixels belonging to the image object in the patch divided by the total number of pixels of the image object or the number of pixels of the patch. The result of the aggregation was a m-dimensional vector as shown in Figure 7e. Finally, image objects could be classified with these aggregated features and object features.

Deep Feature Aggregation of Image Objects
Through the patches of an image object and the trained CNN model, deep features (the outputs of the last BN layer) of each patch can be learned. The feature map of each patch is composed of matrices with size × . However, these deep features are patch-based, object-based deep features which are needed for classifying image objects. Additionally, the numbers of patches of image objects vary over the sizes and the shapes of image objects. Therefore, different numbers of deep features can be learned for different image objects due to the differences in the number of patches. However, to classify image objects, deep features with identical dimension for all image objects are needed. To resolve this issue, a deep feature aggregation method is proposed in this paper to aggregate the various dimension features of different image objects into identical dimension features for all image objects.
As shown in Figure 7, considering an image object represented by patches, a total of × feature maps with size × can be learned, and these feature maps are arranged by their spatial position in Figure 7b. The feature map set 1 in Figure 7b contains the first feature map of all patches: feature map set 2, feature map set 3, and feature map set . Since the × dimension of each feature map was too high, each feature map needed to be compressed into a single value, so that each feature map set contained values. In other words, there are -dimensional vectors in Figure 7c. The feature compression can be achieved by one of the following operators: mean, variance, maximum, and minimum, for example. The mean operator was used in this paper. To assign these -dimensional deep features of patches to one -dimensional deep features of the image object, an aggregation process needs to be performed. Since in some cases, only parts of patches fall inside an image object, a weighted summation was performed to achieve the aggregation and solve the boundary problem. The weight of each patch was equal to the number of pixels belonging to the image object in the patch divided by the total number of pixels of the image object or the number of pixels of the patch. The result of the aggregation was a -dimensional vector as shown in Figure 7e. Finally, image objects could be classified with these aggregated features and object features.

Image Object Classification Using Random Forest Classifier
Random forest (RF) is composed of a multiple classification and regression tree (CART), where each tree is generated using a bootstrap sampling from the input vector and casts a unit vote for the most popular class to classify the input vector [42]. The RF does not overfit because of the Law of Large Numbers and requires two user-defined parameters: the number of trees and the number of random split variables [43]. In [43], it was also demonstrated that once the number of trees reaches a state (100 trees), the number of random split variables only alters the classifier's accuracy slightly. Additionally, RF is relatively robust to reduce the training set size and noise and can handle categorical data and data with missing values [44]. Therefore, RF was used for image classification in this paper. A m-dimensional deep feature of image objects can be obtained through Section 2.1, Section 2.2, and Section 2.3, and some object features, such as mean values, standard deviations, normalized difference vegetation index (NDVI), shape index, and eight metrics of the gray-level co-occurrence matrix, were also selected for classification. As a result, deep features and object features were concatenated together to train an, RFC and categories of image objects could be obtained (Figure 1, features concatenating and image objects classification).

Accuracy Assessment
There are lots of methods for the accuracy assessment of remote sensing imagery classification. The Kappa index [45] is widely used in evaluating classification results, but it will introduce problems in calculation and interpretation because the Kappa index is a ratio [46]. In [47], the Bradley-Terry model was used to quantify association in remotely sensed images. In this research, not only the classification results needed to be evaluated, but also the segmentation results needed to be evaluated. Therefore, the segmentation evaluation method proposed in [48] was used; classification results were evaluated by computing the confusion matrix based on the unit of segmentation accuracy assessment.
For segmentation, assume that X = {x i } n i=1 is a set of n image objects, and Y = y k m k=1 is a set of m reference polygons. For each x ∈ X and y ∈ Y, if the overlapping degree between x and y is larger than a threshold, x will be regarded as a correspondence of y.
where area(x ∩ y) denotes the overlapping area between x and y, the area(x) denotes the area of x, and the area(y) denotes the area of y. Based on Equation (3), the correspondences between image objects and reference polygons were determined. For an image object x l and its corresponding reference polygon y l , the following three indices can be defined to evaluate their consistent degree [48]: where oversegmentation index OSeg l signifies to what degree reference polygon y l is oversegmented by image object x l ; undersegmentation index USeg l implies to what degree reference polygon y l is undersegmented by image object x l ; while index RMS l refers to the root mean square. Both OSeg l and USeg l measure how image object x l fits with its corresponding reference polygon y l , and RMS l integrates these two indices into one single value. OSeg, USeg, and RMS are the averages of all OSeg l , USeg l , and RMS l of all objects. The values of OSeg, USeg, and RMS range from 0 to 1. The smaller the three values, the better the segmentation results; thus, OSeg = 0, USeg = 0, and RMS = 0 signify the best segmentation which was hardly achieved. For classification, the same as the segmentation, is a set of n image objects, and Y = y k m k=1 is a set of m reference polygons. Considering there are N classes in classification results, and the semantic label of an image object or a reference polygon is c C = {1, 2, . . . , N}. Therefore, the confusion matrix (CM) can be computed as: CM i∈C, j∈C = x∈X and y∈Y and SL(x)==i and SL(y)== j area(x ∩ y) where SL( * ) denotes the label of an image object or a reference polygon. Therefore, the overall accuracy (OA), producer accuracy (PA), and user accuracy (UA) were computed as:

Image Datasets
For the experiments, two datasets were used. One was a WorldView-2 imagery of Beijing composed of near infrared, red, green band in 2010 ( Figure 8a). The imagery size was 10, 000 × 10, 000, and the spatial resolution was 0.5 m. The other one was a QuickBird imagery of Zurich composed of near infrared, red, green, blue band in 2002 ( Figure 8c). The imagery size was 1195 × 1264, and the spatial resolution was 0.6 m.

Training and Validation Datasets
Due to the inconsistencies between image objects and the inputs of CNN, it is hard to make an end-to-end training for image object classification. The approach proposed in this paper contains two training processes. The first is the CNN training, and the other is the RF training. All training samples were obtained from the reference map.
The training samples for CNN were labelled image patches which were obtained by extending The image range of Beijing dataset is located between Beijing North Third Ring Road and North Fifth Ring Road. The land cover types in the experimental area include buildings, roads, vegetation, water, and bare soils. Due to the large number of buildings, there are also lots of shadows. Different buildings and roads vary greatly in spectral and spatial structures, which makes it difficult to obtain an accurate land cover map. There are regular residential buildings, irregular commercial buildings, wide and long roads, and narrow and discontinuous residential lanes. Moreover, water and vegetation are mainly distributed in the northwest of the experimental area, and the area of bare soils is the smallest. The reference map was obtained by careful visual interpretation (Figure 8b).
The Zurich dataset [49] is a public dataset, which can be downloaded at https://sites.google.com/ site/michelevolpiresearch/data/zurich-dataset. There are 20 chips in the Zurich dataset, and the first chip was used in this paper. Six classes including buildings, roads, railways, trees, grass, and bare soils are presented in the reference map ( Figure 8d). Because the image range of the Zurich dataset is small, this dataset is not suitable for parameter analysis and was only used to compare with the state-of-the-art methods.

Training and Validation Datasets
Due to the inconsistencies between image objects and the inputs of CNN, it is hard to make an end-to-end training for image object classification. The approach proposed in this paper contains two training processes. The first is the CNN training, and the other is the RF training. All training samples were obtained from the reference map.
The training samples for CNN were labelled image patches which were obtained by extending the pixels in the reference map into image patches. For each class, there were 1000 labelled image patches for training and 500 labelled image patches for validation. The 500 labelled image patches were used to validate if the CNN could extract representative features.
RFC was used to classify image objects by considering both deep and object features. The training samples were labelled image objects, and the number of training samples depended on the segmentation scale, which is a parameter of the multiresolution segmentation algorithm. For image objects at a certain segmentation scale, object-based samples were selected based on the reference map. If more than half of the pixels of an image object belonged to one class c, the image object would be assigned to class c as a sample. Finally, approximately four percent of image object samples were selected as training samples of RFC at each segmentation scale.

Parameter Settings
Segmentation scales determine whether image objects can describe geographic objects exactly. If segmentation scales are too small, geographic objects will be segmented into fragmentary objects, leading to oversegmentation. On the contrary, if segmentation scales are too large, an image object will be related to several geographic objects, leading to undersegmentation. Generally, it is difficult to find a suitable segmentation scale because both over-and undersegmentation always exist simultaneously [50]. Additionally, patch size greatly influences the extraction of deep features. Therefore, nine segmentation scales range from 50 to 210 with an interval of 20 and patch size range from 16 to 64 with an interval of eight were used to explore the influences of segmentation scales on the Beijing dataset. Moreover, to verify the effectiveness of the approach proposed in this paper, comparisons were conducted between our approach with some state-of-the-art methods in both the Beijing dataset and Zurich dataset. For the Beijing dataset, the segmentation scales of all methods are 110, and the patch size of our approach was 48. For the Zurich dataset, the segmentation scales of all methods are 10, and the patch size of our approach was 32. Since seven patch sizes were adopted in the experiments, corresponding to CNN models with seven input sizes, the network structures of CNN in these experiments were the same ( Figure 6) except for the input sizes. In this structure, the first DB contained three CBs, the second DB contained six CBs, and the last DB contained four CBs. In the CNN training process, an Adam optimizer was used, and the learning rates, beta1 and beta2, were 0.0001, 0.9, and 0.999, respectively.
The batch size was 32. In addition, to prevent overfitting, the method of early stopping was used; that is, if the accuracy of the validation set does not increase for five epochs, then training is stopped. Tensorflow was used in CNN training and deep feature extraction, and scikit-learn was used in image object classification using RFC.

Land Cover Classification Results
After CNN training, deep feature extraction and aggregation, and RF classification with deep and object features, the classification results of the Beijing dataset and Zurich dataset were obtained and are shown in Figure 9. Tables 1 and 2 present the PA and the UA of each class in the Beijing dataset and Zurich dataset, respectively.
Remote Sens. 2020, 12, x; doi: FOR PEER REVIEW www.mdpi.com/journal/remotesensing Beijing dataset and Zurich dataset. For the Beijing dataset, the segmentation scales of all methods are 110, and the patch size of our approach was 48. For the Zurich dataset, the segmentation scales of all methods are 10, and the patch size of our approach was 32. Since seven patch sizes were adopted in the experiments, corresponding to CNN models with seven input sizes, the network structures of CNN in these experiments were the same ( Figure 6) except for the input sizes. In this structure, the first DB contained three CBs, the second DB contained six CBs, and the last DB contained four CBs.
In the CNN training process, an Adam optimizer was used, and the learning rates, beta1 and beta2, were 0.0001, 0.9, and 0.999, respectively. The batch size was 32. In addition, to prevent overfitting, the method of early stopping was used; that is, if the accuracy of the validation set does not increase for five epochs, then training is stopped. Tensorflow was used in CNN training and deep feature extraction, and scikit-learn was used in image object classification using RFC.

Land Cover Classification Results
After CNN training, deep feature extraction and aggregation, and RF classification with deep and object features, the classification results of the Beijing dataset and Zurich dataset were obtained and are shown in Figure 9. Tables 1 and 2 present the and the of each class in the Beijing dataset and Zurich dataset, respectively.   For the Beijing dataset, the OA was 91.21%, which explains the feasibility of using our method for land cover classification. From the perspective of PA, Water obtained the highest accuracy, and Buildings obtained the lowest accuracy. From the perspective of UA, Buildings obtained the highest accuracy, and Bare soils obtained the lowest accuracy. After comprehensive consideration, Roads obtained the highest classification accuracy, which is also consistent with the classification map.
For the Zurich dataset, the OA was 99.05%. Such a high overall accuracy is mainly because the scene of the Zurich dataset is more single than the Beijing dataset. For example, Buildings in the Beijing dataset are diverse. There are residential buildings, commercial building, schools, and museums in the Beijing dataset. On the contrary, buildings are mainly residential buildings in the Zurich dataset.
From the perspective of PA and UA, Bare soils obtained the highest classification accuracy, and Trees obtained the lowest classification accuracy.
To further explore the uncertainty of our method, 5-fold stratified cross validations were performed 50 times in RF classification with deep and object features. The box plots of classification accuracies are shown in Figure 10. It can be seen from Figure 10a,c that the range of overall accuracies of the Beijing dataset and Zurich dataset was very small, which demonstrates that overall accuracy is less affected by the random selection of samples. Therefore, our method is quite robust for different samples. For the Beijing dataset (Figure 10b), the producer accuracies of Buildings, Vegetation, and Shadows were higher than Roads, Water, and Bare soils. At the same time, the range of producer accuracy of Buildings. Vegetation and Shadows are lower than Roads, Water and Bare soils. For Zurich dataset (Figure 10d), Trees gets the lowest producer accuracy and get the highest range of producer accuracy.

Influences of Segmentation Scales
Nine segmentation scales were used to explore the influences of segmentation scales on the Beijing dataset in this section. Indices , , and of different ground entities at different scales are shown in Figure 11. When segmentation scales became larger, the values of indices and of all ground entities became smaller, while the values of became larger. At the segmentation scale of 210, the values of were still much higher than that of for all ground entities except for shadows, meaning that these ground entities were still oversegmented. Buildings were also oversegmented because buildings with windows and roofs were heterogeneous; vegetation was oversegmented because their distribution were uneven: some places are dense, while some are sparse. Bare soils were also heterogeneous. Only shadows were relatively homogeneous; thus, the oversegmentation phenomenon was not significant. This means that a larger segmentation scale should be utilized to segment shadows. However, the conclusion may be different from the perspective of classification results.

Influences of Segmentation Scales
Nine segmentation scales were used to explore the influences of segmentation scales on the Beijing dataset in this section. Indices OSeg, USeg, and RMS of different ground entities at different scales are shown in Figure 11. When segmentation scales became larger, the values of indices OSeg and RMS of all ground entities became smaller, while the values of USeg became larger. At the segmentation scale of 210, the values of OSeg were still much higher than that of USeg for all ground entities except for shadows, meaning that these ground entities were still oversegmented. Buildings were also oversegmented because buildings with windows and roofs were heterogeneous; vegetation was oversegmented because their distribution were uneven: some places are dense, while some are sparse. Bare soils were also heterogeneous. Only shadows were relatively homogeneous; thus, the oversegmentation phenomenon was not significant. This means that a larger segmentation scale should be utilized to segment shadows. However, the conclusion may be different from the perspective of classification results. Remote Sens. 2020, 12,   For image objects generated by the above nine segmentation scales, the proposed approach was used for classification. The overall classification accuracies varied over segmentation scales ( Figure  12), and so do the producer accuracies of different classes ( Figure 13). As segmentation scales become larger, the overall accuracies remain unchanged first, and then decrease slightly. There were similar trends for different ground entities. Therefore, a smaller segmentation scale should be used to generate image objects in terms of classification results. This was exactly the opposite conclusion to the segmentation evaluation results.  For image objects generated by the above nine segmentation scales, the proposed approach was used for classification. The overall classification accuracies varied over segmentation scales (Figure 12), and so do the producer accuracies of different classes ( Figure 13). As segmentation scales become larger, the overall accuracies remain unchanged first, and then decrease slightly. There were similar trends for different ground entities. Therefore, a smaller segmentation scale should be used to generate image objects in terms of classification results. This was exactly the opposite conclusion to the segmentation evaluation results. For image objects generated by the above nine segmentation scales, the proposed approach for classification. The overall classification accuracies varied over segmentation scales (F nd so do the producer accuracies of different classes ( Figure 13). As segmentation scales be r, the overall accuracies remain unchanged first, and then decrease slightly. There were si s for different ground entities. Therefore, a smaller segmentation scale should be us rate image objects in terms of classification results. This was exactly the opposite conclusi egmentation evaluation results.  To explore the reasons for the segmentation scale selection, Figure 14 shows the frequency distribution histograms of and at the two segmentation scales of 50 and 210. Apparently, when the segmentation scale was 50, over 90% image objects had values greater than 0.8, while less than 1% image objects had values greater than 0.5. That is, most image objects were oversegmented at the scale 50. When the segmentation scale was 210, both the over-and undersegmentation coexisted. However, how do over-and undersegmentation impact classification? As shown in Figure 15, A-G represent geographic objects with different categories, a-f represent oversegmented image objects of G, and g represents undersegmented image objects of G. For oversegmented image objects a-f, whether they were classified correctly or not, they affected the classification results of adjacent geographic objects A-F slightly. However, for undersegmented image object g, it was mixed by diverse classes of pixels. If g was classified correctly, the pixels of adjacent geographic objects A-F would be misclassified, because A-F are different categories with G. If g was misclassified, geographic object G would be misclassified, too. Therefore, whether undersegmented image objects are classified correctly or not will lead to incorrect classification results. All in all, if the segmentation cannot completely coincide with image objects and geographic objects, a smaller segmentation scale is recommended in this paper, but the scale should not be too small; otherwise, the phenomenon of the "salt and pepper effect" will occur.  To explore the reasons for the segmentation scale selection, Figure 14 shows the frequency distribution histograms of OSeg l and USeg l at the two segmentation scales of 50 and 210. Apparently, when the segmentation scale was 50, over 90% image objects had OSeg values greater than 0.8, while less than 1% image objects had USeg values greater than 0.5. That is, most image objects were oversegmented at the scale 50. When the segmentation scale was 210, both the over-and undersegmentation coexisted. However, how do over-and undersegmentation impact classification? As shown in Figure 15, A-G represent geographic objects with different categories, a-f represent oversegmented image objects of G, and g represents undersegmented image objects of G. For oversegmented image objects a-f, whether they were classified correctly or not, they affected the classification results of adjacent geographic objects A-F slightly. However, for undersegmented image object g, it was mixed by diverse classes of pixels. If g was classified correctly, the pixels of adjacent geographic objects A-F would be misclassified, because A-F are different categories with G. If g was misclassified, geographic object G would be misclassified, too. Therefore, whether undersegmented image objects are classified correctly or not will lead to incorrect classification results. All in all, if the segmentation cannot completely coincide with image objects and geographic objects, a smaller segmentation scale is recommended in this paper, but the scale should not be too small; otherwise, the phenomenon of the "salt and pepper effect" will occur.  To explore the reasons for the segmentation scale selection, Figure 14 shows the frequency distribution histograms of and at the two segmentation scales of 50 and 210. Apparently, when the segmentation scale was 50, over 90% image objects had values greater than 0.8, while less than 1% image objects had values greater than 0.5. That is, most image objects were oversegmented at the scale 50. When the segmentation scale was 210, both the over-and undersegmentation coexisted. However, how do over-and undersegmentation impact classification? As shown in Figure 15, A-G represent geographic objects with different categories, a-f represent oversegmented image objects of G, and g represents undersegmented image objects of G. For oversegmented image objects a-f, whether they were classified correctly or not, they affected the classification results of adjacent geographic objects A-F slightly. However, for undersegmented image object g, it was mixed by diverse classes of pixels. If g was classified correctly, the pixels of adjacent geographic objects A-F would be misclassified, because A-F are different categories with G. If g was misclassified, geographic object G would be misclassified, too. Therefore, whether undersegmented image objects are classified correctly or not will lead to incorrect classification results. All in all, if the segmentation cannot completely coincide with image objects and geographic objects, a smaller segmentation scale is recommended in this paper, but the scale should not be too small; otherwise, the phenomenon of the "salt and pepper effect" will occur.

Influences of Patch Sizes on Classification
The unit of patch size was pixel, and it greatly influenced the extraction of deep features. To explore how patch size affects deep feature extraction, seven patch sizes were chosen. Figure 16 shows the variation of overall accuracies over patch sizes, while Figure 17 illustrates that of the producer accuracies over patch sizes. In Figure 16, overall accuracies increased firstly and decreased lastly, and the overall accuracy reached its highest when the patch size was close to 48. Additionally, in Figure 17, buildings were most sensitive to the change of patch sizes, while the producer accuracies of other ground entities changed slightly with the change of patch sizes.

Influences of Patch Sizes on Classification
The unit of patch size was pixel, and it greatly influenced the extraction of deep features. To explore how patch size affects deep feature extraction, seven patch sizes were chosen. Figure 16 shows the variation of overall accuracies over patch sizes, while Figure 17 illustrates that of the producer accuracies over patch sizes. In Figure 16, overall accuracies increased firstly and decreased lastly, and the overall accuracy reached its highest when the patch size was close to 48. Additionally, in Figure 17, buildings were most sensitive to the change of patch sizes, while the producer accuracies of other ground entities changed slightly with the change of patch sizes.

Influences of Patch Sizes on Classification
The unit of patch size was pixel, and it greatly influenced the extraction of deep features. To explore how patch size affects deep feature extraction, seven patch sizes were chosen. Figure 16 shows the variation of overall accuracies over patch sizes, while Figure 17 illustrates that of the producer accuracies over patch sizes. In Figure 16, overall accuracies increased firstly and decreased lastly, and the overall accuracy reached its highest when the patch size was close to 48. Additionally, in Figure 17, buildings were most sensitive to the change of patch sizes, while the producer accuracies of other ground entities changed slightly with the change of patch sizes.

Influences of Patch Sizes on Classification
The unit of patch size was pixel, and it greatly influenced the extraction of deep features. To explore how patch size affects deep feature extraction, seven patch sizes were chosen. Figure 16 shows the variation of overall accuracies over patch sizes, while Figure 17 illustrates that of the producer accuracies over patch sizes. In Figure 16, overall accuracies increased firstly and decreased lastly, and the overall accuracy reached its highest when the patch size was close to 48. Additionally, in Figure 17, buildings were most sensitive to the change of patch sizes, while the producer accuracies of other ground entities changed slightly with the change of patch sizes.  To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48.  Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48.  Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48.  Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48.  Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48.  Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48.  Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48.  Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48.  Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48.  Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48. Table 3. Patch representations of image objects (yellow lines represent geographic objects and red lines represent patches) and the learned deep features (red represents high value and blue represent low value) with different patch sizes. Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).

Patch
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48. Table 3. Patch representations of image objects (yellow lines represent geographic objects and red lines represent patches) and the learned deep features (red represents high value and blue represent low value) with different patch sizes. Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).

Patch
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48. Table 3. Patch representations of image objects (yellow lines represent geographic objects and red lines represent patches) and the learned deep features (red represents high value and blue represent low value) with different patch sizes. Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).

Patch
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48. Table 3. Patch representations of image objects (yellow lines represent geographic objects and red lines represent patches) and the learned deep features (red represents high value and blue represent low value) with different patch sizes. Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).

Patch
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48. Table 3. Patch representations of image objects (yellow lines represent geographic objects and red lines represent patches) and the learned deep features (red represents high value and blue represent low value) with different patch sizes. Bare soils (f) over patch sizes (the horizontal axis refers to patch size, the vertical axis refers to accuracies, while different curves represent different segmentation scales).

Patch
To explore the influence of patch sizes on deep feature extraction, an image object of building class was chosen as an example to illustrate the patch representation and deep feature extraction, because buildings are most sensitive to the change of patch size (Table 3). Results show that when patch size was relatively small or relatively large, the outlines of buildings in deep features were not obvious; but when patch size was 48, the outline of the building in deep features was obvious and the classification accuracy was the highest, simultaneously. Therefore, it is important to choose an appropriate patch size, which is related to the data resolution, sizes of image objects, and the method to place the patch. The appropriate patch size in this paper was 48. Table 3. Patch representations of image objects (yellow lines represent geographic objects and red lines represent patches) and the learned deep features (red represents high value and blue represent low value) with different patch sizes.

Object Features vs. Deep Features
Both deep and object features can be used to classify VHR images, but which one is more important for classification? A comparative experiment composed of three different features combinations was designed to answer this question in the Beijing dataset. The first experiment explored the effectiveness of object features on classification; the second explored the effectiveness of deep features; while the third considered both object and deep features. The classification results are shown in Figure 18. The largest difference between the first and the second experiments relied on the results of roads and buildings. The differences between the latter two experiments appeared to be small.

Object Features vs. Deep Features
Both deep and object features can be used to classify VHR images, but which one is more important for classification? A comparative experiment composed of three different features combinations was designed to answer this question in the Beijing dataset. The first experiment explored the effectiveness of object features on classification; the second explored the effectiveness of deep features; while the third considered both object and deep features. The classification results are shown in Figure 18. The largest difference between the first and the second experiments relied on the results of roads and buildings. The differences between the latter two experiments appeared to be small. The overall accuracies of the three experiments are shown in Figure 19, and the producer accuracies are shown in Figure 20. From the perspective of overall accuracy, no matter which segmentation scale was used, the classification result with both object and deep features produced the highest classification accuracy, while classification result with object features led to the lowest accuracy. Therefore, deep features are more important than object features to classify VHR images, although object features are also essential in general. At the same time, the conclusion is not always correct for all ground entities. Most ground entities are in line with the above conclusion, but for vegetation and shadows at smaller scales, object features can produce better classification results than deep features. This is because image objects are relatively homogeneous at a small scale, while NDVI and spectral mean values can distinguish vegetation and shadows well from other ground entities, respectively. The larger the segmentation scales become, the more complicated image objects are, and the more robust features are needed. Additionally, in Figure 19, the effects of segmentation scales on overall accuracies of the experiments two and three were much lower than the impacts on the overall accuracies of the first experiment. In other words, deep features can reduce the impact of segmentation scales on classification results. Therefore, it is recommended that both deep and object features should be considered in classification.  The overall accuracies of the three experiments are shown in Figure 19, and the producer accuracies are shown in Figure 20. From the perspective of overall accuracy, no matter which segmentation scale was used, the classification result with both object and deep features produced the highest classification accuracy, while classification result with object features led to the lowest accuracy. Therefore, deep features are more important than object features to classify VHR images, although object features are also essential in general. At the same time, the conclusion is not always correct for all ground entities. Most ground entities are in line with the above conclusion, but for vegetation and shadows at smaller scales, object features can produce better classification results than deep features. This is because image objects are relatively homogeneous at a small scale, while NDVI and spectral mean values can distinguish vegetation and shadows well from other ground entities, respectively. The larger the segmentation scales become, the more complicated image objects are, and the more robust features are needed. Additionally, in Figure 19, the effects of segmentation scales on overall accuracies of the experiments two and three were much lower than the impacts on the overall accuracies of the first experiment. In other words, deep features can reduce the impact of segmentation scales on classification results. Therefore, it is recommended that both deep and object features should be considered in classification. , 12, x; doi: FOR PEER REVIEW www.mdpi.com/journal/ cies of the experiments two and three were much lower than the impacts on the first experiment. In other words, deep features can reduce the scales on classification results. Therefore, it is recommended that both dee ld be considered in classification.

The Importance of Deep Feature Aggregation
Deep feature aggregation is an important operator in our approach. Classification results can be also obtained without deep feature aggregation. For example, deep features of patches can be used to classify these patches and a decision-level fusion through patches-based voting can be performed to obtain the classification results of image objects. However, many useful spectral features such as NDVI cannot be accurately learned through CNN. In addition, deep and object features are obtained through different units. That is, deep features are related to patches, while object features are concerned with image objects. Therefore, to combine these two types of features, aggregating patchbased deep features to object-based deep features is necessary.
To verify the importance of deep feature aggregation, deep features of patches are used to classify these patches and a decision-level fusion by patches-based voting is performed to obtain the classification results of image objects without deep feature aggregation on Beijing dataset. The patch size is 48 in this experiment. As shown in Figure 21, larger classification accuracy is obtained with deep feature aggregation whether which scale is selected. Additionally, compared with Figure 19, it can be found that deep features and deep feature aggregation can produce larger classification accuracy than deep features alone. Therefore, deep feature aggregation is important.

The Importance of Deep Feature Aggregation
Deep feature aggregation is an important operator in our approach. Classification results can be also obtained without deep feature aggregation. For example, deep features of patches can be used to classify these patches and a decision-level fusion through patches-based voting can be performed to obtain the classification results of image objects. However, many useful spectral features such as NDVI cannot be accurately learned through CNN. In addition, deep and object features are obtained through different units. That is, deep features are related to patches, while object features are concerned with image objects. Therefore, to combine these two types of features, aggregating patch-based deep features to object-based deep features is necessary.
To verify the importance of deep feature aggregation, deep features of patches are used to classify these patches and a decision-level fusion by patches-based voting is performed to obtain the classification results of image objects without deep feature aggregation on Beijing dataset. The patch size is 48 in this experiment. As shown in Figure 21, larger classification accuracy is obtained with deep feature aggregation whether which scale is selected. Additionally, compared with Figure 19, it can be found that deep features and deep feature aggregation can produce larger classification accuracy than deep features alone. Therefore, deep feature aggregation is important. To verify the importance of deep feature aggregation, deep features of patches are used to classify these patches and a decision-level fusion by patches-based voting is performed to obtain the classification results of image objects without deep feature aggregation on Beijing dataset. The patch size is 48 in this experiment. As shown in Figure 21, larger classification accuracy is obtained with deep feature aggregation whether which scale is selected. Additionally, compared with Figure 19, it can be found that deep features and deep feature aggregation can produce larger classification accuracy than deep features alone. Therefore, deep feature aggregation is important.

Comparison with the State-Of-The-Art Methods
There are three main object-based CNN methods [32][33][34] in current literature. How does the proposed approach differ from these existing methods? Which method can produce the best

Comparison with the State-Of-The-Art Methods
There are three main object-based CNN methods [32][33][34] in current literature. How does the proposed approach differ from these existing methods? Which method can produce the best classification result for VHR imagery? The main differences between the proposed approach and those existing methods are summarized as follows: (1) Different methods to place patches. Different methods adopt different strategies to place patches. As shown in Figure 22a, Fu et al. [34] applied only one patch to one image object. The center of the patch was located at the center of the image object. However, for linearly shaped image objects, this method may be inappropriate to place patches. Zhao et al. [32] placed one patch on each pixel of an image object (Figure 22b). However, the patches were too dense and overlapped largely, leading to extremely massive computation and a large correlation among neighboring pixels. Zhang et al. [33] divided image objects into two types: general image objects and linearly shaped image objects. Large patches were used for general image objects, while several small patches were adopted for linearly shaped image objects. The positions of patches were determined by convolutional position analysis. For a general image object, the method to place patches is similar to Figure 22a. For a linearly shaped image object, the method to place patches is shown in Figure 22c. However, in this way, the patches cannot represent image objects exactly and take advantage of all the information. In our method, a set of patches was used to represent an image object, and all patches covered the entire object together (Figure 22d). The number of patches for an image object was determined by patch sizes and the shape of the image object. Compared with the former three methods, deep features of different patches were weighted, and deep features of image objects were obtained through deep feature aggregation in our method.
Remote Sens. 2020, 12, x FOR PEER REVIEW 19 of 25 classification result for VHR imagery? The main differences between the proposed approach and those existing methods are summarized as follows: (1) Different methods to place patches. Different methods adopt different strategies to place patches. As shown in Figure 22a, Fu et al. [34] applied only one patch to one image object. The center of the patch was located at the center of the image object. However, for linearly shaped image objects, this method may be inappropriate to place patches. Zhao et al. [32] placed one patch on each pixel of an image object (Figure 22b). However, the patches were too dense and overlapped largely, leading to extremely massive computation and a large correlation among neighboring pixels. Zhang et al. [33] divided image objects into two types: general image objects and linearly shaped image objects. Large patches were used for general image objects, while several small patches were adopted for linearly shaped image objects. The positions of patches were determined by convolutional position analysis. For a general image object, the method to place patches is similar to Figure 22a. For a linearly shaped image object, the method to place patches is shown in Figure 22c. However, in this way, the patches cannot represent image objects exactly and take advantage of all the information. In our method, a set of patches was used to represent an image object, and all patches covered the entire object together (Figure 22d). The number of patches for an image object was determined by patch sizes and the shape of the image object. Compared with the former three methods, deep features of different patches were weighted, and deep features of image objects were obtained through deep feature aggregation in our method. (2) Different methods to obtain object class from patches. Different deep features can be learned from different patches of an image object; that is, deep features are patch-based. As a result, an aggregation operator is needed to aggregate these deep features to classify image objects or to generate a fixed dimension feature for further classification. Fu et al. [34] utilized only one patch to represent an image object; thus, no aggregation operator is used. Zhao et al. [32] and Zhang et al. [33] utilized the majority voting operator to obtain classification results from these patch-based deep features. In other words, Zhao et al. and Zhang et al. used a decision-level fusion, and a feature-level fusion was used in this paper.
(3) Different roles of object and deep features playing in classification. As demonstrated in (2) Different methods to obtain object class from patches. Different deep features can be learned from different patches of an image object; that is, deep features are patch-based. As a result, an aggregation operator is needed to aggregate these deep features to classify image objects or to generate a fixed dimension feature for further classification. Fu et al. [34] utilized only one patch to represent an image object; thus, no aggregation operator is used. Zhao et al. [32] and Zhang et al. [33] utilized the majority voting operator to obtain classification results from these patch-based deep features. In other words, Zhao et al. and Zhang et al. used a decision-level fusion, and a feature-level fusion was used in this paper.
(3) Different roles of object and deep features playing in classification. As demonstrated in Section 5.1, although deep features can help to classify image objects directly, they cannot completely replace object features, including spectral, shape, texture, and other features. Existing work [33,34] only considered deep features and ignored object features. Zhao et al. [32] combined object features with the deep features of pixels to classify pixels. Unfortunately, both object and deep features were extracted from different units: object features from image objects, while deep features from pixels. As a result, it is hard to combine the two kinds of features. However, this study resolved this issue by aggregating deep features of patches into deep features of objects through mean and weighted sum operators. Finally, deep and object features of image objects were concatenated together to classify image objects.
The classification results of the four methods on Beijing dataset and Zurich dataset are shown in Figures 23 and 24, their classification accuracies are reported in Tables 4 and 5. These two tables demonstrate that our method can obtain the best classification accuracy. For Beijing dataset, the main advantage was that better classification results of buildings and water for our method, while the misclassification of buildings and water by other three methods was more significant. Buildings and roads, water, and shadows are more likely to be confused in the other three methods while they were very well classified in our method. For the Zurich dataset, whether from the perspective of PA or from the perspective of UA, our method can obtain very good results for each class. At the same time, the other three methods cannot achieve good results for every class. Therefore, compared with the three existing object-based CNN methods, our method has more advantages for the classification of VHR imagery.
Remote Sens. 2020, 12, x FOR PEER REVIEW 20 of 25 demonstrate that our method can obtain the best classification accuracy. For Beijing dataset, the main advantage was that better classification results of buildings and water for our method, while the misclassification of buildings and water by other three methods was more significant. Buildings and roads, water, and shadows are more likely to be confused in the other three methods while they were very well classified in our method. For the Zurich dataset, whether from the perspective of PA or from the perspective of UA, our method can obtain very good results for each class. At the same time, the other three methods cannot achieve good results for every class. Therefore, compared with the three existing object-based CNN methods, our method has more advantages for the classification of VHR imagery.

Limitations and Future Work
A novel method was proposed to integrate deep features into the GEOBIA paradigm in this study and was used for VHR imagery classification. Although the proposed method can achieve better classification results than existing object-based CNN methods, there were still some limitations.
The first is the determination of the method of patch division and the selection of patch size. The results of deep feature extraction depend on patch sizes and the method of patch division. For simplicity, this paper adopted the fixed patch size. However, different ground entities may vary in sizes and structures; thus, different patch sizes should be considered. As a result, adaptive patch sizes will be addressed in future.
The second is the method of aggregating deep features, which directly affects how the deep features extracted through patches are aggregated into image objects with less information loss. The mean operator was used to compress the deep features of each patch, and weighted summation was used to aggregate deep features extracted from different patches in this paper. However, the mean function may lose some information, and the weighted summation may not be the best way to aggregate deep features of different patches. Therefore, other methods of deep feature aggregation will be the focus of future work.

Conclusions
In this study, we showed an effective method of integrating deep features into GEOBIA for VHR remote sensing imagery classification. We proposed a patch-based approach for representing image objects using patches and learning patch-based deep features and a deep feature aggregation method for aggregating patch-based deep features into object-based deep features, in order to extract deep features from irregular image objects through CNN. Results show that smaller segmentation scales were more conducive to VHR remote sensing imagery classification, and it was not appropriate to select too large or too small patches, as the patch size should be determined by imagery and its resolution. Moreover, we performed 5-fold stratified cross validations 50 times to demonstrate the stability of our method. Additionally, we found that although deep features are better than object features for classification in most cases, object features still matter for improving classification results. We demonstrated the deep feature aggregation is a critical step in our method. In the comparison with existing object-based CNN methods, our method achieved the highest overall accuracies, and the overall accuracies were 91.21% for the Beijing dataset and 99.05% for the Zurich dataset. Therefore, the method presented in this paper has great application prospects in LULC mapping.
Though our method achieves better classification results than the state-of-the-art methods, there still are some points that can be improved. The compression method with less information lost can be developed, and other methods of deep feature aggregation can be used. A better classification result can be achieved through these methods.

Conflicts of Interest:
There is no conflict of interest.