Gradient Boosting Machine and Object-Based CNN for Land Cover Classiﬁcation

: In regular convolutional neural networks (CNN), fully-connected layers act as classiﬁers to estimate the probabilities for each instance in classiﬁcation tasks. The accuracy of CNNs can be improved by replacing fully connected layers with gradient boosting algorithms. In this regard, this study investigates three robust classiﬁers, namely XGBoost, LightGBM


Introduction
Machine learning methods have been developed to automate the analysis and enhance remote sensing observations by introducing new classifiers, segmentation, or optimization algorithms.These methods are efficient when applied to high spatial resolution data, including satellites, air-borne, and Unmanned Aerial Vehicle data.Among conventional methods, ensemble classifier random forest (RF), Neural network, and Support vector machine (SVM) techniques are regularly employed for image classification and other tasks (e.g., change detection) with considerable success.These methods have received much attention due to their ability to handle multi-dimensional data and perform well with limited training samples [1][2][3][4][5][6][7].Typically, these conventional machine learning approaches have been applied using shallow classification techniques.However, the massive increase in the size of datasets (velocity, volume, variety) has resulted in a bottleneck in efficient data processing [8].
In more recent years, the advent of deep learning (DL) has led to renewed interest in neural networks.DL has demonstrated astounding capabilities, primarily attributed to the automated extraction of essential features, removing the need for identifying case-specific features.The driving force behind the success of DL in image analysis can be traced to the following three key factors: (1) More data available for training DNNs, especially in cases of supervised learning such as classification, where users typically provide annotations, for example in [9,10]; (2) More processing power, especially the explosion in the availability of Graphical Processing Units; (3) More algorithms [5,[11][12][13].Among different deep learning structures, the convolutional neural network (CNN) is a widely used method successfully applied to pattern recognition, natural language processing, landcover classification, and point cloud dataset processing.As they are more efficient for processing large datasets, the CNN is particularly relevant for tasks involving remotely sensed imagery and other spatial data.CNN has been used in a wide range of applications, particularly in the classification of high spatial resolution datasets [14][15][16], including LULC classification, scene classification, and object detection [15][16][17][18][19][20], and for annotation of point cloud datasets [21].More recently, recurrent CNNs (R-CNNs) have been used in the analysis of very high spatial resolution datasets with considerable success.For example, R-CNNs have been used to overcome the scarcity of labeled training data to detect scalable old and new buildings [22], enable the regularization of the building footprint [23], and facilitate the rapid extraction of buildings through iterative inclusion of validated samples [24].
The structure of CNN in land cover classification depends on several factors, such as the number of convolutional layers, activation functions, loss functions [7,25,26], or shapes of input data, such as patch-based [27,28] or graph-based [29].Moreover, the types of output, either scene-based [9] or pixel-based classes [12,30], influence the selection of classification methods.Some studies [13,31] have discussed how 1D and 2D graphs and line thickness improve the classification accuracy compared to several standard CCN-based methods, whereas others have attempted to integrate machine learning classifiers into CNN, as in [32].The object-based image analysis (OBIA) has also been combined with CNN to take advantage of boundary delimitation of the former and spatial feature extraction of the latter method [33][34][35][36][37][38].In these studies, a dense layer is used in the classification of image objects, although this classifier can be replaced with other algorithms and are potentially valuable in land cover analysis.
Another note on the uses of CNNs for remote sensing applications is on the importance of the band combinations for the improvement of classification accuracies.Numerous studies focused on the application of high-resolution images using the recently established state-of-the-art object-based CNN deep learning technique, where they utilized optimal band combinations (e.g., three-band combinations) and exhibited significant accuracies [39][40][41][42].Furthermore, these works developed an automatic extraction framework for remote sensing applications from high spatial resolution optical images using CNN architecture in a large-scale application based on multispectral band combinations.These approaches offer potential choices of bands in multiple spectral satellite images but offer none for datasets with limited bands, such as RGB images in common UAVs or several high-resolution images (SPOT 7).In these cases, the spatial arrangement of objects is a significant contribution to the overall performance of land pattern classification.
Many recent studies have discussed the robustness of CNN and other ensemble algorithms in land cover analysis.In brief, CNN reveals its strength with unstructured data (images), while ensemble methods seem to be better suited to tabulated datasets.In general, gradient boosting [43][44][45] is an iterative learning process that learns from the errors made in the previous step for improved selection of weights in the subsequent iterations.This process develops a more complete picture of the dataset, and classification results are statistically reliable.A recent study that focused on land cover [46] compared CNN and gradient boosting for urban land classification and found small differences between those methods, although greater effectiveness of tree-based methods in land cover analysis for classification accuracy has been demonstrated [47].The combination of CNN and gradient boosting algorithms may, indeed, improve the classification of satellite datasets.Nevertheless, complex human-made objects are replacing natural physical surfaces and, therefore, more efficient and effective models are increasingly needed.This study extends previous works [13,26,31,32] to test the potential use of gradient boosting algorithms as classifiers for the final layer of a CNN to improve the classification performance in terms of overall accuracy (OA) and error.The aim is to use SPOT7 imagery as input data, prepared in a sequence of image segmentation, feature extraction, and graph generation, followed by training of the proposed model for comparison with several other benchmarked methods.In summary, the main contributions of this manuscript include the construction of 2D input graphs from object features of the images extracted during image segmentation processes and the application of gradient boosting algorithms as a replacement for dense layers in CNN for more accurate land cover classification.

Study Area and Training Data Preparation
Hanoi, the capital city of Vietnam, was selected as a case study because of its complex surface morphology and spatial mixture of various land cover types (Figure 1).The city boundary was extended in 2018 through a decision to merge neighboring provinces with different landscapes or historically distinctive morphological zones.The central-western part is a dynamic area because the residential areas are subject to ongoing development with a mix of high-rise buildings surrounded by open green spaces.The old French area is distinguishable by its house styles interspersed with small gardens, and it is also home to government buildings and affluent residential neighborhoods.On the other hand, the historical center remains unchanged, with its dense population concentration and smallfacade houses.The automated classification of such areas can prove difficult because of the complex mixture of different land patterns, and accuracies are subject to the choice of the spatial and spectral resolution of input data.In this study, SPOT 7 with 1.5 m spatial resolution in the panchromatic band and 6 m for multiple spectral ranges was used as the input dataset.For pre-processing, the process of combining multispectral and high-resolution panchromatic images with complementary characteristics often serves as an integral component of remote sensing mapping workflows [48,49].Here, the fusion technique was applied to generate a higher resolution image with spectral information for multiple bands.Even though concerns have been raised about artifacts in fusion images, several studies have reported positive effects on the classification accuracy, although accuracy is subject to the detailed application of the fusion method.Fusion techniques are increasingly being used in remote sensing applications, such as Wavelength transformation, Browvey transformation, Intensity-Hue-Saturation fusion, Principal component transformation, and High pass filtering.In this study, the Gam Schmidt method [50] was used by simulating panchromatic bands and averaging multispectral bands.This method was found to be efficient in producing more natural colors.A more detailed specification of the dataset is presented in Table 1.Segmentation was then carried out with the fusion images, using PCI Geomatics (evaluation version).This software uses the region-growing algorithm by selecting initial seed points and searching for similar neighbor pixels to form a larger region.The iteration is continued until desirable outputs are achieved by determining three parameters: scale, shape, and compactness (30, 0.75, and 0.5, respectively).The scale value influences how large the objects should be and is defined according to the spatial resolution of the input image.A sample area of the city covering all typical types of land patterns was chosen for verifying the proposed model.This subset of 54,234 segmented objects represents various land cover patterns, including six classes: house, impervious surface, water, bare land, vegetation, and shadow areas (Figure 1).All 54,234 objects in the study area were visually allocated into six classes, using higher resolution images as references and ground-truthing and ancillary documents, such as cultivation plans or current land-use/land-cover maps.
The inclusion of homogeneous pixels can result in valuable geometric shapes (i.e., square, round, and rectangle) that are useful for detecting specific human-made objects.With six input spectral bands, the algorithms produce 52 features, as shown in Table 2, including (i) spectral statistics of pixel values (min, max, mean, and standard deviations) and (ii) spatial metrics, such as Circularity, Compactness, Elongation, and Rectangular.These values can be used in tabulated form with traditional machine learning algorithms or hybrid models as reported in [11,12].However, an alternative approach was proposed to generate plots from these values before feeding through the proposed model.Figure 1 shows the classes of the segmented objects that are set partly transparent for visualization and overlaid upon the natural-composite SPOT 7 image.

Gradient Boosting Classifiers
Gradient Boosting Machines (GBM) are powerful ensemble machine learning algorithms that employ decision trees to build up the classifiers.Technically, the algorithms apply iteration by adding models to correct weaknesses in prior models and improve overall performance accuracy.Among gradient boosting algorithms, XGBoost, LightGBM, and CatBoost are often considered as successful classifiers for various applications [44,45,51].XGBoost uses the pre-sorted and histogram-based algorithm for estimating best splits and employs parallel processing with a handling capacity of missing values and minimization of over-fitting.In addition, this algorithm is based on a leaf-wise pruning strategy that leverages deep searches for an optimal solution, and the gradient descent algorithm minimizes errors.
LightGBM, proposed by Microsoft, is a recently developed gradient boosting algorithm or tree-based learning algorithm.It was developed to improve predictive efficiency, handle large datasets, and reduce training time, and is typically recommended for tabular datasets.LightGBM differs from other tree-based methods by implementing leaf-wise splits (Figure 2), which create more complex trees that are more efficient in reducing loss and resulting in higher accuracy.The split is based on a novel sampling method named Gradient-Based One Side Sampling [52], in which data with small gradients are excluded, and the rest is used for estimation of information gain and tree growth.This algorithm is controlled by a group of several parameters, including (1) boosting parameters, such as Max_depth, Learning_rate, Max_leaf_node, and gamma, and (2) learning task parameters, such as loss function type, evaluation metric, and number of iterations.These parameters control how leaves grow, as briefly shown in (Figure 2).As the tree grows, the model becomes more complex, the loss is reduced, and the algorithm learns faster.One of the limitations of such an algorithm is that over-fitting may occur if the dataset is small and a proper set of model parameters is required to avoid it.Catboost, developed by Yandex, is a challenger to the previous two algorithms and is currently receiving close attention from data science communities.It has proved to be more effective than others without pre-processing requirements, and handling of overfitting is avoided by the ordered boosting approach used in Catboost.During training, consecutive symmetric decision trees are built with reduced loss in comparison to others.The symmetrical trees are not a feature of other gradient boosting methods, and faster training is also achieved.

Object-Based CNN with GBM Algorithms
The proposed CNN structure is based on several successful models in land cover classification, such as the number of hidden layers, feature maps, and activated and loss functions [13,31].The proposed hybrid model is illustrated in Figure 3 with several sequential steps.Step 1: A high-resolution image is segmented with various parameters, namely scale, compactness, and shape.These parameters should be tuned through a trial-error process to generate the most optimal boundary of potentially similar pixels.After segmentation, spectral and spatial metric features (   A more detailed structure of the CNN is presented in Figure 4.It consists of a sequence of layer stacks, in which the two first convolutional layers map similar grids over input images and sequentially map smaller grids.The leaky ReLU activation function is used during the training course to transform the feature spaces.Dropout is also applied to avoid over-fitting, in which neurons are turned off basing on their assigned probabilities during the forward stage.The output is flatted before feeding to the fully connected layer (FCN).This layer acts as a hidden layer in neural networks and outputs the probability for each class.After training the model with FCN, the last layer is replaced using gradient boosting algorithms to improve the prediction accuracy.

Accuracy Assessment
For multiple classification tasks, several statistical indicators are used for validating classifiers' performances.Multiple errors and overall accuracy are used to validate the model, and the categorial_logloss functions are used for training.This loss function is the default option (such as Sparse Multiclass Cross-Entropy Loss and Kullback-Leibler Divergence Loss, among others) and is preferred for multiclass classification problems and can be explained as follows: where Y i (y i1 , y i2 , . . .y i6 ) is a one-hot encoded target vector representing six land cover classes. The shows that ith element is in class j.This function estimates the average difference between predicted and observed classes, and a score is calculated.Moreover, the study also compares the CNN-based gradient boosting algorithms' performance with traditional classifiers, such as Random Forest and Support Vector Machine; therefore, the Root Mean Square Error (RMSE) and Overall Accuracy (OA) are also used.In addition, the model was interpreted using the salient map method that estimates the prediction capabilities (gradient of loss functions) for specific classes of each input feature.

Result and Discussions
The input images take the form of a graph representing the object's features, as presented in Table 2.The input data were normalized to a similar value range [0-1] before being plotted and saved to single-band images.During the training dataset preparation, the plot lines' weight impacts the edges' recognition in the plots, as discussed in the study [13].In this regard, we defined the line weight = 2, image size = 76 × 76, line color = black, and background = white, to generate 54,234 plots/figures in total.An illustration of the conversion from tabulated data to plots is shown in Figure 5.Among the figures, 50,234 were used for training, and 4000 plots were kept out of the training stage for visualization.The proportion of classes in the training data are bare soil = 890 images, impervious = 4716, shadows = 7786, vegetation = 6502, water = 330, and house = 30.010(Table 3).It could be seen that the training dataset is unbalanced between a number of training data points among classes because of the dominance of houses in the urban area.The applications of CNNs typically use trained networks and retrain them with new land cover datasets [30].However, this study did not follow this strategy because the training data have different perceptions, which represent feature variation in one gray band image format.The training process would learn edge differences, which are generated considering changes in image objects' spectral and spatial information (Figure 5).In this regard, the proposed model is trained from scratch with a proportion of samples, as shown in Table 3.Before gradient boosting algorithms are replaced for the classification task, the CNN with fully connected layers was trained, and the categorical log-loss was used as the objective function.The variation of log-loss is presented in Figure 6, in which, after the 120th epoch, the log-loss value seems to vary in a smaller range.At the 300th epoch, the log-loss fluctuation is so slight that we could consider terminating the training process and use the trained model for the next step.4. On the left side are values extracted from CNN with a fully connected layer and CNN's with SVM, XGBoost, LightGBM, and CatBoost to replace the fully connected layer.These models were trained with plotted images, as explained in the previous section.Moreover, we considered verifying these algorithms with a dataset with the original 52 features, as illustrated in Figure 5.The results are shown on the right side of Table 3.It could be seen that CNN-CatBoost (OA = 0.8956) and CNN-LightGBM (OA = 0.8956) achieve the highest overall accuracies and the smallest errors.Thus, the CNN-based classifiers show improvements compared to traditional methods, which were run with the tabulated dataset.Furthermore, the higher features of the CNN-based gradient boosting (128) might result in higher accuracy of these models over the tabulated dataset (52 features).For a more detailed analysis of the CNN-based methods, the confusion matrix is also shown in Tables 4 and 5.For example, in looking at the Producer Accuracy, there is a high probability of misclassification between Bare land and Impervious and between Impervious and House.The reasons for these misclassifications might be because of the similar spectral values of all bands, and the spatial information might be the distinguishable factors between these classes.Moreover, about 20% of the water objects is misclassified to shadows because of the low reflectance in these areas, which are sometimes considered water (high absorption in the visible range).The study area is complex, with a mixture of houses and vegetation in a small area, which is difficult and impractical for such a small area to be classified into more than one class.In this regard, the object-based image classification proves to be more accurate since it generates boundaries around the mixed area basing on average spectral variations.Moreover, in comparison to pixel-based analysis, the OBIA takes spatial metrics into consideration that help to segregate long-shaped-objects, such as roads, from round-shaped objects, such as lakes and shadows.Figure 7 shows the classification results of four CNNbased models in several subset polygons of the study areas.It could be seen that water objects were correctly classified, as they have a typical spatial structure and low reflectance in all bands.These objects were more likely to be misclassified to shadows when pixelbased methods are used because of similar spectral information.Impervious surfaces, which are mostly considered roads, also achieve good classification results because of their spatial structure.In machine learning, the imbalance of training classes impacts classification models' performance, and several techniques are proposed to cope with these issues.Some of them are also applied in the experiments, such as generating more data (through adjusting the scale, compactness, and shape of the image segmentation process to generate smaller homogeneous objects) and data generation before the training step.For example, the water bodies cover a smaller space in this study area and encounter small portions of total training data (large homogenous water pixels to generate a large polygon significantly to form an object).However, these objects can be accurately detected out of other classes because of their typical reflectance values and spatial metrics (elongation, circular).The spatial metrics are considered as the strength of OBIA, with high-resolution images.
There are always requirements for the generalization of proposed methods for different datasets and applications in machine learning.For example, gradient boosting algorithms have been found to be efficient in many works [44,45,51].They have effects in improving the accuracy of this case study in Vietnam.However, due to the limited access to the benchmark dataset (most of the open dataset is for scene-based classification, and there are no available data on pixel-based classification), it was not easy to verify the performance of this hybrid network on different data.
On another note, model interpretation plays a significant role in understanding the impacts of specific features on the classification task's general performance or, for any instance, such as using SHAP (SHapley Additive exPlanations).This interpretation can be implemented with object features, representing spectral and spatial information.For CNN models with pixel-based input, the sensitivity can be analyzed using several methods, such as perturbation-based visualization, randomized mask sampling, and backpropagationbased visualization.In this study, salient mapping was used to visualize the model during the training process.Figure 8 shows both color and grayscale salient for six classes to highlight the most important pixels.For the 'Water' class, the spatial arrangement significantly influences determining this class since water bodies in the study area are mainly open canals and rivers.Spatial information can also be seen as important, as pixels (in circles) classified as "Elongation," "Circular," "Compactness," and "Rectangular" display high values.

Future Remarks
The shape of the curves has significant impacts on the mapping of convolutional layers.In this regard, [13] discussed an alternative solution for generating 1D or 2D graphs that would also bring more diversity to the input patches.In the tabulated dataset, as illustrated in (Figure 5), the order of columns does not affect machine learning classifiers' performance.However, the order might have a significant impact when these datasets are plotted and saved to figures before feeding them to deep learning models.Other researchers [13,31] plotted the graphs with the registered orders of spectral bands and for four seasons, respectively.In these studies, graphs were generated using the associated features from image segments, in which features were ordered by spectral min, mean, standard deviations, and spatial information (Elongation, Circular, and Rectangular).Different orders generate different graph shapes, so that deep convolutional layers learn the edges differently.Re-ordering features are not examined in this study, but they are worth trying in future works, particularly feature-rich datasets.
The input images of a CNN can have different formats, such as image patches from multiple spectral satellite data or spectral graphs displaying spectral variations across all bands.The first one is a typical form in numerous land cover classification studies [8,14,29,53].The second approach was investigated in the works of [13,29,31] with multiple temporal satellite images.Only a single SPOT7 image was used in this study, and spectral bands are limited to 4. The inclusion of multiple band images, such as Sen-tinel 2A, and the combination of multiple spectral bands for segmentation might improve classification accuracies.This is another notion for future work.
Object-based image analysis has been proved efficient in land cover classification with high-resolution data [33,35,38], with accurate detection of boundaries land cover types from the segmentation process.The researchers in [13,35] proposed an approach to generate 1D and 2D graphs from spectral bands for pixel-based image classification.This study investigates the potential to extend previous works to take advantage of CNN's ability to learn unstructured data (plot/figures from 52 features) and tree-based algorithms to handle tabular data (128 features from dense layer) for land cover classification.Two types of methods are considered best-in-class with the data types mentioned above, and their combination can be of high potential in accurate land monitoring.

Conclusions
This study investigates the combination of object-based image analysis, convolutional neural networks, and gradient boosting classifiers for land cover classification with a case study in Vietnam.The experience shows an improvement in the overall accuracies with the use of XGBoost (OA = 0.8905), LightGBM (OA = 0.8956), and CatBoost (OA = 0.8956) as replacements for the fully connected layers in a CNN.The hybrid proposed to take advantage of the OBA in defining boundaries of homogeneous pixels or classes, and CNN contributes to recognizing edges in plots of associated attributes of objects.The last layer feature's extraction classifies the task with a tabulated dataset, which is the strength of the gradient boosting algorithm, as discussed in this study.
Deep learning applies predominantly to the classification of satellite images, aerial photos, and unmanned aerial vehicle data, with considerable achievements.Since SPOT7 was used in this study, only four spectral bands (R, G, B, and NIR) were used to generate object attributes and plots before feeding to the CNN.Therefore, the inclusion of more spectral bands is more relevant.Moreover, the free access to such a dataset is more relevant to generating seasonally changed features and to detect surface classes better and improve land monitoring accuracy.

Figure 1 .
Figure 1.A subset of the study area.The image objects are set partly transparent and overlaid to SPOT image for visualization.
step is required to ensure all features have a similar scale.A simple method (x-min)/(maxmin) is applied to keep the original data distribution.Then, the normalized data are plotted in two-dimensional space.The plots are used as input patches that are fed into the CNN during the training stages.The total number of all image objects is randomly split into training/validation datasets and test sets.This hold-out method is commonly used in the CNN-based method rather than cross-validation because of the large amount of training data, representing the entire study area well.Step 3: the model is trained with the categorical log-loss function with fully connected layers as the classifiers.Step 4: Training data are fed again into the trained model from the previous step.However, the last dense layer is extracted to build up another training dataset, which is then used to learn three gradient boosting algorithms.

Figure 3 .
Figure 3. Object-based convolutional neural network with gradient boosting algorithms.

Figure 5 .
Figure 5. Examples of graphs representing variations of the object's attributes.The graphs in (b) were used for CNN-based gradient boosting algorithms.The original tabulated data in (a) were used to learn these algorithms for comparison.

Figure 6 .
Figure 6.Variation of the loss value after 300 epochs.The training data are again fed to the trained model, but the dense layer (before being fully connected) is extracted to form a new training set (40,188 instances, 128 features) and test set (10,046 instances, 128 features), respectively.These data were used to learn the gradient boosting algorithms, and the results are shown in Table4.On the left side are values extracted from CNN with a fully connected layer and CNN's with SVM, XGBoost, LightGBM, and CatBoost to replace the fully connected layer.These models were trained with plotted images, as explained in the previous section.Moreover, we considered verifying these algorithms with a dataset with the original 52 features, as illustrated in Figure5.The results are shown on the right side of Table3.It could be seen that CNN-CatBoost (OA = 0.8956) and CNN-LightGBM (OA = 0.8956) achieve the highest overall accuracies and the smallest errors.Thus, the CNN-based classifiers show improvements compared to traditional methods, which were run with the tabulated dataset.Furthermore, the higher features of the CNN-based gradient boosting (128) might result in higher accuracy of these models over the tabulated dataset (52 features).

Figure 7 .
Figure 7. Classification results from different CNN-based methods.

Table 2 )
are associated with each object.Step 2: Because features are measured in different units and scales, a normalization

Table 3 .
Samples for training, validation, testing the proposed model, and samples for visualization.

Table 4 .
Statistical indicators of CNN-based and benchmark methods.FC: Fully connected layer.

Table 5 .
Confusion matrix of the CNN-based gradient boosting methods.PA: Producer's Accuracy, UA: User's Accuracy, OA: Overall Accuracy.