Tree, Shrub, and Grass Classiﬁcation Using Only RGB Images

: In this work, a semantic segmentation-based deep learning method, DeepLabV3 + , is applied to classify three vegetation land covers, which are tree, shrub, and grass using only three band color (RGB) images. DeepLabV3 + ’s detection performance has been studied on low and high resolution datasets that both contain tree, shrub, and grass and some other land cover types. The two datasets are heavily imbalanced where shrub pixels are much fewer than tree and grass pixels. A simple weighting strategy known as median frequency weighting was incorporated into DeepLabV3 + to mitigate the data imbalance issue, which originally used uniform weights. The tree, shrub, grass classiﬁcation performances are compared when all land cover types are included in the classiﬁcation and also when classiﬁcation is limited to the three vegetation classes with both uniform and median frequency weights. Among the three vegetation types, shrub is found to be the most challenging one to classify correctly whereas correct classiﬁcation accuracy was highest for tree. It is observed that even though the median frequency weighting did not improve the overall accuracy, it resulted in better classiﬁcation accuracy for the underrepresented classes such as shrub in our case and it also signiﬁcantly increased the average class accuracy. The classiﬁcation performance and computation time comparison of DeepLabV3 + with two other pixel-based classiﬁcation methods on sampled pixels of the three vegetation classes showed that DeepLabV3 + achieves signiﬁcantly higher accuracy than these methods with a trade-o ﬀ for longer model training time.

Tree, shrub, and grass are three of the vegetation-type land covers and classification of them using remote sensing data has several important applications. For example, people have utilized shrub information for assessing the condition of grassland to determine whether a grassland has become unusable because of shrub encroachment or not [3]. In emergency landing of unmanned air vehicles (UAVs), it is critical to land on grassland rather than on trees or shrubs [5,6]. Removing tall vegetation from the digital surface model (DSM) such as trees and shrub is an important step in developing an accurate digital terrain model (DTM) [2]. Traditionally, normalized difference of vegetation index (NDVI) has been used for vegetation detection. However, NDVI cannot differentiate tree, shrub, and grass because of their similar spectral characteristics. Moreover, NDVI requires near infrared (NIR) band, which may not be available sometimes.
For accurate classification of these three vegetation land covers, the use of light detection and ranging (LiDAR) data with height information via the extracted digital terrain model (DTM) is highly beneficial to assist the classification process [3] since these three vegetation types differ with respect to their height. Nonetheless, LiDAR may help detecting tall trees, but it is still challenging to distinguish some shrubs from grass [10]. Moreover, NIR and LiDAR data may be expensive to acquire.
Other than the use of LiDAR for extracting height information in the form of DTM, there is also considerable interest in the remote sensing community to estimate DTMs using stereo images [11][12][13][14]. The DTM estimations from the stereo images could however be noisy at lower heights. Auxiliary methodologies that utilize spatial information of land covers together with their spectral information could be helpful to make these DTM estimations more accurate. In contrast to NIR and LiDAR data, RGB images can be easily obtained with low-cost color cameras. The cost issue is especially important for farmers who may have limited budget. In many agricultural monitoring applications, farmers like to simply use a low-cost drone with an onboard low cost color camera to fly over farmlands for agricultural condition monitoring.
There is an increasing interest in adapting deep learning methods for land cover classification after several breakthroughs have been achieved in a variety of computer vision tasks, including image classification, object detection and tracking, and semantic segmentation. In [7], a comparison of convolutional neural network (CNN)-based methods with the state-of-the-art object-based image analysis methods is provided for the detection of a protected plant from a shrub family, Ziziphus lotus shrubs, using high-resolution Google Earth TM images. The authors reported higher accuracies with the CNN-detectors compared to other investigated object-based image analysis methods. In [15], progressive cascaded convolutional neural networks are used for single tree detection with Google Earth imagery. In [16], Basu investigated deep belief networks, basic CNNs, and stacked denoising autoencoders on the SAT-6 remote sensing dataset, which includes barren land, trees, grassland, roads, buildings, and water bodies as land cover type. In [17], low-color descriptors and deep CNNs are evaluated on the University of California Merced Land Use dataset (UCM) with 21 classes. In [18], a comprehensive review on land cover classification and object detection approaches using high resolution imagery is provided. The authors evaluated the performances of deep learning models against traditional approaches and concluded that the deep learning-based methods provide an end-to-end solution and show better performance than the traditional pixel-based methods by utilizing both spatial and spectral information. A number of other works have also shown that semantic segmentation classification with deep learning methods at a pixel level are quite promising in land cover classification [19][20][21][22].
In this paper, we focused on three vegetation land cover (tree, shrub, and grass) classification using only RGB images. We used a semantic segmentation deep learning method, DeepLabV3+ [23], which has been proven to perform better than conventional deep learning methods such as Semantic Segmentation (SegNet) [24], Pyramid Scene Parsing Network (PSP) [25], and Fully Convolutional Networks (FCN) [26]. DeepLabV3+ uses color image as the only input and does not need any feature extraction process such as texture. In our experiments, we used the Slovenia dataset [27], which is a low resolution dataset (10 m per pixel) and a custom dataset from Oregon, US area. The land cover map of this area, which has 1 m per pixel resolution is in public domain [28] and we obtained the color image (~0.25 m/pixel) from Google Maps. Both Slovenia and Oregon datasets included these three vegetation types in addition to some other land cover types.
DeepLabV3+ is first applied to both to low and high resolution datasets using all land covers. In both datasets, the number of pixels representing some of the land covers are fewer in number in comparison to other land covers, making the two datasets heavily imbalanced. Using suggestions from the developers of DeepLabV3+, which are posted in their GitHub page [29], we extracted the number of pixels information for each of the land covers and then computed the median frequency weights [30] and we assigned these weights to land cover classes when training DeepLabV3+ models. For comparison purposes, we considered using both uniform weights and median frequency weights when training.
With uniform weights, we noticed that the classification accuracies of the underrepresented classes such as shrub, had quite low classification accuracies. After the use of median frequency weights [30], the classification accuracies of the underrepresented classes were improved considerably. The trade-off for this was degradation in the accuracies of overrepresented classes such as tree. We then applied the same classification investigation on the two datasets but this time by including only the three vegetation classes (tree, shrub, and grass) and excluding all other land cover classes from the classification. In doing this, it is assumed that the three vegetation classes can be separated from other land covers. The objective of this investigation was to create a pure classification scenario that focuses only on these three vegetation classes by eliminating the impact of all other land covers' misclassifications on these three vegetation classes' accuracy and thus better assess DeepLabV3+'s classification performance. This investigation showed similar trends with respect to using median frequency weights. With uniform weights, shrub detection was very poor, which then significantly improved with median frequency weights. Moreover, when the vegetation-only classification results are compared with the classification results of all land covers, a considerable classification accuracy improvement had been observed in all three vegetation types. Other than these, this analysis also indicated that the highest correct classification accuracy corresponded to tree whereas shrub was the most difficult one to correctly classify.
The classification performance and computation time comparison of DeepLabV3+ with two other pixel-based machine learning classification methods, support vector machine [31] and random forest [32] showed that DeepLabV3+ generates more accurate classification results with a trade-off for longer model training time.
It should be emphasized that we only used RGB bands without any help from LiDAR, NIR bands, or stereo images and we still managed to get 78% average classification accuracy in Slovenia dataset and 79% average classification accuracy in Oregon dataset for trees, shrubs, and grass (vegetation-only classification). Compared to the results in [3] (even though the dataset used in that work was different from ours), the results in [3] attained only 53% for the combined class of trees and shrubs when Red+Green+NIR bands were used. This clearly shows that the standalone use of DeepLabV3+ with only RGB images for classifying trees, shrubs, and grass is effective to some extent. It is also low cost since low resolution color cameras can be used. Moreover, this can be considered as an auxiliary methodology to help making LiDAR-extracted or stereo-image extracted DTM estimations more accurate. The contributions of this paper are:

1.
Provided a comprehensive evaluation of a deep learning-based semantic segmentation method, DeepLabV3+, for the classification of three similar looking vegetation types, which are tree, shrub, and grass, using color images only with both low resolution and high resolution and outlined classification performance and computation time comparisons of DeepLabV3+ with two pixel-based classifiers.

2.
Discussed the data imbalance issue with DeepLabV3+ and demonstrated that the average class accuracy can be increased considerably in DeepLabV3+ using median frequency weights during model training in contrast to using uniform weights.

3.
Demonstrated that a higher classification accuracy can be achieved for each of the three vegetation types (tree, shrub, and grass) with DeepLabV3+ if the classification can be limited to the three green vegetation classes only rather than including all land covers that are present in the image datasets.

4.
Provided insights about which of these three vegetation types are more challenging to classify.
Our paper is organized as follows. Section 2 provides technical information about DeepLabV3+ and the datasets used in our experiments. Section 3 contains two case studies (8-class and 3-vegetation-only class) for Slovenia dataset and another two case studies (6-class and 3-vegetation-only class) for the Oregon dataset, and a performance and computation time comparison study of DeepLabV3+ with two pixel-based classifiers. Finally, Section 4 concludes the paper with some remarks.

Method
DeepLabV3+ [33] is a semantic segmentation method that provided very promising results in the PASCAL VOC-2012 data challenge [34]. For the PASCAL VOC-2012 dataset, DeepLabV3+ has currently the best ranking among several methods including SegNet [24], PSP [25], and FCN [26]. In a very recent study [35], which involves land cover type classification, it was reported that DeepLabV3+ performed better than PSP and SegNet.
DeeplabV3+ uses the Atrous Spatial Pyramid Pooling (ASPP) mechanism which exploits the multi-scale contextual information to improve segmentation [23]. Atrous (which means holes) convolution has advantages over the standard convolution by providing responses at all image positions and while the number of filter parameters and the number of operations stay constant [23]. DeepLabV3+ has an encoder-decoder network structure. The encoder part of it consists of a set of processes that reduce the feature maps and capture semantic information and the decoder part of it recovers the spatial information and results in sharper segmentations. The block diagram of DeepLabV3+ can be seen in Figure 1. DeepLabV3+ [33] is a semantic segmentation method that provided very promising results in the PASCAL VOC-2012 data challenge [34]. For the PASCAL VOC-2012 dataset, DeepLabV3+ has currently the best ranking among several methods including SegNet [24], PSP [25], and FCN [26]. In a very recent study [35], which involves land cover type classification, it was reported that DeepLabV3+ performed better than PSP and SegNet.
DeeplabV3+ uses the Atrous Spatial Pyramid Pooling (ASPP) mechanism which exploits the multi-scale contextual information to improve segmentation [23]. Atrous (which means holes) convolution has advantages over the standard convolution by providing responses at all image positions and while the number of filter parameters and the number of operations stay constant [23]. DeepLabV3+ has an encoder-decoder network structure. The encoder part of it consists of a set of processes that reduce the feature maps and capture semantic information and the decoder part of it recovers the spatial information and results in sharper segmentations. The block diagram of DeepLabV3+ can be seen in Figure 1.

Training with DeepLabV3+
A Windows 10 machine with a GPU card (RTX2070) and 16 GB memory is used for DeepLabv3+ model training and testing, which uses TensorFlow framework to run. For training a DeepLabV3+ model for any of the two datasets, the weights of a pre-trained model with the exception of the logit layer weights are used for initialization and these weights are fine-tuned with further training. These initial weights belong to a pre-trained model for the PASCAL VOC 2012 dataset ("deeplabv3_pascal_train_aug_2018_01_04.tar.gz"). Because the number of land covers in the two investigated training datasets is different from the number of classes in the PASCAL VOC-2012 dataset, the logit weights in the pre-trained model are excluded. The DeepLabV3+ training parameters used in this work can be seen in Table 1. The training number of steps in DeepLabV3+ was set to 100,000 for both datasets.

Training with DeepLabV3+
A Windows 10 machine with a GPU card (RTX2070) and 16 GB memory is used for DeepLabv3+ model training and testing, which uses TensorFlow framework to run. For training a DeepLabV3+ model for any of the two datasets, the weights of a pre-trained model with the exception of the logit layer weights are used for initialization and these weights are fine-tuned with further training. These initial weights belong to a pre-trained model for the PASCAL VOC 2012 dataset ("deeplabv3_pascal_train_aug_2018_01_04.tar.gz"). Because the number of land covers in the two investigated training datasets is different from the number of classes in the PASCAL VOC-2012 dataset, the logit weights in the pre-trained model are excluded. The DeepLabV3+ training parameters used in this work can be seen in Table 1. The training number of steps in DeepLabV3+ was set to 100,000 for both datasets. The Slovenia dataset [27] was collected by the Sentinel-2 satellite. It has a resolution of 10 m and 6 bands (L1C bands). Among these 6 bands, only the three-color image bands (RGB) are used in this investigation. In this dataset, there are originally 293 images with size of 1010 × 999. After excluding 91 images, which mostly consist of "no data" labels in their ground truth annotations, the remaining 202 images are partitioned into four non-overlapping images with size of 505 × 499. The total number of images in the modified dataset becomes 808 after this. Among these 808 images, 708 of them are randomly selected for training a DeepLabV3+ model, and 100 of them are left for testing. The eight land covers in the Slovenia dataset are: cultivated land, forest, grassland, shrub land, water, wetlands, artificial surface, and barren land. These satellite images are captured over the European country of Slovenia for the year of 2017. The Slovenia dataset contains all three vegetation types we are interested in (forest, shrub land, and grassland). An example color image from the Slovenia dataset and its ground truth annotation can be seen in Figure 2. White pixels correspond to unlabeled samples. In Figure 2b, red color is used to annotate forest, green color is used for grassland and yellow mustard color is used for shrub land annotation.
land covers in the Slovenia dataset are: cultivated land, forest, grassland, shrub land, water, wetlands, artificial surface, and barren land. These satellite images are captured over the European country of Slovenia for the year of 2017. The Slovenia dataset contains all three vegetation types we are interested in (forest, shrub land, and grassland). An example color image from the Slovenia dataset and its ground truth annotation can be seen in Figure 2. White pixels correspond to unlabeled samples. In Figure 2 (b), red color is used to annotate forest, green color is used for grassland and yellow mustard color is used for shrub land annotation.

Oregon Dataset
The commercial company, EarthDefine, [36] provides sample land cover maps that are publicly accessible. One of these sample land cover maps containing the three vegetation types (tree, shrub and grass) is used as the second dataset in this work. Other than the tree, shrub and grass, there are three other land covers, which are bare land, impervious and water. The land cover map belongs to an area in Gleneden Beach, Oregon. The land cover map together with its reconstructed color image using the image tiles downloaded via Google Maps API [37] can be seen in Figure 3. In the land cover map, yellow color is used for shrub, dark blue color is used for grass, and orange color is used for tree annotation.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 19 The commercial company, EarthDefine, [36] provides sample land cover maps that are publicly accessible. One of these sample land cover maps containing the three vegetation types (tree, shrub and grass) is used as the second dataset in this work. Other than the tree, shrub and grass, there are three other land covers, which are bare land, impervious and water. The land cover map belongs to an area in Gleneden Beach, Oregon. The land cover map together with its reconstructed color image using the image tiles downloaded via Google Maps API [37] can be seen in Figure 3. In the land cover map, yellow color is used for shrub, dark blue color is used for grass, and orange color is used for tree annotation.  The land cover map is a single band image in geotiff format. According to the information that accompanied the sample land cover map [28], the Oregon land cover map is a high resolution (1 m) land cover data product and it is derived from 1 m, 4 band color infrared imagery flown between June 5, 2016 and August 11, 2016 as part of the National Agriculture Imagery Program (NAIP). The website also mentions that LiDAR data flown between 2009 and 2012 was used to aid the classification process [28]. Google Map API is used to retrieve corresponding the high-resolution color image tiles at an image resolution close to 25 cm for the same area of the land cover map. The procedure in [4] is used to retrieve the color image tiles from Google Maps and to reconstruct the corresponding color image. To register the reconstructed color image to the land cover map, a GDAL tool [4] is used that warps the land cover map into the same WGS84 model of the reconstructed color image. For DeepLabV3+ application, the color image and the land cover map are partitioned into 404 image patches of size 512 x 512. All these 404 image tiles contain at least one type of vegetation type or more. Among these 404 image patches, 304 of them are randomly selected for training and 100 of them are randomly selected for testing. Because the land cover map has a data collection time of 2016 and the color image data collection time is 2019, it is likely that there could be some discrepancies between the land cover map and the color image. The land cover map is a single band image in geotiff format. According to the information that accompanied the sample land cover map [28], the Oregon land cover map is a high resolution (1 m) land cover data product and it is derived from 1 m, 4 band color infrared imagery flown between 5 June 2016 and 11 August 2016 as part of the National Agriculture Imagery Program (NAIP). The website also mentions that LiDAR data flown between 2009 and 2012 was used to aid the classification process [28]. Google Map API is used to retrieve corresponding the high-resolution color image tiles at an image resolution close to 25 cm for the same area of the land cover map. The procedure in [4] is used to retrieve the color image tiles from Google Maps and to reconstruct the corresponding color image. To register the reconstructed color image to the land cover map, a GDAL tool [4] is used that warps the land cover map into the same WGS84 model of the reconstructed color image. For DeepLabV3+ application, the color image and the land cover map are partitioned into 404 image patches of size Remote Sens. 2020, 12, 1333 7 of 20 512 × 512. All these 404 image tiles contain at least one type of vegetation type or more. Among these 404 image patches, 304 of them are randomly selected for training and 100 of them are randomly selected for testing. Because the land cover map has a data collection time of 2016 and the color image data collection time is 2019, it is likely that there could be some discrepancies between the land cover map and the color image.

Forest-Grassland-Shrub Classification in Slovenia Dataset with Eight Classes
In this investigation, all eight land covers in Slovenia dataset are used, of which three of the land covers are forest (tree), shrub land (shrub), and grassland (grass). Two DeepLabV3+ models are trained for two cases: (a) Uniform weights for the 8 classes in model training, (b) median frequency weights for the 8 classes.
The number of pixels for each land cover, which is denoted by "pixel count," its frequency and the median frequency balancing weights for the Slovenia dataset can be seen in Table 2. To provide a physical sense of pixel count values to area, it is worth mentioning that 10,000 pixels correspond to an area of 1 km 2 in the Slovenia dataset. In Table 2, the term frequency represents the number of pixels of the class divided by the total number of pixels in images that had an instance of that class [30]. The median frequency balancing weight of class C is then computed by median frequency divided by the frequency of class C [30]. In Table 2, the median frequency balancing weights are denoted by the "weights" row. From Table 2, it can be noticed that the number of pixels for some of the classes such as Wetlands and Water and their corresponding frequency are quite low in number. A significant imbalance can be seen between shrub land and other vegetation classes (grassland and forest). After median frequency weighting, the Water and Wetlands have heavier weighting. The confusion matrix for the uniform weights and median frequency weight cases can be seen in Tables 3 and 4, respectively. The class accuracy for each land cover, averaged class accuracy for eight classes, averaged class accuracy for three vegetation classes, overall accuracy and the kappa metric values [38] can be seen in Table 5. Intersection over union (IoU) values for each land cover and mean IoU (mIoU) can be seen in Table 6. IoU is defined as the intersection area over the union of the area [39]. With respect to the three vegetation classifications, using uniform weights, shrub land classification accuracy is extremely poor with a value of 0.0945 and from the confusion matrix in Table 3, it is noticed that the shrub land pixels are misclassified mostly as forest or grassland. This observation indicates the challenges using only RGB color images especially when the training data is heavily imbalanced. In many classifiers including the deep learning ones when the dataset is heavily imbalanced, the error from the overrepresented classes contributes much more to the loss value than the error contribution from the underrepresented classes. This makes the deep learning method's loss function to be biased toward the overrepresented classes resulting in poor classification performance for the underrepresented classes such as shrub in this case.    With the use of median frequency weights when training a DeepLabV3+ model, the shrub land class accuracy increased significantly from 0.0945 to 0.4782 and the average eight-class accuracy also increased considerably from 0.4830 to 0.5634. The overall accuracy (eight classes) is found to be 0.8082 with uniform weights and was reduced to 0.7044 with median frequency weights. The decrease in the overall accuracy with median frequency weights is understandable since even though using median frequency weights helps to improve the classification accuracy of underrepresented classes, the trade-off comes as a reduction in the classification accuracy values of the overrepresented classes which then results in the reduction in overall accuracy. With respect to kappa and mIoU measures, both have lower values when median frequency weights are used. Even though the average accuracy for the eight classes is significantly improved, with respect to the three vegetation classes, the average accuracy stayed the same with median frequency weights in comparison to uniform weights.

Forest-Grassland-Shrub Classification in Slovenia Dataset with Three Vegetation Classes
The same investigation is applied by including only the three vegetation classes (forest, shrub, and grass) and excluding all other land cover types from the classification. This investigation assumes that the three vegetation classes can be separated from other land covers. The objective of this investigation was to create a pure classification scenario that includes the three vegetation classes only to eliminate the impact of all other land covers' misclassifications on the three vegetation classes' accuracy. Similarly, two DeepLabV3+ models are trained with uniform and median frequency weights. All other non-vegetation classes (cultivated land, water, wetlands, artificial surface, and bare land) are excluded by labeling them as ignore during training of DeepLabV3+ models. The pixel counts for each of the three vegetation classes, class frequency, and the median frequency balancing weights for the three classes can be seen in Table 7. It can be noticed that forest has the highest pixel count followed by grassland. The shrub land has the lowest number of pixels among the three vegetation classes. Tables 8-10 correspond to the confusion matrix, accuracy, and IoU-related measures for the three vegetation-class-only results with uniform and median frequency weights. It can be noticed from Table 9 that with median frequency weights, the correct classification accuracy of shrub land, which was very poor with uniform weights, significantly improved from 0.1597 to 0.6915. When uniform weights are used, most of the shrub land pixels were mostly misclassified as forest according to the confusion matrix. The average classification accuracies also improved from 0.6702 to 0.7802 with median frequency weights. The overall accuracy values were reduced from 0.9099 to 0.8310 since forest and grassland accuracy values dropped with the use of median frequency weights as a trade-off to the significant accuracy improvement in shrub land classification. Similar patterns are observed with the IoU measure.  Table 11 shows the three vegetation classification accuracy comparisons with three and eight class DeepLabV3+ models. From Table 11, it can be noticed that classification accuracies for all three vegetation types (forest, grass, and shrub) improve with the three-vegetation-class classification only in comparison to including all land covers in an 8-class classification. Even though using median frequency weights results in some reduction in the classification accuracies of forest and grassland, in return, the classification accuracy of shrub land gets a significant boost. From Table 11, it can be also noticed that among the three vegetation classes, shrub land has the lowest correct classification accuracy whereas forest has the highest. This finding about the classification difficulty ranking of these three vegetation types makes sense from a visual perspective, since among them forest and grass land have more easily distinguishable spatial features relative to shrub and form the two opposite sides of the visualization range and shrub stays somewhere between forest and grass land in this range. The confusion matrices support this by revealing that shrub land is misclassified as tree or grass by large amounts whereas the misclassification ratio is smallest in forest followed by grass land. We provided screenshots of two images in the Slovenia test dataset (three-vegetation class DeepLabv3+ model trained using median frequency weights). We included the color images together with the estimated and ground truth land cover maps. In the land cover maps in Figure 4, black color corresponds to forest, red color corresponds to grassland, green color corresponds to shrub land, and white color corresponds to ignore class which corresponds to the pixel locations that are excluded from DeepLabV3+ model training. Even though color images look very challenging for classification due to low resolution, DeepLabV3+ is found to perform considerably well.

Tree-Grass-Shrub Classification in Oregon Dataset with Six Land Cover Types
In this investigation, all six land covers of Oregon dataset are used in the classification, three of which are tree, grass, and shrub. Table 12 shows the pixel counts for each land cover and the corresponding median frequency weight values used for DeepLabv3+ model training. Considering thẽ 25 cm image resolution, an area of 1 km 2 corresponds to about 16 million pixels in the Oregon dataset. Tables 13 and 14 correspond to the resultant confusion matrices with uniform and median frequency weights, respectively. Tables 15 and 16 show the accuracy-and IoU-related measures. With the use of median frequency weights, the shrub classification accuracy increased significantly from 0.4951 to 0.6279 and the average classification accuracy increased from 0.7156 to 0.7688. Similar trends that were observed in the Slovenia dataset with respect to shrub and average classification accuracy were also observed in this dataset. Different from Slovenia dataset results, there are almost no changes in the overall accuracy and kappa values when switching from uniform weights to median frequency weights. An increase in mIoU value is also observed with median frequency weights. only) Median, (Slovenia-8 classes) 0.8172 0.4754 0.4782 0.5903 Median, (Slovenia-3 veg. classes only) 0.8526 0.7965 0.6915 0.7802 We provided screenshots of two images in the Slovenia test dataset (three-vegetation class DeepLabv3+ model trained using median frequency weights). We included the color images together with the estimated and ground truth land cover maps. In the land cover maps in Figure 4, black color corresponds to forest, red color corresponds to grassland, green color corresponds to shrub land, and white color corresponds to ignore class which corresponds to the pixel locations that are excluded from DeepLabV3+ model training. Even though color images look very challenging for classification due to low resolution, DeepLabV3+ is found to perform considerably well.

Tree-Grass-Shrub Classification in Oregon Dataset with Six Land Cover Types
In this investigation, all six land covers of Oregon dataset are used in the classification, three of which are tree, grass, and shrub. Table 12 shows the pixel counts for each land cover and the

Tree-Grass-Shrub Classification in Oregon Dataset with Three Land Cover Types
Only the three vegetation classes (tree, shrub, and grass) are included in the classification. Table 17 shows the pixel counts for the three vegetation classes and median frequency weightsTables 18-20 correspond to the confusion matrices, accuracy-, and IoU-related measures with uniform and median frequency weights. Here, tree is the overrepresented class and shrub is the underrepresented class. It is worth mentioning that relatively there are more shrub pixels in the Oregon dataset in comparison to Slovenia dataset. With median frequency weights, considerable improvements can be seen mainly in shrub classification accuracy, increasing from 0.5456 to 0.5918, followed by an improvement in grass classification accuracy, from 0.8216 to 0.8470. Tree classification accuracy, however, drops from 0.9523 to 0.9284 as was expected since it is the overrepresented class. Overall, the average classification accuracy improves by about 1.6%, from 0.7732 to 0.7890. The improvement in average classification accuracy after switching to median frequency weights is not as significant as the improvement that was observed in the Slovenia dataset, since there are relatively more shrub pixels in the Oregon dataset and since Oregon dataset is a higher resolution dataset relatively better classifications for shrub are observed with uniform weights in comparison to the Slovenia dataset where shrub is severely underrepresented.   Table 21 shows the three vegetation classification accuracy comparisons with both three and six class models in DeepLabV3+. From Table 21, it can be seen that classification accuracies for all three vegetation types (tree, grass, and shrub) improve with the three-vegetation-only classification in comparison to including all land covers. The only exception was shrub with median frequency weights. Even though using median frequency weights results in some reduction in the classification accuracies of tree, in return, the classification accuracy of shrub and grass improves. Similar to the Slovenia dataset, among the three vegetation classes, shrub is found to have the lowest correct classification accuracy whereas tree has the highest correct classification accuracy. Sample screenshots from two image samples of Oregon test dataset (three-vegetation class DeepLabv3+ model trained using median frequency weights) can be seen in Figure 5.

Sampled Pixels Investigation for Comparison of DeepLabv3+ with Pixel-Based Classifiers
The classification performance and computation time comparison of DeepLabV3+ with two pixel-based classification methods are conducted on sampled pixel sets from the three vegetation classes. These two classifiers are support vector machine (SVM) [31] and random forest (RF) [32]. The features used with the two classifiers are the RGB values (for baseline), GLCM, and Gabor texture features extracted from batch images of size 21 × 21, and a combined set of GLCM and Gabor texture features. This investigation is to assess DeepLabV3+'s performance with respect to two well-known pixel-based classification methods. Both Slovenia and Oregon datasets are used in the investigation.

Sampled Pixels Investigation for Comparison of DeepLabv3+ with Pixel-Based Classifiers
The classification performance and computation time comparison of DeepLabV3+ with two pixel-based classification methods are conducted on sampled pixel sets from the three vegetation classes. These two classifiers are support vector machine (SVM) [31] and random forest (RF) [32]. The features used with the two classifiers are the RGB values (for baseline), GLCM, and Gabor texture features extracted from batch images of size 21 x 21, and a combined set of GLCM and Gabor texture features. This investigation is to assess DeepLabV3+'s performance with respect to two well-known pixel-based classification methods. Both Slovenia and Oregon datasets are used in the investigation.
Using the ground truth land cover maps, separate maps for each of the three vegetation types are generated. An erosion morphology operator is applied to these maps with a square structuring element of size 21. From each of the eroded individual land cover maps of the training data set, ~100,000 pixels for each vegetation type (~300,000 total) are randomly selected from the Slovenia dataset. Using these pixel locations, the batch images of size 21 x 21 are identified where the selected pixel is in the center of the identified batch image. This process enabled selecting homogeneous land cover pixels for the three vegetation types which can then be used for training the pixel-based classifier models. By using equal number of pixels from each vegetation type when forming training Using the ground truth land cover maps, separate maps for each of the three vegetation types are generated. An erosion morphology operator is applied to these maps with a square structuring element of size 21. From each of the eroded individual land cover maps of the training data set,~100,000 pixels for each vegetation type (~300,000 total) are randomly selected from the Slovenia dataset. Using these pixel locations, the batch images of size 21 × 21 are identified where the selected pixel is in the center of the identified batch image. This process enabled selecting homogeneous land cover pixels for the three vegetation types which can then be used for training the pixel-based classifier models. By using equal number of pixels from each vegetation type when forming training data, it is aimed to exclude the data imbalance effects from the classification analyses. In addition to training separate models using GLCM features, Gabor features and combined GLCM/Gabor texture features, we also trained SVM and RG models using RGB values of the selected pixels for baseline.
GLCM texture features (total of 17 features) and Gabor textures features (total of 28 features) are extracted from the batch images for the~300,000 pixels locations [5]. Using these extracted training features,~100,000 for each vegetation type (~300,000 total), SVM and RF models are trained for Slovenia and Oregon datasets separately. Regarding test data, we randomly selected~26,000 pixel locations from each of the three vegetation types (~78,000 total) to form the test data in Slovenia dataset. We randomly picked 400,000 pixel locations from each of the three vegetation types (1,200,000 total) to form the test data in Oregon dataset. We needed to use less number of pixels in the Slovenia test data since there were not that many homogenous shrub image patches in the Slovenia test data.
For SVM we used LibSVM tool [40] with C-SVM classification with the RBF (radial basis function) kernel. For optimal SVM parameters (g and c), LibSVM's parameter selection tool is used. This tool uses cross validation (CV) technique to estimate the accuracy of each parameter combination in the specified range. When using this tool, five-fold cross validation is applied. For c parameter, we scanned the a range of c as: c_range = 2ˆ{-5,-3,-1,1, 3,5,7,9,11,13,15} and for g parameter, we scanned a range of g as: g_range = 2ˆ{3,1,-1,-3,-5,-7,-9,-11,-13,-15}. For RF, we used the Matlab source codes in [41]. We set the number of trees, ntree, to 500 after trial and error to find the highest classification and let the other RF parameter, mtry, to be automatically identified based on the total number of features. For technical information about RF and its parameters (ntree and mtry), one can refer to [41]. Table 22 shows the three vegetation class accuracies, average classification accuracy and kappa values for DeepLabV3+, and the two pixel-based classifiers for the Slovenia dataset. The two pixel-based classifiers use RGB pixel values (for baseline), GLCM features, Gabor features and combined GLCM/Gabor features. For DeepLabV3+, the segmentation estimations of DeepLabV3+ for the randomly selected pixel locations are simply retrieved from the previously generated results with median frequency weights and used in generating the performance measures. Table 23 shows the IoU measures for each of the three vegetation class and mIoU. Similarly Tables 24 and 25 correspond to the accuracy and IoU-related measures for the Oregon dataset. From the results, it can be seen that in both datasets DeepLabV3+ performs significantly better than the two pixel-based classifiers.  Table 26 shows the comparison of DeepLabV3+ with these two classifiers with respect to computation time (model training and testing). Slovenia dataset is used for computation time comparison and GLCM features (total 17 features) are used in the two pixel-based classifiers. It can be seen that DeepLabV3+ has the longest model training time but its test time is less than SVM. RF is the fastest method both in training and testing times while providing classification accuracy close to SVM. Overall, the results showed that DeepLabV3+ provides more accurate classification results than these two pixel-based classifiers with a trade-off for a longer model training time.

Discusssion
Even though height data in the form of DTM could add significant capability to classify the three similar looking vegetation land covers (tree, shrub, and grass), obtaining height data via LiDAR could be costly and DTM estimates from LiDAR for lower heights may also not be highly accurate. DTM estimation via stereo images could be an alternative to LiDAR but this also has its own challenges in terms of noisy DTM estimations especially at lower heights. The use of NIR band helps detecting vegetation when used with Red band via NDVI index but faces setbacks when it comes to classifying vegetation land covers with similar spectral characteristics such as tree, shrub, and grass. As an example, in [3], together with Red, Green, NIR band images, LiDAR were also used for land cover classification and different from our work, the authors combined trees and shrubs into a single class which is relatively a less challenging problem than ours since in our case tree, shrub, and grass are set as three separate classes. In [3], it is reported that if only RG and NIR data were used, the classification accuracy of "trees and shrubs" was only 52.9%. With LiDAR, the authors stated that the classification performance was improved to 89.7%. We achieved~59.0% average classification accuracy for the three vegetation classes in the 8-class case and 78.0% for the three-class vegetation-only case in the low-resolution Slovenia dataset. In the high-resolution Oregon dataset, we achieved~74.8% average classification accuracy for the three vegetation classes in the 6-class case and~78.9% for the three-class vegetation-only case. Considering that only RGB color bands were used without LiDAR or NIR bands and that each of these three vegetation types has its own class, the classification results with DeepLabV3+ using median frequency weights are found quite remarkable.

Conclusions
Without using NIR and LiDAR, it is challenging to correctly classify trees, shrubs, and grass. In some cases, even the use of NIR and LiDAR may not provide highly accurate results and it is important to utilize auxiliary methods which could be used as supportive information to increase the confidence of the classification decisions using LiDAR data. In this paper, we report some new results using a semantic segmentation based deep learning method to tackle the above challenging problem using only RGB images.
We provided a comprehensive evaluation of DeepLabV3+ for classification of three similar looking vegetation types, which are tree, shrub, and grass, using color images only with both low resolution and high resolution datasets. The data imbalance issue with DeepLabV3+ is discussed and it is demonstrated that the average class accuracy can be increased considerably in DeepLabV3+ using median frequency weights during model training in contrast to using uniform weights. It is observed from both datasets that higher tree, grass, and shrub classification accuracy can be achieved with DeepLabV3+ if the classification can be limited to these three vegetation classes only rather than including all other land cover types that are present in the color image datasets. In both Slovenia and Oregon datasets, it is observed that the highest classification accuracy corresponds to "tree" type whereas "shrub" type is found the most challenging to classify accurately. In addition, the performance of DeepLabV3+ is compared with two state-of-the-art machine learning classification algorithms (SVM and random forests) which use RGB pixel values, GLCM and Gabor texture features, and combination of the two sets of texture features. It is observed that DeepLabV3+ outperforms both SVM and random forests. Being a semantic segmentation-based method, DeepLabV3+ has advantages over pixel-based classifiers by utilizing both spectral (via RGB bands only) and spatial information.
Future research directions include customization of DeepLabV3+ framework to accept more than three channels (adding NIR band to three color bands) and utilization of digital terrain model (DTM) in the form of using LiDAR sensor data or in the form estimating DTM through stereo satellite images to further improve the classification accuracy of tree, grass, and shrub.

Conflicts of Interest:
The authors declare no conflict of interest.