Vegetation Detection Using Deep Learning and Conventional Methods

: Land cover classiﬁcation with the focus on chlorophyll-rich vegetation detection plays an important role in urban growth monitoring and planning, autonomous navigation, drone mapping, biodiversity conservation, etc. Conventional approaches usually apply the normalized di ﬀ erence vegetation index (NDVI) for vegetation detection. In this paper, we investigate the performance of deep learning and conventional methods for vegetation detection. Two deep learning methods, DeepLabV3 + and our customized convolutional neural network (CNN) were evaluated with respect to their detection performance when training and testing datasets originated from di ﬀ erent geographical sites with di ﬀ erent image resolutions. A novel object-based vegetation detection approach, which utilizes NDVI, computer vision, and machine learning (ML) techniques, is also proposed. The vegetation detection methods were applied to high-resolution airborne color images which consist of RGB and near-infrared (NIR) bands. RGB color images alone were also used with the two deep learning methods to examine their detection performances without the NIR band. The detection performances of the deep learning methods with respect to the object-based detection approach are discussed and sample images from the datasets are used for demonstrations.


Introduction
Land cover classification [1] has been widely used in change detection monitoring [2], construction surveying [3], agricultural management [4], green vegetation classification [5], identifying emergency landing sites for UAVs [6,7], biodiversity conservation [8], land-use [9], and urban planning [10]. One important application of land cover classification is vegetation detection. In Skarlatos et al. [3], chlorophyll-rich vegetation detection was a crucial stepping stone to improve the accuracy of the estimated digital terrain model (DTM). Upon detection of vegetation areas, they were automatically removed from the digital surface model (DSM) to have better DTM estimates. Bradley et al. [11] conducted chlorophyll-rich vegetation detection to improve autonomous navigation in natural environments for autonomous mobile robots operating in off-road terrain. Zare et al. [12] used vegetation detection for mine detection to minimize false alarms since some vegetation such as round bushes were mistakenly identified as mines by mine detection algorithms and they demonstrated that vegetation detection improves mine detection results. Miura et al. [13] used vegetation detection techniques for monitoring vegetation areas in the Amazon to monitor the temporal and spatial changes

•
We introduced a novel object-based vegetation detection method, NDVI-ML, which utilized NDVI, computer vision, and machine learning techniques with no need for training. The method is simple and outperformed the two investigated deep learning methods in detection performance.

•
We demonstrated the potential use of the NDVI band image as a replacement to the red (R) band in the color image for DeepLabV3+ model training to take advantage of the NIR band while fulfilling the three-input channels restriction in DeepLabV3+ and transfer learning from a pre-trained RGB model.

•
We compared the detection performances of DeepLabV3+ (RGB and NDVI-GB bands), our CNN-based deep learning method (RGB and RGB-NIR bands), and NDVI-ML (RGB-NIR). This demonstrated that DeepLabV3+ detection results using RGB color bands only were better than those obtained by conventional methods using the NDVI index only and were also quite close to NDVI-ML's results which used NIR band and several sophisticated machine learning and computer vision techniques.

•
We discussed the underlying reasons why NDVI-ML could be performing better than the deep learning methods and potential strategies to further boost the deep learning methods' performances.
This paper is organized as follows. In Section 2, we describe the dataset and the vegetation detection methods used for training and testing. Section 3 summarizes the results using various methods. Section 4 contains some discussions about the results. A few concluding remarks are provided in Section 5.

Materials and Methods
We first introduce the dataset in Section 2.1 followed by the two deep learning methods and our object-based vegetation detection method, NDVI-ML, in Sections 2.2-2.4. A block diagram of the used dataset and the applied methods in this paper can be seen in Figure 1.

Dataset Used for Training and Testing
The dataset used in this work was originally used in Skarlatos and Vlachos [3] and belongs to two studied sites known as Vasiliko in Cyprus and Kimisala in Rhodes Island. UAV photography and a modified non-calibrated near-infrared camera were used to acquire the data with two separate UAV flights. Both flights were performed with SwingletCam UAV on different days. In the first flight, a Canon IXUS 220HS camera was flown at an average flight height of 78 m. In the second flight, a modified near-infrared Canon PowerShot ELPH 300HS camera was used with a flight height of 100 m. Both cameras were provided by SensFly, which is a UAV manufacturer. Agisoft's Photoscan was used to process the captured UAV photography and to create two orthophotos. Using the extracted Digital Surface Model of the two sites, color RGB and NIR orthophotos were generated. For the overlapping and co-registration of the orthophotos, a common bundle adjustment was performed with all the RGB and NIR photos [3].

Vasiliko Site Data Used for Training
The height and width dimensions of the Vasiliko image (RGB and NIR) used in the investigations as training data are 3450 × 3645. The image resolution of the Vasiliko image is 20 cm per pixel. The image area used in the investigations thus corresponds to an area of ~0.5 km². For investigations with DeepLabV3+, this image is partitioned into 1764 overlapping image tiles of size 512 × 512. The number of overlapping rows for two consecutive images along column direction is set to 440, and the number of overlapping columns for two consecutive images along row direction is set to 435. This partitioning is conducted to increase the number of image patches in the Vasiliko training dataset and can be considered as data augmentation by introducing shifted versions of the image patches. The number of overlapping pixels in row and column directions are set to high numbers to increase the number of image batches in the training data set as much as possible. This image is annotated with respect to four land covers. Two color images from the Vasiliko dataset and their land cover annotations can be seen in Figure 2. Block diagram showing the used dataset and applied methods. mIoU is the mean-intersection-of-union metric.

Dataset Used for Training and Testing
The dataset used in this work was originally used in Skarlatos and Vlachos [3] and belongs to two studied sites known as Vasiliko in Cyprus and Kimisala in Rhodes Island. UAV photography and a modified non-calibrated near-infrared camera were used to acquire the data with two separate UAV flights. Both flights were performed with SwingletCam UAV on different days. In the first flight, a Canon IXUS 220HS camera was flown at an average flight height of 78 m. In the second flight, a modified near-infrared Canon PowerShot ELPH 300HS camera was used with a flight height of 100 m. Both cameras were provided by SensFly, which is a UAV manufacturer. Agisoft's Photoscan was used to process the captured UAV photography and to create two orthophotos. Using the extracted Digital Surface Model of the two sites, color RGB and NIR orthophotos were generated. For the overlapping and co-registration of the orthophotos, a common bundle adjustment was performed with all the RGB and NIR photos [3].

Vasiliko Site Data Used for Training
The height and width dimensions of the Vasiliko image (RGB and NIR) used in the investigations as training data are 3450 × 3645. The image resolution of the Vasiliko image is 20 cm per pixel. The image area used in the investigations thus corresponds to an area of~0.5 km 2 . For investigations with DeepLabV3+, this image is partitioned into 1764 overlapping image tiles of size 512 × 512. The number of overlapping rows for two consecutive images along column direction is set to 440, and the number of overlapping columns for two consecutive images along row direction is set to 435. This partitioning is conducted to increase the number of image patches in the Vasiliko training dataset and can be considered as data augmentation by introducing shifted versions of the image patches. The number of overlapping pixels in row and column directions are set to high numbers to increase the number of image batches in the training data set as much as possible. This image is annotated with respect to four land covers. Two color images from the Vasiliko dataset and their land cover annotations can be seen in Figure 2.

Kimisala Site Data Used for Testing
This site is part of the Kimisala area in the southwestern part of the island of Rhodes and contains many scattered archaeological sites. Regarding the Kimisala image used in the investigations, it corresponds to an area of ~0.05 km². There are two images of the same site with two different resolutions (10 cm per pixel and 20 cm per pixel). In the Kimisala test images, land covers in the form of trees, shrubs, barren land, and archaeological sites are present. These land covers are categorized into two major land cover classes which are vegetation (tree/shrub) and non-vegetation (barren land and archaeological site). The 10 cm resolution Kimisala test image is annotated with respect to the two land covers (vegetation and non-vegetation). The land cover map for the Kimisala-20 test image is generated by resizing the annotated land cover map generated for the Kimisala-10 test image. The color Kimisala images for the 10 and 20 cm resolutions and their land cover annotations can be seen in Figure 3. It is worth mentioning that the Kimisala-10 test image is 1920 × 2680 in size and the Kimisala-20 test image is 960 × 1340 in size. Both test images are split into non-overlapping 512 × 512 image tiles when testing with the trained DeepLabV3+ models with the exception of the tiles that Sample images from Vasiliko dataset and their annotations (silver color corresponds to barren land, green color corresponds to tree/shrub/grass, red color corresponds to urban land, and blue color corresponds to water in land cover map annotations). (a) Color image for mari_20_3_8; (b) ground truth land cover map for mari_20_3_8; (c) color image for mari_20_42_4; (d) ground truth land cover map for mari_20_42_4.

Kimisala Site Data Used for Testing
This site is part of the Kimisala area in the southwestern part of the island of Rhodes and contains many scattered archaeological sites. Regarding the Kimisala image used in the investigations, it corresponds to an area of~0.05 km 2 . There are two images of the same site with two different resolutions (10 cm per pixel and 20 cm per pixel). In the Kimisala test images, land covers in the form of trees, shrubs, barren land, and archaeological sites are present. These land covers are categorized into two major land cover classes which are vegetation (tree/shrub) and non-vegetation (barren land and archaeological site). The 10 cm resolution Kimisala test image is annotated with respect to the two land covers (vegetation and non-vegetation). The land cover map for the Kimisala-20 test image is generated by resizing the annotated land cover map generated for the Kimisala-10 test image. The color Kimisala images for the 10 and 20 cm resolutions and their land cover annotations can be seen in Figure 3. It is worth mentioning that the Kimisala-10 test image is 1920 × 2680 in size and the Kimisala-20 test image is 960 × 1340 in size. Both test images are split into non-overlapping 512 × 512 image tiles when testing with the trained DeepLabV3+ models with the exception of the tiles that form the last portion Remote Sens. 2020, 12, 2502 6 of 23 of the rows and columns of the image in which there is a slight overlapping. In the ground truth and estimated land cover maps for the Kimisala test images, a yellow color is used to annotate vegetation and a blue color is used to annotate non-vegetation land covers.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 23 form the last portion of the rows and columns of the image in which there is a slight overlapping. In the ground truth and estimated land cover maps for the Kimisala test images, a yellow color is used to annotate vegetation and a blue color is used to annotate non-vegetation land covers.

DeepLabV3+
DeepLabV3+ uses the Atrous Spatial Pyramid Pooling (ASPP) mechanism which exploits the multi-scale contextual information to improve segmentation [42]. Atrous (which means holes) convolution has advantages over the standard convolution by providing responses at all image positions and while the number of filter parameters and the number of operations stays constant [42]. DeepLabV3+ has an encoder-decoder network structure. The encoder part consists of a set of processes that reduce the feature maps and capture semantic information and the decoder part recovers the spatial information and result in sharper segmentations. The block diagram of DeepLabV3+ can be seen in Figure 4.

DeepLabV3+
DeepLabV3+ uses the Atrous Spatial Pyramid Pooling (ASPP) mechanism which exploits the multi-scale contextual information to improve segmentation [42]. Atrous (which means holes) convolution has advantages over the standard convolution by providing responses at all image positions and while the number of filter parameters and the number of operations stays constant [42]. DeepLabV3+ has an encoder-decoder network structure. The encoder part consists of a set of processes that reduce the feature maps and capture semantic information and the decoder part recovers the spatial information and result in sharper segmentations. The block diagram of DeepLabV3+ can be seen in Figure 4. Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 23 A PC with Windows 10 operating system, a GPU card (RTX2070), and 16GB memory is used for DeepLabv3+ model training and testing which uses the TensorFlow framework to run. For training a land cover model using the training datasets, the weights of a pre-trained model with the exception of the logits are used as the starting point and these weights are fine-tuned with further training. These initial weights belong to a pre-trained model for the PASCAL VOC 2012 dataset ("deeplabv3_pascal_train_aug_2018_01_04.tar.gz"). This model was based on the Xception-65 backbone [42]. Because the number of land covers in the Vasiliko and Kimisala dataset is different from the number of classes in the PASCAL VOC-2012 dataset, the logit weights in the pre-trained model are excluded. The training parameters used for training a model with DeepLabv3+ in this work can be seen in Table 1.  A PC with Windows 10 operating system, a GPU card (RTX2070), and 16GB memory is used for DeepLabv3+ model training and testing which uses the TensorFlow framework to run. For training a land cover model using the training datasets, the weights of a pre-trained model with the exception of the logits are used as the starting point and these weights are fine-tuned with further training. These initial weights belong to a pre-trained model for the PASCAL VOC 2012 dataset ("deeplabv3_pascal_train_aug_2018_01_04.tar.gz"). This model was based on the Xception-65 backbone [42]. Because the number of land covers in the Vasiliko and Kimisala dataset is different from the number of classes in the PASCAL VOC-2012 dataset, the logit weights in the pre-trained model are excluded. The training parameters used for training a model with DeepLabv3+ in this work can be seen in Table 1.

CNN Model
Since the used DeepLabV3+ pre-trained model ("deeplabv3_pascal_train_aug_2018_01_04.tar.gz") utilized thousands of RGB images in PASCAL VOC 2012 dataset for training [26], it is difficult to retrain it from scratch to accommodate four bands (RGB+NIR) because we do not have many NIR images that are co-registered with those RGB bands. Our own customized CNN model, on the other hand, can handle up to four bands. In addition to using our CNN model with RGB-NIR bands, we also used it with three bands (RGB) to compare with the DeepLabV3+ results (RGB).
We used the same structure for the CNN model as in our previous work [38] for soil detection, except the filter size in the first convolution layer changed accordingly to be consistent with the input patch sizes. The input image with N bands is extracted into 7 × 7 image patches each with a size of 7 × 7 × N. These image patches are input to the CNN model. For the case when RGB image bands are used, N is 3 and when the NIR band is included, N becomes 4. The CNN model has four convolutional layers and a fully connected layer with 100 hidden units. The CNN model structure is shown in Figure 5. The 3D convolution filters in the convolutional layers are set to 3 × 3 × N in the first convolutional layer, 3 × 3 × 20 in the second and third layers and to 1 × 1 × 100 in the fourth layer. The naming convention used for the convolutional layers in Figure 5 indicates the number of filters and the filter size. As an example, 20 @ 3 × 3 × N, in the first layer indicates 20 convolutional filters with a size of 3 × 3 × N. The stride in all four convolution layers is set to 1 (shown as 1 × 1 × 1) meaning the convolution filter is moved one pixel at a time in each dimension. When we designed the network, we tried different configurations for the number of layers and the size of each layer and selected the one that provided the best results. We did this for all the layers (convolutional and fully connected layers). The choice of "100 hidden units" in the fully connected layer was the outcome of our design studies.
Each convolutional layer utilizes the Rectified Linear Unit (ReLu) as an activation function, the last fully connected layer uses the SoftMax function for classification. We added a dropout layer for each convolutional layer with a dropout rate of '0.1' to mitigate overfitting [43] after observing that '0.1' dropout value performed better than two other dropout values, which are '0.05' and '0.2'.

CNN Model
Since the used DeepLabV3+ pre-trained model ("deeplabv3_pascal_train_aug_2018_01_04.tar.gz") utilized thousands of RGB images in PASCAL VOC 2012 dataset for training [26], it is difficult to retrain it from scratch to accommodate four bands (RGB+NIR) because we do not have many NIR images that are co-registered with those RGB bands. Our own customized CNN model, on the other hand, can handle up to four bands. In addition to using our CNN model with RGB-NIR bands, we also used it with three bands (RGB) to compare with the DeepLabV3+ results (RGB).
We used the same structure for the CNN model as in our previous work [38] for soil detection, except the filter size in the first convolution layer changed accordingly to be consistent with the input patch sizes. The input image with N bands is extracted into 7 × 7 image patches each with a size of 7 × 7 × N. These image patches are input to the CNN model. For the case when RGB image bands are used, N is 3 and when the NIR band is included, N becomes 4. The CNN model has four convolutional layers and a fully connected layer with 100 hidden units. The CNN model structure is shown in Figure 5. The 3D convolution filters in the convolutional layers are set to 3 × 3 × N in the first convolutional layer, 3 × 3 × 20 in the second and third layers and to 1 × 1 × 100 in the fourth layer. The naming convention used for the convolutional layers in Figure 5 indicates the number of filters and the filter size. As an example, 20 @ 3 × 3 × N, in the first layer indicates 20 convolutional filters with a size of 3 × 3 × N. The stride in all four convolution layers is set to 1 (shown as 1 × 1 × 1) meaning the convolution filter is moved one pixel at a time in each dimension. When we designed the network, we tried different configurations for the number of layers and the size of each layer and selected the one that provided the best results. We did this for all the layers (convolutional and fully connected layers). The choice of "100 hidden units" in the fully connected layer was the outcome of our design studies.
Each convolutional layer utilizes the Rectified Linear Unit (ReLu) as an activation function, the last fully connected layer uses the SoftMax function for classification. We added a dropout layer for each convolutional layer with a dropout rate of '0.1' to mitigate overfitting [43] after observing that '0.1' dropout value performed better than two other dropout values, which are '0.05' and '0.2'.

NDVI-ML
We developed an object-based vegetation detection method, NDVI-ML, which utilizes NDVI [44], machine learning (ML) techniques for classification and computer vision techniques for segmentation. The block diagram of NDVI-ML can be found in Figure 6.
The NDVI-ML method identifies the potential vegetation candidates using an NDVI threshold of zero. The candidate vegetation pixels are split into connected components using the Dulmage-Mendelsohn decomposition [45] of the node pairs' adjacency matrix [46] by assigning each candidate vegetation pixel as a node and forming the neighboring node pairs using the 8-neighborhood

NDVI-ML
We developed an object-based vegetation detection method, NDVI-ML, which utilizes NDVI [44], machine learning (ML) techniques for classification and computer vision techniques for segmentation. The block diagram of NDVI-ML can be found in Figure 6.
The NDVI-ML method identifies the potential vegetation candidates using an NDVI threshold of zero. The candidate vegetation pixels are split into connected components using the Dulmage-Mendelsohn decomposition [45] of the node pairs' adjacency matrix [46] by assigning each candidate vegetation pixel as a node and forming the neighboring node pairs using the 8-neighborhood connectivity. Each connected component is considered as a separate vegetation object entity with its own vegetation map, sub-vegetation map.
For each vegetation object, a number of rules with respect to the size of the vegetation objects and the amplitude of RGB content are applied. If the number of pixels of the connected component vegetation object contains only a few pixels, these pixels are labeled as 'non-vegetation' since this small size of a vegetation object candidate is not of interest to detect. For this, the number of pixels of the object is compared with a threshold, minVegObj. If the number of pixels in the vegetation object is bigger than minVegObj, but smaller than another set threshold, medVegObj, among these pixels the ones that have red or blue content larger than green content are labeled as 'non-vegetation'. If, on the other hand, the number of pixels is larger than medVegObj, a more sophisticated process is applied to classify these pixels. This process included a two-class Gaussian Mixture Model (GMM) [47], which are fit to the RGB values of the connected component object pixels. The two GMMs split these pixels into 'vegetation' and 'non-vegetation' classes. Among the identified GMMs, the one that has a higher green content value is considered as the 'vegetation' class and the other as the 'non-vegetation'. If the difference between the mean green content values in the two GMMs exceeds a set threshold, thrGMMGreen, the spatial information of the identified non-vegetation class pixels are then checked to decide whether they are 'non-vegetation' pixels (such as shadow) or they are dark-toned vegetation pixels which happen to be located inside the inner sections of the vegetation object. For extracting the spatial information, an average filter is used with consideration of the image resolution (which is set to 5 × 5 in our investigations). When applying this average filter, if the pixel of interest is a dark-toned vegetation object that happens to fall in inner parts of the vegetation object, the averaged filtered value is expected to have higher green content since it is assumed that there will be several vegetation pixels around it with green content being dominant and the average filtering would thus increase the green content value of this pixel. Similarly, if it is a shadow pixel that happens to fall on the boundary sections of the vegetation object, then because the neighborhood of the pixel would have more 'non-vegetation' pixels, the average filtering would result in a decrease in the green content value of this pixel. Thus, the averaging filter helps to extract spatial information which is utilized to separate the shadow-like non-vegetation pixels from the dark-toned vegetation pixels that happen to fall in inner parts of the vegetation object.
The pseudocode of the NDVI-ML processing steps can be seen in Table 2. Other than these processing steps, NDVI-ML has one other final estimated vegetation map cleaning process. The pseudocode of the final decision map cleaning process can be seen in Table 3. In this cleaning process, it is basically checked if there are any connected components remaining with a very small number of pixels after the applied processing steps and if there are any connected components where the green content is lower than red or blue content. If there are cases like this, the pixels of these connected components are also labeled as 'non-vegetation' in the final estimated vegetation map. Assign ccvo pixels as "Non-vegetation" in its sub-vegetation map 4: end 5: if minVegObj < number of pixels in ccvo < medVegObj, 6: Find the pixels in ccvo with red content (R) > green content (G), or blue content (B) > green content (G) 7: Remove these identified pixels from ccvo 8: Assign all the remaining pixels in ccvo as "Vegetation" in its sub-vegetation map 9: end 10: if number of pixels in ccvo > medVegObj, 11: Identify the pixels in ccvo with red content (R)>green content(G), or blue content(B)>green content (G) 12: Label the identified pixels as "Non-vegetation" in its sub-vegetation map 13: Exclude the identified pixels from ccvo 14: if the number of pixels in ccvo > minGMM 15: Fit a two-class GMM to split ccvo pixels into "Vegetation" and "Non-vegetation" classes 16: Assign the class with lower green content (G) as "Non-vegetation", higher (G) content as "Vegetation" 17: if (averaged green content difference in Vegetation and Non-vegetation class) > thrGMMGreen 18: Extract spatial information of the identified Non-vegetation pixels to decide whether they are shadow-related Non-vegetation pixels or dark-toned Vegetation pixels which are located inside the vegetation object. 19: -Apply a 5x5 average filter to the Non-vegetation class pixels to form spatial statistical features 20: -Apply a two-class GMM to the spatial features to split them into two classes, Dark-toned vegetation and Shadow. 21: -Among the two GMM classes, assign the one with the lower green content as Non-vegetation.

22:
Exclude the identified Non-vegetation pixels from ccvo 23: Apply a closing morphology operation to ccvo 24: Assign the remaining pixels in ccvo after closing operation as "Vegetation" in its sub-veg. map 25: else 26: Apply a closing morphology operation to ccvo 27: Assign the remaining pixels in ccvo after closing operation as "Vegetation" in its sub-veg. map 28: end 29: else 30: Apply a closing morphology operation to ccvo pixels 31: Assign the remaining pixels in ccvo after closing operation as "Vegetation" in its sub-veg. map 32: end 33: end 34: Generate final vegetation map using all sub-veg. maps Table 3. Final vegetation map cleaning process in the NDVI-ML method. Label the pixels of the connected component as "Non-vegetation" 9: else 10: Label the pixels of the connected component as "Vegetation" 11: end 12: end 13: end

Performance Comparison Metrics
Accuracy and mean-intersection-over-union (mIoU) measures [48] are used to assess the performance of DeepLabV3+, custom CNN, and NDVI-ML methods on the two Kimisala test images. Suppose TP corresponds to the true positives, FP is the false positives, FN is the false negatives, and TN is the true negatives. The accuracy measure is computed as: The intersection-over-union (IoU) measure also known as the Jaccard similarity coefficient [48] for a two-class problem can be mathematically expressed using the same notations as: The mean-intersection-over-union (mIoU) measure simply takes the averages of IoU for all classes.

DeepLabV3+
Results Using DeepLabV3+ Model Trained with Vasiliko Dataset In our recent work [5], DeepLabV3+ models were trained using RGB color images of three datasets, which are the Slovenia [36], DeepGlobe datasets [37], and Vasiliko dataset and the trained models were all tested on the Kimisala test data. For completeness of this paper, the DeepLabV3+ results with these three datasets are shown in Table 4. Due to significant resolution differences between the training data (Slovenia data: 10 m resolution; DeepGlobe: 0.5 m resolution) and the testing data (10 cm and 20 cm), the results of these two DeepLabV3+ models on the Kimisala test dataset were found very poor. Table 4 shows the performance metrics obtained with the three DeepLabV3+ models which are trained using three different datasets of different image resolutions and image capturing hardware for the Kimisala-10 test image. It can be noticed that the model trained with the Vasiliko dataset has the highest detection scores. The DeepLabV3+ model trained with Vasiliko dataset also provided the best performance on the Kimisala-20 test image. The resultant performance metrics for the Kimisala-20 test image with three DeepLabV3+ models can be seen in Table 5. The detection results for the two Kimisala test images using the DeepLabV3+ model trained with the Vasiliko dataset and the estimated vegetation maps can be seen in Figure 7. In the DeepLabV3+ results shown in Figure 7a,b, green color pixels correspond to tree/shrub/grass, the silver color corresponds to the barren land and the red color corresponds to urban land).  As mentioned earlier, the DeepLabV3+ pre-trained model that was used for the initialization of our own training model's weights was trained with thousands of RGB images. It is, however, challenging to extend DeepLabV3+ model to incorporate more than three input channels, such as including a NIR band in addition to the RGB bands. This is because in order to build a satisfactory model, not only a lot of training data are needed that include the NIR band in addition to RGB color As mentioned earlier, the DeepLabV3+ pre-trained model that was used for the initialization of our own training model's weights was trained with thousands of RGB images. It is, however, challenging to extend DeepLabV3+ model to incorporate more than three input channels, such as including a NIR band in addition to the RGB bands. This is because in order to build a satisfactory model, not only a lot of training data are needed that include the NIR band in addition to RGB color images but also significant GPU power that can conduct a proper training with batch sizes larger than 16. This is because, for efficient model training, larger batch sizes are recommended in DeepLabV3+ [49]. Moreover, the DeepLabV3+ architecture, which is originally designed for three input channels, RGB, needs to be adjusted accordingly to accommodate four-channel input images. With four-channel input images, the existing pre-trained models, which are for RGB, cannot be used directly and there is a need for training a model from scratch or at least modifying the DeepLabV3+ architecture such that only the weights for the newly added image bands can be learned and the weights of RGB input channels can be initialized by pre-trained model weights via transfer learning [41].
In this work, we used the default DeepLabV3+ architecture and considered using more than three input channels with DeepLabV3+ as future work. However, we conducted an interesting investigation in which we replaced the red (R) band with the NDVI band and kept the green (G) and blue (B) bands (NDVI-GB) in the training data when training a DeepLabV3+ model. Since the NDVI band is computed using red (R) and NIR bands, all four bands are involved in model training to some extent, while also fulfilling DeepLabV3+'s three input channels restriction. The resultant NDVI values which originally take values between −1 to 1 are scaled such that they take values between 0 and 255 which is the case for the color band image values.
When we started DeepLabV3+ model training for NDVI-GB bands from scratch with a higher learning rate, 0.1, even though the total loss values during training dropped nicely after 200K training steps, as can be seen in Figure 8, the final DeepLabV3+ predictions for the test dataset and even for the training set were all black color indicating that the model trained from scratch was not reliable.
Remote Sens. 2020, 12, x FOR PEER REVIEW 14 of 23 images but also significant GPU power that can conduct a proper training with batch sizes larger than 16. This is because, for efficient model training, larger batch sizes are recommended in DeepLabV3+ [49]. Moreover, the DeepLabV3+ architecture, which is originally designed for three input channels, RGB, needs to be adjusted accordingly to accommodate four-channel input images. With four-channel input images, the existing pre-trained models, which are for RGB, cannot be used directly and there is a need for training a model from scratch or at least modifying the DeepLabV3+ architecture such that only the weights for the newly added image bands can be learned and the weights of RGB input channels can be initialized by pre-trained model weights via transfer learning [41].
In this work, we used the default DeepLabV3+ architecture and considered using more than three input channels with DeepLabV3+ as future work. However, we conducted an interesting investigation in which we replaced the red (R) band with the NDVI band and kept the green (G) and blue (B) bands (NDVI-GB) in the training data when training a DeepLabV3+ model. Since the NDVI band is computed using red (R) and NIR bands, all four bands are involved in model training to some extent, while also fulfilling DeepLabV3+'s three input channels restriction. The resultant NDVI values which originally take values between −1 to 1 are scaled such that they take values between 0 and 255 which is the case for the color band image values.
When we started DeepLabV3+ model training for NDVI-GB bands from scratch with a higher learning rate, 0.1, even though the total loss values during training dropped nicely after 200K training steps, as can be seen in Figure 8, the final DeepLabV3+ predictions for the test dataset and even for the training set were all black color indicating that the model trained from scratch was not reliable. Next, we used the same NDVI-GB training data set and used the pre-trained model's weights (pre-trained model for PASCAL VOC 2012 dataset for RGB images) to initialize our training model's weights. Even though the training dataset has the NDVI bands instead of red (R) band, the total loss converged nicely during training as can be seen in Figure 9 and the detection results also improved slightly for both Kimisala test datasets in comparison to the results using the DeepLabV3+ model trained using RGB bands. The results for the two Kimisala test datasets can be seen in Table 6. Considering that an RGB-based pre-trained model is used as the initial model whereas the training dataset does not contain the R band but NDVI instead, it is found highly interesting that slightly better detection results can be still obtained. Next, we used the same NDVI-GB training data set and used the pre-trained model's weights (pre-trained model for PASCAL VOC 2012 dataset for RGB images) to initialize our training model's weights. Even though the training dataset has the NDVI bands instead of red (R) band, the total loss converged nicely during training as can be seen in Figure 9 and the detection results also improved slightly for both Kimisala test datasets in comparison to the results using the DeepLabV3+ model trained using RGB bands. The results for the two Kimisala test datasets can be seen in Table 6. Considering that an RGB-based pre-trained model is used as the initial model whereas the training dataset does not contain the R band but NDVI instead, it is found highly interesting that slightly better detection results can be still obtained. Remote Sens. 2020, 12, x FOR PEER REVIEW 15 of 23 Figure 9. The total loss plot for DeepLabV3+ model training with the NDVI band replaced with the R band and with the pre-trained RGB model as the initial model.

CNN Results
The first investigation with the CNN model for the Vasiliko training and two Kimisala test datasets was to examine its detection performance when RGB (color images) and RGB-NIR bands are used. Using a subset of 30,000 training samples per class out of the complete training data (Class 1-Barren: 1469178 and Class 2-Trees: 935340), 7 × 7 RGB patches for each class, it was observed that for this case, the overall classification rate is 76.2% (Class 0, vegetation, has a 90.1% accuracy and Class 1, no-vegetation, has a 65.8% accuracy). Table 7 shows the classification accuracy with CNN for RGB only and RGB-NIR cases. It can be noticed that with the addition of the NIR band, the overall accuracy improves by ~5% with our custom CNN method. Next, we conducted a full investigation with the CNN model on the Vasiliko/Kimisala datasets to further improve the classification accuracy. Due to memory issues, we could not feed the whole Vasiliko image for training. Instead, we had to divide the training image into four quadrants. The performance metrics were generated by sequentially training the model on each of the four quadrants of the Vasiliko dataset. In order to do this, we trained an initial model on the first quadrant and then updated the weights using the following quadrant. This step was repeated for the final three quadrants of the image. For training the model sequentially, we used a patch size of 7, the learning rate of 0.01, and used all the possible training samples in the quadrant. Table 8 summarizes the results for our sequential approach. We only generated results using the Kimisala-20 test image because Kimisala-10 is very large in size. The average overall accuracy for the CNN model reached 0.8298, which was better than the earlier results in Table 7 in which only 30,000 samples were used to train

CNN Results
The first investigation with the CNN model for the Vasiliko training and two Kimisala test datasets was to examine its detection performance when RGB (color images) and RGB-NIR bands are used. Using a subset of 30,000 training samples per class out of the complete training data (Class 1-Barren: 1469178 and Class 2-Trees: 935340), 7 × 7 RGB patches for each class, it was observed that for this case, the overall classification rate is 76.2% (Class 0, vegetation, has a 90.1% accuracy and Class 1, no-vegetation, has a 65.8% accuracy). Table 7 shows the classification accuracy with CNN for RGB only and RGB-NIR cases. It can be noticed that with the addition of the NIR band, the overall accuracy improves by~5% with our custom CNN method. Next, we conducted a full investigation with the CNN model on the Vasiliko/Kimisala datasets to further improve the classification accuracy. Due to memory issues, we could not feed the whole Vasiliko image for training. Instead, we had to divide the training image into four quadrants. The performance metrics were generated by sequentially training the model on each of the four quadrants of the Vasiliko dataset. In order to do this, we trained an initial model on the first quadrant and then updated the weights using the following quadrant. This step was repeated for the final three quadrants of the image. For training the model sequentially, we used a patch size of 7, the learning rate of 0.01, and used all the possible training samples in the quadrant. Table 8 summarizes the results for our sequential approach. We only generated results using the Kimisala-20 test image because Kimisala-10 is very large in size. The average overall accuracy for the CNN model reached 0.8298, which was better than the earlier results in Table 7 in which only 30,000 samples were used to train the CNN model. Here, all the training samples in the Vasiliko image were used. The resultant vegetation map is shown in Figure 10.  Figure 10.   Table 9 summarizes the results of the NDVI-ML method for the Kimisala-10 test image. As a baseline comparison benchmark to NDVI-ML, the NDVI band image only is transformed into binary detection maps using two different NDVI thresholds, which are 0.0 and 0.09, respectively. These NDVI-only detection results with two different thresholds are also included in Table 9. It can be seen that the proposed NDVI-ML improved the performance of NDVI-only results significantly. We also observed the same trend for the Kimisala-20 image as can be seen in Table 10. The vegetation detection results for the two Kimisala test images using NDVI-only and NDVI-ML can be seen in Figures 11 and 12, respectively. When applying the NDVI-ML approach, the parameter minVegObj is set to 70 for the Kimisala-10 test image and to 35 for the Kimisala-20 test image. Regarding the other NDVI-ML parameters, medVegObj is set to 250 and minGMM is set to 200 for both Kimisala-10 and Kimisala-20 test images. In comparison to the ground truth land cover map, the NDVI-ML approach results are found to be highly accurate and better than the detection results of the two deep learning methods. Table 9. Accuracy and mIoU measures for Kimisala-10 vegetation detection.

Method
Accuracy  Table 9 summarizes the results of the NDVI-ML method for the Kimisala-10 test image. As a baseline comparison benchmark to NDVI-ML, the NDVI band image only is transformed into binary detection maps using two different NDVI thresholds, which are 0.0 and 0.09, respectively. These NDVI-only detection results with two different thresholds are also included in Table 9. It can be seen that the proposed NDVI-ML improved the performance of NDVI-only results significantly. We also observed the same trend for the Kimisala-20 image as can be seen in Table 10. The vegetation detection results for the two Kimisala test images using NDVI-only and NDVI-ML can be seen in Figures 11  and 12, respectively. When applying the NDVI-ML approach, the parameter minVegObj is set to 70 for the Kimisala-10 test image and to 35 for the Kimisala-20 test image. Regarding the other NDVI-ML parameters, medVegObj is set to 250 and minGMM is set to 200 for both Kimisala-10 and Kimisala-20 test images. In comparison to the ground truth land cover map, the NDVI-ML approach results are found to be highly accurate and better than the detection results of the two deep learning methods.

Performance Comparisons
Tables 11 and 12 summarizes the resultant performance metrics of NDVI-ML and deep learningbased methods for the two Kimisala test data. It can be seen that in both metrics, NDVI-ML performs better than the deep learning methods. The two DeepLabV3+ models trained with Vasiliko dataset (RGB and NDVI-GB input channels) follow closely the NDVI-ML approach in terms of accuracy. DeepLabV3+ results with NDVI-GB input channels are found to be slightly better than the DeepLabV3+ results with RGB input channels only. Among the three DeepLabV3+ models, Vasiliko model performs the best. Because the image resolutions in Slovenia (10 m) and DeepGlobe (50 cm) training datasets are quite different than the resolutions of the Kimisala test images (10 cm and 20 cm) and also because the image characteristics of Kimisala test images are quite different from the ones in Slovenia and DeepGlobe datasets, these two training models were considered to be performing poorly. One other aspect that must be favoring the Vasiliko DeepLabV3+ model over the other two models is that both Kimisala test images and the Vasiliko training dataset images were collected with the same camera systems. Our customized CNN model (RGB+NIR) performance was evaluated only on the Kimisala-20 test image. The customized CNN results were found to be better than DeepLabV3+, but were still worse than NDVI-ML.

Performance Comparisons
Tables 11 and 12 summarizes the resultant performance metrics of NDVI-ML and deep learning-based methods for the two Kimisala test data. It can be seen that in both metrics, NDVI-ML performs better than the deep learning methods. The two DeepLabV3+ models trained with Vasiliko dataset (RGB and NDVI-GB input channels) follow closely the NDVI-ML approach in terms of accuracy. DeepLabV3+ results with NDVI-GB input channels are found to be slightly better than the DeepLabV3+ results with RGB input channels only. Among the three DeepLabV3+ models, Vasiliko model performs the best. Because the image resolutions in Slovenia (10 m) and DeepGlobe (50 cm) training datasets are quite different than the resolutions of the Kimisala test images (10 cm and 20 cm) and also because the image characteristics of Kimisala test images are quite different from the ones in Slovenia and DeepGlobe datasets, these two training models were considered to be performing poorly. One other aspect that must be favoring the Vasiliko DeepLabV3+ model over the other two models is that both Kimisala test images and the Vasiliko training dataset images were collected with the same camera systems. Our customized CNN model (RGB+NIR) performance was evaluated only on the Kimisala-20 test image. The customized CNN results were found to be better than DeepLabV3+, but were still worse than NDVI-ML.

Discussion
In Kimisala-20 (20 cm resolution) test dataset, DeepLabV3+ (RGB only) had 0.8015 overall accuracy whereas our customized CNN model achieved an accuracy of 0.7620 with RGB bands and reached an accuracy of 0.8298 if the RGB and NIR bands were all used (four bands). The DeepLabV3+ model with NDVI-GB input channels provided an overall accuracy of 0.8089, which was slightly better than DeepLabV3+ model accuracy with RGB channels but lower than our CNN model's accuracy. Similar trends were also observed in the second Kimisala test dataset that has a 10 cm resolution (Kimisala-10). The detection results for the NDVI-ML method are found to be better than DeepLabV3+ and our customized CNN model for both Kimisala test datasets. For the Kimisala-20 cm test dataset, NDVI-ML provided an accuracy of 0.8578 which was considerably higher than the two deep learning methods. NDVI-ML method performed considerably better than the investigated deep learning methods for vegetation detection and does not need any training data. However, NDVI-ML consists of several rules and thresholds that need to be selected properly by the user and the parameters and thresholds used in these rules might most likely need to be revisited for another test image other than Kimisala test data. NDVI-ML also focuses on vegetation detection as a binary classification problem (vegetation vs. non-vegetation) since it depends on NDVI for detecting candidate vegetation pixels in its first step whereas in the deep learning-based methods there is the flexibility to classify different vegetation types (such as a tree, shrub, and grass). Among the two deep learning methods, DeepLabV3+, provided extremely good detection performance using only RGB images without the NIR band showing that for low-budget land cover classification applications using drones with low-cost onboard RGB cameras, DeepLabV3+ could certainly be a viable method.
Comparing the deep learning and NDVI-based approaches, we observe that the NDVI-ML method provided significantly better results than the two deep learning methods. This may look surprising because normally people would expect deep learning methods to perform better than conventional techniques. However, a close look at the results and images reveal that these findings are actually reasonable from two perspectives. First, for deep learning methods to work well, a large amount of training data is necessary. Otherwise, the performance will not be good. Second, for deep learning methods to work decently, it is better for the training and testing images to have a close resemblance. However, in our case, the training and testing images are somewhat different even if they were captured by the same camera system as can be seen from Figure 13, making the vegetation classification challenging for deep learning methods. In our recent study [5], we observed more serious detection performance drops with DeepLabV3+ when training and testing datasets had different image resolutions, and the camera systems that captured these images were different.
Remote Sens. 2020, 12, x FOR PEER REVIEW 20 of 23 learning methods to work decently, it is better for the training and testing images to have a close resemblance. However, in our case, the training and testing images are somewhat different even if they were captured by the same camera system as can be seen from Figure 13, making the vegetation classification challenging for deep learning methods. In our recent study [5], we observed more serious detection performance drops with DeepLabV3+ when training and testing datasets had different image resolutions, and the camera systems that captured these images were different. Another limitation of DeepLabV3+ is that it accepts only three input channels and requires architecture modifications when more than three channels are aimed to be used. Even if these modifications are done properly, the training will have to start from scratch since there are no pretrained DeepLabV3+ models other than RGB input channels. Moreover, one needs to find a significant number of training images that contain all these additional input channels. This may not be practical since the existing RGB pre-trained model utilized thousands, if not millions, of RGB images in the training process. Our customized CNN method, on the other hand, can handle more than three channels; however, the training needs to start from scratch since there are no pre-trained models available for the NIR band.

Non-vegetation Vegetation Mixed
One other challenge with deep learning methods is when the dataset is imbalanced. With heavily imbalanced datasets, the error from the overrepresented classes contributes much more to the loss value than the error contribution from the underrepresented classes. This makes the deep learning method's loss function to be biased toward the overrepresented classes resulting in poor classification performance for the underrepresented classes [50]. One should also pay attention when applying deep learning methods to new applications because one requirement for deep learning is the availability of a vast amount of training data. Moreover, the training data needs to have similar characteristics as the testing data. Otherwise, deep learning methods may not yield good performance. Augmenting the training dataset using different brightness levels, adding vertically and horizontally flipped versions, shifting, rotating, or adding noisy versions of the training images could be potential strategies to mitigate the issues when test data characteristics differ from the training data.

Conclusions
In this paper, we investigated the performance of three methods for vegetation detection. Two of these methods are based on deep learning and another one is an object-based method that utilizes NDVI, computer vision, and machine learning techniques. Experimental results showed that the DeepLabV3+ model that used the RGB bands performed reasonably well. However, it is challenging to extend that model to include the NIR band in addition to the three RGB bands. When the NDVI band is replaced with the red band to enable the use of all four input channels to some extent while Another limitation of DeepLabV3+ is that it accepts only three input channels and requires architecture modifications when more than three channels are aimed to be used. Even if these modifications are done properly, the training will have to start from scratch since there are no pre-trained DeepLabV3+ models other than RGB input channels. Moreover, one needs to find a significant number of training images that contain all these additional input channels. This may not be practical since the existing RGB pre-trained model utilized thousands, if not millions, of RGB images in the training process. Our customized CNN method, on the other hand, can handle more than three channels; however, the training needs to start from scratch since there are no pre-trained models available for the NIR band.
One other challenge with deep learning methods is when the dataset is imbalanced. With heavily imbalanced datasets, the error from the overrepresented classes contributes much more to the loss value than the error contribution from the underrepresented classes. This makes the deep learning method's loss function to be biased toward the overrepresented classes resulting in poor classification performance for the underrepresented classes [50]. One should also pay attention when applying deep learning methods to new applications because one requirement for deep learning is the availability of a vast amount of training data. Moreover, the training data needs to have similar characteristics as the testing data. Otherwise, deep learning methods may not yield good performance. Augmenting the training dataset using different brightness levels, adding vertically and horizontally flipped versions, shifting, rotating, or adding noisy versions of the training images could be potential strategies to mitigate the issues when test data characteristics differ from the training data.

Conclusions
In this paper, we investigated the performance of three methods for vegetation detection. Two of these methods are based on deep learning and another one is an object-based method that utilizes NDVI, computer vision, and machine learning techniques. Experimental results showed that the DeepLabV3+ model that used the RGB bands performed reasonably well. However, it is challenging to extend that model to include the NIR band in addition to the three RGB bands. When the NDVI band is replaced with the red band to enable the use of all four input channels to some extent while fulfilling DeepLabV3+'s three input channels only restriction, we noticed some slight detection improvements; yet, this is not fully equivalent to using all four bands at once. In contrast to DeepLabV3+, our customized CNN model can be easily adapted to use RBG+NIR bands. With our customized CNN model, slightly better results than DeepLabV3+ were obtained for the Kimisala-20 dataset when RGB and NIR bands were used. Overall, we found the NDVI-ML approach to perform better than both two deep learning models. We anticipate that the reason is that the training data and testing data are different in appearance making it challenging for deep learning methods. In contrast, NDVI-ML does not require any training data and may be more practical in real-world applications. However, NDVI-ML is not applicable to situations where the NIR band is not available and might need special care when choosing optimal parameters that might vary for different test images with different resolutions. Even though vegetation detection with a reasonable level of accuracy is possible with DeepLabV3+ using RGB bands only, one of the future research directions would be the customization of the DeepLabV3+ framework to accept more than three channels so that the NIR band can be used together with the three color channels, RGB. Another direction would be using augmentation techniques with deep learning methods to diversify the training data so that more robust responses can be obtained when the test data characteristics considerably differ from training data.

Conflicts of Interest:
The authors declare no conflict of interest.