Deep Learning Based Oil Palm Tree Detection and Counting for High-Resolution Remote Sensing Images

Oil palm trees are important economic crops in Malaysia and other tropical areas. The number of oil palm trees in a plantation area is important information for predicting the yield of palm oil, monitoring the growing situation of palm trees and maximizing their productivity, etc. In this paper, we propose a deep learning based framework for oil palm tree detection and counting using high-resolution remote sensing images for Malaysia. Unlike previous palm tree detection studies, the trees in our study area are more crowded and their crowns often overlap. We use a number of manually interpreted samples to train and optimize the convolutional neural network (CNN), and predict labels for all the samples in an image dataset collected through the sliding window technique. Then, we merge the predicted palm coordinates corresponding to the same palm tree into one palm coordinate and obtain the final palm tree detection results. Based on our proposed method, more than 96% of the oil palm trees in our study area can be detected correctly when compared with the manually interpreted ground truth, and this is higher than the accuracies of the other three tree detection methods used in this study.


Introduction
Oil palm trees are important economic crops.In addition to their main use to produce palm oil, oil palms are also used to generate a variety of products such as plywood, paper, furniture, etc. [1].Information about the locations and the number of oil palm trees in a plantation area is important in many aspects.First, it is essential for predicting the yield of palm oil, which is the most widely used vegetable oil in the world.Second, it provides vital information to understand the growing situation of palm trees after plantation, such as the age or the survival rate of the palm trees.Moreover, it informs the development of irrigation processes and maximizes productivity [2].
Remote sensing has played an important role in various studies on oil palm productivity, the age of oil palm trees and oil palm mapping, etc. [3][4][5][6][7][8].In recent years, high-resolution remote sensing images have become increasingly popular and important for many applications including automatic palm tree detection.Previous palm tree or tree crown detection research has usually been based on traditional methods in the computer vision domain.For instance, a tree detection-delineation algorithm was designed for high-resolution digital imagery tree crown detection, which is based on the local maximum filter and the analysis of local transects extending outward from a potential tree apex [9].Shafri et al. [10] presented an approach for oil palm tree extraction and counting from high spatial resolution airborne imagery data, which is composed of many parts including spectral analysis, texture analysis, edge enhancement, segmentation process, morphological analysis and blob analysis.Ke et al. [11] reviewed various methods for automatic individual tree-crown detection and delineation from passive remote sensing, including local maximum filtering, image binarization, scale analysis, and template matching, etc. Srestasathiern et al. [12] used semi-variogram computation and non-maximal suppression for palm tree detection from high-resolution multi-spectral satellite images.
Moreover, some researchers have also applied machine learning-based methods to palm tree detection studies.Malek et al. [2] used a scale-invariant feature transform (SIFT) and a supervised extreme learning machine classifier to detect palm trees from unmanned aerial vehicle (UAV) images.Manandhar et al. [13] used circular autocorrelation of the polar shape matrix representation of an image as the shape feature and a linear support vector machine to standardize and reduce dimensions of the feature.This study also used a local maximum detection algorithm on the spatial distribution of standardized features to detect palm trees.Previous palm tree or tree crown detection studies have focused on detecting trees that are not very crowded and have achieved good detection results for their study areas.However, the performance of some of these methods would deteriorate when detecting palm trees in some of the regions of our study area.For instance, the local maximum filter based method [9] cannot detect palm trees correctly in regions where the trees are very young and small, as the local maximum of each filter does not locate around the apex of young palm trees.The template matching method [10] is not suitable for regions where palm trees are very crowded and where their crowns overlap.
The convolutional neural network (CNN), a widely used deep learning model, has achieved great performance in many studies in the computer vision field, such as image classification [14,15], face recognition [16,17], and pedestrian detection [18,19], etc.In recent years, deep learning based methods have also been applied to hyperspectral image classification [20,21], large-scale land cover classification [22], scene classification [23][24][25], and object detection [26,27], etc. in the remote sensing domain and achieved better performance than traditional methods.For instance, Chen et al. [20] introduced the concept of deep learning and applied the stacked autoencoder method to hyperspectral remote sensing image classification for the first time.Li et al. [22] built a classification framework for large-scale remote sensing image processing and African land cover mapping based on the stacked autoencoder.Zou et al. [24] proposed a deep belief network based feature selection method for remote sensing scene classification.Chen et al. [26] proposed a hybrid deep convolutional neural network for vehicle detection in high-resolution satellite images.Vakalopoulou et al. [27] proposed an automated building detection framework from very high-resolution remote sensing data based on deep convolutional neural networks.
In this paper, we introduce the deep learning based method to oil palm tree detection for the first time.We propose a CNN based framework for the detection and counting of oil palm trees using high-resolution remote sensing images from Malaysia.The detection and counting of oil palm trees in our study area is more difficult than for the previous palm detection research mentioned above, as the trees are very crowded and their crowns often overlap.In our proposed method, we collect a number of manually interpreted training and test samples for training the convolutional neural network and calculating the classification accuracy.Secondly, we optimize the convolutional neural network through tuning its main parameters to obtain the best CNN model.Then, we use the best CNN model obtained previously to predict the labels for all the samples in an image dataset that are collected through the sliding window technique.Finally, we merge the predicted palm tree coordinates corresponding to the same palm tree (spatial distance less than a certain threshold) into one coordinate, and obtain the final palm tree detection results.Compared with the manually interpreted ground truth, more than 96% of the oil palm trees in our study area can be detected correctly, which is higher than the accuracies of the other three tree detection methods used in this study.The detection accuracy of our proposed method is affected, to some extent, by the limited number of our manually interpreted samples.In our future work, more manually interpreted samples will be collected to further improve the overall performance of our proposed method.
The rest of this paper is organized as follows.Section 2 presents the study area and the datasets of this research; Section 3 describes the flowchart and the details of our proposed method; Section 4 provides the detection results of our proposed method and the performance comparison with other methods; and Section 5 presents some important conclusions of this research.

Study Area and Datasets
In this research, a QuickBird image acquired on 21 November 2006 is used.The QuickBird satellite has one panchromatic (Pan) band with 0.6-m spatial resolution and four multi-spectral (MS) bands with 2.4-m spatial resolution.The Gram-Schmidt (GS) spectral sharpening fusion method [28], which is implemented in the ENVI software (version 5.3, Exelis Visual Information Solutions, Boulder, CO, USA), was employed to integrate Pan and MS bands to obtain a higher sharpness and spectral quality (0.6-m spatial resolution, four bands) dataset for further image processing and analysis.
The study area of this research is located in the south of Malaysia, as shown in Figure 1.The manually interpreted samples used in this study were collected from two typical regions of our study area (denoted by the blue rectangles in Figure 1).To evaluate the performance of our proposed method, we selected another three representative regions in our study area (denoted by the red squares in Figure 1) and compared the detected images of these regions with the ground truth collected by manual interpretation.number of our manually interpreted samples.In our future work, more manually interpreted samples will be collected to further improve the overall performance of our proposed method.
The rest of this paper is organized as follows.Section 2 presents the study area and the datasets of this research; Section 3 describes the flowchart and the details of our proposed method; Section 4 provides the detection results of our proposed method and the performance comparison with other methods; and Section 5 presents some important conclusions of this research.

Study Area and Datasets
In this research, a QuickBird image acquired on 21 November 2006 is used.The QuickBird satellite has one panchromatic (Pan) band with 0.6-m spatial resolution and four multi-spectral (MS) bands with 2.4-m spatial resolution.The Gram-Schmidt (GS) spectral sharpening fusion method [28], which is implemented in the ENVI software (version 5.3, Exelis Visual Information Solutions, Boulder, CO., USA), was employed to integrate Pan and MS bands to obtain a higher sharpness and spectral quality (0.6-m spatial resolution, four bands) dataset for further image processing and analysis.
The study area of this research is located in the south of Malaysia, as shown in Figure 1.The manually interpreted samples used in this study were collected from two typical regions of our study area (denoted by the blue rectangles in Figure 1).To evaluate the performance of our proposed method, we selected another three representative regions in our study area (denoted by the red squares in Figure 1) and compared the detected images of these regions with the ground truth collected by manual interpretation.

Overview
The flowchart of our proposed method is shown in Figure 2. First, the convolutional neural network [14] was implemented based on the Tensorflow framework [29].We used a number of training samples collected previously by manual interpretation to train the CNN, and calculated the classification accuracy based on a number of test samples collected independently of training samples.The main parameters of the CNN (e.g., the number of kernels in the first convolutional

Overview
The flowchart of our proposed method is shown in Figure 2. First, the convolutional neural network [14] was implemented based on the Tensorflow framework [29].We used a number of training samples collected previously by manual interpretation to train the CNN, and calculated the classification accuracy based on a number of test samples collected independently of training samples.
The main parameters of the CNN (e.g., the number of kernels in the first convolutional layer, the number of kernels in the second convolutional layer and the number of hidden units in the fully connected layer) were adjusted continuously until we found the best combination of parameters of which the overall accuracy was the highest on our test samples.By tuning the parameters, we achieved the best CNN model and saved it for further use.Secondly, the image dataset for palm tree detection was collected through the sliding window technique (the window size is 17 × 17 and the sliding step is three pixels).Then, we used the best CNN model obtained previously to predict the label for each sample in the image dataset.Thirdly, for all samples that were predicted as "palm tree" class, we merged the coordinates corresponding to the same palm tree sample (spatial distance less than a certain threshold) into one coordinate, and obtained the final palm tree detection results.
Remote Sens. 2017, 9, 22 4 of 13 layer, the number of kernels in the second convolutional layer and the number of hidden units in the fully connected layer) were adjusted continuously until we found the best combination of parameters of which the overall accuracy was the highest on our test samples.By tuning the parameters, we achieved the best CNN model and saved it for further use.Secondly, the image dataset for palm tree detection was collected through the sliding window technique (the window size is 17 × 17 and the sliding step is three pixels).Then, we used the best CNN model obtained previously to predict the label for each sample in the image dataset.Thirdly, for all samples that were predicted as "palm tree" class, we merged the coordinates corresponding to the same palm tree sample (spatial distance less than a certain threshold) into one coordinate, and obtained the final palm tree detection results.

CNN Training and Parameter Optimization
The LeNet convolutional neural network used in this study is constructed of two convolutional layers, two pooling layers and a fully connected layer, as shown in Figure 3.The input to the fully connected layer is the set of all features maps at the layer below.The fully connected layers correspond to a traditional multilayer perception constructed by a hidden layer and a logistic regression layer.We use the Rectified Linear Unit (ReLU) as the activation function of the CNN.In this research, we manually interpreted 5000 palm tree samples and 4000 background samples from two regions of our study area (denoted by the blue rectangles in Figure 1).Then, we randomly select 7200 of these samples as the training dataset of the convolutional neural network, and the other 1800 samples as its test dataset.Only a sample with a palm located at its center will be labeled as "palm tree".Each sample corresponds to 17 × 17 pixels with three bands (Red, Green and Blue) selected from the original four bands.The main parameters of CNN are adjusted continuously until we find the best combination of parameters for which the overall accuracy is the highest from 1800 test samples.After parameter tuning, we achieve the best CNN model that will be used in the subsequent process of image dataset label prediction.

CNN Training and Parameter Optimization
The LeNet convolutional neural network used in this study is constructed of two convolutional layers, two pooling layers and a fully connected layer, as shown in Figure 3.The input to the fully connected layer is the set of all features maps at the layer below.The fully connected layers correspond to a traditional multilayer perception constructed by a hidden layer and a logistic regression layer.We use the Rectified Linear Unit (ReLU) as the activation function of the CNN.In this research, we manually interpreted 5000 palm tree samples and 4000 background samples from two regions of our study area (denoted by the blue rectangles in Figure 1).Then, we randomly select 7200 of these samples as the training dataset of the convolutional neural network, and the other 1800 samples as its test dataset.Only a sample with a palm located at its center will be labeled as "palm tree".Each sample corresponds to 17 × 17 pixels with three bands (Red, Green and Blue) selected from the original four bands.The main parameters of CNN are adjusted continuously until we find the best combination of parameters for which the overall accuracy is the highest from 1800 test samples.After parameter tuning, we achieve the best CNN model that will be used in the subsequent process of image dataset label prediction.

Label Prediction
The image dataset for label prediction is collected through the sliding window technique, as shown in Figure 4.The size of the sliding window is 17 × 17 pixels, which is consistent with the feature size of our training and test samples.In addition, the sliding step (the moving distance of the sliding window in each step) will have a great influence on the final palm tree detection results.If the sliding step is too large, many palm samples will be missed and will not be detected.On the other hand, if the sliding step is too small, one palm sample might be detected repeatedly.Moreover, the process of label prediction will become slower due to the increasing number of samples in the image dataset, which is actually unnecessary and a waste of time.In this study, the sliding step is set as three pixels through experimental tests.After collecting all samples of the image dataset through the sliding window technique, we use the best CNN model obtained in Section 3.2 to predict the label for each sample in the image dataset.

Sample Merging
After the labels of all samples in the image dataset are predicted, we collect the spatial coordinates of all the samples that are predicted as "palm tree" class.At this point, the number of predicted palm tree coordinates could be larger than the actual number of palm trees because one palm tree might correspond to several predicted palm tree coordinates.To avoid this problem, the coordinates corresponding to the same palm tree sample will be merged into one coordinate iteratively, as shown in Figure 5. Assuming that, in our study area, the spatial distance between two palm trees cannot be less than 8 pixels, the merging process will take six iterations.In each iteration, all groups of coordinates with the Euclidean distance less than a certain threshold (3, 4, 5, 6, 7, 8 pixels) will be merged into one coordinate.That is, the original group of coordinates will be replaced by their average coordinate.The remaining palm tree coordinates after the merging process represent the actual coordinates of detected palm trees.

Classification Accuracy and Parameter Optimization
In this study, the classification accuracy of our CNN model was assessed by 1800 test samples collected independently from 7200 training samples.The classification accuracy can be affected by many parameters, such as the size of the convolutional kernel and the max-pooling kernel, the number of kernels in each convolutional layer and hidden units in fully connected layers, etc.For our CNN model, the size of the convolutional kernel is five, the size of the max-pooling kernel is two, the size of mini-batch is 10 and the maximum number of iterations is 8000.We adjusted three important parameters to optimize the model: the number of kernels in the first convolutional layer, the number of kernels in the second convolutional layer and the number of hidden units in the fully connected layer.Experimental results showed that we could obtain the highest overall accuracy of 95% after 7500 iterations when the number of kernels in two convolutional layers are set as 30 and 55 and the number of hidden units in fully connected layers is set as 600.

Detection Results Evaluation
To evaluate the performance of our proposed oil palm tree detection method quantitatively, we calculate the precision, recall and overall accuracy of the palm tree detection results through comparison with the ground truth.The precision is the probability that a detected oil palm tree is valid, as described in Formula (1); the recall is the probability that an oil palm tree in ground truth is detected, as described in Formula (2); the overall accuracy is the average of precision and recall, as described in Formula (3).A palm is regarded as detected correctly only if the distance between the center of a detected palm and the center of a palm in ground truth is less than or equal to five pixels:

Precision =
The number of correctly detected palm trees The number of all detected objects ,

Recall =
The number of correctly detected palm trees The number of palm trees in ground truth , Overall Accuracy = Precision + Recall 2 . ( Table 1 shows that the overall accuracies of regions 1, 2 and 3 are 96.05%,96.34% and 98.77%, respectively.In addition, for each of the three regions, the difference between the predicted number of palm trees (the number of all detected objects) and the true number of palm trees (the number of palm trees in ground truth) is less than 4%.These evaluation results show that our proposed method is effective for both palm tree detection and counting.

Discussion
To further evaluate our proposed palm tree detection method, we implemented three other representative existing palm trees or tree crown detection methods (i.e., Artificial Neural Network (ANN), template matching, and local maximum filter) and compared their detection results with our proposed method.The procedure of the ANN based method is the same as our proposed method, including the ANN training, parameter optimization, image dataset label prediction, and sample merging.
The local maximum filter based method [9] and the template matching based method [11] are two traditional tree crown detection methods.For the template matching based method, we used 5000 manually labeled palm tree samples as the template dataset, and a 17 × 17 window to slide through the whole image.We chose the CV_TM_SQDIFF_NORMED provided by OpenCV [30] as our matching method.A sliding window will be detected as a palm tree if it matches any sample in the template dataset (the difference between the sliding window and the template calculated by the CV_TM_SQDIFF_NORMED method is less than a threshold.In this study, the threshold is set as five through experimental tests).
For the local maximum filter based method, we first applied a non-overlapping 10×10 local maximum filter to the absolute difference image of the NIR and red spectral bands.Then, we conducted transect sampling and a scaling scheme to obtain potential tree apexes, and adjusted the locations of tree apexes to the new local maximum positions.
Finally, the outputs of the template matching based method and the local maximum filter based method are post-processed (described in Section 3.4) to obtain the final palm tree detection results.Figures 6-8 show the detection images of each method for extracted areas of regions 1, 2 and 3, respectively.Each red circle denotes a detected palm tree.Each green square denotes a palm tree in ground truth that cannot be detected correctly.Each blue square denotes a background sample that is detected as a palm tree by mistake.
Tables 2-4 show the detection results of ANN, template matching (TMPL), and local maximum filter (LMF), respectively.Table 5 summarizes the performance of all four methods in terms of the number of correctly detected palm trees.Table 6 summarizes the performance of all four methods in terms of precision, recall and overall accuracy (OA).The proposed method (CNN) performs better than any of the other three methods in the number of correctly detected palm trees and in OA.Generally, machine learning based approaches (i.e., CNN and ANN) perform better than traditional tree crown detection methods (i.e., TMPL and LMF) in our study area, especially in region 1 and region 2. For example, the local maximum filter based method cannot detect palm trees correctly for regions where palm trees are very young and small (see Figure 7d), as the local maximum of each filter does not locate around the apex of young palm trees.The template matching method is not suitable for regions where the palm trees are very crowded and the canopies often overlap (see Figure 6c).Each red circle denotes a detected palm tree.Each green square denotes a palm tree in ground truth that cannot be detected correctly.Each blue square denotes a background sample that is detected as a palm tree by mistake.

Conclusions
In this paper, we designed and implemented a deep learning based framework for oil palm tree detection and counting from high-resolution remote sensing images.Three representative regions in our study area are selected for assessment of our proposed method.Experimental results show the effectiveness of our proposed method for palm tree detection and counting.First, the palm tree detection results are very similar to the manually labeled ground truth in general.Secondly, the overall accuracies of region 1, region 2 and region 3 are 96%, 96% and 99%, respectively, which are higher than the accuracies of the three other methods used in this research.Moreover, the difference between the predicted number of palm trees and the true number of palm trees is less than 4% for each region of the study area.In our future work, the palm tree detection results should be further improved through enlarging the number of manually interpreted samples and optimizing our proposed CNN based detection framework.We also want to take the computation time of different detection methods into consideration, and explore the deep learning based detection framework for larger scale palm tree detection studies.

Conclusions
In this paper, we designed and implemented a deep learning based framework for oil palm tree detection and counting from high-resolution remote sensing images.Three representative regions in our study area are selected for assessment of our proposed method.Experimental results show the effectiveness of our proposed method for palm tree detection and counting.First, the palm tree detection results are very similar to the manually labeled ground truth in general.Secondly, the overall accuracies of region 1, region 2 and region 3 are 96%, 96% and 99%, respectively, which are higher than the accuracies of the three other methods used in this research.Moreover, the difference between the predicted number of palm trees and the true number of palm trees is less than 4% for each region of the study area.In our future work, the palm tree detection results should be further improved through enlarging the number of manually interpreted samples and optimizing our proposed CNN based detection framework.We also want to take the computation time of different detection methods into consideration, and explore the deep learning based detection framework for larger scale palm tree detection studies.

Figure 1 .
Figure 1.The study area of this research in the south of Peninsular Malaysia.The blue rectangles show the two regions from which the manually interpreted samples are collected.The red squares show the three selected regions for evaluating the performance of our proposed method.

Figure 1 .
Figure 1.The study area of this research in the south of Peninsular Malaysia.The blue rectangles show the two regions from which the manually interpreted samples are collected.The red squares show the three selected regions for evaluating the performance of our proposed method.

Figure 2 .
Figure 2. The flowchart of our proposed method.

Figure 2 .
Figure 2. The flowchart of our proposed method.

Figure 3 .
Figure 3.The structure of the convolutional neural network (CNN).

Figure 6 .
Figure 6.Detection image of each method for region 1 (extracted area).Each red circle denotes a detected palm tree.Each green square denotes a palm tree in ground truth that cannot be detected correctly.Each blue square denotes a background sample that is detected as a palm tree by mistake.

Figure 6 .
Figure 6.Detection image of each method for region 1 (extracted area).Each red circle denotes a detected palm tree.Each green square denotes a palm tree in ground truth that cannot be detected correctly.Each blue square denotes a background sample that is detected as a palm tree by mistake.

Figure 7 .
Figure 7. Detection image of each method for region 2 (extracted area).Figure 7. Detection image of each method for region 2 (extracted area).

Figure 7 .
Figure 7. Detection image of each method for region 2 (extracted area).Figure 7. Detection image of each method for region 2 (extracted area).

Figure 8 .
Figure 8.Detection image of each method for region 3 (extracted area).

Figure 8 .
Figure 8.Detection image of each method for region 3 (extracted area).

Table 1 .
Detection results of convolutional neural network (CNN).

Table 2 .
Detection results of artificial neural network (ANN).

Table 4 .
Detection results of local maximum filter (LMF).

Table 4 .
Detection results of local maximum filter (LMF).

Table 5 .
Summary of the number of correctly detected palm trees for all four methods.

Table 6 .
Summary of the precision, recall and overall accuracy (OA) of all four methods.

Table 5 .
Summary of the number of correctly detected palm trees for all four methods.

Table 6 .
Summary of the precision, recall and overall accuracy (OA) of all four methods.
(a) Convolutional neural network (b) Artificial neural network (c) Template matching (d) Local maximum filter