Large-Scale Oil Palm Trees Detection from High-Resolution Remote Sensing Images Using Deep Learning

: Tree counting is an important plantation practice for biological asset inventories, etc. The application of precision agriculture in counting oil palm trees can be implemented by detecting oil palm trees from aerial imagery. This research uses the deep learning approach using YOLOv3, YOLOv4, and YOLOv5m in detecting oil palm trees. The dataset consists of drone images of an oil palm plantation acquired using a Fixed Wing VTOL drone with a resolution of 5cm/pixel, covering an area of 730 ha labeled with an oil palm class of 56,614 labels. The test dataset covers an area of 180 ha with ﬂat and hilly conditions with sparse, dense, and overlapping canopy and oil palm trees intersecting with other vegetations. Model testing using images from 24 regions, each of which covering 12 ha with up to 1000 trees (for a total of 17,343 oil palm trees), yielded F1-scores of 97.28%, 97.74%, and 94.94%, with an average detection time of 43 s, 45 s, and 21 s for models trained with YOLOv3, YOLOv4, and YOLOv5m, respectively. This result shows that the method is sufﬁciently accurate and efﬁcient in detecting oil palm trees and has the potential to be implemented in commercial applications for plantation companies.


Introduction
Oil palm is an essential agricultural economic crop in many tropical countries such as Indonesia, Malaysia, Thailand, and Colombia. The primary use of oil palm is to produce palm oil, which is not only used to make vegetable oil but also as raw material for cosmetics, biodiesel, and others. Palm oil is the main source of vegetable oil due to its high yield compared to other vegetable oils [1], and it is the most consumed vegetable oil in the world [2]. Palm oil is the most important vegetable oil globally in production and trade [3].
Tree counting is an important plantation practice for biological asset inventories, fresh fruit bunch production estimation, fertilization and maintenance budgeting, plant growth/health monitoring, replanting, plant layout planning, etc. Counting trees manually is expensive, labor-intensive, and prone to errors. Most plantations are forced to estimate the amount of fresh fruit bunch production by multiplying the total area by the number of oil palm trees per hectare, which often results in significant inaccuracy due to the heterogeneity of the land surface, which is hilly and undulating, as well as the presence of rivers, wasteland, and forests. Remote sensing is a solution to these problems because of the broad view of the plantation area, and it is a means of counting oil palm trees [4].
Yin et al. [5] stated that drone remote sensing is cheaper and more flexible than satellite imagery on an industrial scale. The rapid development of drone technology, information technology, and sensor technology allows drones to also be applied to various fields of agriculture and forestry. Drone remote sensing technology can monitor large areas of plantations with high spatial resolution, so it is widely used for oil palm research such as tree counting [6][7][8], oil palm harvest prediction [9], plant nutrition monitoring [10], and plant health monitoring [11].
YOLOv4, and YOLOv5m due to its real-time object detection capability, which typically has a higher accuracy in object detection and a faster computation time than other deep learning algorithms [28][29][30].
The rest of this paper is organized as follows. Section 2 presents the research plan, the study area and datasets, the data preprocessing and model building steps, and the evaluation metrics; Section 3 describes the training and testing results of our proposed method; Section 4 describes the performance and limitation methods; Section 5 presents some important conclusions of this research.

Overview
The research plan in this study is presented in the flowchart in Figure 1. Preprocessing the data on the drone images produces training data, validation data, and testing data. The training and validation process uses YOLO pre-trained weights for convolutional layers to be more accurate and to avoid lengthy model training steps. Hyperparameter tuning is also performed in order to find the optimal model during training, which will then be used during model testing. The detection results will be evaluated by comparing the accuracies and detection times of YOLOv3, YOLOv4, and YOLOv5m. large areas. The research location consists of an oil palm plantation area with a flat and hilly topography. In a flat area, the distance between the plant canopy is sparse and close together, whereas, in a hilly area, the plant canopy distance overlaps when viewed from the drone image. In addition, oil palm tree leaf areas can also intersect with other vegetations with leaf colors similar to oil palm. Hilly conditions with overlapping canopies intersecting with other vegetations are challenging for object detection algorithms to identify oil palm trees. The method used in this study is a deep learning approach based on YOLOv3, YOLOv4, and YOLOv5m due to its real-time object detection capability, which typically has a higher accuracy in object detection and a faster computation time than other deep learning algorithms [28][29][30].
The rest of this paper is organized as follows. Section 2 presents the research plan, the study area and datasets, the data preprocessing and model building steps, and the evaluation metrics; Section 3 describes the training and testing results of our proposed method; Section 4 describes the performance and limitation methods; Section 5 presents some important conclusions of this research.

Overview
The research plan in this study is presented in the flowchart in Figure 1. Preprocessing the data on the drone images produces training data, validation data, and testing data. The training and validation process uses YOLO pre-trained weights for convolutional layers to be more accurate and to avoid lengthy model training steps. Hyperparameter tuning is also performed in order to find the optimal model during training, which will then be used during model testing. The detection results will be evaluated by comparing the accuracies and detection times of YOLOv3, YOLOv4, and YOLOv5m.

Study Area and Datasets
The research location in this study is an oil palm plantation located in Jambi province, Indonesia, as shown in Figure 2. The image was acquired from 2-3 May 2021 using a Fixed Wing VTOL drone, shown in Figure 3, at an altitude of 200 m above ground level with a resolution of 5 cm/pixel. The drone specifications can be seen in Table 1. Each of the sample dataset areas (training, validation, testing) includes oil palm trees under the criteria of young plants (the planting year 2013-2015) and mature plants (the planting year  2009-2011) in oil palm plantation areas with flat and hilly contours with sparse, dense,  and overlapping canopy spacing conditions, as well as oil palm trees intersecting with  other vegetations. The training and validation area covers 730 ha, while the testing area  covers 180 ha. The distribution of datasets based on the blocks area in regions in oil palm plantations can be seen in Table 2.

Study Area and Datasets
The research location in this study is an oil palm plantation located in Jambi province, Indonesia, as shown in Figure 2. The image was acquired from 2-3 May 2021 using a Fixed Wing VTOL drone, shown in Figure 3, at an altitude of 200 m above ground level with a resolution of 5 cm/pixel. The drone specifications can be seen in Table 1. Each of the sample dataset areas (training, validation, testing) includes oil palm trees under the criteria of  young plants (the planting year 2013-2015) and mature plants (the planting year 2009-2011) in oil palm plantation areas with flat and hilly contours with sparse, dense, and  overlapping canopy spacing conditions, as well as oil palm trees intersecting with other  vegetations. The training and validation area covers 730 ha, while the testing area covers  180 ha. The distribution of datasets based on the blocks area in regions in oil palm plantations can be seen in Table 2.

Data Preprocessing
Before performing input data processing in the YOLO architecture, we first preprocess the data with the following steps: 1. Cropping the drone images for the training (311 images) and validation (66 images) datasets into grids with a size of 3943 × 3943 pixels, which corresponds to 200 m × 200 m (4 ha) using QGIS. An example of the image can be seen in Figure 4. 2. Manually identifying and labeling 56,614 oil palm trees in the training and validation datasets using LabelImg. The composition of the training data is 80% (45,290), and that of the validation data is 20% (11,324). An example of image labeling can be seen in Figure 5. 3. Cropping 24 drone images on the testing data block into grids with an image size of 7886 × 5914 pixels, which corresponds to 400 m × 300 m (12 ha) using QGIS. An example of the image can be seen in Figure 6.

Data Preprocessing
Before performing input data processing in the YOLO architecture, we first preprocess the data with the following steps:

1.
Cropping the drone images for the training (311 images) and validation (66 images) datasets into grids with a size of 3943 × 3943 pixels, which corresponds to 200 m × 200 m (4 ha) using QGIS. An example of the image can be seen in Figure 4.

2.
Manually identifying and labeling 56,614 oil palm trees in the training and validation datasets using LabelImg. The composition of the training data is 80% (45,290), and that of the validation data is 20% (11,324). An example of image labeling can be seen in Figure 5.

3.
Cropping 24 drone images on the testing data block into grids with an image size of 7886 × 5914 pixels, which corresponds to 400 m × 300 m (12 ha) using QGIS. An example of the image can be seen in Figure 6.

Model Development
The model development uses the Darknet framework on YOLOv3 [28] and YOLOv4 [29], while YOLOv5m [30] uses the PyTorch framework. We tried several input sizes of the network (width × height)-416 × 416, 608 × 608, 832 × 832, and 1024 × 1024 for YOLOv3, YOLOv4, and YOLOv5m-to obtain the best alternative model. The hyperparameters used are batch size, subdivision, momentum, decay, and learning rate. The larger the network input size, the greater the computational process. Therefore, some adjustments to the batch size and subdivision values were made, while the momentum, decay, and learning rate were set to their default values. The network input size and hyperparameter scenarios for YOLOv3, YOLOv4, and YOLOv5m are given, respectively, in Tables 3-5. We used YOLO pre-trained weights for convolutional layers to increase the accuracy and to avoid longer training times. Hyperparameter tunings were also performed to optimize the training process.

Evaluation Metrics
The measurement of model performance in this study used Recall, Precision, and F1-score [39], as shown in Equations (1)-(3). Recall measures how well the model can detect oil palm trees, precision measures how accurately the model predicts oil palm trees, and F1-score is the harmonic mean of recall and precision. In addition, detection time is also used as a metric in model evaluation in order to be able to compare model efficiency during detection. Average IoU is also needed to assess the accuracy of the bounding box location for detection [40], as shown in Equation (4); the illustration can be seen in Figure 7.
TP (True Positive) = The objects of oil palm trees were detected as oil palm trees. FP (False Positive) = Objects other than oil palm trees were detected as oil palm trees. FN (False Negative) = The objects of oil palm trees were not detected as oil palm trees.

IoU = Area of Overlap
Area of Union (4) Figure 7. Illustration of IoU (Intersect over Union). The ground truth bounding box from the manual labeling of validation data determines where our object is in the image, while the predicted bounding box is from the trained model.

Training Results
Training and validation with four network input size scenarios in YOLOv3 were carried out for 8000 iterations with a saved model every 1000 iterations, while the training and validation of YOLOv4 and YOLOv5m were carried out for 6000 iterations with a saved model every 500 iterations. Each saved model iteration is validated with a threshold variation of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 for the Precision, Recall, F1-score, and Average IoU values. The selection of the threshold of each saved model iteration prioritizes the Recall value-in this case, the model's ability to detect oil palm tree objects. The determination of the best model as a result of training from the saved model iteration is based on the highest Precision, Recall, and F1-score values for YOLOv3 and YOLOv4, while YOLOv5m is based on 6000 iterations because the training tends to be stable overall. The results of the training evaluation for each network input size for YOLOv3, YOLOv4, and YOLOv5m can be seen in Supplementary Spreadsheet 1. The best models obtained from each network input size on YOLOv3, YOLOv4, and YOLOv5m are shown in Tables 6-8, respectively. These models will be used for testing the test data.

Training Results
Training and validation with four network input size scenarios in YOLOv3 were carried out for 8000 iterations with a saved model every 1000 iterations, while the training and validation of YOLOv4 and YOLOv5m were carried out for 6000 iterations with a saved model every 500 iterations. Each saved model iteration is validated with a threshold variation of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 for the Precision, Recall, F1-score, and Average IoU values. The selection of the threshold of each saved model iteration prioritizes the Recall value-in this case, the model's ability to detect oil palm tree objects. The determination of the best model as a result of training from the saved model iteration is based on the highest Precision, Recall, and F1-score values for YOLOv3 and YOLOv4, while YOLOv5m is based on 6000 iterations because the training tends to be stable overall. The results of the training evaluation for each network input size for YOLOv3, YOLOv4, and YOLOv5m can be seen in Supplementary Spreadsheet S1. The best models obtained from each network input size on YOLOv3, YOLOv4, and YOLOv5m are shown in Tables 6-8, respectively. These models will be used for testing the test data.

Testing Results
The oil palm detection test is carried out on the testing data block with image grids measuring 7886 × 5914 pixels, which corresponds to 400 m × 300 m (12 ha), with as many as 24 images (regions) and as many as 17,343 oil palm trees objects (ground truth), as shown in Table 9. The oil palm tree detection test results using the models from YOLOv3, YOLOv4, and YOLOv5m, referred to in Tables 6-8, are given in Tables 10-12, respectively. The testing results per region based on the network input size for YOLOv3, YOLOv4, and YOLOv5m can be seen in Supplementary Spreadsheet S2. Table 9. Oil palm tree detection testing area.

Region Ground Truth
Region  The comparison based on the evaluation results of the YOLOv3, YOLOv4, and YOLOv5m models refers to Tables 10-12. The object detection tests on 17,343 oil palm trees in 24 images (regions), in general, showed satisfactory results on YOLOv3, YOLOv4, and YOLOv5m. The precision values reached above 98%, as shown in Figure 8. This shows that the three models are sufficiently good at predicting the oil palm class correctly, with minor incorrect detections (false positives).  The recall value on YOLOv3, in general, is above 90%, indicating that the model is sufficiently good at detecting oil palm tree objects, with the highest value of 94.75% at the network input size 1024 × 1024. The highest recall value of YOLOv4 on the network input size 832 × 832 is 95.72%, but the network input size 1024 × 1024 only yields 61.69%. This is likely due to the limited computing resources, causing poor training results when using the subdivision parameter value of 64. Based on the experimental results, where a larger network input size requires greater computational resources, we used larger subdivision parameter values for YOLOv3 and YOLOv4, with the consequence of reduced accuracies for both. The recall value on YOLOv5m shows an upward trend for the network input sizes 416 × 416, 608 × 608, 832 × 832, and 1024 × 1024, but there is a large gap between the network input size 416 × 416 and the others. This happens because the default value of the YOLOv5m network input size is 640 × 640; so, for a smaller network input size, the accuracy is lower. The comparison of the recall values of YOLOv3, YOLOv4, and YOLOv5m is shown in Figure 9.  The poor training on the network input size 1024 × 1024 YOLOv4 is shown in Table  S2 Supplementary Spreadsheet 1; there is a downward trend in the precision, F1-score, and average IoU values in the 2500 to 6000 iterations. This can also be seen in the comparison of the average IoU values in the training validation, as shown in Figure 10. YOLOv3 and YOLOv5m have an upward trend for the network input sizes 416 × 416, 608 × 608, 832 × 832, and 1024 × 1024, while YOLOv4 has an uptrend for the network input sizes 416 × 416, 608 × 608, and 832 × 832 but a downward trend at 1024 × 1024. This is also in line with the F1-score values. YOLOv4 has an upward trend with values above 90% for the network input sizes 416 × 416, 608 × 608, and 832 × 832 but a downward trend in the network input size 1024 × 1024, with a value of 76.29%. YOLOv3 has a stable value at over 90% for the network input sizes 416 × 416, 608 × 608, 832 × 832, and 1024 × 1024, and YOLOv5m has an uptrend for the network input sizes 416 × 416, 608 × 608, 832 × 832, and 1024 × 1024. The low value of the F1-score on the network input sizes 1024 × 1024 YOLOv4 and 416 × 416 YOLOv5m is due to the low recall value. The comparison of the F1-score values of YOLOv3, YOLOv4, and YOLOv5m can be seen in Figure 11. The poor training on the network input size 1024 × 1024 YOLOv4 is shown in Table S2 Supplementary Spreadsheet S1; there is a downward trend in the precision, F1-score, and average IoU values in the 2500 to 6000 iterations. This can also be seen in the comparison of the average IoU values in the training validation, as shown in Figure 10. YOLOv3 and YOLOv5m have an upward trend for the network input sizes 416 × 416, 608 × 608, 832 × 832, and 1024 × 1024, while YOLOv4 has an uptrend for the network input sizes 416 × 416, 608 × 608, and 832 × 832 but a downward trend at 1024 × 1024.  This is also in line with the F1-score values. YOLOv4 has an upward trend with values above 90% for the network input sizes 416 × 416, 608 × 608, and 832 × 832 but a downward trend in the network input size 1024 × 1024, with a value of 76.29%. YOLOv3 has a stable value at over 90% for the network input sizes 416 × 416, 608 × 608, 832 × 832, and 1024 × 1024, and YOLOv5m has an uptrend for the network input sizes 416 × 416, 608 × 608, 832 × 832, and 1024 × 1024. The low value of the F1-score on the network input sizes 1024 × 1024 YOLOv4 and 416 × 416 YOLOv5m is due to the low recall value. The comparison of the F1-score values of YOLOv3, YOLOv4, and YOLOv5m can be seen in Figure 11.  Based on the experiments and visual detection results, we concluded that determining the best network input size model from each of the YOLOv3, YOLOv4, and YOLOv5m models is not based only on the criteria for the F1-score value and the detection time but also on the average IoU value. The F1-score is used to determine the overall accuracy, the detection time is used to find out how fast the model detects objects, and the average IoU is used to assess the accuracy of the bounding box location for detection. The accuracy of the bounding box location is important because the number of oil palm trees in one image file is quite large-up to hundreds to thousands of oil palm trees. The higher the IoU average value, the better the accuracy of the location bounding box to the detected object, making for better visuals for large amounts of data, as shown in the comparison in Figure  13.  Based on the experiments and visual detection results, we concluded that determining the best network input size model from each of the YOLOv3, YOLOv4, and YOLOv5m models is not based only on the criteria for the F1-score value and the detection time but also on the average IoU value. The F1-score is used to determine the overall accuracy, the detection time is used to find out how fast the model detects objects, and the average IoU is used to assess the accuracy of the bounding box location for detection. The accuracy of the bounding box location is important because the number of oil palm trees in one image file is quite large-up to hundreds to thousands of oil palm trees. The higher the IoU average value, the better the accuracy of the location bounding box to the detected object, making for better visuals for large amounts of data, as shown in the comparison in Figure 13. The best model for the network input size on YOLOv3 is 1024 × 1024, with an F1score of 97.28%, a detection time of 43 s, and an average IoU of 76.42%. On YOLOv4, the best model for the network input size is 832 × 832, with an F1-score of 97.74%, a detection time of 45 s, and an average IoU of 76.76%. On YOLOv5m, the best model for the network input size is 1024 × 1024, with an F1-score of 94.94%, a detection time of 21 s, and an average IoU of 75.1%. The image detection results per region based on the best model for YOLOv3, YOLOv4, and YOLOv5m can be seen in Supplementary Spreadsheet 3.

Discussion
YOLOv3, YOLOv4, and YOLOv5m have high accuracy and are fast in detecting oil palm trees (purple boxes), with only a few palm trees not detected (red circles). The models can detect oil palm trees with sparse and dense canopy conditions, as shown in Figures  14 and 15; besides that, the models are also able to detect oil palm trees with a hilly topography when viewed from the drone imagery of overlapping oil palm trees canopy, as shown in Figure 16. YOLOv5m was quite good at detecting oil palm trees in areas with hundreds of trees, but for the number of trees reaching thousands, there were many false negatives-namely, oil palm trees cannot be detected. YOLOv3 and YOLOv4 were quite good at detecting oil palm trees in areas with hundreds or even thousands of trees.

Discussion
YOLOv3, YOLOv4, and YOLOv5m have high accuracy and are fast in detecting oil palm trees (purple boxes), with only a few palm trees not detected (red circles). The models can detect oil palm trees with sparse and dense canopy conditions, as shown in Figures 14 and 15; besides that, the models are also able to detect oil palm trees with a hilly topography when viewed from the drone imagery of overlapping oil palm trees canopy, as shown in Figure 16. YOLOv5m was quite good at detecting oil palm trees in areas with hundreds of trees, but for the number of trees reaching thousands, there were many false negatives-namely, oil palm trees cannot be detected. YOLOv3 and YOLOv4 were quite good at detecting oil palm trees in areas with hundreds or even thousands of trees.
There were several shortcomings of the models during testing. It was quite challenging to detect oil palm trees with unhealthy oil palm conditions (stressed growth or nutrient deficiency), as shown by the red circles in Figures 17 and 18. In addition, the models had difficulty detecting (red circles) or incorrectly detecting (blue circles) for the condition where the oil palm trees intersect with other vegetations, so it is disguised, as shown in Figure 19. This weakness occurs because, at the time of labeling the oil palm class dataset, the majority of oil palm tree conditions are healthy/normal, while, for the unhealthy oil palm tree conditions and the oil palm trees intersecting with other vegetations (disguised), the labeling samples are few, so it is quite difficult to detect when testing the data.  There were several shortcomings of the models during testing. It was quite challenging to detect oil palm trees with unhealthy oil palm conditions (stressed growth or nutrient deficiency), as shown by the red circles in Figures 17 and 18. In addition, the models had difficulty detecting (red circles) or incorrectly detecting (blue circles) for the condition where the oil palm trees intersect with other vegetations, so it is disguised, as shown in   The three models can be applied to oil palm tree inventories in plantation companies because of their high accuracy, reaching 94-97%, while the fast detection time helps make the work more efficient for large-scale oil palm fields. Further development can add datasets of unhealthy oil palm trees (stressed growth/nutrient deficiency) and oil palm trees that are covered/camouflaged by other vegetations so that the model can recognize these The three models can be applied to oil palm tree inventories in plantation companies because of their high accuracy, reaching 94-97%, while the fast detection time helps make the work more efficient for large-scale oil palm fields. Further development can add datasets of unhealthy oil palm trees (stressed growth/nutrient deficiency) and oil palm trees that are covered/camouflaged by other vegetations so that the model can recognize these The three models can be applied to oil palm tree inventories in plantation companies because of their high accuracy, reaching 94-97%, while the fast detection time helps make the work more efficient for large-scale oil palm fields. Further development can add datasets of unhealthy oil palm trees (stressed growth/nutrient deficiency) and oil palm trees that are covered/camouflaged by other vegetations so that the model can recognize these objects during testing/detection. These datasets are not only used for an inventory of oil palm trees but can also detect/differentiate classes of healthy and unhealthy oil palm trees. Developing models with a larger network input size using hardware with higher resources is expected to increase accuracy. It is also possible to develop the model into a desktop/web-based application to make it easier for end-users to operate.
Compared to the previous oil palm tree counting research using deep learning [31,[34][35][36][37][38], our method also provides accurate results, but what is different is that we present the results of the detection time on testing the test data and the accuracy of the bounding box. This information is important for the implementation of the large-scale detection of oil palm trees in areas reaching thousands to tens of thousands of hectares of oil palm land. With this large oil palm area, it requires a fast model in the detection and also needs the accuracy of the location of the bounding box for detection objects with a lot of data.

Conclusions
In this research, we have proposed an oil palm trees detection model using YOLOv3, YOLOv4, and YOLOv5m. The results of testing the YOLOv3, YOLOv4, and YOLOv5m models in four scenarios of the network input sizes 416 × 416, 608 × 608, 832 × 832, and 1024 × 1024 obtained the best model for YOLOv3 and YOLOv5m using the network input size of 1024 × 1024, while for YOLOv4, the best model used the network input size of 832 × 832. The test was carried out on 24 images/regions of 17,343 oil palm trees, with an average detection time of 43 s for YOLOv3, 45 s for YOLOv4, and 21 s for YOLOv5m. YOLOv3 obtained an F1 score of 97.28% and an average IoU of 76.42%, YOLOv4 obtained an F1 score of 97.74% and an average IoU of 76.76%, and YOLOV5m obtained an F1 score of 94.94% and an average IoU 75.1%. In terms of large-scale oil palm tree detection, accuracy is the top priority, so the recommendation for the best model is YOLOv3 and YOLOv4 because they have the highest accuracy, with a time difference of just under 25 s compared to YOLOv5m.