Next Article in Journal
UAS-GEOBIA Approach to Sapling Identification in Jack Pine Barrens after Fire
Previous Article in Journal
A Hybrid Battery Charging Approach for Drone-Aided Border Surveillance Scheduling
Article
Peer-Review Record

Identification of Citrus Trees from Unmanned Aerial Vehicle Imagery Using Convolutional Neural Networks

Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Received: 24 October 2018 / Revised: 7 November 2018 / Accepted: 14 November 2018 / Published: 20 November 2018

Round 1

Reviewer 1 Report

I propose some minor revisions. Suggestions were directly written in the authors' PDF that fallows attached. I would like to see more details about the CNN trainning process (trainning times, computer specs, accuracy/validation results, etc.). Also, regarding the sample patches of 40x40 pixels, it is not clear how did the authors reach these values? It was only by naked-eye analisys or there was some preliminary calculcus/processing?


A few format tips are provided, as well.

Comments for author File: Comments.pdf

Author Response

R1:


I propose some minor revisions. Suggestions were directly written in the authors' PDF that follows attached. I would like to see more details about the CNN training process (training times, computer specs, accuracy/validation results, etc.). Also, regarding the sample patches of 40x40 pixels, it is not clear how did the authors reach these values? It was only by naked-eye analysis or there was some preliminary calculus/processing?


A few format tips are provided, as well.

Response: These comments are below, and each explanation is provided.


Line 91: I wouldn't use "early stage" because it is already relatively easy to find CNN-based works to identify trees and plants in general. In spite of that, there is space for progress in the field. So, I would go for something like (if you agree):  The exploitation of deep learning techniques for tree identification from UAV imagery is still under development and (...)

Response: Good point. Change made as suggested. Line 93.


Line 147: Can you, please provide more data about the used application? Is it web-based or there is a need for local installation? In the second case, which were the specs of the machine responsible for handling the training? With the indicated dataset, how long took the machine to train? Can you tell us more about training/validation accuracy results? You see, in Keras, all of this data is retrievable and can be used as success indicators.

Response: In response to the request for more details on the application, we have added more detail to the Methods section and rewritten the text as:

Line 149. “The analysis was developed and tested using a 64-bit operating system, with Intel® Core™ i5-6500 CPU @ 3.20 GHz and 16 GB RAM. We applied the CNN workflow using Trimble’s eCognition Developer 9.3 [55], which is based on the Google TensorFlow API [56]. Trimble’s eCognition Developer is one of the most popular software for object-based image analysis and the application of the CNN using this platform gave the opportunity of integrating the CNN approach with object-based post-processing of the results, thus performing the entire analysis in one software.”


In answer to the second set of comments, we have included these details in the text at Line 154:

Application of the CNN in eCognition consisted of three steps: (1) derivation of the 4,000 training samples of 40x40 pixels for the three classes used (5 minutes), (2) training the CNN model (13 minutes), and (3) applying the trained CNN model to the validation area not used in training (2 minutes).


Finally, thank you for pointing out Keras. Although Keras is a high-level neural network API, written in Python and capable of running on top of TensorFlow, we are not as familiar with it as with the eCognition - a remote sensing dedicated software. As implemented in eCognition with user friendly interface, the user does not have much flexibility or the ability to view intermediate results or statistics of the CNN. We will consider Keras and other platforms based on TensorFlow for our future research.


Line 157: Can you provide more detail about this compromise establishment? Did you run preliminary some tests?

Response: There were initial trial-and-error approaches with sizes smaller or bigger than 40x40 pixels. Values smaller than this would increase the multiple-crown detection errors while values bigger than 40x40 would miss some of the small trees. A size of 40x40 was also in line with most of the tree sizes in our study areas, as it is depicted in Figure 3. We have changed the text to read:

Training samples were patches of 40x40 pixels because this size best matched the size of the targeted trees. Trial-and-error approaches with sample sizes smaller or bigger than 40x40 pixels we tried; values smaller than this increased the multiple-crown detection errors while values bigger than 40x40 missed some of the small trees. A size of 40x40 was also in line with most of the tree sizes in our study areas (Figure 3).” Line 169.


Line 159: Instead of "max pooling set to 'Yes'", it would be more relevant to inform about its parameters (stride, patch size, etc.)...

Response: We agree and have changed the text to read:

Max pooling was applied to reduce the resolution of the feature maps using a 2x2 filter with a stride of 2, both in horizontal and vertical directions.” Line 182.


Line 192: Unnecessary white space…

Response: Space deleted.

Reviewer 2 Report

The CNN model and SLIC algorithm are used in this paper for UAV images. I believe that the paper can be accepted for publication, once that some minor revisions are received:

(1) the methods for this paper, such as CNN and SLIC, should be add some details for describing the application in detection of citrus and other crop trees. For example, the features of UAV data, and how to choose or obtain the model’s parameters.

(2) The autors should give more research areas to validate the performance of presented methods and more quantitative analysis for applied results.

Author Response

R2:


The CNN model and SLIC algorithm are used in this paper for UAV images. I believe that the paper can be accepted for publication, once that some minor revisions are received:

(1) the methods for this paper, such as CNN and SLIC, should be add some details for describing the application in detection of citrus and other crop trees. For example, the features of UAV data, and how to choose or obtain the model’s parameters.

Response: We have not found other meaningful papers that use CNN for tree identification besides those mentioned in the literature review, and none using the SLIC algorithm for tree mapping. However, the point is a good one, and in response we have added considerable detail to the Methods section of the paper to help answer this. The text now reads (starting Line 159):

“Finding the best architecture for the CNN is an ongoing debate in the world of deep learning. It usually starts from a simple model and the hyper-parameters are tuned iteratively until a good model is found for a specific application. Our study area was first divided into a training area in the north and a validation area in the south. An initial CNN was trained with three classes - trees, bare soil, and weeds - with 4,000 samples per class. The samples were derived based on the dataset of manually identified individual trees (trees in the north used for training and trees in the south used for validation) and randomly generated samples in areas without trees, which were classified as bare soil or weeds. To increase the number of training samples and the robustness of CNN, we derived a 3x3 pixel buffer area around our training point samples. In this way, every tree is now represented by 9 pixels around the center of the tree and the algorithm will randomly choose locations out of these 9 pixels. Training samples were patches of 40x40 pixels because this size best matched the size of the targeted trees. Trial-and-error approaches with sample sizes smaller or bigger than 40x40 pixels we tried; values smaller than this increased the multiple-crown detection errors while values bigger than 40x40 missed some of the small trees. A size of 40x40 was also in line with most of the tree sizes in our study areas (Figure 3).  
All 4 spectral bands from the UAV dataset were used in the training step, namely green, red, near infrared and red edge. Examples of the training samples are shown in Figure 3. We chose a simple CNN model that uses one hidden layer that convolve the input layers using different kernels and generating different feature maps. For this hidden layer, a kernel size of 4x11x11 (4 bands and 11x11 pixels) was used for convolution and 40 distinct feature maps were generated. Max pooling was applied to reduce the resolution of the feature maps using a 2x2 filter with a stride of 2, both in horizontal and vertical directions.
During the training of the CNN, the learning rate was set to 0.0015 after trial-and-error tests. This parameter dictates the amount by which weights are adjusted during the statistical gradient descent optimization. Lower values of the learning rate can slow down the process of training or finding suboptimal weights by finding local minima, while higher values will improve speed but increase the risk of missing the optimal minimum. We used 5,000 training steps with 50 training samples used at each training step.”


(2) The authors should give more research areas to validate the performance of presented methods and more quantitative analysis for applied results.

Response:  This is a good comment and very similar to one from Reviewer 3 and Reviewer 4. In addition to adding more detail on the methods (see response above), we have clarified our intention for this case study as follows:

Line 97: “Our intent was not to test this methods’ performance in all conditions or with numerous targets: this is a case study using imagery collected in ideal conditions, as is common practice with agricultural managers, to examine the performance of a novel algorithm, to provide the impetus for further exploration.”

Line 127: “The imagery was collected as a proof-of-concept, as an alternative to NAIP as a high resolution image source for subsequent management and research planning.”

Line 323: “Individual case studies are critical in a new and growing field, however, integration of multiple missions over larger study areas and with different targets can provide elements for larger projects focused on: generalizable results across conditions and targets [76]; camera, platform and protocol benchmarking (e.g. as has been done for airborne LiDAR [77]), data fusion and scaling [78], and multi-temporal analysis.”

Reviewer 3 Report

1, They are many instance-aware segmentation CNN can segment and differentiate trees end-to-end. But none of these methods are reviewed. Also some paper already implemented instance-aware segmentation in UAV images for trees.  Even just for counting trees, the most popular method Faster RCNN is not in the review either. In addition, these libraries are now released in either Tensorflow or Pytorch. So it is very straightforward to play, run and compare. 

2, For the applied method, it would be better to see how the performance is becoming better by tuning parameters or adding refinement in terms of confusion metrics, rather than just showing the image.

3, It would be better to add more detail how these algorithm is implemented. The current description is too simple for readers to follow up and experiment. What is the setting after the hidden layer for example?  Also how fast does this method run?  It is always good to see accuracy vs. speed to evaluate one method.  

Author Response

R3:

1, They are many instance-aware segmentation CNN can segment and differentiate trees end-to-end. But none of these methods are reviewed. Also some paper already implemented instance-aware segmentation in UAV images for trees.  Even just for counting trees, the most popular method Faster RCNN is not in the review either. In addition, these libraries are now released in either Tensorflow or Pytorch. So it is very straightforward to play, run and compare.

Response:  This is an excellent comment, and we use it to build upon our literature review to include some critical background and discussion. We have included new papers that use Faster R-CNN in agricultural settings (e.g. Wang et al. 2018 and Sa et al. 2016), but we could not find any papers outside of conference proceedings that dealt with agricultural trees and Faster R-CNN. We have added references in the Introduction and the Discussion. Additionally, our goal is not to compare multiple CNN algorithms, but to conduct one case study, using imagery flown for general management purposes, that utilizes a novel CNN algorithm. We have clarified that in the introduction (text below), the Methods section (text below), and added more discussion about R-CNNs in the Discussion (text below).

Line 90: “and Wang et al. [49] and Sa et al. [50] implemented Faster Region-based CNN (Faster R-CNN) algorithms for mango fruit flower detection and fruit detection (sweet pepper and melon), respectively.

Line 97: “Our intent was not to test this methods’ performance in all conditions or with numerous targets: this is a case study using imagery collected in ideal conditions, as is common practice with agricultural managers, to examine the performance of a novel algorithm, to provide the impetus for further exploration.”

Line 127: “The imagery was collected as a proof-of-concept, as an alternative to NAIP as a high resolution image source for subsequent management and research planning.”

Line 316: “While this work highlighted one case study using a simple convolutional neural network, deeper CNNs with multiple hidden layers or ability to regionalize data should be tested to understand improvements. For example, newer CNN algorithms called Region Based CNNs (e.g. Fast R-CNN, Faster R-CNN) that detect objects by identifying regions that are most likely to contain the object to be identified [77,78] are proving to be fast and efficient in detecting objects. These have most often been used in detection of cars (e.g. [79]), but there is a growing literature around detection of agricultural features such as flowers and fruit [49,50].”

Line 323: “Individual case studies are critical in a new and growing field, however, integration of multiple missions over larger study areas and with different targets can provide elements for larger projects focused on: generalizable results across conditions and targets [76]; camera, platform and protocol benchmarking (e.g. as has been done for airborne LiDAR [77]), data fusion and scaling [78], and multi-temporal analysis.”


2, For the applied method, it would be better to see how the performance is becoming better by tuning parameters or adding refinement in terms of confusion metrics, rather than just showing the image.

Response: This is a valid point. However, we again wish to emphasize that this is not our goal in this paper. We present a cases study highlighting the novel use of CNN and SLIC used on UAV imagery for a common task in orchard management - the counting of trees. Confusion metrics or a confusion matrix would be more useful in a classification sense if we were evaluating a number of classification algorithms, but since we are presenting our best tree detection result, we chose to validate with Precision, Recall and f-score, which is more commonly used for binary class evaluations.


3, It would be better to add more detail how these algorithm is implemented. The current description is too simple for readers to follow up and experiment. What is the setting after the hidden layer for example?  Also how fast does this method run?  It is always good to see accuracy vs. speed to evaluate one method.

Response: We agree. We have rewritten the Methods section to include more detail. It now reads:

Line 149: “The analysis was developed and tested using a 64-bit operating system, with Intel® Core™ i5-6500 CPU @ 3.20 GHz and 16 GB RAM. We applied the CNN workflow using Trimble’s eCognition Developer 9.3 [55], which is based on the Google TensorFlow API [56]. Trimble’s eCognition Developer is one of the most popular software for object-based image analysis and the application of the CNN using this platform gave the opportunity of integrating the CNN approach with object-based post-processing of the results, thus performing the entire analysis in one software. Application of the CNN in eCognition consisted of three steps: (1) derivation of 4,000 training samples of 40x40 pixels for the three classes used (5 minutes), (2) training the CNN model (13 minutes), and (3) applying the trained CNN model to the validation area not used in training (2 minutes).”
Line 159: “Finding the best architecture for the CNN is an ongoing debate in the world of deep learning. It usually starts from a simple model and the hyper-parameters are tuned iteratively until a good model is found for a specific application. Our study area was first divided into a training area in the north and a validation area in the south. An initial CNN was trained with three classes - trees, bare soil, and weeds - with 4,000 samples per class. The samples were derived based on the dataset of manually identified individual trees (trees in the north used for training and trees in the south used for validation) and randomly generated samples in areas without trees, which were classified as bare soil or weeds. To increase the number of training samples and the robustness of CNN, we derived a 3x3 pixel buffer area around our training point samples. In this way, every tree is now represented by 9 pixels around the center of the tree and the algorithm will randomly choose locations out of these 9 pixels. Training samples were patches of 40x40 pixels because this size best matched the size of the targeted trees. Trial-and-error approaches with sample sizes smaller or bigger than 40x40 pixels we tried; values smaller than this increased the multiple-crown detection errors while values bigger than 40x40 missed some of the small trees. A size of 40x40 was also in line with most of the tree sizes in our study areas, as it is depicted in Figure 3.  
All 4 spectral bands from the UAV dataset were used in the training step, namely green, red, near infrared and red edge. Examples of the training samples are shown in Figure 3. We chose a simple CNN model that uses one hidden layer that convolve the input layers using different kernels and generating different feature maps. For this hidden layer, a kernel size of 4x11x11 (4 bands and 11x11 pixels) was used for convolution and 40 distinct feature maps were generated. Max pooling was applied to reduce the resolution of the feature maps using a 2x2 filter with a stride of 2, both in horizontal and vertical directions.
During the training of the CNN, the learning rate was set to 0.0015 after trial-and-error tests. This parameter dictates the amount by which weights are adjusted during the statistical gradient descent optimization. Lower values of the learning rate can slow down the process of training or finding suboptimal weights by finding local minima, while higher values will improve speed but increase the risk of missing the optimal minimum. We used 5,000 training steps with 50 training samples used at each training step.”

Reviewer 4 Report

The presented method is evaluated in a specific geographical area. This does not prove that the method is valid to generalize to plantations of the same variety of tree in other geographical areas (due to the bias introduced by the type of soil for example). On the other hand, it is only evaluated in ideal climatic conditions, so it does not test the robustness before images taken on cloudy days, for example.


On the other hand, the architecture of CNN and the training process used are poorly described, making it difficult to implement the same architecture to reproduce the results.


The problem shown is the detection and location of objects, not binary classification of images (although the training method presented is focused in this way, which is totally valid). This is why the metrics chosen do not represent the effectiveness of the method to solve the problem.

Finally, no metrics are reported on the detection of weeds, when they were present at the training stage.

I think a redesign of the evaluation is necessary, and if it is not possible to obtain new images in different conditions and locations, this should be clearly reflected in the conclusions.

Author Response

R4:


The presented method is evaluated in a specific geographical area. This does not prove that the method is valid to generalize to plantations of the same variety of tree in other geographical areas (due to the bias introduced by the type of soil for example). On the other hand, it is only evaluated in ideal climatic conditions, so it does not test the robustness before images taken on cloudy days, for example.

Response: This is a great point: we are reporting on one case study and have not provided exhaustive comparisons with other climatic conditions or geographic areas. However, our intention was not to test a methods’ performance in all conditions or with numerous targets. This is a case study using imagery collected at ideal conditions - which is how most managers acquire UAV imagery for their sites. Our intention was to provide a case study of how a relatively underused algorithm in this context performed, to provide the impetus for further exploration. The point is a valid and useful one and needs to be better addressed in our paper. These individual projects can amount to Since in these early years of UAV research there are numerous missions that might when integrated, provide elements for larger projects focused on, for example, generalizable results across conditions and targets, camera and platform benchmarking, multi-temporal analysis and data fusion.

We have added the following text:  

Line 97: “Our intent was not to test this methods’ performance in all conditions or with numerous targets: this is a case study using imagery collected in ideal conditions, as is common practice with agricultural managers, to examine the performance of a novel algorithm, to provide the impetus for further exploration.”

Line 323: “Individual case studies are critical in a new and growing field, however, integration of multiple missions over larger study areas and with different targets can provide elements for larger projects focused on: generalizable results across conditions and targets [76]; camera, platform and protocol benchmarking (e.g. as has been done for airborne LiDAR [77]), data fusion and scaling [78], and multi-temporal analysis.”


On the other hand, the architecture of CNN and the training process used are poorly described, making it difficult to implement the same architecture to reproduce the results.

Response: We agree. We have rewritten the Methods section to include more detail. It now reads:

Line 149: “The analysis was developed and tested using a 64-bit operating system, with Intel® Core™ i5-6500 CPU @ 3.20 GHz and 16 GB RAM. We applied the CNN workflow using Trimble’s eCognition Developer 9.3 [55], which is based on the Google TensorFlow API [56]. Trimble’s eCognition Developer is one of the most popular software for object-based image analysis and the application of the CNN using this platform gave the opportunity of integrating the CNN approach with object-based post-processing of the results, thus performing the entire analysis in one software. Application of the CNN in eCognition consisted of three steps: (1) derivation of 4,000 training samples of 40x40 pixels for the three classes used (5 minutes), (2) training the CNN model (13 minutes), and (3) applying the trained CNN model to the validation area not used in training (2 minutes).”
Line 159: “Finding the best architecture for the CNN is an ongoing debate in the world of deep learning. It usually starts from a simple model and the hyper-parameters are tuned iteratively until a good model is found for a specific application. Our study area was first divided into a training area in the north and a validation area in the south. An initial CNN was trained with three classes - trees, bare soil, and weeds - with 4,000 samples per class. The samples were derived based on the dataset of manually identified individual trees (trees in the north used for training and trees in the south used for validation) and randomly generated samples in areas without trees, which were classified as bare soil or weeds. To increase the number of training samples and the robustness of CNN, we derived a 3x3 pixel buffer area around our training point samples. In this way, every tree is now represented by 9 pixels around the center of the tree and the algorithm will randomly choose locations out of these 9 pixels. Training samples were patches of 40x40 pixels because this size best matched the size of the targeted trees. Trial-and-error approaches with sample sizes smaller or bigger than 40x40 pixels we tried; values smaller than this increased the multiple-crown detection errors while values bigger than 40x40 missed some of the small trees. A size of 40x40 was also in line with most of the tree sizes in our study areas, as it is depicted in Figure 3.  
All 4 spectral bands from the UAV dataset were used in the training step, namely green, red, near infrared and red edge. Examples of the training samples are shown in Figure 3. We chose a simple CNN model that uses one hidden layer that convolve the input layers using different kernels and generating different feature maps. For this hidden layer, a kernel size of 4x11x11 (4 bands and 11x11 pixels) was used for convolution and 40 distinct feature maps were generated. Max pooling was applied to reduce the resolution of the feature maps using a 2x2 filter with a stride of 2, both in horizontal and vertical directions.
During the training of the CNN, the learning rate was set to 0.0015 after trial-and-error tests. This parameter dictates the amount by which weights are adjusted during the statistical gradient descent optimization. Lower values of the learning rate can slow down the process of training or finding suboptimal weights by finding local minima, while higher values will improve speed but increase the risk of missing the optimal minimum. We used 5,000 training steps with 50 training samples used at each training step.”

The problem shown is the detection and location of objects, not binary classification of images (although the training method presented is focused in this way, which is totally valid). This is why the metrics chosen do not represent the effectiveness of the method to solve the problem.

Response: This comment is somewhat confusing to us. The problem presented in the paper was to detect and count trees, not the binary classification of images. The metrics chosen to validate the method (e.g. number of trees successfully counted, number missed and number falsely counted) do indeed match the problem at hand. Our method (e.g. calculating Precision, Recall and f-score) is used throughout the literature for this kind of problem. However, if we misunderstood this comment, and this question has to do with the training (e.g. trees, bare ground, weeds) vs. the reporting (only trees) we have made that clearer in the text, as detailed in the response immediately below.


Finally, no metrics are reported on the detection of weeds, when they were present at the training stage.

Response: This is an excellent point, and we have clarified our reason for using weeds as training samples, while not reporting on their validation in the text. We used weeds as part of the training process to minimize confusion with trees. These were selected visually? Our validation dataset (previously created) does not include weeds, and thus we did not report accuracies. We have added the following text:

We used the manually delimited reference tree dataset (n=2,912) for validation. As the reference dataset included only trees, our validation process focused on trees alone, and ignored weeds and bare ground.”  Line 223.


I think a redesign of the evaluation is necessary, and if it is not possible to obtain new images in different conditions and locations, this should be clearly reflected in the conclusions.

Response: Again, this is an excellent point, similar to the one above. We have made changes to the Introduction, outlining our intent more clearly, and to the Discussion section, outlining the need to integrate case studies.

We have added the following text in three places:  

Line 97: “Our intent was not to test this methods’ performance in all conditions or with numerous targets: this is a case study using imagery collected in ideal conditions, as is common practice with agricultural managers, to examine the performance of a novel algorithm, to provide the impetus for further exploration.”

Line 127: “The imagery was collected as a proof-of-concept, as an alternative to NAIP as a high resolution image source for subsequent management and research planning.”

Line 323: “Individual case studies are critical in a new and growing field, however, integration of multiple missions over larger study areas and with different targets can provide elements for larger projects focused on: generalizable results across conditions and targets [76]; camera, platform and protocol benchmarking (e.g. as has been done for airborne LiDAR [77]), data fusion and scaling [78], and multi-temporal analysis.”


Round 2

Reviewer 3 Report

N/A

Reviewer 4 Report

I agree with the modifications

Back to TopTop