Next Article in Journal
A Refined Four-Stream Radiative Transfer Model for Row-Planted Crops
Next Article in Special Issue
Random Forest Spatial Interpolation
Previous Article in Journal
Generalized Sparse Convolutional Neural Networks for Semantic Segmentation of Point Clouds Derived from Tri-Stereo Satellite Imagery
 
 
Article
Peer-Review Record

Tree Crown Delineation Algorithm Based on a Convolutional Neural Network

Remote Sens. 2020, 12(8), 1288; https://doi.org/10.3390/rs12081288
by José R. G. Braga 1,*, Vinícius Peripato 1, Ricardo Dalagnol 1, Matheus P. Ferreira 2, Yuliya Tarabalka 3,4, Luiz E. O. C. Aragão 1,5, Haroldo F. de Campos Velho 6, Elcio H. Shiguemori 7 and Fabien H. Wagner 1,8
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2020, 12(8), 1288; https://doi.org/10.3390/rs12081288
Submission received: 19 February 2020 / Revised: 8 April 2020 / Accepted: 14 April 2020 / Published: 18 April 2020
(This article belongs to the Special Issue Machine Learning Methods for Environmental Monitoring)

Round 1

Reviewer 1 Report

The manuscript presents an interesting idea for creating additional training data in deep learning applications. In addition, the use case at hand is both, very challenging and of high importance. However, in my opinion, the manuscript needs considerable revision before it can be considered for further processing and publication. The main issues in my opinion are the following:

The authors argue that the main novelties in this work are a) the algorithm for creating synthetic images to increase the amount of training data and b) the application of the well-established Mask R-CNN to the presented application area, i.e. creating forest inventories in tropical rainforests.

However, the results regarding both of these aspects are not discussed sufficiently in the paper:

Creating synthetic images for training of deep learning algorithms is -in general-  a very interesting idea. However, it does not become clear to me, to what degree the presented algorithm is better than applying extensive image augmentation, which basically does the same thing: creating synthetic images from existing images. The difference seems to be, that augmentation transforms whole images in various ways, while the presented algorithm randomly distributes the objects of interest (tree crowns) within images. Since this strategy is one of the main aspects of the paper, I believe it would be important to discuss its effects in more detail. It would be very interesting to see this algorithm compares to image augmentation, as it is currently being applied. This would mean comparing results of Mask R-CNN applied on the dataset created by this algorithm (without additional augmentation - currently both strategies are mixed) to the results obtained from a data set of the same size, but created by established augmentation methods. Here, not only flipping, rotation and brightness changes should be applied, but some of the many more available augmentations (shearing, changing hue and saturation, and combinations of these techniques).

The second main novelty is the application field that is being explored. Since this paper is very application oriented, the evaluation of results should also be application oriented. For example, the authors state that a tree crown is considered as correctly identified by the algorithm when 50% of its pixels are identified. How was the threshold of 50% determined? If -as the authors state- metrics such as diameter and area of crowns are the important tree metrics in forest inventories, isn´t it problematic to have trees inventoried with an area possibly as low as 50% of their real area?

Another question related to this aspect is to what degree does the example forest at hand represent tropical rainforests globally (e.g. seasonality, leaf loss).

In addition, the evaluation of results needs to be described better:

  • the authors prominently state the overall detection accuracy of 96%, which is in this application the least meaningful metric . The metrics shown in table 7, especially the  F1 score should be discussed more prominently, compared to other studies, and should be mentioned in the abstract instead of the detection accuracy.
  • in order to get a better idea of the algorithms performance there should be a figure showing the location of training areas/sample and the test areas in relation to the whole study area
  • how are different tree species represented in the training data set? Does this have an impact on the results?
  • Fig.7: the loss function seems to stop improving after a bit more than 90 epochs. Why continue as far as epoch 120
  • Fig. 7: there is a strange phase of the loss function being stable around epochs 40-45. what could be the reason?

 

 

Further remarks:

- The text, specifically in chapters 1 and 2, needs to be streamlined and focused on the main aspects of the paper. Just a few examples:

  • page 2, paragraph 3: "Remote Sensing is the source of ..." . Since this journal is called "remote sensing", it is not necessary to explain in a whole paragraph the advantages and application areas of remote sensing.
  • page 3, paragraph 5: This paragraph contains 3 consecutive sentences saying the same thing: that deep learning is receiving growing attention in remote sensing.
  • page 5/6: pan-sharpening has been used remote sensing for ages. It is not necessary to write a whole paragraph about it. One sentence describing the reason for choosing this specific algorithm is enough

In total, the text can be significantly shortened by reducing the number of generic sentences and redundant explanations.

 

- With regards to the presentation quality, I believe the manuscript should be carefully revised to minimize minor flaws which are currently quite frequent. Examples are spelling errors such as "cont” instead of "count" in the pseudo code, "crow" instead of "crown" in fig 5, "individually tree information" instead of "individual tree information" in the abstract, ...

Author Response

(José Braga) Dear Reviewer #1, thank you very much for your positive review that I believe have improved the paper. In the following text you will find your comments (in black) and our answers in blue. We answer all your questions and made all the changes you suggested to us. Thank you very much for your attention.

Reviewer 1 (Q1) - Creating synthetic images for training of deep learning algorithms is -in general-  a very interesting idea. However, it does not become clear to me, to what degree the presented algorithm is better than applying extensive image augmentation, which basically does the same thing: creating synthetic images from existing images. The difference seems to be, that augmentation transforms whole images in various ways, while the presented algorithm randomly distributes the objects of interest (tree crowns) within images. Since this strategy is one of the main aspects of the paper, I believe it would be important to discuss its effects in more detail. It would be very interesting to see this algorithm compares to image augmentation, as it is currently being applied. This would mean comparing results of Mask R-CNN applied on the dataset created by this algorithm (without additional augmentation - currently both strategies are mixed) to the results obtained from a data set of the same size, but created by established augmentation methods. Here, not only flipping, rotation and brightness changes should be applied, but some of the many more available augmentations (shearing, changing hue and saturation, and combinations of these techniques).

(José Braga) - The creation of synthetic forests is a very important step for the training of Mask R-CNN. The synthetic forests are created with different number of trees. The range of trees ranges from 4 to 150. All synthetic forest images are 128 * 128 pixels and 0.5 meters of spatial resolution. In synthetic images with a large number of trees (> 70), there is a significant amount of overlap between tree crowns. Each new overlap is a new tree crown pattern to be presented to the neural network during training process. In addition, these overlaps between tree crowns could mimic what happens in a true tropical forest, where there is great overlap between the canopy. For this reason, the synthetic image creation algorithm is different from the data augmentation process applied before the presentation of an image to the neural network. Another problem for Mask R-CNN training is that all training images must have all objects of interest outlined. In a 128 * 128 pixel patch from WV-2 Santa Genebra image, the number of tree crowns is very large, greater than 100, and if only one tree crown is not delineated this can hinder the training of the neural network. It is simply very difficult/impossible to outline all the tree crowns in a number sufficient of 128 * 128 images to be able to train a deep learning algorithm, it could take months of manual sampling. For this reason, the algorithm for creating synthetic forests is very important, since the algorithm reproduces an image of 128 * 128 pixels with all the crowns of trees outlined. We also use a data augmentation process to decrease the chance that the neural network undergoes an overfitting process during training. We have combined several data augmentation techniques and those that produce the best value for the training evaluation metrics (Total Loss and Validation Loss) were the flipping, rotation and brightness changes and combinations between them. This is because we are training the network with examples from the image where we also evaluate the results. If we used other images, from other regions or from different sensors, more data augmentation operations could be necessary. The data augmentation with flipping, rotation and brightness changes further increases the variability of the patterns. These two methods combined present a great variability of images similar to real forests. Furthermore, could help to prevent overtraining of the neural network.

Reviewer 1 (Q2) - The second main novelty is the application field that is being explored. Since this paper is very application oriented, the evaluation of results should also be application oriented. For example, the authors state that a tree crown is considered as correctly identified by the algorithm when 50% of its pixels are identified. How was the threshold of 50% determined? If -as the authors state- metrics such as diameter and area of crowns are the important tree metrics in forest inventories, isn´t it problematic to have trees inventoried with an area possibly as low as 50% of their real area?

(José Braga) - The value of 50% is in relation to the evaluation set only, it is not for all tree crowns detected in the region of the forest of Santa Genebra. The algorithm determined that 395 tree crowns had their area detected with at least 50% of the pixels correctly detected. But from these 395, 345 (or 81%) trees crowns obtained more than 70% of the pixels correctly detected. Other metrics were applied to assess the delineation, such as the IoU and the F1 score. For the IoU metric, 82% (349) of the detected tree crowns obtained a value greater than 0.5. Accord to Weinstein et al, this values is more stringent. For the F1 score metric, the average value obtained was 0.77, considering all the tree crowns detected, which shows quite significant results. Weinstein et al calculated the Precision and Recall values only for tree crown with IoU over 0.5, for these group, in our research the values obtained for Recall, Precision and F1 score were 0.81, 0.91 and 0.86 respectively, we added this information in section 3.2 Delineation accuracy. We conducted a more in-depth discussion in the text using the metrics cited here in the response, see section 4.2 TCDD delineation performance.

You are correct, if most tree crowns obtained a correct amount of pixels close to 50% this could negatively impact the forest inventory, but this does not happen because 81% of tree crowns obtained at least 70% of their pixels correctly detected. The average of pixels correctly detected were 84%. We added this information in the text, see section 3.1 Detection accuracy.

Following your guidance, we performed more in-depth discussion about the detection accuracy. We made a comparison with others TCDD researches about the detection rate, please check section 4.1 TCDD detection performance.

We also applied the algorithm to the entire forest, and our methodology detected 59062 tree crown. Within this set we performed the calculation of the average size of the tree crowns and identified the 5th and 95th percentiles to show the highest occurrence in the sizes of the tree crowns.

We believe that with your recommendations, which have been added, the analysis of the text has become more application oriented.

Thank you for your suggestion.

Changes in the text:

________________________________________________________

Section 3: Results

Subsection 3.1: Detection accuracy

paragraph which start at line 398, lines 399 to 401 were added

Subsection 3.2: Delineation accuracy

paragraph which start at line 459, were added information in lines 460 and 461

Table 7 was modified

____________________________________________________

Section 4: Discussion

Subsection 4.1: TCDD detection performance

paragraph which start at line 490 was modified

paragraph which start at line 503 was modified

Table 8 was added

Subsection 4.2: TCDD delineation performance

paragraph which start at line 526 was modified

paragraph which start at line 539 was modified

Table 9 was added

______________________________________________

Reference:  Weinstein, B.G.;  Marconi, S.;  Bohlman, S.;  Zare, A.;  White, E. Individual tree-crown detection in RGB imagery using semi-supervised deep learning neural networks. Remote Sensing 2019, 11, 1-13.

Reviewer 1 (Q3) - Another question related to this aspect is to what degree does the example forest at hand represent tropical rainforests globally (e.g. seasonality, leaf loss)

(Matheus Ferreira) - Thank you very much for your question. Seasonal semi-deciduous forests (SSF) are tropical forest formations that are subjected to a well-defined dry season, in which a portion of the trees defoliate. The percentage of overstory trees that present total leaf loss varies from 20% to 50% during the dry season (Veloso et al., 1991). In Brazil, SSF originally covered some 500,000 km², mainly in the southern and southeastern parts of the country. Seasonal semi-deciduous forests feature a highly complex canopy in terms of density, structure and species richness. For example, floristics assessments performed in the study area showed that there are more than 100 tree species per hectare, which is similar to other tropical forest sites. Moreover, one can note in the satellite images that the forest canopy is composed of overlapping individuals with varying sizes and shapes, which is typical of humid tropical forests. Thus, we believe that our proposed approach can be applied to other tropical forest sites in which high-resolution images are available.

(José Braga) – Thank you very much for your question. According to Ferreira et al, in seasonal semi-deciduous forests (for example, Santa Genebra Reserve), the percentage of overstorey trees that present total leaf loss varies from 20% to 50% during the dry season. The image applied in our research was taken in December, during the wet season. In the wet season all the crowns are likely foliated. If a tree crown is not foliated, more shade is present within the crown and the delineation of the tree crown becomes more difficult. (This information is section 4.3.2 )

Surveys performed in Santa Genebra forest found near of 100 woody species within one hectare (Farah et al, 2014, Guaratini et al, 2008). In Santa Genebra Reserve, the forest canopy is quite heterogeneous and it is composed by deciduous and evergreen species (Farah et al., 2014, Ferreira et al, 2016).

References

Veloso, H. P., Rangel Filho, A. L. R., Lima, J. C. A., 1991. Classificação da Vegetação Brasileira Adaptada a um Sistema Universal. Rio de Janeiro: IBGE. 123p.

Ferreira, M. P.; Zortea, M.; Zanotta, D. C.; Shimabukuro, Y. E.; de Souza Filho, C. R. Mapping tree speciesin tropical seasonal semi-deciduous forests with hyperspectral and multispectral data. Remote Sensing of Environment 2016, 179, 66–78.

T. Farah, R. R. Rodrigues, F. A. M. Santos, J. Y. Tamashiro, G. J. Shepherd, T. Siqueira, J. L. F. Batista, B. J. F. Manly. Forest destructuring as revealed by the temporal dynamics of fundamental species. Case study of Santa Genebra Forest in Brazil Ecol. Indicat., 37 (2014), 40-44.

Guaratini, E. C. P. Gomes, J. Y. Tamashiro, R. R. Rodrigues. Composição florística da reserva municipal de santa genebra, Campinas, SP Revista Brasileira de Botânica, 31 (2008), p. 323-337.

Changes in the text

_______________________________________________

Section 2: Materials and Methods

Subsection 2.1: Study site

paragraph which start at line 187 was modified

Reviewer 1 (Q4) - the authors prominently state the overall detection accuracy of 96%, which is in this application the least meaningful metric. The metrics shown in table 7, especially the F1 score should be discussed more prominently, compared to other studies, and should be mentioned in the abstract instead of the detection accuracy.

(José Braga) - Thank you very much for your suggestion. You are correct, we made the change in the abstract to add more significant metrics.

Following your guidance, during the analysis of the results, we made a more detailed description of the metrics Recall, Precision and F1 score. We performed a comparison with others researches that applied the metrics Recall, Precision and F1 score to evaluate our results, see Table 7 and 9. Please check sections 4.1 and 4.2 to evaluate our changes in the text.

Following your guidance, in the conclusion section we made changes to present for the reader the results of these metrics which are more important.  

The change suggested in the abstract was carried out.

Changes in the text:

__________________________________________________

Lines changed in the Abstract.

Line 11 and 12 were added information and were removed information.

____________________________________________________

Subsection 4.2: TCDD delineation performance

paragraph which start at line 526 was modified

paragraph which start at line 539 was modified

Table 9 was added

___________________________________________________

Section 5: Conclusions

paragraph which start at line 643 was modified

___________________________________________________

Reviewer 1 (Q5) - in order to get a better idea of the algorithms performance there should be a figure showing the location of training areas/sample and the test areas in relation to the whole study area

(José Braga) - Thank you very much for your suggestion. We made a figure, that is in the supplementary material, that shows the trees that belong to the training set, the trees from evaluation set and Mask R-CNN response (patternsAndResponse.jpg). In the supplementary material there is also an image (santagenebra.jpg) that shows the result of the delineation in the entire forest of Santa Genebra.

Reviewer 1 (Q6) - how are different tree species represented in the training data set? Does this have an impact on the results?

(José Braga) – Thank you very much for your question. There was a concern to collect the largest possible variety of tree crowns, and thus to train the algorithm to delineate the largest possible number of different tree crowns. There was no specific concern with the tree species to construct the training set. For the formation of the training set, tree crowns of different sizes, shapes and colors were collected, so that the neural network could delineate the tree crowns of any species from Santa Genebra reserve.

Reviewer 1 (Q7) - Fig.7: the loss function seems to stop improving after a bit more than 90 epochs. Why continue as far as epoch 120.

(José Braga) - Thank you very much for your question. Yes, you are correct. There is a decrease in the improvement of the metrics, but despite this the value at the end of 120 epochs for both metrics were the best among all epochs. There is a less pronounced drop, but the decline in loss and validation values continues until the end of the training. For this reason, we decided to use the weights obtained in epoch 120 to produce the results.

Reviewer 1 (Q8) - Fig. 7: there is a strange phase of the loss function being stable around epochs 40-45. what could be the reason?

(José Braga) - Thank you very much for your question. This behavior was likely caused by decaying in the learning rate. During the training, a division by 10 in the value of the learning rate was applied every 40 epochs. This was done because, without this decaying, an overfitting of the model occurred, that is, the total loss of training decreased but the total loss of validation remained constant or worsened. With this decaying, the metrics suffered a small stabilization between epochs 40 and 45, but soon after they started to fall again.

Another decaying in the learning rate was made after the 80th epoch, but in this case, the improvement was smooth until the 120th epoch. The training phase is an iterative step for neural network configuration. For an iterative process, a standard scenario is the "fake stabilization", but such "fake stabilization" (there is no changes in the cost function) is canceled if the iterative process follows, as shown in Fig. 7. Therefore, only stabilization for much more iterations must be considered, Fig. 7 - see iterations after iteration 95.

Reviewer 1 (Q9) - The text, specifically in chapters 1 and 2, needs to be streamlined and focused on the main aspects of the paper.

(José Braga) - Thank you very much for your suggestion, several changes in the text were made following your comments. We believe that, due to your suggestions, the text is more concise and clear. The mains changes were made in the sections 1 and 2.

 Changes in the text:

____________________________________________

Section 1: Introduction

paragraph which start at line 19 was modified in lines 20, 22, and between lines 26 to 32

paragraph which start at line 33 was modified in lines 37, 38 and lines 39 to 42 were removed

paragraph which start at line 43 - line 43 to 45 -  were removed information

paragraph which start at line 55 was modified

paragraph which start at line 72 was modified

paragraph which start at line 78 was removed (was joined to paragraph which start at line 55)

paragraph which start at line 87 was removed

paragraph which start at line 96 was modified line 100 and 101

paragraph which start at line 120 was modified in line 120, 123 to 128 were removed and lines 129 to 133 were added

paragraph which start at line 134 was removed

paragraph which start at line 141 was modified in lines 141, 142, 143, 146 and 147

paragraph which start at line 150 was modified in lines 157, 158, 159, 161, 162, 163, 165

paragraph which start at line 168, lines 178 to 181 were added

________________________________________________

Section 2: Materials and Methods

Subsection 2.2: Worldview-2 satellite image

Lines 200 to 209 were summarized to create the lines 210 to 215

_________________________________________________

Reviewer 1 (Q10) - With regards to the presentation quality, I believe the manuscript should be carefully revised to minimize minor flaws which are currently quite frequent. Examples are spelling errors such as "cont” instead of "count" in the pseudo code, "crow" instead of "crown" in fig 5, "individually tree information" instead of "individual tree information" in the abstract.

(José Braga) - Thank you very much for your attention. The errors have been corrected.

 

Changes in the text:

_____________________________________________________

Abstract

Line 5 was corrected one word (Individual)

____________________________________________________

Subsection 2.5 Synthetic forest images for training

Line 270 cont was corrected to count

Line 275 cont was corrected to count

Line 295 The crow word in Fig. 5 was corrected to crown

Author Response File: Author Response.docx

Reviewer 2 Report

This paper applies instance segmentation based on Mask-R CNN architecture to RGB satellite imagery of forests with a spatial resolution of 0.5 meters to perform tree crown detection and delineation, a process that can ease the extraction of forest inventories over large areas. The case study was developed in the Santa Genebra Forest Reserve in Brazil. Authors start by applying the LMVM pan-sharpening process to WV-2 images provided by DigitalGlobe consisting of panchromatic and RGB images with spatial resolutions of 0.5 and 2 meters respectively, after which, they obtain RGB images with a spatial resolution of 0.5 meters. In the paper, authors manually label the contours of tree crowns and split them into training and validation datasets. To emulate different tree densities and enrich the training dataset, the authors created an algorithm that produces images of synthetic forests from a set of delineated tree crowns. Each image in the dataset covers 128 x 128 pixels, which corresponds to a 64-meter x 64-meter area. Then, the authors use an available implementation of the Mask-R CNN architecture and perform the training using a data augmentation process with horizontal and vertical flips, geometric rotations of 90, 180, and 270 degrees, and pixel brightness change between 50% and 150%. After training and validating the model using different metrics (such as class loss, bounding box loss, and mask loss), the authors perform the inference on a third randomly-sampled and manually-labeled evaluation set. As the inference is performed on sub-images or patches of 128 x 128 pixels, the results over the entire geographic extent of the study are merged at the end. Finally, the authors report a global accuracy detection rate of 96 %, a Kappa index value of 0.919, and an F1 score of 0.77.

In my opinion, the paper is well written, it is easy to follow, and provides a useful solution to an important problem using well-known algorithms in remote sensing (e.g.: pan-sharpening) and computer vision (e.g.: instance segmentation through Mask R-CNN, data augmentation, etc.). The paper includes multiple performance metrics and graphs to assess the method's performance in a complex case study with a high tree density. As the proposed methodology only requires satellite imagery, it can be extended to other geographies, enabling a more efficient way of tracking forests over time. The paper's code is already available on Github: https://github.com/jgarciabraga/MASK_RCNN_TCDD


After reading the paper, I have the following questions:

* In Figure 3, why are the hand-labeled boundaries for training and validation not following the exact contour of the tree crowns? how subjective and error-prone is this labeling process?

* In Table 4, the performance metrics in the validation set are worse than those in the training set, at least by a factor of two. By looking at Figure 7, it also seems that the gap between the training and validation losses increases after about 45 epochs. Are you sure that the model is not overfitting?

* By looking at Figure 14.B, it looks like the algorithm is overestimating the number of tree crowns, i.e.: it is producing a larger number of small tree crowns that are just not there. If that is the case, how can you address this issue? (future work).

* In section 4.3.3. you explain that as part of the processing pipeline, you have to merge the predictions of the algorithm over the different patches. However, this process is not explained elsewhere in the paper and it should be included. I believe that it plays an important role in achieving the overall accuracy during the delineation process, considering that the Mask-R CNN will likely provide cropped-delineation predictions near the patch boundaries.

Author Response

(José Braga) Dear Reviewer #2, thank you very much for your positive review that I believe have improved the paper. In the following text you will find your comments (in black) and our answers in blue. We answer all your questions and made all the changes you suggested to us. Thank you very much for your attention.

Reviewer 2 (Q1) - In Figure 3, why are the hand-labeled boundaries for training and validation not following the exact contour of the tree crowns? how subjective and error-prone is this labeling process?

(José Braga) - One of the most difficult step of our research was to collect trees crowns for training the neural network. The gathering of patterns was done manually over the WV-2 image with a resolution of 0.5 meters per pixel. Despite the high spatial resolution of the image, the gathering of patterns at the limits of each tree crown is quite complex, especially within a tropical forest. Being absolutely sure where a tree crown ends and another starts was the most difficult. In this way, in order to do not add too many noises to the neural network training, it was stipulated to perform the outline of the crowns as far as the specialist was sure. For this reason, in some tree crown patterns it may not be completely outlined. This is the most difficult part of the job as it is the most error prone, despite the collection being done by a specialist. For future work on the Santa Genebra forest region, data from a Lidar sensor or field data can be used to improve the collection of data patterns, the results obtained could be used to validate this research and even improve the results obtained.

Changes in the text:

_________________________________________

Section 5: Conclusions

paragraph which start at line 654, lines 659 to 661 were added.

Reviewer 2 (Q2) -  the performance metrics in the validation set are worse than those in the training set, at least by a factor of two. By looking at Figure 7, it also seems that the gap between the training and validation losses increases after about 45 epochs. Are you sure that the model is not overfitting?

(José Braga) - During the training of an artificial neural network there will always be a gap between the values ​​of training loss and validation loss. The validation is performed with patterns that were not presented to the neural network during training. There is an overfitting when the validation error (validation loss) start to increase while the training error (training loss) still decrease [2]

In our research, the difference between the training and validation metrics got a factor of 2. There are studies that apply deep learning with error factors between validation and training superior than 10, and it was not considered, until that training moment, an overfitting [1]. It is considered an overfitting when the validation loss become worst and the training loss is improving [2,3].

Especially in deep learning, overfitting is considered when the validation loss starts to get worst (in the graph its curve is ascendant) and training loss continues to improve (in the graph its curve is descendant) [2,3]. Throughout our training, the validation loss has improved, even at a smoother rate, as it does since the 85th epoch.

In addition, to avoid overfitting, we used the synthetic forest construction algorithm, which increases the variability during the training. Data augmentation techniques were applied. The Mask R-CNN neural network algorithm incorporates regularization techniques to avoid overfitting [4].

For these reasons presented, we believe that there is no overfitting in our training, mainly because there is an improvement in the validation loss and training loss value throughout our training. If an overfitting occurs the validation loss must increase while the training loss is decreasing. 

Reference: [1] - El-Khatib, H.; Popescu, D.; Ichim, L.; Deep Learning-Based Methods for Automatic Diagnosis of Skin Lesions. Sensors, p. 1-25, 2020.

Reference: [2] - Le, T.; Huynh, D. T.; Pham, H. “Efficient Human-Robot Interaction using Deep Learning with Mask R-CNN: Detection, Recognition, Tracking and Segmentation”. In: 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV 2018).

Reference: [3] – Khan, S; Rahmani, H.; Shah, S. A. A.; Bemmaoum, M. A guide to Convolutional Neural Networks for Computer Vision. Morgan & Claypool Publishers.

Reference: [4] - He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. “Mask R-CNN” .Cornell University - Computer Research Repository.

Reviewer 2 (Q3) - it looks like the algorithm is overestimating the number of tree crowns, i.e.: it is producing a larger number of small tree crowns that are just not there. If that is the case, how can you address this issue?

The Mask R-CNN was effective to identify the delineation of tree crown. Of course, there is an error associated to this segmentation procedure - such as: overestimating the number of tree crowns. The Santa Genebra Reserve is a secondary forest and for this reason it has a large number of small tree crowns. The construction of the training set followed this trend. The average size of the tree crowns for training is 49.54 m², which is considered small for the Santa Genebra forest. Considering the training and validation set, 75% of the tree crowns have an area less than or equal to 55 m². In order to reduce the errors, we can add larger tree crowns into the training set, and this action could decrease excessive segmentation.

Reviewer 2 (Q4) - you explain that as part of the processing pipeline, you have to merge the predictions of the algorithm over the different patches. However, this process is not explained elsewhere in the paper and it should be included. I believe that it plays an important role in achieving the overall accuracy during the delineation process, considering that the Mask-R CNN will likely provide cropped-delineation predictions near the patch boundaries.

(José Braga) - Thank you very much for your suggestion. You are correct, we made the change in the text and we added additional information about the algorithm which performs the merging of two tree crowns belonging to two patches.

Changes in the text:

_________________________________________

Section 2: Materials and methods

Subsection 2.6: Training a Mask R-CNN for TCDD

paragraph which start at line 335, lines 337 and lines 338 to 343 were added

_____________________________________________

Subsection 4.3: Algorithm requirements

Subsubsection 4.3.3: Algorithm limitations

paragraph which start at line 591 was modified

paragraph which start at line 598 was added

Author Response File: Author Response.docx

Reviewer 3 Report

The authors propose a crown delineation method by CNN. CNN is well known technology for pattern recognition and detection compared with traditional image based methods. Sine this study compares the proposed method with [17], it should be explained whether [17] belongs to CNN-based or traditional method. If [17] belongs to CNN-based, the theoretical advancement should be clearly described. Otherwise, what is the significance of using CNN compared to traditional method? However, the theoretical advancement and the reason of comparing with [17] should be explained in more detail.

Author Response

(José Braga) Dear Reviewer #3, thank you very much for your positive review that I believe have improved the paper. In the following text you will find your comments (in black) and our answers in blue. We answer all your questions and made all the changes you suggested to us. Thank you very much for your attention.

Reviewer 3 (Q1) - The authors propose a crown delineation method by CNN. CNN is well known technology for pattern recognition and detection compared with traditional image based methods. Sine this study compares the proposed method with [17], it should be explained whether [17] belongs to CNN-based or traditional method. If [17] belongs to CNN-based, the theoretical advancement should be clearly described. Otherwise, what is the significance of using CNN compared to traditional method? However, the theoretical advancement and the reason of comparing with [17] should be explained in more detail.

(José Braga) - Thank you very much for your question. We make a comparison with the crwon delineation model developed by Wagner et. al. at the same study site. Wagner et. Al’s research applies an algorithm based on region growing and edge detection, that could be defined as a traditional method. The main comparison that we did with the research developed by Wagner et al is the number of tree crowns delineate in all Santa Genebra Forest, our approaches found 59,062 tree crowns and Wagner et al found 23,278.  In addition, we use a set of existing metrics (amount of pixel correctly delineate, IoU, Recall, Precision, F1 score) that can be applied as guidelines for future research, as a point made by Weinstein et. al.  is the lack of standardization for the evaluation of the algorithms that perform the TCDD.  

We made changes to the text to clarify these doubts. For example, Wagner et al's research applies an algorithm based on region growing and edge detection. In lines 501 and 502, we add the sentence: “The research developed by Wagner et al proposed an algorithm based on edge detection and region growing to perform the TCDD.”

In the section 4 (Discussion) we made changes in the text to provide in-depth discussion and comparison with others researches.

Changes in the text:

_____________________________________________

Subsection 2.7: Independent algorithm assessment

paragraph which start at line 352 was modified

_____________________________________________

Section 4: Discussion

Subsection 4.1: TCDD detection performance

paragraph which start at line 490 was modified

paragraph which start at line 503 was modified

Table 8 was added

Subsection 4.2: TCDD delineation performance

paragraph which start at line 526 was modified

paragraph which start at line 539 was modified

Table 9 was added

______________________________________________

Reference:  Weinstein, B.G.;  Marconi, S.;  Bohlman, S.;  Zare, A.;  White, E. Individual tree-crown detection in RGB imagery using semi-supervised deep learning neural networks. Remote Sensing 2019, 11, 1-13.

Wagner, F.H.; Ferreira, M.P.; Sanchez, A.; Hirye, M.C.; Zortea, M.; Gloor, E.; Phillips, O.L.; de S. Filho, C.R.;  Shimabukuro, Y.E.;  Aragão, L.E.   Individual tree crown delineation in a highly diverse tropical forest  using  very  high  resolution  satellite  images. ISPRS  Journal  of  Photogrammetry  and  Remote Sensing, 2018, 145, 326-327.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Thank you for the detailed answers to my questions.

One final remark: I could not find the figure patternsAndResponse.jpg in the supplementary files. As mentioned earlier, I think it is important to include this figure to get a better impression of the performance by inspecting the location of training areas/sample and the test areas in relation to the whole study area, i.e. what the algorithm was presented for learning in relation to where it was used for prediction

Author Response

(José Braga) Dear Reviewer #1, thank you very much for your positive review to the major review. I believe that your major review has improved the paper. In the following text you will find your question (in black) and the answers in blue. Thank you very much for your attention.

Reviewer 1 (Q1) - One final remark: I could not find the figure patternsAndResponse.jpg in the supplementary files. As mentioned earlier, I think it is important to include this figure to get a better impression of the performance by inspecting the location of training areas/sample and the test areas in relation to the whole study area, i.e. what the algorithm was presented for learning in relation to where it was used for prediction.

(José Braga) - Due to the size of the files, when I did the upload of the files for the major revisions, I placed the figure patternsAndResponse.jpg in the folder called supplementary, this folder is within the package (the zip file) that contains all the files (template.tex, references.bib, figures folder, Definitions folder and supplementary folder). I had to do this because the submission site had an error when I sent the files in separate folders.

Please, check the zip with all the files (the main zip file), in this zip there is a folder called supplementary where the figure patternsAndResponse.jpg is.

To avoid this problem again, I am sending to you this response and the image in the next page (see pdf file).

Thank you very much.

Author Response File: Author Response.pdf

Reviewer 3 Report

The authors reinforced the comparison and explanation of performance. It is still not clearly described as theoretical advancement. Since CNN is a well-known technology for pattern recognition and detection compared with traditional image-based methods, not only the performance should be studied, but also the theoretical advancement needs a clearer discussion.


Author Response

(José Braga) Dear Reviewer #3, thank you very much for your review that I believe have improved the paper. In the following text, we answered your question and made the changes you have suggested. You will find your comment (in black) and our answer in blue.

Reviewer 3 (Q1) - The authors reinforced the comparison and explanation of performance. It is still not clearly described as theoretical advancement. Since CNN is a well-known technology for pattern recognition and detection compared with traditional image-based methods, not only the performance should be studied, but also the theoretical advancement needs a clearer discussion.

(José Braga) - Thank you very much for your question. Our study proposes the using of a CNN based algorithm (Mask R-CNN) to perform the TCDD over a tropical forest. Our results are very good, that is, for detection accuracy the result obtained was over 90% and for delineation accuracy the values of F1 score and IoU metric were 0.86 and 0.73, respectively.

Most algorithms developed for TCDD work over temperate forest and they use techniques as region growing, edge detection, local maximum and template matching (with a specific geometric form, such as ellipse) [1, 2]. The application of these algorithms over tropical forest, where the tree crowns are not so homogeneous, may be difficulty because they could need of several configuration steps.  However, the application of our technique could be extended to any type of region (for example, temperate forests or another tropical forests), because it depends only of some hand annotate ITCs to feed the algorithm of synthetic image creation.

Tropical forests can be composed of a great amount of tree crowns with extensive variety of colors, sizes and shapes, and even using very high-resolution images is difficult/impossible to hand-annotate all ITCs within a specific region.  We believe that one major theoretical advancement of our work, is that you can train the model on synthetic data, and that it works very well, as seen by the different accuracies values.

We add this information in text to become clear the advance of our study over other techniques. In the section 4 (Discussion) we made changes in the text to provide this information.

Changes in the text:

_____________________________________________

Section 4: Discussion

We create a new subsection at line 533

Subsection 4.4: Algorithm’s advance

We add a paragraph which start at line 534.

We add a paragraph which start at line 539.

We add a paragraph which start at line 548.

 

References 

[1] - Ke, Y.; Quackenbush, L. J. A review of methods for automatic individual tree-crown detection and delineation from passive remote sensing. International Journal of Remote Sensing 2011, 32, 4725–4747

[2] - Gomes, M. F.; Maillard, P.; Deng, H. Individual tree crown detection in sub-meter satellite imagery using Marked Point Processes and a geometrical-optical model. Remote Sensing of Environment 2018, 211, 184–195.

Author Response File: Author Response.pdf

Back to TopTop