Forest monitoring provides essential information to support public policies related to protection, control, climate change mitigation, and sustainable development. Therefore, the continuous monitoring of forest trends through remote sensing enables a cost efficient measurement of vegetated ecosystems. In this context, satellite observations constitute a suitable platform to cover large areas at regular periodicity [1
In remote sensing, object detection is a common and challenging problem aiming to locate instances of a given object class in a specific image [2
]. In the forest monitoring context, single tree detection is an essential task for many applications, including resource inventories, wildlife habitat mapping, biodiversity assessment, and hazard and stress management [3
]. Over the years, researchers have worked in this field, mapping single tree species based on different satellite imagery and achieving moderate results [4
]. In the last decade, new approaches emerged to take advantage of the characteristics of active sensors, especially light detection and ranging (LiDAR) systems, which became a trend for tree crown detection [9
]. More recently, the authors in [10
] concluded that combining LiDAR data with optical imagery generally leads to better classification accuracy. Although this conclusion might be generalized, the authors focused on classifying tree species in urban environments.
In fact, urban forests are a particular case of forests with singular attributes and peculiarities. Urban forests are commonly defined as woody vegetation located in an urban area and usually limited to single and/or groups of trees distributed in parking places, gardens, small parks, and along roads in the city. They may be associated with flower beds or be in contrast to grass and herbaceous shrubs [11
]. Thus, the heterogeneity of urban environments makes the accurate classification of tree species more challenging than in natural forests. Firstly, a high spatial resolution image is required in order to differentiate them as individual objects. Secondly, with the progress of urbanization, urban trees are heavily influenced in their environment by urban patterns like streets, communities, and factories [12
More recently, unmanned aerial vehicles (UAVs) can provide appropriate temporal and spatial resolution images to produce suitable datasets for mapping forested areas on the individual tree level [13
]. This may allow a better detection of single trees in urban scenarios.
The flexibility, versatility, and low cost, as well as the recent advances in high spatial resolution cameras [13
] have spread the use of UAVs in a wide range of applications, like precision agriculture [14
] and ecological, environmental, and conservation monitoring [16
]. Following this trend, Feng and Li [19
] proposed a method for mapping tree species in urban areas based on histograms and thresholding using UAV observations. Similarly, Baena et al. [20
] used object based image analysis on high spatial resolution UAV images to identify and quantify tree species across different landscapes. On the other hand, computer vision has evolved substantially in the last decade, mainly due to the introduction of deep learning methods. In this context, convolutional neural networks (CNNs) have become the most common approach for different image analysis tasks such as automatic classification, object detection, and semantic segmentation [3
]. Recently, CNNs have been widely applied for remote sensing problems achieving the state-of-the-art in many applications [26
]. Some deep learning based approaches for tree species detection have been proposed in recent years. Li et al. [27
] presented a deep learning based framework for oil palm tree detection and counting, using high spatial resolution satellite images. Weinstein et al. [28
] used RGB images from an airborne observation platform along with airborne LiDAR data to detect tree crowns through a deep learning network.
Considering UAV platforms, Natesan et al. [29
] proposed a deep learning framework for tree species classification. In this approach, images of pre-delineated tree crowns were the inputs to a CNN to classify the delineated trees to one out of three classes: red pine, white pine, and non-pine. Similarly, Masanori et al. [30
] used UAVs to acquire RGB images of individual tree crowns and carried out a multiresolution segmentation algorithm [31
] to classify seven different types of trees. Overall accuracy up to 89% was reported in this study. In [21
], Santos et al. proposed different deep learning methods for detecting law protected tree species using high resolution RGB imagery. These methods delivered a bounding box that enclosed each object instance, but did not delineate the shape or contour of the target. In contrast, semantic segmentation is the task of assigning a class label to each pixel in the image [32
]. Thus, semantic segmentation has the potential to capture object form and size more accurately than object detection, which may be essential in many applications.
The first idea for deep semantic segmentation methods was to build a patch based CNN. This approach consists of splitting the image into patches and classifying their central pixel using a traditional CNN. A critical drawback of this method is the redundant operations, specifically in overlapping patches, associated with its high computational cost. To overcome these difficulties, fully convolutional neural networks (FCNs) were first proposed in [33
]. The network uses convolutional and pooling layers to build an end-to-end network able to manage different spatial resolutions and predict class labels for all pixels, exploiting context and location information of the objects in the scene. Later on, with U-Net [34
], a technique to improve the spatial accuracy of the segmentation outcome was proposed. Typically, in this approach, the input image is first processed by an encoder path consisting of convolutional and pooling layers that reduces the spatial resolution. It is then followed by a decoder path that recovers the original spatial image resolution by using upsampling layers followed by convolutional layers (“up-convolution”). In addition, the network uses the so-called skip connections appending the output of the corresponding layers in the encoder path to the inputs of the decoder path. The SegNet architecture [24
], as the U-Net, employs the same principle of the encoder and decoder paths. However, instead of using skip connections, the decoder makes use of the pooling indices computed in the pooling operation of the corresponding encoder layers to upsample the result up to the original image resolution. Recently, Mask R-CNN [35
] combined both detection and segmentation in an end-to-end fashion. Beyond predicting the class and the object bounding box, as required by the detection task, the network also outputs the binary object mask. Mask R-CNN was designed for instance segmentation. Strictly speaking, this application is different from the one addressed by the present study. In fact, the Mask R-CNN also uses an FCN which, however, only segments the region within the predicted bounding boxes.
Some authors proposed the use of a conditional random fields (CRF) based post-processing to further improve the spatial and semantic accuracy of the FCN outcome (e.g., [25
]). Notwithstanding the reported improvements brought about by CRF, these methods have a significant drawback: FCN and CRF need to be trained separately so that such methods constitute no end-to-end solution. In the last few years, real end-to-end FCN architectures for semantic segmentation were published, which reportedly performed at least as good as prior solutions that included CRF post-processing (e.g., [37
]). This was achieved due to innovative techniques to capture multi-scale context within the FCN, such as global-to-local contexts aggregation as in ScasNet [37
] and atrous spatial pyramid pooling in DeepLabv3+ [38
In recent years, a few studies have already evaluated the potential of the FCN architectures, specifically U-Net, for forest mapping from optical images [39
]. In [39
], the authors used a U-Net to identify instances of a given tree species from WorldView-3 images. Similarly, in [40
], the U-Net was trained with the RGB bands and the digital elevation models (DEM) from high resolution UAV imagery. The importance of monitoring urban forests and the lack of studies on using FCNs’ capabilities for this purpose motivated the present study. We propose and evaluate in this paper the use of five state-of-the-art deep learning methods for semantic segmentation of individual tree species identification in an urban context using RGB images derived from UAVs.
Specifically, we focus on identifying the canopy of the threatened species Dipteryx alata
Vogel, also known as cumbaru. It comes about in midwestern Brazil, and due to its particular shadow and architecture, it is used for afforestation practices over urban areas. This species has a tremendous social and economic relevance for the development of some areas of the Brazilian Cerrado [41
]. It has been threatened by extinction according to the IUCN (2020) (The International Union for Conservation of Nature’s Red List of Threatened Species, https://www.iucnredlist.org/species/32984/9741012
), which makes its preservation a very important issue since this particular species provides fruits for a large number of bird species.
The main contributions of this work are threefold: (I) to evaluate the capability of deep learning methods to segment individual trees on high spatial resolution RGB/UAV images; (II) to compare five state-of-the-art deep learning semantic segmentation methods, namely U-Net, SegNet, FC-DenseNet, and Deeplabv3+ with the Xception and MobileNetV2 backbone, for the segmentation of cumbaru trees on the aforementioned RGB/UAV imagery; and (III) to assess the improvements of using CRFs as a post-processing step for individual tree level semantic segmentation.
The remainder of this paper is organized as follows: Section 2
describes the study areas and introduces the fundamentals of FCNs, specifically the approaches used in this work. It further presents the protocol followed in our experimental analysis. Section 3
presents and discusses the experimental results. Finally, Section 4
summarizes the main conclusions of this work and points to future directions.
4. Conclusions and Research Perspective
In this work, we proposed and evaluated the use of state-of-the-art fully convolutional networks for semantic segmentation of a threatened tree species using high spatial resolution RGB images acquired by UAV platforms. Five architectures were tested: SegNet, U-Net, FC-DenseNet, and two DeepLabv3+ variants, specifically Xception and MobileNetV2. The analysis was conducted on a dataset that represented an urban context. The experiments demonstrated that networks could learn the distinguishing features of the target tree species in a supervised way. This fact indicated that the tested FCN designs could delineate other tree species, provided that enough representative labeled samples are available for training.
Among the tested networks, FC-DenseNet attained the best performance achieving 96.7%, 96.1%, and 92.5% in terms of overall accuracy, F1-score, and IoU. Ranked second and third were U-Net and DeepLabv3+ MobileNetV2, respectively, with a difference of 1.4%, 1.7%, and 3.1% for overall accuracy, F1-score, and IoU, followed by the SegNet. The lowest accuracy scores were achieved by DeepLabv3+ Xception with 88.9% for overall accuracy, 87.1% for the F1-score, and 77.1% for IoU. Notably, this was the most complex of all evaluated networks. It contained about 100 times more learnable parameters than FC-DenseNet, the best performing network.
As for the computational efficiency, FC-DenseNet and DeepLabv3+ Xception were again the best and the worst performing networks, respectively, in terms of inference times.
We also observed in our study that post-processing the networks’ outcomes by a fully connected CRF was beneficial in nearly all cases. However, the impact on overall accuracy metrics was often numerically modest, because CRF generally fixed errors in small image regions. Yet, the improvement in segmentation quality was usually significant, as evidenced by visual inspection. The price for such accuracy gain was the comparatively long CRF processing time, about 30 times the FC-DenseNet’s inference time.
DeepLabv3 + Xception was by far the most complex among the networks to be tested. Though it was regarded in the literature as staying amongst the top performing FCNs, DeepLabv3+ Xception achieved the worst accuracy compared to all tested networks. This finding suggested that the training data fell short to estimate DeepLabv3+ Xception’s parameters properly. Even the simpler MobileNetV2 version, which involved just 1/20 of learnable parameters, surpassed the Xception version in all experiments.
We also noticed that DeepLabv3+ Xception benefited from CRF more than all other architectures. In the continuation of this research, we intend to verify if CRF is generally able to mitigate the problem of scarce training data for FCN based semantic segmentation. Additionally, we aim to investigate the application of morphological operations as a post-processing alternative. Another issue that deserves further analysis concerns the generalizability of these methods. Unfortunately, the number and diversity of annotated databases available for this purpose are still limited. We are currently working on building a more diverse database in terms of sensors, tree species, and climate characteristics.