1. Introduction
Remote sensing applications in the precision agriculture field have diversified to include satellite, aerial, and hand-held or tractor-mounted sensors [
1]. Remote sensing using unmanned aerial vehicles (UAVs) has become an important new technology to assist farmers with precision agriculture, providing easier crop nutrient management [
2], better diagnosis of crop diseases, and usage of pests and weeds with a lower cost compared with satellite remote sensing [
3].
Among the tasks of precise management of orchards, instance segmentation of fruit trees’ canopies using UAV-acquired images, which is also known as identification or information extraction of individual trees, is of critical importance since it provides the basic information for plant breeding evaluation [
4], differentiated analysis, and decision-making, as well as information on plantation cover-area and location [
5].
Deep learning represents a powerful tool for big data processing, especially image data. By training with a large amount of data, deep learning-based models can achieve good prediction results for complex phenomena. Recently, deep learning-based methods have been increasingly used in agriculture and horticultural research [
6]. A series of studies have demonstrated that the convolutional neural network (CNN), which denotes one of the deep learning-based models, is effective in spatial pattern recognition, enabling the extraction of vegetation properties from remote sensing imagery [
7]. Csillik et al. [
8] detected citrus and other crop trees from UAV images using a simple CNN model, followed by a classification refinement using super-pixels derived with a simple linear iterative clustering (SLIC) algorithm [
9]. Mubin et al. [
10] utilized two different CNNs to detect young and mature oil palm separately, and used geographic information systems (GIS) during the data processing and resulting storage process.
Compared with simple image classification using complex post processes, such as the functions in GIS software [
11] or extra image processing algorithms [
8], the object detection process, which is an incremental step in the progression from coarse to fine image inference, not only provides the classes of objects but also their locations [
12], which means the algorithm can extract both the classes and the location information of trees in a unified way. Zamboni, et al. [
13] evaluated 21 object detection algorithms, including anchor-based and anchor-free methods, for single tree crown detection. Semantic segmentation, different from object detection, gives fine inference by predicting classes for each pixel of the input image [
12]. Morales et al. [
14] proposed a semantic segmentation method of the Mauritia flexuosa palm using an end-to-end trainable CNN based on the DeepLab v3+ architecture [
15]. Furthermore, instance segmentation, which represents a mixture of object detection and semantic segmentation, gives different labels for separate instances of objects belonging to the same class [
12]. The introduction of the Mask-RCNN [
16] has started a new era of instance segmentation based on deep learning, and many new methods have been proposed, including the YOLACT [
17], SOLO [
18], and Blend Mask [
19]. Among them, the YOLACT is considered the first real-time instance segmentation model. Instance segmentation methods have been widely applied to the task of tree [
20] or fruit [
21] extraction.
Data collected by unmanned aerial systems combined with photogrammetric processing enable reaching different data types, such as digital orthophoto maps (DOMs), digital surface models (DSMs), digital terrain models (DTMs), digital elevation models (DEMs), and three-dimensional (3D) point clouds [
22]. In previous studies, more than one type of data product has been required for the extraction task. For instance, Dong et al. [
23] designed digital height models (DHMs) by subtracting the DTM from the DSM, which is the key data product for avoiding confusion between the treetop and soil areas. Similarly, Timilsina et al. [
24] developed a canopy height model (CHM) by subtracting DEM from the DSM using the tool in ENVI for the identification of tree coverage.
Previous studies on the identification of individual trees have been focused on several species, including citrus [
4,
5,
8,
25,
26,
27,
28], apple [
23], palm [
10,
14,
29], cranberry [
21], and urban trees [
13,
24]. However, although there are studies on the semantic segmentation of litchi flowers [
30] and branches [
31], the studies on litchi canopy segmentation based on remote sensing, as far as we know, have not been proposed.
In this paper, instance segmentation of the litchi canopy, which represents the identification of individual litchi trees, is proposed. The segmentation task is performed using the deep learning-based instance segmentation method YOLACT [
17]. The YOLACT method achieves good performance by recognizing the pixels of the tree canopy in the input image and separating instances individually without external algorithm processing, that is, inference in a unified way. Unlike the above-mentioned studies, which use plural data products as the input, in the proposed method, only the DOM is used as the input.
Annotating canopy areas in input images with boxes or polygons is a key step in the data pre-processing for the training of deep learning models for tree identification [
13,
20,
21]. Since the amount of data needed for the model training is large [
32], it is ineffective to annotate all data manually. As a large number of images in the custom dataset in this paper are collected at the same place at different flight heights and dates, a labor-friendly semi-auto annotation method based on the invariance of objects’ geographical location is introduced, which can significantly reduce the time of data pre-processing.
It is common and necessary to divide the original DOM, whose side length reaches thousands or tens of thousands of pixels, into image subsets with a side length of only a few hundred pixels, which are used as input data when training deep learning models for the identification of individual tree, due to high demand for computing resources [
13,
14,
20]. However, the corresponding reverse operation, that is, integrating the inference results of image patches into an inference result of the whole DOM, has been seldom considered in previous studies. In this paper, a partition-based method for high-resolution instance segmentation of DOMs is presented, having two main differences compared with the previously proposed methods. First, the DOM is split into patches, and the position of each patch is saved separately in the data pre-processing task. Second, the inference results of image patches are integrated into a unified result based on the position information stored in the data pre-processing task, which is followed by non-maximum suppression (NMS).
Although the data are collected on different dates and flight heights, the original litchi images still lack diversity. To solve this problem, a large amount of citrus data were annotated and added to the training set. The comparative experiment results show that the addition of citrus data can improve model performance in litchi tree identification.
In this paper, the average precision (AP) is chosen as an evaluation metric of the proposed model. This metric has been commonly used to examine the performance of models in detection tasks. A series of comparative experiments are performed using different settings of the backbone network, model structure, spectral type, data augmentation method, and training data source. According to the experimental results, when trained with the litchi-citrus datasets, the AP on the test set reaches 96.25%, achieving the best performance among all experiment groups.
The main contributions of this paper can be summarized as follows:
The YOLACT model is used to develop a method for litchi canopy instance segmentation from UAV imagery;
A labor-friendly semi-auto annotation method for data pre-processing is developed;
A partition-based method for high-resolution instance segmentation of DOMs, including the division of input images and integration of inference results, is proposed.
The paper is organized as follows.
Section 2 describes the study areas, data collection and processing, the proposed method, and the validation method.
Section 3 shows the experimental results using the proposed method.
Section 4 is devoted to a discussion, and
Section 5 presents the conclusions.
2. Materials and Methods
2.1. Study Areas
The study area of this work is located in Guangdong Province, China. The experiment was conducted in three orchards containing litchi trees and citrus trees. The orchards were denoted as Area A, B, and C. Area A was located in Conghua District, Guangzhou City (23°35′11.98″ N–113°36′48.49″ E), and contained 141 litchi trees. Area B was located in Tianhe District, Guangzhou City (23°9′40.75″ N–113°21′10.75″ E), and contained 246 litchi trees. Area C was located in Boluo County, Huizhou City (23°29′56.74″ N–114°28′4.11″ E), and contained 324 citrus trees. There were significant differences in lighting conditions and canopy shapes between the three areas. The overview of the study areas is shown in
Figure 1.
2.2. UAV Image Collection
Images of the three study areas were obtained using a DJI P4 Multispectral. An example of the UAV image is shown in
Figure 2. The UAV was equipped with six 1/2.9″complementary metal-oxide semiconductors (CMOS), including one RGB sensor for visible light imaging and five monochrome sensors for multispectral imaging: blue (B): (450 ± 16) nm; green (G): (560 ± 16) nm; red (R): (650 ± 16) nm; red edge (RE): (730 ± 16) nm; near-infrared (NIR): (840 ± 26) nm. The flight height and flight date of the three areas are shown in
Table 1. Flight planning and mission control software was managed by the DJI GO Pro software.
2.3. Photogrammetric and Data Format Processing
The imagery was photogrammetrically processed to generate the RGB DOM using DJI Terra software. The corresponding normal different vegetation index (NDVI) image was obtained based on the red and near-infrared bands using the formula of (NIR − Red)/(NIR + Red). As the input form of the YOLACT network is three-band, in order to allow the single-band NDVI image to be input in the same format as the RGB image, additional data processing was performed. The workflow of this process is shown in
Figure 3.
2.4. Annotation
A labor-friendly annotation method based on the coordinate system conversion is introduced since it is time-consuming to annotate the images of the canopy areas of the same litchi tree collected on different days manually. The positioning information of the same place based on different coordinate systems can be converted to each other through a series of calculations [
33]. Suppose a point’s positions in the WGS 84 geographic coordinate system and image coordinate system are denoted as
and
and
and
. The values needed for conversion between the image coordinate system and the WGS 84 geographic coordinate system, including longitude and latitude of the image’s upper left corner denoted as
and
and horizontal and vertical spacings of raster pixels denoted as
and
, were extracted from the DOM using the Pillow library in Python. The coordinate system conversion is given by (1)–(4).
Theoretically, the actual geo-coordinates of the trees in the experimental area can be considered fixed. The coordinates in an image of the annotations of the canopy in new shots can be easily calculated if the actual geo-coordinates of trees and values of another DOM needed for the conversion are known. In practice, the canopy areas of trees in DOMs acquired on different days can be automatically annotated by the above-mentioned method based on the manual DOM annotation. The principle of the annotation method is shown in
Figure 4.
2.5. Crop Sampling and Datasets Construction
Random cropping for sampling was performed. The cropping size was set at 1100 × 1100 pixels. An object was chosen for sampling only if it was the whole inside the frame. An illustration of the crop sampling process is shown in
Figure 5.
In addition, a crop sampling image would not be accepted if all objects in the frame had already appeared in the previous sampling image. The NDVI image sampling was performed in parallel with the RGB image sampling. The sample numbers of the three areas are given in
Table 2, and the distribution of the original size of instances in samples is shown in
Figure 6. Since the largest instance had a side length of almost two times the default input size of the YOLACT, each cropped image was down-sampled at a ratio of 0.5.
After the crop sampling, four datasets were constructed for the experiments. The components of the train set, valid set, and test set in each dataset are shown in
Table 3.
2.6. YOLACT Network
The YOLACT [
17] is a simple, fully-convolutional model for real-time instance segmentation. The ResNet [
34] with feature pyramid network (FPN) [
35] was used as a default feature backbone, while the base image size was set at 550
550 pixels. Each layer of the FPN included three anchors with aspect ratios of 1, 0.5, and 2.
The YOLACT divides the segmentation task into two parallel subtasks: generation of the prototype mask set and prediction of per-instance mask coefficients. Instance masks can be produced by linearly combining prototypes with the mask coefficients.
In this paper, several modifications in the YOLACT model are introduced to reduce computational complexity while achieving high-precision instance segmentation.
The output of the proto-net has a size of 138
138 pixels, which is smaller than the final output size of the whole model of 550
550 pixels. In the original implementation, the up-sampling by interpolation is performed to enlarge the per-instance mask. This approach provides a good match between the masks and the margin of detected objects. However, in the canopy segmentation task from remote sensing images, the shape of the tree canopy is generally round, without obvious protruding corners. The interpolation for mask production only brings up the subtle difference for the contours, which is not worthy from the aspect of computation cost. In this paper, polygon contours of the masks are obtained directly from the output of the proto-net using OpenCV functions, and values of the coordinates of the points of contours are simply multiplied by the zoom ratio for the enlargement. This approach reduces computation while still achieving proper segmentation of canopies. The difference between the two workflows is shown in
Figure 7.
As introduced above, three anchors with aspect ratios are used for each layer of the FPN. Unlike various ratios of width and height of objects in the public datasets, such as MS COCO [
36], in this work, the circumscribed rectangles of the litchi tree canopies are of approximately square shape in most cases, so multi-ratio anchors can be replaced by a single anchor for the instance segmentation of litchi canopy. In this study, experiments were performed using two types of anchor ratio settings, the original ratio setting and the single ratio setting with the value one.
In the default configuration of the YOLACT, the number of prototypes k is set to 32. Considering the reduced variety in the litchi canopy shape, in this study, a smaller k with the value of four or eight is used. In the comparison experiments, different k values were used.
2.7. Instance Segmentation of High-Resolution Image by Partition
The training and inference of high-resolution images have not been considered in most studies on instance segmentation. In addition, it is not advisable to down-sample large images roughly to match their sizes with the input size of the model [
37] since such an approach can cause a great loss of details, which are important for the detection and segmentation processes. Furthermore, object shape can be distorted during down-sampling if the formats of the input images and model input are different in width and height. Both these situations can significantly degrade the precision of inference.
Similar to the YOLT method proposed in [
38], a partition-based method for high-resolution instance segmentation DOMs is presented in this work. The DOM is divided into patches, and the position of each patch is saved during data pre-processing. The inference results of image patches are integrated into a unified result based on the position information stored during data pre-processing, which is followed by the NMS;
and
denote the width and height of a DOM;
denotes the lower bound of the number of samplings via sliding a window with the size of
in each direction
;
where
is the window sliding distance, and
is the overlap length, and they are respectively calculated by:
In practice,
is set to be equal to the input size of YOLACT, and
is multiplied by the gain ratio
to enlarge
, which can be expressed as:
Once the partition is completed,
image patches obtained from the original DOM are subjected to instance segmentation sequentially. This approach can infer the high-resolution DOMs while avoiding the shortcoming of rough down-sampling, as mentioned above. The partition and integration workflow is shown in
Figure 8.
In this paper, the partitioned-based method was applied only for inference, while the image patches for training were generated by random cropping. Unlike the sliding window position randomly generated when sampling mentioned in
Section 2.5, when partitioning is mentioned in this section, the sliding window position for cropping each image patch is determined based on the size of the original image, the sliding window’s size
, and the gain ratio
, without any randomness.
2.8. Training Details
The original YOLACT model was trained with the COCO dataset using the stochastic gradient descent (SGD) algorithm for 800,000 iterations starting at an initial learning rate of , which was decreased by a factor of 10 after 280,000, 600,000, 700,000, and 750,000 iterations; the weight decay was , and the momentum was set to 0.9.
The learning rate decay strategy was applied to the training process with two modifications. First, iteration nodes for learning rate changes were multiplied with a ratio. Suppose the number of samples in the training and MS COCO datasets were denoted as and , respectively; then, the ratio was set at . Second, the training would stop when the half of maximum iterations was reached since after that, the accuracy could not be further improved.
The ResNet [
34] is a default backbone of the YOLACT. The same backbone settings were applied to the experiments in
Section 3. Models in the original implementation [
17] were trained on the MS COCO dataset, while the models in this paper were trained on the custom dataset described in
Section 2.5.
All models were trained with a batch size of eight on a single NVIDIA Titan X using ImageNet [
39] pre-trained weights, the same as in the original implementation.
2.9. Model Validation
The InterSection over Union (IoU) used in the validation was defined as a quotient of the overlapping area and the union area between the prediction and ground-truth. In addition, the Box IoU and Mask IoU denote the IoU of objects’ circumscribed rectangle area and the IoU of objects’ own area, respectively. The predictions were classified into three groups: (1) True Positive (TP), which represented the predictions with the IoU larger than the threshold; (2) False Positive (FP): which represented the predictions with the IoU below the threshold; (3) False Negative (FN): which indicated that the ground-truth area was not detected by any prediction.
Further, the precision and recall were respectively calculated by:
The average precision (AP), which corresponded to the area under the Precision-Recall curve, was used to validate the performances of the models. The Box AP and Mask AP were calculated based on the Box IoU and Mask IoU, respectively. In this paper, the IoU threshold was set to 0.5, and the AP based on the threshold was denoted as AP50.