Deep Learning-Based Segmentation of Intertwined Fruit Trees for Agricultural Tasks

La, Young-Jae; Seo, Dasom; Kang, Junhyeok; Kim, Minwoo; Yoo, Tae-Woong; Oh, Il-Seok

doi:10.3390/agriculture13112097

Open AccessArticle

Deep Learning-Based Segmentation of Intertwined Fruit Trees for Agricultural Tasks

by

Young-Jae La

,

Dasom Seo

,

Junhyeok Kang

,

Minwoo Kim

,

Tae-Woong Yoo

and

Il-Seok Oh

^*

Department of Computer Science and Artificial Intelligence/CAIIT, Jeonbuk National University, Jeonju 54896, Republic of Korea

^*

Author to whom correspondence should be addressed.

Agriculture 2023, 13(11), 2097; https://doi.org/10.3390/agriculture13112097

Submission received: 24 September 2023 / Revised: 25 October 2023 / Accepted: 2 November 2023 / Published: 4 November 2023

(This article belongs to the Special Issue Applications of Data Analysis in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Fruit trees in orchards are typically placed at equal distances in rows; therefore, their branches are intertwined. The precise segmentation of a target tree in this situation is very important for many agricultural tasks, such as yield estimation, phenotyping, spraying, and pruning. However, our survey on tree segmentation revealed that no study has explicitly addressed this intertwining situation. This paper presents a novel dataset in which a precise tree region is labeled carefully by a human annotator by delineating the branches and trunk of a target apple tree. Because traditional rule-based image segmentation methods neglect semantic considerations, we employed cutting-edge deep learning models. Five recently pre-trained deep learning models for segmentation were modified to suit tree segmentation and were fine-tuned using our dataset. The experimental results show that YOLOv8 produces the best average precision (AP), 93.7 box AP@0.5:0.95 and 84.2 mask AP@0.5:0.95. We believe that our model can be successfully applied to various agricultural tasks.

Keywords:

tree segmentation; deep learning; branch intertwining; fine-tuning; apple tree; agricultural automation

1. Introduction

Trees in forests have a substantial influence on human life by providing many benefits, such as improving human health, reducing air pollution, and providing green spaces [1]. Fruit trees in orchards are important for human nutrition. To maintain tree health and productivity, various tasks must be performed regularly. For forest trees, typical tasks include habitat identification, population estimation, health monitoring, growth monitoring, and logging planning. In orchards, finer observations of individual trees are required for tasks, such as fruit detection, yield estimation, spraying, phenotyping, and pruning.

An important preliminary problem for successfully accomplishing these tasks is tree segmentation [2]. Depending on the final goal of the tasks, different problems must be formulated with respect to image acquisition and segmentation algorithms. The cost of image acquisition ranges from low to high. On the cheap side, a farmer acquires RGB images using a smartphone placed in front of the trees. The most costly method is acquiring satellite images over a large field. Unmanned aerial vehicles (UAV), such as drones, are most commonly used because of their high performance and reasonable cost. They can utilize a diverse range of sensors which include LiDAR, thermal, ultrasound, multispectral, hyperspectral, RGB-D, and RGB sensors. Segmentation algorithms range from traditional computer vision algorithms, such as thresholding and watersheds, to modern deep-learning-based models. Deep learning models include convolutional neural network (CNN)-based models, such as you only look once (YOLO) and regions with CNN features (R-CNN), and transformer-based models, such as Swin and detection transformer (DETR) [3]. The correct choice of data acquisition and segmentation algorithms is critical to successfully accomplish the specific goal of a given task.

Our survey of tree segmentation research revealed that there are many papers that use top-view images captured using UAVs. Figure 1a shows an example of this situation. Despite the importance of managing fruit trees in orchards, only a few papers have focused on fruit tree segmentation, and have assumed that trees in images are placed in isolation; that is, they do not allow intertwining with neighboring trees. Most use RGB-D cameras, which are more expensive than RGB cameras [4,5].

These observations motivated this study. The aim of this study was to use a low-cost RGB camera and to enable tree segmentation when the target tree is intertwined with neighboring trees and/or facilities such as poles or trellises. A typical application scenario is the estimation of apple tree yields by taking an RGB image in front of an apple tree using a smartphone to count the number of apples on that tree. As apple trees are placed in a row, the branches of neighboring trees are likely to be intertwined. Figure 1d shows the importance of our study by comparing it with conventional research shown in Figure 1b,c. First, the system must precisely segment the region of a target tree to detect and count the apples on that tree. This application is further discussed in Section 4.3. Precise segmentation is crucial for excluding the ground and adjacent trees. The application of segmentation results is not limited to yield estimation, but can be expanded to many tasks, such as automating fruit harvesting, spraying, phenotyping, pruning, and health and growth monitoring.

To precisely segment the tree regions of intertwined trees in an RGB image, we employed high-performance pre-trained deep learning models. The contributions of this study are as follows:

-: We built a tree-segmentation dataset of apple trees in which the target tree was intertwined with neighboring trees. A human annotator accurately delineated the outlines of the branches and the trunk using the LabelMe tool. To the best of our knowledge, this dataset is the first to consider intertwined trees. The URL for downloading our dataset is at the end of the main text.
-: The CNN and transformer models were trained to segment the target tree using this dataset. The models used were the mask R-CNN with different backbones, YOLACT, and YOLOv8. A comparative study showed that YOLOv8 was the best, with a large margin in terms of average precision (AP). A qualitative analysis also showed the superiority of YOLOv8.

Section 2 reviews research on conventional tree segmentation that focuses on the agricultural domain. Section 3.1 presents our dataset by describing how to obtain tree images and annotate the tree regions. Section 3.2 briefly describes the five pre-trained deep learning models and how to fine-tune them using our dataset. Section 4 presents the experimental results and compares performances quantitatively and qualitatively. Finally, Section 5 concludes the paper.

2. Related Works on Tree Segmentation

In Section 2.1, related works are divided into agricultural and digital forestry domains, and the papers are summarized in tabular form with respect to various aspects. Section 2.2 provides a detailed description of the related studies and public datasets on the agricultural domain that are closer to our approach. Section 2.3 discusses the characteristics of the related studies.

2.1. Summary of Related Studies on Agricultural and Digital Forestry Domains

Table 1 presents an overview of the literature on tree segmentation published since 2010. For earlier papers, we refer readers to survey papers [2,9]. Our review grouped the papers into two domains: agriculture and digital forestry. Most applications in the agricultural domain require finer segmentation of tree regions to manage individual trees. Typical tasks related to precision farming include fruit harvesting, yield estimation, spraying, health and growth monitoring, phenotyping, and pruning. Digital forestry requires a comparatively coarse segmentation of trees covering a large area, such as 1 km² of forest. Typical tasks include species and habitat identification, population estimation, spraying, health and growth monitoring, and logging planning. In agriculture, farmers or field robots take pictures in front of trees (1–5 m distance), whereas in digital forestry, UAVs carrying various image sensors take top-view pictures at 10 m to 1 km distances.

We are particularly interested in image types and whether segmentation algorithms allow the intertwining of neighboring trees. In digital forestry, the primary purposes are population estimation and spraying far from the crowns; tree intertwining is generally not a concern. In contrast, in many agricultural tasks, such as phenotyping, yield estimation, and pruning, closer views of trees are required, and precise delineation of tree intertwining is important. The row named as ‘our’ describes our approach, wherein images were obtained using smartphones, the cheapest image acquisition method, and the segmentation algorithms allowed severely intertwined trees. Our approach was multipurpose, enabling spraying, phenotyping, monitoring, harvesting, and pruning, because it segmented the entire tree region by considering branch intertwining.

2.2. Detailed Description of Agricultural Domain

We divided the papers in this group into two subgroups: those using RGB-D images and those using RGB images. The RGB-D images provide an additional depth map; therefore, the target tree can be segmented more accurately by excluding neighboring trees and facilities. However, RGB-D sensors are more expensive, and the segmentation algorithm becomes more complex. In addition, when trees are intertwined, the depth is useless because of the similar depth values of adjacent trees. RGB images are the cheapest image acquisition method. Farmers can take images even within a smartphone. Thus, RGB is the most popular method. Because many pre-trained deep learning models are publicly available for RGB images, the approach using RGB images is the most effective.

Papers adopting RGB-D were first reviewed. Xiao used a Kinect RGB-D sensor attached to a spray robot [10]. This method applied a threshold to the green channel and excluded distant trees based on a depth map. Thresholding is limited in that it cannot exclude neighboring trees with similar depths. Milella proposed a system that used RGB and depth to segment individual grapevine trees, and RGB to segment a tree into leaf, wood, bunch, pole, and background [11]. It relied on both rule-based and deep learning methods, which resulted in a complex procedure. Dong presented a system that reconstructed a 3D scene of a tree row in an apple orchard using RGB-D images [12]. It did not attempt to separate neighboring trees. Gao et al. proposed a segmentation algorithm that applied a threshold to the green channel, applied k-means to the depth map, and fused the results for spraying robot path planning [13]. Thresholding could not exclude neighboring trees of similar depths. Hani removed the ground plane and neighboring trees using depth information for apple yield estimation [5]. This method first constructed a 3D scene model using multiple depth maps and used a mask to exclude the ground plane and far trees to count the apples more accurately. However, the 3D reconstruction process was complex, and the mask could not exclude facilities and neighboring trees with similar depths. Chen proposed a similar approach for branch segmentation of apple trees [4]. The method segmented only branches and not the entire tree. Lin developed a system that segmented guava fruits and branches using a mask R-CNN to accomplish the path planning of a harvest robot [14]. This method segmented only the branches and fruits, and not the entire tree. Cong presented a tree segmentation method for precise spraying of citrus trees [8]. They used simple thresholding of the depth map to exclude other trees. To further segment the target tree, a modified U-net employing a squeeze-and-excite attention layer was proposed. The thresholding could not exclude neighboring trees of similar depths. Seol described a deep learning segmentation model for spraying robots [15]. The model first segmented the tree regions using SegNet and then performed postprocessing by thresholding the depth map. The thresholding could not exclude neighboring trees of similar depths.

Now we review studies that use RGB images. Zhang proposed a deep learning method to segment trees, focusing on occluded branch segmentation for pruning and apple-thinning [16]. It did not attempt to separate neighboring trees. Asaei proposed a technique to threshold the green channel [17]. The information of the segmented region was used to control the opening of the nozzle of the spraying robot. This technique is severely limited because it can only be used for trees that satisfy the threshold value ranges. Majeed proposed a deep learning segmentation of apple trees into trunks, branches, and trellis wire [18]. The study compared the accuracies of various models using RGB and RGB-D and concluded that the RGB-D model is better. Its application is limited in that the method relied on the depth thresholding to remove background objects. Song proposed a system that segmented kiwifruits and branches using mask R-CNN [19]. Tree segmentation was not performed. Lin proposed deep learning models to segment branches and fruits for the path planning of a harvesting robot [20]. The authors modified MobileNetV3 to develop an attention mechanism. The model processed only close views of trees and not the entire tree. Cao proposed a method for segmenting trees using YOLOv7 equipped with a squeeze-and-excite attention layer [7]. However, it assumed that the image contained an isolated tree.

The row denoted ‘ours’ describes our approach. Our approach is multipurpose because it segments the entire tree while considering intertwining.

Chen proposed a support vector machine (SVM) segmentation algorithm for top-view citrus tree images captured using a UAV [21]. Previous studies [22,23] also processed top-view images captured by UAVs. Gibril proposed a transformer-based segmentation model for top-view palm tree images [24]. The author concluded that transformer models are superior to CNN models. However, approaches that process top-view images are limited to tasks such as spraying and population estimation.

Few public datasets are available for tree segmentation in the agricultural domain. Yang released an aerial point-cloud dataset for tree detection [46]. The dataset contained top-view RGB images. It is publicly available as a cost-based subscription. In contrast, our dataset provided frontal-view apple tree images that allow intertwining with neighboring trees. This is freely available on GitHub. A detailed description of the dataset is provided in Section 3.

2.3. Discussion

This review led to several important conclusions. This motivated our research and directed our approach of using an RGB camera, allowing tree intertwining, and making our dataset publicly available.

-: Research on digital forestry is more plentiful than that in the agricultural domain. In the agricultural domain, the number of cases using RGB-D was higher than that using RGB. Considering several factors such as the increasing importance of agricultural tasks, high-quality smartphone cameras, and high-performance deep learning models pre-trained with RGB images, more active research is required in the agricultural domain using RGB cameras.
-: Many studies have used traditional rule-based segmentation algorithms rather than modern deep learning models. Maintaining pace with the rapidly evolving deep-learning-based segmentation models is required for tree segmentation.
-: A few public datasets are available. More datasets should be constructed and publicly released to activate research and share objective performance evaluations.
-: No research has been conducted on the segmentation of intertwined trees. Because many orchards have intertwined fruit trees, research on this situation should be actively conducted.

3. Materials and Methods

Figure 2 illustrates the overall procedure of our study. It consists of 3 stages: dataset construction, model training, and testing for performance evaluation. Each will be explained in the following section.

3.1. Dataset Construction

To train a deep learning model, the quality of the training dataset is crucial. This section describes our approach for collecting tree images and annotating the target tree regions.

3.1.1. Collection of Apple Tree Images

The original images used in our dataset were collected from an orchard of Hongro apple trees cultivated at the National Institute of Horticultural and Herbal Science in Jeonju, South Korea. Thirty Hongro apple trees were photographed during the harvest season, from August to September 2021. Each tree was imaged once a week for five consecutive weeks during daylight using an iPhone 6 with a resolution of 2448 × 3264 pixels. As shown in Figure 3, the target tree was placed at the center of the image. A total of 150 images of 30 different trees were included in the dataset.

3.1.2. Labeling of Tree Regions

We used the LabelMe tool to label the tree regions. The labeling results were stored in the COCO data format. Only one tree region occupying the image center was labeled, as shown in Figure 4. The tree region included the crown and trunk. The ground, sky, facilities such as poles and trellises, and neighboring trees were excluded. We labeled 150 tree images in total. Because a tree has a variety of shapes and may be heavily intertwined with facilities and neighboring trees, labeling jobs are labor intensive and costly. Labeling one image required approximately 30 min.

We established several guidelines for consistent and effective labeling. These results are illustrated in Figure 5, Figure 6 and Figure 7. As shown in Figure 5, the empty spaces between leaves were included in the region. This significantly reduced the labeling cost without sacrificing performance. For example, this labeling policy did not cause any errors in the yield estimation. For tasks such as monitoring the health of leaves, successive finer segmentation of leaves in the tree region was required, which was another segmentation problem. When a part of the tree was hidden by a pole or trellis, as shown in Figure 6, the hidden part was included in the tree region. This condition was important for preventing the tree from being separated into two different regions. Some parts of the pole enclosed by the tree were included in this region. This condition was important for tasks such as tree volume estimation. Figure 7 illustrates a case in which the branches of neighboring trees are heavily intertwined. To precisely label this case, careful observation of the tree morphology was required. When it was unclear which tree a branch belong to, the branch was excluded from the tree region.

3.2. Deep Learning Models for Tree Segmentation

Many excellent pre-trained segmentation models are publicly available [3]. The two-stage R-CNN and one-stage YOLO series are the most popular approaches. Recently, transformer models have been developed for image segmentation, and several pre-trained models have been developed. Thus, we adopted pre-trained models and performed transfer learning on our tree-segmentation dataset through fine-tuning. Our approach was twofold: selecting pre-trained models suitable for our problem and fine-tuning using our dataset.

3.2.1. Selection of Pre-Trained Models

Table 2 presents the pre-trained models used in our study. We chose the popularly used state-of-the-art models in natural image segmentation whose pre-trained weight files are available. In the column named pre-trained models, the segmentation model is a convolutional neural network (CNN) that had a complete pipeline from accepting the original input image to outputting the final segmentation map. The backbone is part of the CNN that extracted a feature map. For the R-CNN series, the mask R-CNN was chosen [47]. Three different backbone networks were employed in the mask R-CNN. The Swin-T and Swin-S backbones are transformer-based neural networks [48]. For the YOLO series, the most recent models, YOLACT and YOLOv8, were selected. Brief descriptions of each component are provided below.

Mask R-CNN: The R-CNN was originally developed as an object detection model and has evolved into fast R-CNN and faster R-CNN [49]. As shown in Figure 8, removing the mask branch (orange arrow) results in faster R-CNN. The class and box branches output the bounding box with the class information. Confidence is also ensured. Adding a mask branch transforms the faster R-CNN detection model into a mask R-CNN segmentation model [47]. The complete pipeline of the mask R-CNN consists of a backbone, region proposal, and head. The backbone extracts a rich feature map with the potential to produce the correct segmentation. The proposed region generates several candidate patches that are likely meaningful objects. The head checks each patch to determine how confident it is an object is present and calculates the object location and mask.

ResNet50 and ResNet101: ResNet is the same as an ordinary CNN except for the skip connection shown in Figure 9 [50]. The skip connection adds the input feature map x to the processed feature map f(x) and passes the result forward. By using both feature maps, the resultant feature map became richer. ResNet50 and ResNet101 contain 50 and 101 layers, respectively.

Swin-T and Swin-S: The transformer is an innovative model that extracts profound feature maps by measuring the self-attention among words in an input sentence [51]. This has revolutionized natural language processing. The vision community modified the transformer to be suitable for processing images by considering the grid patches as words and succeeded in obtaining superior performance compared to a CNN. The Swin transformer was developed to serve as a backbone for various tasks, such as classification, detection, and segmentation [48]. It employs hierarchical multiscale feature maps, starting from small patches and gradually merging with neighboring patches in deeper layers. The algorithm employs the shifted-window concept, allowing for different window partitions.

YOLOv8: In contrast to slow two-stage R-CNN series models that generate many region proposals and successively evaluate each region proposal, YOLO-series models perform a one-stage process that directly regresses the object locations from each window in the grid-partitioned image [52]. Owing to its simpler architecture, YOLO achieves real-time processing while sacrificing accuracy. The original YOLO, YOLOv1, has been continually improved by the authors and other research groups. The latest version, YOLOv8, expands the detection-specific model into a versatile model that enables detection, segmentation, pose estimation, and tracking [53].

YOLACT: YOLACT is the same as YOLO in that it uses one-stage processing and guarantees real-time processing. However, it uses a completely different approach [54] because it uses two processing modules to generate prototypes and compute the mask coefficients. The two modules work in parallel, resulting in a high speed. The prototypes are feature maps of the same size as the input image, each of which is likely to represent the appearance of certain objects. A linear combination of the prototypes with the mask coefficients resulted in the final segmentation map.

3.2.2. Fine-Tuning Using Our Dataset

Our tree segmentation problem is easier than general image segmentation because it deals with only one class, ‘tree’, and a single region is sufficient. Our problem is more difficult because the object boundaries are more unclear and variational owing to the intertwining with neighboring trees.

Our learning strategy is to transfer the pre-trained segmentation models described in Table 2 to our dataset using a fine-tuning technique. The decoding head of each pre-trained model is modified to have a single class. As our dataset is small, data augmentation is important. During the learning, various random geometric and photometric transformations were applied to augment the data. A random horizontal flip was applied with a 50% chance. A brightness change was applied with maximum ±10%. The 30 trees are randomly split into 24 and 6 trees; 120 images acquired from the 24 trees are used for the training set, and 30 images from the 6 trees are used for the test set. The hyperparameters for fine-tuning are listed in Table 2.

4. Results

PyTorch 1.12.0 and Cuda 11.3.1, under Ubuntu 20.04.4 LTS, were used to implement the segmentation models listed in Table 2. A GeForce RTX 2080 Ti graphics processing unit was used.

The de facto standard segmentation metric average precision (AP) was used to evaluate and compare the model accuracies [55]. The AP was calculated by first collecting the precision–recall value pairs under the condition of fixing the intersection over union (IoU) threshold and varying the confidence threshold. The area under the curve, where the x- and y-axes represent recall and precision, respectively, is the AP. The AP computed at the box level is the box AP, and the AP computed at the pixel level is the mask AP, as shown in Table 3. When there are multiple classes, the mean AP for all classes, that is, the mAP (mean AP), should be used. Because we have only one class, ‘tree,’ AP is sufficient.

4.1. AP Analysis

Table 3 compares the five models in terms of AP. The analysis used the de facto standard metrics AP@0.5 and AP@0.75, which were measured by fixing the IoU threshold at 0.5 and 0.75, respectively. The AP@0.5:0.95 metric was measured by averaging the Aps when the IoU threshold was varied from 0.5 to 0.95 by 0.05. In box AP, the IoU was calculated box-by-box. In mask AP, the IoU was calculated regionally.

Since our segmentation problem is to identify one target tree region and the region covers about half of the image, AP@0.5 has little meaning since it accepts the overlap over 0.5 as correct. The box AP@0.5 and mask AP@0.5 are 100.0 or close to 100.0. Increasing the threshold to 0.75 results in a meaningful comparison. In terms of mask AP which calculates the overlapping pixel-by-pixel, YOLOv8 produced a very high value of 99.5 mask AP@0.75. It is higher than the second best of mask R-CNN with Swin-S by 4.1. In terms of mask AP@0.5:0.95, YOLOv8 is higher than the second best by 10.8.

Figure 10a,b shows the learning curve of AP over 300 epochs with respect to box AP and mask AP. This also demonstrates the superiority of YOLOv8. Figure 10c,d illustrates the curves for the focal loss for train and validation set, respectively. They show that the learning was smoothly converged. Figure 11 presents precision–recall curves and F1 score curves. They enable a more objective and visual assessment of the proposed YOLOv8 model.

4.2. Qualitative Analysis

Figure 12 illustrates four sample tree images segmented using the five models listed in Table 2. All models successfully captured the overall shape of the target tree. However, careful observation of tree branches and trunks revealed different performances among the models. The delineation of the intertwined branches from other trees and facilities also differed. The following figures show certain parts of the tree that make these differences more evident.

The windows in Figure 13 show the branches. For the upper left window, YOLACT and YOLOv8 segmented well, whereas mask R-CNN missed some parts of the branches. The lower-right window had a faint branch that may have been missed by an uncareful observer. YOLOv8 successfully segmented the faint branch, whereas the other models failed to find it. For example, a pruning robot employing YOLOv8 successfully cut a branch, whereas a robot employing other models failed.

Figure 14 shows a comparison of trunk detections. YOLOv8 successfully segmented the trunk, whereas the other models failed to find it. Successful segmentation of trunks is important for several tasks such as accurate phenotyping and growth monitoring.

Figure 15 shows a case in which the branches of neighboring trees are heavily intertwined. YOLOv8 produced results closest to the ground truth. Accurate segmentation is important for several tasks such as phenotyping, pruning, and yield estimation.

4.3. Discussions

The technical novelty of this paper can be found in extending the tree segmentation problem to a natural condition where the branches of neighboring trees are intertwined. The new segmentation problem is challenging because the branches grow randomly, resulting in very complex and often unclear boundaries of a tree region. To make the problem concrete, this paper constructed and released publicly a dataset.

The primary contribution of this paper is to show that the state-of-the-art deep leaning models can be successfully applied to the segmentation of fruit tree images captured in the natural condition of orchards with low-cost RGB cameras. The quantitative and qualitative evaluation of the most recent deep learning models showed the potential of automating various agricultural tasks. Especially, YOLOv8 produced the best performance of 93.7 box AP@0.5:0.95 and 84.2 mask AP@0.5:0.95.

The most important future work is to apply the segmentation result to actual agricultural tasks. The count-based yield estimation for individual apple trees is a good application task. Because the tree region exactly corresponds to the region where the fruits should be counted, the tree segmentation model can be used as a preprocessing stage for the yield estimation. However, since there may be hidden apples in an image viewed at one angle, further processing is required. A simple method is to extrapolate the total count by multiplying a scaling factor. A more elaborate way is to use a video captured with a moving camera at a near-constant speed in front of the target tree. A fruit tracking algorithm is applied, and the number of tracks is considered as the number of fruits in the target tree. Since a hidden fruit in a frame may appear in other frames, the tracking-based approach can lead to a more precise yield estimation. In the video processing, the tree segmentation algorithm proposed in this paper is essential since the fruits to be tracked should be confined to the target tree region.

It is well-known that the deep learning has been rapidly and innovatively developed thanks to the high-quality public datasets. The public datasets allow the objective evaluation of a variety of deep learning models and motivate the competitions among world-wide research teams. Typical datasets in computer vision are ImageNet and COCO, which play the role of de facto standards in classification, detection, and segmentation of natural images. The medical imaging community has also excellent datasets such as ADNI and CheXpert which have been actively used in brain MR and chest X-ray processing. Compared to these excellent datasets, the datasets in the agricultural field are poor. For a recent survey of public agricultural datasets, we refer the readers to [56]. To accelerate the research for automating agricultural tasks using artificial intelligence technology, the construction of large and high-quality datasets and making them publicly available are the most important steps.

In this regard, our dataset matters. The tree images were acquired in an actual apple orchard under the natural condition that the neighboring trees were intertwined. This dataset is expected to accelerate the tree segmentation research. The accuracies in Table 3 and qualitative evaluation in Section 4.2 can serve as a baseline performance.

Our dataset has several limitations. It has only 150 labeled images. Since labeling one image required about 30 min, the total labeling cost was 75 man-hours. Another limitation is the lack of diversity. The images were captured at an apple orchard. Extending the dataset requires diversification in terms of fruit type, tree age, season, weather, and tree training. For example, it is usual that the trees in orchards are trained using trellis wires. Considering all of the five factors above and assuming four different values per factor, the amount of images expands by 1024 times and labeling cost becomes 76,800 man-hours or 8.76 man-years. Because the cost for labeling a large dataset is impractical, another approach such as few-shot learning should be considered. The few-shot learning approach is practical in modern computer vision. Adaptation of the few-shot learning to tree segmentation is another important direction for future research.

5. Conclusions

This paper presents a new apple tree-segmentation dataset in which a target tree at the center of an image is intertwined with neighboring trees and facilities. Recent deep-learning-based pre-trained models for object segmentation were adapted and fine-tuned using our dataset. YOLOv8 exhibited the highest accuracy. We believe that the model can be successfully applied to various tasks such as yield estimation, spraying, phenotyping, and pruning. One of the most important future works will be to enlarge the dataset in terms of both the number of labeled images and the diversity of trees from different orchards. Another future work will be to apply the segmentation results to actual tasks, such as yield estimation.

Author Contributions

Methodology, Y.-J.L., M.K., and I.-S.O.; Software, D.S. and J.K.; Investigation, T.-W.Y.; Writing—original draft, Y.-J.L.; Writing—review & editing, I.-S.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was carried out with the support of “Cooperative Research Program for Agriculture Science and Technology Development (Project No. PJ015618)” Rural Development Administration, Republic of Korea.

Data Availability Statement

Our dataset is freely available at http://data.mendeley.com/datasets/t7jk2mspcy/1, accessed on 23 September 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Turner-Skoff, J.B.; Cavender, N. The Benefits of Trees for Livable and Sustainable Communities. Plants People Planet 2019, 1, 323–335. [Google Scholar] [CrossRef]
Chehreh, B.; Moutinho, A.; Viegas, C. Latest Trends on Tree Classification and Segmentation Using UAV Data—A Review of Agroforestry Applications. Remote Sens. 2023, 15, 2263. [Google Scholar] [CrossRef]
Minaee, S.; Boykov, Y.Y.; Porikli, F.; Plaza, A.J.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Ting, D.; Newbury, R.; Chen, C. Semantic Segmentation for Partially Occluded Apple Trees Based on Deep Learning. Comput. Electron. Agric. 2021, 181, 105952. [Google Scholar] [CrossRef]
Häni, N.; Roy, P.; Isler, V. A Comparative Study of Fruit Detection and Counting Methods for Yield Mapping in Apple Orchards. J. Field Robot. 2019, 37, 263–282. [Google Scholar] [CrossRef]
Mo, J.; Lan, Y.; Yang, D.; Wen, F.; Qiu, H.; Chen, X.; Deng, X. Deep Learning-Based Instance Segmentation Method of Litchi Canopy from UAV-Acquired Images. Remote Sens. 2021, 13, 3919. [Google Scholar] [CrossRef]
Cao, L.; Zheng, X.; Fang, L. The Semantic Segmentation of Standing Tree Images Based on the Yolov7 Deep Learning Algorithm. Electronics 2023, 12, 929. [Google Scholar] [CrossRef]
Cong, P.; Zhou, J.; Li, S.; Lv, K.; Feng, H. Citrus Tree Crown Segmentation of Orchard Spraying Robot Based on RGB-D Image and Improved Mask R-CNN. Appl. Sci. 2022, 13, 164. [Google Scholar] [CrossRef]
Diez, Y.; Kentsch, S.; Fukuda, M.; Caceres, M.L.L.; Moritake, K.; Cabezas, M. Deep Learning in Forestry Using UAV-Acquired RGB Data: A Practical Review. Remote Sens. 2021, 13, 2837. [Google Scholar] [CrossRef]
Xiao, K.; Ma, Y.; Gao, G. An Intelligent Precision Orchard Pesticide Spray Technique Based on the Depth-of-Field Extraction Algorithm. Comput. Electron. Agric. 2017, 133, 30–36. [Google Scholar] [CrossRef]
Milella, A.; Marani, R.; Petitti, A.; Reina, G. In-Field High Throughput Grapevine Phenotyping with a Consumer-Grade Depth Camera. Comput. Electron. Agric. 2019, 156, 293–306. [Google Scholar] [CrossRef]
Dong, W.; Roy, P.; Isler, V. Semantic Mapping for Orchard Environments by Merging Two-Sides Reconstructions of Tree Rows. J. Field Robot. 2019, 37, 97–121. [Google Scholar] [CrossRef]
Gao, G.; Xiao, K.; Jun, Y. A Spraying Path Planning Algorithm Based on Colour-Depth Fusion Segmentation in Peach Orchards. Comput. Electron. Agric. 2020, 173, 105412. [Google Scholar] [CrossRef]
Lin, G.; Tang, Y.; Zou, X.; Wang, C. Three-Dimensional Reconstruction of Guava Fruits and Branches Using Instance Segmentation and Geometry Analysis. Comput. Electron. Agric. 2021, 184, 106107. [Google Scholar] [CrossRef]
Seol, J.; Kim, J.; Son, H.I. Field Evaluations of a Deep Learning-Based Intelligent Spraying Robot with Flow Control for Pear Orchards. Precis. Agric. 2022, 23, 712–732. [Google Scholar] [CrossRef]
Zhang, J.; He, L.; Karkee, M.; Zhang, Q.; Zhang, X.; Gao, Z. Branch Detection for Apple Trees Trained in Fruiting Wall Architecture Using Depth Features and Regions-Convolutional Neural Network (R-CNN). Comput. Electron. Agric. 2018, 155, 386–393. [Google Scholar] [CrossRef]
Asaei, H.; Jafari, A.; Loghavi, M. Site-Specific Orchard Sprayer Equipped with Machine Vision for Chemical Usage Management. Comput. Electron. Agric. 2019, 162, 431–439. [Google Scholar] [CrossRef]
Majeed, Y.; Zhang, J.; Zhang, X.; Fu, L.; Karkee, M.; Zhang, Q.; Whiting, M.D. Deep Learning Based Segmentation for Automated Training of Apple Trees on Trellis Wires. Comput. Electron. Agric. 2020, 170, 105277. [Google Scholar] [CrossRef]
Song, Z.; Zhou, Z.; Wang, W.; Gao, F.; Fu, L.; Li, R.; Cui, Y. Canopy Segmentation and Wire Reconstruction for Kiwifruit Robotic Harvesting. Comput. Electron. Agric. 2021, 181, 105933. [Google Scholar] [CrossRef]
Lin, G.; Chen, Z.; Xu, Y.; Wang, M.; Zhang, Z.; Zhu, L. Real-Time Guava Tree-Part Segmentation Using Fully Convolutional Network with Channel and Spatial Attention. Front. Plant Sci. 2022, 13, 991487. [Google Scholar] [CrossRef]
Chen, Y.; Hou, C.; Tang, Y.; Zhuang, J.; Lin, J.; He, Y.; Guo, Q.; Zhong, Z.; Lei, H.; Luo, S. Citrus Tree Segmentation from UAV Images Based on Monocular Machine Vision in a Natural Orchard Environment. Sensors 2019, 19, 5558. [Google Scholar] [CrossRef]
Safonova, A.; Guirado, E.; Maglinets, Y.; Alcaraz-Segura, D.; Tabik, S. Olive Tree Biovolume from UAV Multi-Resolution Image Segmentation with Mask R-CNN. Sensors 2021, 21, 1617. [Google Scholar] [CrossRef] [PubMed]
Lu, Z.; Qi, L.; Zhang, H.; Wan, J.; Zhou, J. Image Segmentation of UAV Fruit Tree Canopy in a Natural Illumination Environment. Agriculture 2022, 12, 1039. [Google Scholar] [CrossRef]
Gibril, M.B.A.; Shafri, H.Z.M.; Al-Ruzouq, R.; Shanableh, A.; Nahas, F.; Al Mansoori, S. Large-Scale Date Palm Tree Segmentation from Multiscale UAV-Based and Aerial Images Using Deep Vision Transformers. Drones 2023, 7, 93. [Google Scholar] [CrossRef]
Wallace, L.; Lucieer, A.; Watson, C.S. Evaluating Tree Detection and Segmentation Routines on Very High Resolution UAV LiDAR Data. IEEE Trans. Geosci. Remote Sens. 2014, 52, 7619–7628. [Google Scholar] [CrossRef]
Hu, X.; Li, D. Research on a Single-Tree Point Cloud Segmentation Method Based on UAV Tilt Photography and Deep Learning Algorithm. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4111–4120. [Google Scholar] [CrossRef]
Hu, T.; Sun, X.; Su, Y.; Guan, H.; Sun, Q.; Kelly, M.; Guo, Q. Development and Performance Evaluation of a Very Low-Cost UAV-Lidar System for Forestry Applications. Remote Sens. 2020, 13, 77. [Google Scholar] [CrossRef]
Krůček, M.; Král, K.; Cushman, K.; Missarov, A.; Kellner, J.R. Supervised Segmentation of Ultra-High-Density Drone Lidar for Large-Area Mapping of Individual Trees. Remote Sens. 2020, 12, 3260. [Google Scholar] [CrossRef]
Kuželka, K.; Slavík, M.; Surový, P. Very High Density Point Clouds from UAV Laser Scanning for Automatic Tree Stem Detection and Direct Diameter Measurement. Remote Sens. 2020, 12, 1236. [Google Scholar] [CrossRef]
Torresan, C.; Carotenuto, F.; Chiavetta, U.; Miglietta, F.; Zaldei, A.; Gioli, B. Individual Tree Crown Segmentation in Two-Layered Dense Mixed Forests from UAV LiDAR Data. Drones 2020, 4, 10. [Google Scholar] [CrossRef]
Yan, W.; Guan, H.; Cao, L.; Yu, Y.; Li, C.; Lu, J.G. A Self-Adaptive Mean Shift Tree-Segmentation Method Using UAV LiDAR Data. Remote Sens. 2020, 12, 515. [Google Scholar] [CrossRef]
Chen, X.; Jiang, K.; Zhu, Y.; Wang, X.; Yun, T. Individual Tree Crown Segmentation Directly from UAV-Borne LiDAR Data Using the PointNet of Deep Learning. Forests 2021, 12, 131. [Google Scholar] [CrossRef]
Chen, Q.; Wang, X.; Hang, M.; Li, J. Research on the Improvement of Single Tree Segmentation Algorithm Based on Airborne LiDAR Point Cloud. Open Geosci. 2021, 13, 705–716. [Google Scholar] [CrossRef]
Neuville, R.; Bates, J.S.; Jonard, F. Estimating Forest Structure from UAV-Mounted LiDAR Point Cloud Using Machine Learning. Remote Sens. 2021, 13, 352. [Google Scholar] [CrossRef]
Chen, Q.; Gao, T.; Zhu, J.; Wu, F.; Li, X.; Lu, D.; Yu, F. Individual Tree Segmentation and Tree Height Estimation Using Leaf-off and Leaf-on UAV-LiDAR Data in Dense Deciduous Forests. Remote Sens. 2022, 14, 2787. [Google Scholar] [CrossRef]
Li, Y.; Chai, G.; Wang, Y.; Lei, L.; Zhang, X. ACE R-CNN: An Attention Complementary and Edge Detection-Based Instance Segmentation Algorithm for Individual Tree Species Identification Using UAV RGB Images and LiDAR Data. Remote Sens. 2022, 14, 3035. [Google Scholar] [CrossRef]
Ma, K.; Chen, Z.; Fu, L.; Tian, W.; Jiang, F.; Yi, J.; Du, Z.; Sun, H. Performance and Sensitivity of Individual Tree Segmentation Methods for UAV-LiDAR in Multiple Forest Types. Remote Sens. 2022, 14, 298. [Google Scholar] [CrossRef]
Qin, H.; Zhou, W.; Yao, Y.; Wang, W. Individual Tree Segmentation and Tree Species Classification in Subtropical Broadleaf Forests Using UAV-Based LiDAR, Hyperspectral, and Ultrahigh-Resolution RGB Data. Remote Sens. Environ. 2022, 280, 113143. [Google Scholar] [CrossRef]
Terryn, L.; Calders, K.; Bartholomeus, H.; Bartolo, R.E.; Brede, B.; D’hont, B.; Disney, M.; Herold, M.; Lau, A.; Shenkin, A.; et al. Quantifying Tropical Forest Structure through Terrestrial and UAV Laser Scanning Fusion in Australian Rainforests. Remote Sens. Environ. 2022, 271, 112912. [Google Scholar] [CrossRef]
Deng, S.; Katoh, M.; Yu, X.; Hyyppä, J.; Gao, T. Comparison of Tree Species Classifications at the Individual Tree Level by Combining ALS Data and RGB Images Using Different Algorithms. Remote Sens. 2016, 8, 1034. [Google Scholar] [CrossRef]
Puliti, S.; Talbot, B.; Astrup, R. Tree-Stump Detection, Segmentation, Classification, and Measurement Using Unmanned Aerial Vehicle (UAV) Imagery. Forests 2018, 9, 102. [Google Scholar] [CrossRef]
Kattenborn, T.; Eichel, J.; Fassnacht, F.E. Convolutional Neural Networks Enable Efficient, Accurate and Fine-Grained Segmentation of Plant Species and Communities from High-Resolution UAV Imagery. Sci. Rep. 2019, 9, 17656. [Google Scholar] [CrossRef]
Torres, D.; Queiroz Feitosa, R.; Nigri Happ, P.; Elena Cué La Rosa, L.; Marcato Junior, J.; Martins, J.; Olã Bressan, P.; Gonçalves, W.N.; Liesenberg, V. Applying Fully Convolutional Architectures for Semantic Segmentation of a Single Tree Species in Urban Environment on High Resolution UAV Optical Imagery. Sensors 2020, 20, 563. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Zhou, J.; Wang, H.; Tan, T.; Cui, M.; Huang, Z.; Wang, P.; Zhang, L. Multi-Species Individual Tree Segmentation and Identification Based on Improved Mask R-CNN and UAV Imagery in Mixed Forests. Remote Sens. 2022, 14, 874. [Google Scholar] [CrossRef]
Firoze, A.; Wingren, C.; Yeh, R.; Benes, B.; Aliaga, D. Tree Instance Segmentation with Temporal Contour Graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 2193–2202. [Google Scholar]
Yang, R.; Fang, W.; Sun, X.; Jing, X.; Fu, L.; Wei, X.; Li, R. An Aerial Point Cloud Dataset of Apple Tree Detection and Segmentation with Integrating RGB Information and Coordinate Information. IEEE Dataport 2023. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 2961–2969. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; Available online: https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper (accessed on 23 September 2023).
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. Available online: https://proceedings.neurips.cc/paper_files/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html (accessed on 23 September 2023).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 23 September 2023).
Terven, J.; Cordova-Esparza, D. A Comprehensive Review of YOLO: From YOLOv1 to YOLOv8 and Beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar] [CrossRef]
Jocher, G. YOLO by Ultralytics. Available online: https://github. com/ultralytics/ultralytics (accessed on 23 September 2023).
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT++ Better Real-Time Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1108–1121. [Google Scholar] [CrossRef]
Padilla, R.; Netto, S.L.; da Silva, E.A.B. A Survey on Performance Metrics for Object-Detection Algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020. [Google Scholar] [CrossRef]
Lu, Y.; Young, S. A Survey of Public Datasets for Computer Vision Tasks in Precision Agriculture. Comput. Electron. Agric. 2020, 178, 105760. [Google Scholar] [CrossRef]

Figure 1. Various situations of tree image segmentation: (a) top-view RGB image of a forest [6], (b) RGB image of an isolated tree [7], (c) RGB-D image in which trees were removed by thresholding [8], and (d) RGB image obtained from our dataset where branches are severely intertwined.

Figure 2. Overall procedure of our study.

Figure 3. An apple tree image in our dataset.

Figure 4. Example of tree region labeling.

Figure 5. Guidelines for labeling the leaves.

Figure 6. Labeling guidelines for cases in which trees are intertwined with poles or trellises.

Figure 7. Labeling guidelines for cases in which neighboring trees are heavily intertwined.

Figure 8. Processing pipeline of mask R-CNN.

Figure 9. Skip connection.

Figure 10. Learning curves over 300 epochs.

Figure 11. Precision–recall curves and F1 score curves.

Figure 12. Example of tree segmentation: (a) original image, (b) ground truth, (c) mask R-CNN with ResNet50, (d) mask R-CNN with Swin-T, (e) mask R-CNN with Swin-S, (f) YOLACT, and (g) YOLOv8.

Figure 13. Comparison of branch segmentation: (a) original image, (b) ground truth, (c) mask R-CNN with ResNet50, (d) mask R-CNN with Swin-T, (e) mask R-CNN with Swin-S, (f) YOLACT, and (g) YOLOv8.

Figure 14. Comparison of trunk segmentation: (a) original image, (b) ground truth, (c) mask R-CNN with ResNet50, (d) mask R-CNN with Swin-T, (e) mask R-CNN with Swin-S, (f) YOLACT, and (g) YOLOv8.

Figure 15. Comparison of segmentations of heavily intertwined branches: (a) original image, (b) ground truth, (c) mask R-CNN with ResNet50, (d) mask R-CNN with Swin-T, (e) mask R-CNN with Swin-T, (f) YOLACT, and (g) YOLOv8.

Table 1. Summary of tree segmentation studies.

Domain	Papers	Sensor Type	Image Type	View *	Segmentation Target	Segmentation Algorithm	Potential Application
Agriculture	[10]	RGB-D camera	RGB-D	front	single tree	thresholding point cloud	spraying
	[11]	RGB-D camera	RGB-D	front	individual trees	thresholding point cloud	phenotyping
	[12]	RGB-D camera	RGB-D	front	single tree	deep learning (mask R-CNN)	phenotyping
	[13]	RGB-D camera	RGB-D	front	individual trees	color–depth fusion	spraying
	[5]	RGB-D camera	RGB-D	front	single tree	thresholding depth map	yield estimation
	[4]	RGB-D camera	RGB-D	front	single tree	deep learning (U-Net)	pruning
	[14]	RGB-D camera	RGB-D	front	single tree	deep learning (mask R-CNN)	harvesting
	[8]	RGB-D camera	RGB-D	front	single tree	deep learning (mask R-CNN)	spraying
	[15]	RGB-D camera	RGB-D	front	single tree	deep learning (SegNet)	spraying
	[16]	RGB camera	RGB	front	single tree	deep learning (R-CNN)	harvesting
	[17]	RGB camera	RGB	front	single tree	thresholding green channel	spraying
	[18]	RGB camera	RGB	front	single tree	deep learning (SegNet)	tree training
	[19]	RGB camera	RGB	front	single tree	deep learning (DeepLabV3++)	harvesting
	[20]	RGB camera	RGB	front	single tree	deep learning (MobileNet)	harvesting
	[7]	RGB camera	RGB	front	single tree	deep learning (YOLOv7)	phenotyping
	Ours	RGB camera	RGB	front	single tree	deep learning (YOLOv8)	multipurpose
	[21]	RGB camera	RGB	top	individual trees	SVM	tree population
	[22]	RGB camera	RGB	top	individual trees	deep learning (mask R-CNN)	phenotyping
	[23]	RGB camera	RGB	top	individual trees	naïve Bayes	spraying
	[24]	RGB camera	RGB	top	individual trees	deep learning (Segformer)	phenotyping
Digital forestry	[25]	LiDAR	point cloud	front	individual trees	region growing	quantifying forest structure
	[26]	LiDAR	point cloud	top	individual trees	deep learning (T-Net)	quantifying forest structure
	[27]	LiDAR	Point cloud	top	single tree	tree growing process	quantifying forest structure
	[28]	LiDAR	point cloud	top	individual trees	random forest	quantifying forest structure
	[29]	LiDAR	Point cloud	top	individual trees	Hough transform	quantifying forest structure
	[30]	LiDAR	point cloud	top	individual trees	region growing	quantifying forest structure
	[31]	LiDAR	point cloud	top	individual trees	mean shift	quantifying forest structure
	[32]	LiDAR	point cloud	top	individual trees	deep learning (PointNet)	quantifying forest structure
	[33]	LiDAR	point cloud	top	single tree	DBSCAN and k-means	quantifying forest structure
	[34]	LiDAR	point cloud	front	individual trees	HDBSCAN clustering	quantifying forest structure
	[35]	LiDAR	point cloud	top	individual trees	watershed	quantifying forest structure
	[36]	LiDAR	RGB	top	individual trees	deep learning (mask R-CNN)	tree species classification
	[37]	LiDAR	point cloud	top	individual trees	watershed	quantifying forest structure
	[38]	LiDAR	RGB	top	individual trees	watershed and random forest	tree species classification
	[39]	LiDAR	point cloud	front and top	individual trees	region growing	quantifying forest structure
	[40]	RGB camera	RGB	top	individual trees	watershed and SVM	tree species classification
	[41]	RGB camera	RGB	top	tree trumps	region growing	tree population
	[42]	RGB camera	RGB	top	individual trees	deep learning (U-Net)	tree species classification
	[43]	RGB camera	RGB	top	single tree	deep learning (FC-DenseNet)	quantifying forest structure
	[6]	RGB camera	RGB	top	individual trees	deep learning (YOLACT)	quantifying forest structure
	[44]	RGB camera	RGB	top	individual trees	deep learning (mask R-CNN)	quantifying forest structure
	[45]	RGB camera	RGB	top	individual trees	temporal contour graph	tree population

* Top view corresponds to images captured by the UAV overhead.

Table 2. Pre-trained segmentation models.

Pre-Trained Models		Hyper-Parameters for Fine-Tuning
Segmentation Model	Backbone	Loss Function	Optimizer	Batch Size	Learning Rate
Mask R-CNN	ResNet50	CrossEntropyLoss	SGD	16	0.02
	Swin-T	CrossEntropyLoss	AdamW	16	0.0001
	Swin-S	CrossEntropyLoss	AdamW	8	0.0001
YOLACT	ResNet101	CrossEntropyLoss	SGD	8	0.001
YOLOv8	CSPDarknet53	Focal Loss	SGD	4	0.01

Table 3. AP of the five models.

Segmentation Model	Backbone	Box AP@0.5	Box AP@0.75	Box AP@0.5:0.95	Mask AP@0.5	Mask AP@0.75	Mask AP@0.5:0.95
Mask R-CNN	ResNet50	100.0	95.7	77.1	100.0	93.3	69.0
	Swin-T	100.0	91.5	78.4	100.0	95.3	71.0
	Swin-S	100.0	100.0	81.8	100.0	95.4	73.4
YOLACT	ResNet101	100.0	100.0	80.2	100.0	93.2	72.1
YOLOv8	CSPDarknet53	99.5	99.5	93.7	99.5	99.5	84.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

La, Y.-J.; Seo, D.; Kang, J.; Kim, M.; Yoo, T.-W.; Oh, I.-S. Deep Learning-Based Segmentation of Intertwined Fruit Trees for Agricultural Tasks. Agriculture 2023, 13, 2097. https://doi.org/10.3390/agriculture13112097

AMA Style

La Y-J, Seo D, Kang J, Kim M, Yoo T-W, Oh I-S. Deep Learning-Based Segmentation of Intertwined Fruit Trees for Agricultural Tasks. Agriculture. 2023; 13(11):2097. https://doi.org/10.3390/agriculture13112097

Chicago/Turabian Style

La, Young-Jae, Dasom Seo, Junhyeok Kang, Minwoo Kim, Tae-Woong Yoo, and Il-Seok Oh. 2023. "Deep Learning-Based Segmentation of Intertwined Fruit Trees for Agricultural Tasks" Agriculture 13, no. 11: 2097. https://doi.org/10.3390/agriculture13112097

APA Style

La, Y.-J., Seo, D., Kang, J., Kim, M., Yoo, T.-W., & Oh, I.-S. (2023). Deep Learning-Based Segmentation of Intertwined Fruit Trees for Agricultural Tasks. Agriculture, 13(11), 2097. https://doi.org/10.3390/agriculture13112097

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Segmentation of Intertwined Fruit Trees for Agricultural Tasks

Abstract

1. Introduction

2. Related Works on Tree Segmentation

2.1. Summary of Related Studies on Agricultural and Digital Forestry Domains

2.2. Detailed Description of Agricultural Domain

2.3. Discussion

3. Materials and Methods

3.1. Dataset Construction

3.1.1. Collection of Apple Tree Images

3.1.2. Labeling of Tree Regions

3.2. Deep Learning Models for Tree Segmentation

3.2.1. Selection of Pre-Trained Models

3.2.2. Fine-Tuning Using Our Dataset

4. Results

4.1. AP Analysis

4.2. Qualitative Analysis

4.3. Discussions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI