Multi-Species Fruit-Load Estimation Using Deep Learning Models

Yoo, Tae-Woong; Oh, Il-Seok

doi:10.3390/agriengineering7070220

Open AccessArticle

Multi-Species Fruit-Load Estimation Using Deep Learning Models

by

Tae-Woong Yoo

^1,2

and

Il-Seok Oh

^1,2,*

¹

Division of Computer Science and Artificial Intelligence, Jeonbuk National University, Jeonju 54896, Republic of Korea

²

Center for Advanced Image and Information Technology (CAIIT), Jeonbuk National University, Jeonju 54896, Republic of Korea

^*

Author to whom correspondence should be addressed.

AgriEngineering 2025, 7(7), 220; https://doi.org/10.3390/agriengineering7070220

Submission received: 15 May 2025 / Revised: 3 July 2025 / Accepted: 4 July 2025 / Published: 7 July 2025

(This article belongs to the Special Issue The Application of Machine Learning and Deep Learning Techniques in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate estimation of fruit quantity is essential for efficient harvest management, storage, transportation, and marketing in the agricultural industry. To address the limited generalizability of single-species models, this study presents a comprehensive deep learning-based framework for multi-species fruit-load estimation, leveraging the MetaFruit dataset, which contains images of five fruit species collected under diverse orchard conditions. Four representative object detection and regression models—YOLOv8, RT-DETR, Faster R-CNN, and a U-Net-based heatmap regression model—were trained and compared as part of the proposed multi-species learning strategy. The models were evaluated on both the internal MetaFruit dataset and two external datasets, NIHS-JBNU and Peach, to assess their generalization performance. Among them, YOLOv8 and the RGBH heatmap regression model achieved F1-scores of 0.7124 and 0.7015, respectively, on the NIHS-JBNU dataset. These results indicate that a deep learning-based multi-species training strategy can significantly enhance the generalizability of fruit-load estimation across diverse field conditions.

Keywords:

fruit-load estimation; multi-species fruits; precision agriculture; object detection; object counting; deep learning

1. Introduction

Accurate fruit yield prediction allows farmers to optimize harvesting, storage, marketing, transportation, and sales, thereby improving resource management and facilitating informed decision-making [1,2,3,4]. Traditionally, yield estimation has been performed manually by fruit growers and horticultural scientists, who count and weigh samples from a few randomly selected areas to extrapolate the total yield for entire orchards or larger regions [3,5]. However, these manual processes are labor-intensive, time-consuming, and costly. They are also prone to significant errors and variability due to differences among individual trees and diverse orchard environmental conditions [5,6,7]. Consequently, manual methods often lack accuracy and reliability, complicating farmers’ ability to effectively predict yields and plan subsequent actions [1,3]. These challenges highlight the need for intelligent fruit yield estimation systems that enhance both the accuracy and efficiency of yield predictions in agricultural practices [1,2,3,6].

Recent advances in deep learning (DL) have shifted yield estimation from labor-intensive manual methods to automated approaches based on deep learning models [1,2,3,5,6,8,9,10,11,12]. Deep learning-based fruit yield estimation systems offer significant advantages, including nondestructive, image-based measurements, high accuracy, and robust performance across diverse environmental conditions. These capabilities can substantially enhance agricultural productivity and resource management efficiency. As a result, farmers can plan harvesting, storage, transportation, and marketing activities more systematically, address labor shortages, and improve the sustainability and profitability of the agricultural industry by integrating automation technologies such as robotic harvesting [2,3,10,13].

The fundamental concept of orchard yield estimation is to count all visible fruits in a single image and estimate the final yield based on the total number of fruits detected across all captured images [5]. In this process, we focused on fruit-load estimation, which involves counting the fruits visible in tree images. The core technology for fruit-load estimation is the accurate detection of individual fruit locations for precise counting. Typically, the fruit-load estimation process using RGB images involves capturing images on one or both sides of tree rows, accurately detecting the fruits, and counting them. The total fruit count is obtained by summing all individual detections.

Maheswari et al. [10] employed a semantic segmentation model called U-Net to detect and localize guava fruits in orchards for yield estimation. Koirala et al. [6] used images collected at night with LED lighting panels and an RGB camera, applying the MangoYOLO model for fruit detection and the Xception_count model for direct fruit count prediction via CNN-based regression. Koirala et al. [11] developed the MangoYOLO model, a real-time fruit detection system based on the YOLOv3 architecture that utilizes nighttime data. The motivation for using nighttime data is that artificial lighting ensures consistent illumination, stabilizes image quality, and enhances the distinction between fruits and background, thus improving object detection performance. While both studies employed nighttime imaging with artificial lighting to enhance fruit-background contrast and stabilize illumination, they share limitations such as increased operational cost, limited scalability for large-scale deployment, and persistent challenges in accurately estimating heavily occluded fruits. Behera et al. [14] employed a Faster R-CNN with a modified Intersection over Union (MIoU) to improve detection accuracy for mangoes, pomegranates, tomatoes, apples, and oranges. Gao et al. [12] proposed a method for detecting fruits on multiclass plant structures in orchards with fruit wall tree architectures based on Faster R-CNN. In this study, fruits were categorized into four classes according to occlusion conditions relevant to robotic harvesting: non-occluded, leaf-occluded, branch/stem-occluded, and fruit-occluded. Kestur et al. [7] introduced MangoNet, a CNN-based semantic segmentation model that detects mangoes in RGB images. MacEachern et al. [15] utilized six YOLO-based models to classify the ripening stages of wild blueberries into three color stages (green, red, and blue) and two maturity levels (unripe and ripe). Zhang et al. [16] developed a real-time strawberry detection system based on YOLOv4 (tiny) and its customized variant deployed on an embedded device (Jetson Nano). Recently, Xiao et al. [17] proposed a YOLOv8-based model for detecting the location and ripeness stages of apples and pears. Similarly, Yang et al. [18] suggested a tomato detection method based on an enhanced YOLOv8 architecture that incorporated depthwise separable convolutions and a dual-path attention gate module.

When using image sequences, fruit tracking across frames can be performed to merge fruit counts from multiple views and avoid double counting [2,6,9,13]. Häni et al. [13] employed U-Net for fruit detection and counting, combining affine tracking with incremental structure-from-motion (SfM) to track apples and prevent duplicate counting. Xia et al. [19] applied CenterNet for fruit detection in sequential images and utilized a patch-matching model based on the Kuhn-Munkres algorithm to eliminate duplicate detections for oranges and apples, thereby estimating fruit yield. Villacrés et al. [9] experimented with fast R-CNN and YOLOv5 as fruit detection methods, evaluating various tracking and counting techniques, including the Kalman filter, kernelized correlation filter, multihypothesis tracking, simple online real-time tracking (SORT), and DeepSORT.

Steinbrener et al. [20] applied CNNs pretrained on RGB images to classify hyperspectral images of fruits and vegetables. In real-world outdoor farm environments, a single-sensor modality often fails to provide sufficient information for detecting target fruits due to extensive illumination variation, partial occlusion, and diverse appearances. This underscores the necessity of using multimodal fruit detection systems, where different types of sensors can offer complementary information about various aspects of fruits [21]. It is also possible to count fruits by utilizing auxiliary channels beyond RGB, although this approach requires additional expensive devices. Zhang et al. [4] proposed a method for fruit counting and yield estimation mapping by converting RGB images into HSV and Lab color spaces, splitting them into H, S, V, and L *, a *, b * components, and applying the Hough transformation algorithm.

Although previous DL approaches for fruit detection have demonstrated significant success, they share a common limitation: models trained and developed for specific fruit types and conditions cannot be readily extended to different orchards or various fruit species [1]. These DL models are typically trained on datasets containing single fruit species. Consequently, they face limitations when applied to estimate the fruit load in different orchard environments or with other fruit species. While such specialized models perform exceptionally well in designated tasks, they often encounter challenges when deployed in new scenarios involving diverse orchard conditions or fruit types, highlighting their limited generalization ability [1].

We focused on building a single model using datasets from different orchards and various fruit species to perform multi-species fruit-load estimation. Tree fruit-load estimation relies not on evaluating the total number of fruits per tree but on counting the number of visible fruits in the captured images. To evaluate multi-species fruit-load estimation, we employed YOLOv8, RT-DETR [22], Faster R-CNN, and U-Net-based heatmap regression (HR) methods. The primary objective of this study was to develop a generalized DL framework for multi-species fruit-load estimation using datasets collected from diverse orchard environments. The secondary objective was to train four representative models—YOLOv8, RT-DETR, Faster R-CNN, and a U-Net-based HR model—on the multi-species fruit dataset MetaFruit. The trained models were evaluated using external test datasets to assess their robustness and generalizability across different fruit species and orchards.

2. Materials and Methods

2.1. Material

For our experiments, we used the MetaFruit, NIHHS-JBNU apple, and Peach datasets [23]. The MetaFruit dataset [1], a multi-species fruit dataset, consists of 4248 high-resolution RGB images (720 × 1280) collected using an Intel RealSense D435i RGB camera and a Neuvition Titan M1-A LiDAR sensor. It includes 248,015 annotated fruit instances labeled using Labelme. Labeling was restricted to fruits located in the row directly in front of the camera, and fruits with over 95% occlusion were excluded. The MetaFruit dataset is a large-scale, multi-species fruit dataset containing five fruit types—apples, oranges, lemons, grapefruits, and tangerines—captured under natural lighting conditions across various weather conditions and growth stages [1]. The distribution of the MetaFruit dataset is presented in Table 1.

Figure 1 shows representative examples of each fruit category. Unlike existing datasets, MetaFruit was constructed to reflect more realistic and complex orchard environments where fruits often appear in clusters. Notably, this dataset includes multiple varieties within each fruit category. For example, the apple category contains both red and green varieties, further enhancing t diversity and complexity of the dataset [1].

The NIHHS-JBNU dataset was collected from an apple experimental orchard at the National Institute of Horticultural and Herbal Science (NIHHS), targeting 20 spindle-type Hongro apple trees. Images were captured using two cameras: Cam1 (Canon G7X II) and Cam2 (Apple iPhone 6). To include a variety of lighting conditions and growth stages, five rounds of image collection were conducted between August and September 2021, resulting in 200 images. Among these, 199 images were annotated, excluding one in which severe focus distortion made it difficult to identify the apples. The distribution of the NIHHS-JBNU dataset is presented in Table 2.

As shown in Figure 2, when a target apple is partially occluded by other apples, leaves, or branches, a bounding box is designated by estimating the occluded areas. This approach aimed to enable a more accurate prediction of the true center of each apple. In addition, a visibility attribute was added to each apple label to assess harvesting difficulty. Visibility was categorized into three grades: “good” (red), “fair” (yellow), and “bad” (blue). An apple was labeled as “good” if more than two-thirds of it was visible, “bad” if more than two-thirds was occluded, and “fair” for all other cases. Three researchers performed cross-validation to ensure labeling consistency.

The Peach dataset consists of 125 RGB images, each accompanied by ground-truth annotations in the form of instance segmentation masks (https://github.com/ssomda21/Dataset-for-Instance-Segmentation-of-Fruits-in-Peach-Tree-Image, accessed on 1 March 2025). Representative examples from the Peach dataset are shown in Figure 3. The images were captured around June 2021 during a harvest period of approximately 90 days after full bloom in a peach orchard. All images were captured on the same morning under clear weather conditions. Images were acquired using two digital cameras: Canon IXY DIGITAL 220 IS and Samsung MV800. The image resolutions included 1600 × 1200, 1920 × 1080, 1440 × 1080, and 4032 × 1024, or their respective reversed resolutions. There are 80 tree-focused and 45 fruit-bunch-focused images, with an average of 12.0 and 2.7 peach objects per image, respectively. The distribution of the Peach dataset is presented in Table 3.

2.2. Methods

Object detection is a fundamental task in computer vision that aims to simultaneously predict the location and category of objects within a given image. In this study, we utilized three representative models from related research: YOLOv8, RT-DETR, Faster R-CNN, and a U-Net-based HR model. While bounding-box detectors such as YOLOv8 and Faster R-CNN are effective for general object localization, they may struggle to accurately identify small, overlapping, or densely clustered fruits, especially under occlusion. In contrast, HR models predict the probability distribution of object centers, enabling more precise localization in challenging scenarios. Heatmap-based methods offer advantages in annotation simplicity and spatial precision, particularly when bounding-box labels are ambiguous or difficult to define. Furthermore, for fruit-load estimation tasks, where the primary goal is object counting rather than the precise delineation of object boundaries, bounding boxes are not strictly necessary. The HR approach allows for model training with simple point annotations, eliminating the need for traditional bounding-box annotations. In detection-based models, fruit counting is performed by counting the number of detected objects, while in heatmap-based models, fruit centers are identified by locating the local maxima within the predicted heatmaps, followed by counting. Based on the estimated fruit counts, yield predictions were scaled to different levels, including tree-level, row-level, and orchard-level. Additionally, regression-based calibration can be applied by comparing the estimated counts with actual harvest data to derive correction factors.

2.2.1. YOLOv8

YOLOv8 is the latest object detection model released by Ultralytics in 2023 [24,25,26], achieving structural lightweighting and improved accuracy while maintaining the real-time detection performance characteristics of the previous YOLO series. YOLOv8 adopts an anchor-free head, eliminating the need for predefined bounding boxes required by anchor-based models. Instead, each grid cell is designed to predict the object’s center, class, and size directly, allowing for more flexible detection, particularly in cases involving small objects or dense environments. The overall architecture of the model consists of three main components: Backbone, Neck, and Head. The Backbone, based on a lightweight cross stage partial (CSP) structure, is responsible for efficient feature extraction. The Neck employs a PANet structure to effectively integrate feature maps from different resolutions. Finally, the Head utilizes an anchor-free mechanism to simultaneously perform center-based regression and classification.

2.2.2. RT-DETR

In 2024, Zhao et al. [22] proposed a real-time detection transformer (RT-DETR) to enhance real-time object detection performance. RT-DETR combines the high accuracy of transformer-based detectors like DETR with the speed and efficiency typically associated with YOLO-based architectures. Its design emphasizes computational efficiency and architectural simplicity. Instead of relying on complex backbones such as Swin Transformers or multi-scale FPNs, RT-DETR uses lightweight CNN encoders, such as ResNet-18 or ConvNeXt-T, and processes a single-resolution feature map to reduce computational cost. To further streamline the architecture, the standard transformer encoder used in DETR was entirely removed, and a lightweight transformer decoder with cross-attention was introduced. This decoder features fewer layers, reduced hidden dimensions, and a limited number of object queries (typically 100–300). Additionally, RT-DETR adopts an anchor-free design and employs set-based prediction with Hungarian matching, eliminating the need for non-maximum suppression (NMS). Several training enhancements are integrated into the model, including two-stage training, self-distillation, normalized focal loss, and exponential moving average (EMA).

We included RT-DETR in our study as a representative of the latest transformer-based real-time object detection frameworks. Given its promising balance between speed and accuracy, it serves as an important benchmark to compare against conventional CNN-based models and to evaluate the generalizability of transformer architectures in multi-species fruit-load estimation tasks.

2.2.3. Faster R-CNN

Faster R-CNN, proposed by Ren et al. in 2015 [27], is a representative two-stage detector that significantly improves the precision of object detection tasks. It first generates region proposals for potential object locations using a region proposal network (RPN) based on feature maps extracted by the backbone, then classifies each proposed Region of Interest (RoI) and refines its bounding box. The RPN and the detection head are trained in an integrated end-to-end manner, and spatial normalization techniques such as RoI Pooling or RoI Align are applied to standardize the spatial size of the RoIs. Because Faster R-CNN operates in an anchor-based manner, it achieves high precision across objects of various sizes and aspect ratios, demonstrating strong detection performance, particularly in scenes with complex backgrounds and high object overlap. However, due to its two-stage pipeline, its application in real-time scenarios is limited.

2.2.4. Heatmap Regression

HR techniques are widely used to localize semantic landmarks in tasks such as human pose estimation, facial landmark detection, and dense object counting. In human pose estimation, the goal is to predict keypoints corresponding to major body parts, which traditionally require specialized equipment like wearable sensors. However, using HR, it is possible to predict keypoints by learning heatmaps corresponding to body parts without additional devices. Facial landmark detection using HR can be applied in various applications, including gaze estimation and emotion recognition. Moreover, HR can estimate the density of densely packed crowds by generating heatmaps that correspond to object distributions.

HR typically employs Gaussian kernels with fixed standard deviations. To generate ground-truth heatmaps, a 2D Gaussian kernel is centered at the middle of each bounding box. The pixel values in the predicted heatmap are interpreted as the probability that a given pixel corresponds to the center of an object. Example heatmaps are shown in Figure 4. Equation (1) defines the ground-truth heatmap by applying a Gaussian function centered at the object’s location (x,y). Here, (i,j) denote the pixel coordinates on the heatmap, and σ represents the standard deviation of the Gaussian kernel.

H (i, j) = e x p (- (({i - x)}^{2} + {(j - y)}^{2}) / {2 σ}^{2})

(1)

The heatmap-based model performs counting by locating the local maxima on the predicted heatmap, identifying the center point of each fruit, and counting them. To detect local maxima, we first compute the Gaussian similarity at each pixel in the predicted heatmap.

The similarity at each pixel is calculated as follows:

Given a predicted heatmap H, a blurred version

{B l u r r e d}_{σ}

is obtained by applying a Gaussian filter with a standard deviation σ Equation (2):

{B l u r r e d}_{σ} (i, j) = G_{σ} * H (i, j)

(2)

where

G_{σ}

denotes a 2D Gaussian kernel with standard deviation σ, and (i,j) represent the pixel coordinates. The similarity between the original and blurred heatmaps is then computed pixel-wise as shown in Equation (3):

{S i m i l a r i t y}_{σ} (i, j) = \frac{H (i, j) \times {B l u r r e d}_{σ} (i, j)}{\sqrt{H {(i, j)}^{2} + {B l u r r e d}_{σ} {(i, j)}^{2} + ϵ}}

(3)

where ϵ is a small constant (e.g., 1 × 10⁻⁸) added to prevent division by zero.

This similarity metric ranges from 0 to 1 and quantifies how closely the predicted heatmap aligns with its smoothed version. Higher similarity values indicate that the prediction is locally consistent and less noisy, while lower values suggest irregularities or sharp artifacts in the predicted heatmap. The pixel-wise similarity values for each image were averaged to obtain a heatmap smoothness score. Regions in the heatmap with low Gaussian similarity were removed, and center points were identified based on percentile thresholds determined according to the standard deviation.

2.3. Evaluation Metrics

We evaluated the models by calculating true positives (TPs), false positives (FPs), false negatives (FNs), precision, recall, and F1-score based on the distances between the predicted center points and ground-truth centers. The Euclidean distance d(p,g) between each predicted center point

p = (p_{x}, p_{y})

and the ground-truth center

g = (g_{x}, g_{y})

is defined as shown in Equation (4):

d (p, g) = \sqrt{{(p_{x} - g_{x})}^{2} + {(p_{y} - g_{y})}^{2}}

(4)

The matching between a predicted center point p and a ground-truth center g is determined based on a predefined distance threshold. TPs refer to the number of predicted center points that successfully match a ground-truth center within the specified distance Equation (5). FPs represent the number of predicted center points that do not match any ground-truth center and are defined as described in Equation (6). FNs correspond to the number of ground-truth center points that are not matched with any predicted center point Equation (7).

T P = \sum_{N_{p}}^{i = 1} 1 (\exists g_{j} \in G s u c h t h a t d (p_{i}, g_{j}) < τ a n d g_{j} n o t u s e d),

(5)

P = \sum_{N_{p}}^{i = 1} 1 (\forall g_{j} \in G, d (p_{i}, g_{j}) \geq τ),

(6)

F N = \sum_{N_{g}}^{i = 1} 1 (g_{j} u n m a t c h e d) .

(7)

Here, N_p denotes the total number of predicted center points, and N_g denotes the total number of ground-truth center points. The indicator function 1(⋅) returns 1 if the specified condition is satisfied and 0 otherwise. Precision measures the proportion of correctly predicted center points among all predictions and is defined as shown in Equation (8). Recall measures the proportion of ground-truth centers that are correctly predicted and is also defined in Equation (8):

P r e c i s i o n = \frac{T P}{T P + F P}, R e c a l l = \frac{T P}{T P + F N}

(8)

F1-score, the harmonic mean of precision and recall, is calculated as shown in Equation (9):

F 1 s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(9)

Precision quantifies the proportion of correctly predicted center points among all predicted points, making it important to minimize FPs. Recall measures the proportion of ground-truth center points that were successfully detected, reflecting the model’s ability to avoid missed detections. The F1-score, defined as the harmonic mean of precision and recall, offers a balanced metric when both accuracy and completeness are critical.

3. Experimental Results

3.1. Dataset and Training Settings

We conducted experiments by splitting the MetaFruit dataset, which consists of 4248 images, into 2718 training images, 680 validation images, and 850 test images. We trained separate models for each fruit category in the MetaFruit dataset and performed comparative experiments. Supplementary testing was conducted using the NIHS-JBNU and Peach datasets. For the U-Net-based RGBH heatmap regression (HR) model, training was implemented using PyTorch 2.5.1 on four NVIDIA GeForce RTX 2080 Ti GPUs, with AdamW as the optimizer, for a total of 50 epochs. The batch size was set to 4, and the loss function used was mean squared error loss (MSELoss). The AdamW optimizer was employed with an initial learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻⁵. For the U-Net-based RGBH HR model, training was carried out using four-channel input images (R, G, B, and H) with shapes of (4, height, width). The data augmentation pipeline included Gaussian blur, color jittering (adjustments to hue, saturation, brightness, and contrast), and random geometric transformations such as rotation, horizontal flipping, and vertical flipping. All images were normalized using the ImageNet mean and standard deviation values.

For the YOLOv8 model, we used pretrained weights (yolov8s.pt) trained on the COCO dataset and fine-tuned the model on the MetaFruit dataset. The model was trained using the default YOLOv8 configuration with an SGD optimizer, an initial learning rate of 0.01, momentum of 0.937, weight decay of 5 × 10⁻⁴, and a batch size of 16.

Training was conducted for 50 epochs with the input images resized to 640 × 640 pixels. The default YOLOv8 data augmentation pipeline, which includes mosaic augmentation, HSV shifts, scaling, and flipping, was applied. Automatic mixed precision (AMP) was used to accelerate training and reduce memory usage. Figure 5 shows the learning curves of the YOLOv8 model trained on the MetaFruit dataset. The top row illustrates the training losses, including box regression loss, classification loss, and distribution focal loss, along with the precision and recall trends. The bottom row presents the corresponding validation losses and evaluation metrics, including mAP@0.5 and mAP@[0.5:0.95]. All loss values steadily decreased over the course of training, while the precision, recall, and mAP scores consistently increased, indicating stable convergence and improved detection performance without signs of overfitting.

3.2. Results

Overall, the MetaFruit dataset exhibited a balanced distribution of apples, oranges, lemons, and tangerines, each with a similar number of images, while grapefruits were represented by 490 images. Tangerines were particularly well represented, with 1062 images and 85,785 labeled instances, resulting in an average of 81 bounding boxes per image. The average number of bounding boxes per image reflects the fruit density captured within the images, while the average instance size provides insights into the physical dimensions of the objects. In particular, smaller instance sizes are associated with increased difficulty in accurate detection. Although the lemon class did not have the highest average number of bounding boxes per image, it had the smallest average instance size.

We used the YOLOv8 model with pretrained weights (yolov8s.pt) from the COCO dataset and fine-tuned it on the MetaFruit dataset. The Faster R-CNN model, based on the architecture of Faster R-CNN with ResNet-50 and FPN, was fine-tuned using COCO-pretrained weights. Similarly, RT-DETR was fine-tuned with COCO-pretrained weights. The U-Net-based heatmap regression model did not utilize any pretrained weights. Table 4 presents a comparison of the center-point detection performance of the YOLOv8, RT-DETR, Faster R-CNN, and U-Net-based HR models. Additionally, we experimented with U-Net-based RGB(H, S, HS) heatmap regression models by concatenating the hue (H), saturation (S), and hue and saturation (HS) channels from the HSB color space with RGB channels. Specifically, the U-Net-based RGBH HR model uses the hue channel concatenated with the RGB channels as input. Likewise, the RGBS model incorporates the saturation channel with RGB, while the RGBHS model uses both the hue and saturation channels concatenated with the RGB channels as inputs. To enhance the model’s robustness under varying lighting and background conditions, we extended the RGB input by including the hue and saturation channels from the HSV color space. HSV is known for its resilience to illumination changes and effectiveness in color-based object detection. In particular, the hue channel captures color-specific features that are independent of brightness, enabling better fruit-background separation. Although the average numerical difference was small, the RGBH model demonstrated more consistent and stable detection performance across diverse scenarios, especially in low-light or cluttered environments. The experimental results showed that YOLOv8 achieved the highest F1-score, followed by the U-Net-based RGBH HR model. Among all the models, YOLOv8 achieved the highest F1-score (0.8366), demonstrating a good balance between precision and recall.

Qualitative examples of fruit center localization results for YOLOv8, RT-DETR, Faster R-CNN, and U-Net-based HR models on the MetaFruit dataset are shown in Figure 6, Figure 7, Figure 8 and Figure 9.

The orange boxes in Figure 6d, Figure 7d, Figure 8d and Figure 9d illustrate representative center-point detection failure cases for each model. These highlight model-specific limitations, such as YOLOv8′s poor performance in handling occluded instances, RT-DETR’s sensitivity to small fruits in low-contrast environments, the U-Net-based HR model’s tendency to mislocalize centers in complex scenes, and Faster R-CNN’s difficulty in accurately detecting small or densely clustered fruits due to occluded region proposals.

We created a separate single-species model for each of the five fruits in the MetaFruit dataset—apple, orange, lemon, grapefruit, and tangerine—and evaluated their individual performances.

Table 5 presents the performance of the single-species models trained separately for each fruit in the MetaFruit dataset. The YOLOv8 model consistently achieved high F1-scores across all fruit categories, particularly for tangerines and grapefruits, indicating strong generalization and robustness. In contrast, the U-Net-based RGBH HR models exhibited more variable performances, showing strong results for tangerines but relatively low F1-scores for grapefruits and lemons. A comparison with the dataset statistics in Table 1 reveals that model performance is significantly affected by both the number of training images and object density (i.e., the average number of boxes per image). Tangerines, with the highest number of instances and image counts, yielded the best F1-scores for both models. Conversely, grapefruits, with the lowest image and instance counts, showed a sharp drop in performance for the U-Net-based model, whereas YOLOv8 maintained high accuracy. Although lemons had a sufficient number of images, their smaller average instance size likely contributed to lower F1-scores, particularly for the U-Net-based model. Overall, these results suggest that YOLOv8 is more robust to class imbalances and sparse data, whereas the U-Net-based model performs better when trained using dense and abundant data.

The NIHHS-JBNU dataset annotates apples based on their visibility in the images. Across 199 orchard images, 13,260 apples were annotated using bounding boxes. Only 21% of the apples were classified as having good visibility, 33% as fair, and 46% as poor, indicating that most apples exhibited limited visibility. The average size of the apple instances was approximately 98 × 97 pixels in images captured with Cam1, and about 74 × 74 pixels in images captured with Cam2. The proportion of single instances relative to the total image area was extremely small, accounting for only 0.048% and 0.069% of the Cam1 and Cam2 images, respectively. This suggests that apples occupy a very small physical area within the images, increasing the difficulty of accurate detection.

The dataset comprised 125 RGB images depicting peach trees bearing fruit, each accompanied by corresponding ground-truth annotations in the form of masks (for instance, segmentation). It also includes 1077 peach fruit objects, with an average of eight peaches per image. There are 80 tree-focused and 45 fruit-bunch-focused images, with averages of 12.0 and 2.7 peach objects per image, respectively. As shown in Table 6, the center-point detection performance is the lowest for the Peach dataset. This dataset consists of both tree-centered and fruit-bunch-focused images, as illustrated by the peach image example in Figure 3. Notably, the detection performance was lower for fruit-bunch-focused images.

The YOLOv8 model was trained on a subset of the MetaFruit dataset, which comprises five fruit types. Consequently, its performance decreased when evaluated on external datasets such as NIHHS-JBNU and Peach, which exhibit differences in fruit appearance, background complexity, and environmental conditions. Similar decreases in the F1-score were observed across other models, indicating a general domain gap. We believe this issue can be addressed in future work by training models on more diverse and species-rich datasets to improve generalization to unseen conditions.

Qualitative examples of fruit center localization results for YOLOv8, RT-DETR, Faster R-CNN, and U-Net-based HR models on the NIHHS-JBNU and Peach datasets are shown in Figure 10 and Figure 11. TP, FP, FN, and F1-score are shown below each figure.

4. Discussion

In this study, we evaluated the performance of four DL approaches—YOLOv8, RT-DETR, Faster R-CNN, and U-Net-based heatmap regression—for multi-species fruit-load estimation across diverse orchard environments. The experimental results demonstrated that all models successfully detected and counted fruits with reasonable accuracy. However, each model exhibited distinct strengths and weaknesses depending on the complexity of the orchard environment and the diversity of fruit species. YOLOv8, which features an anchor-free and lightweight architecture, achieved competitive precision and recall while maintaining a high inference speed, making it highly suitable for real-time applications such as robotic harvesting. Specifically, YOLOv8 achieved an average inference time of 0.0119 s per image, compared to 0.0363 s for RT-DETR and 0.0826 s for Faster R-CNN. This performance difference reflects the lightweight, one-stage design of YOLOv8 in contrast to the two-stage detection pipeline of Faster R-CNN, which, while more accurate in complex scenes, is computationally more expensive.

RT-DETR, a real-time transformer-based detector, also achieved strong detection performance with efficient computation due to its encoder-free architecture and lightweight transformer decoder. Although slightly slower than YOLOv8, RT-DETR provided a good balance between accuracy and speed and demonstrated strong generalization across various orchard settings. Faster R-CNN, with its two-stage detection mechanism and robust region proposal strategy, exhibited superior detection accuracy in complex environments characterized by occlusion and high fruit density. U-Net-based HR offers a flexible alternative for localizing fruit centers without requiring traditional bounding-box annotations; however, it shows slightly lower accuracy than detection-based methods, particularly in highly cluttered scenes. This indicates that further research is needed to improve methods for identifying center points from predicted heatmaps.

In cases of heavy occlusion or severe fruit clustering, the RGBH HR and RT-DETR models exhibited superior robustness. HR does not require explicit bounding-box boundaries, enabling better center-point detection of overlapping fruits. Similarly, RT-DETR’s query-based transformer architecture effectively handles partial occlusions and spatial ambiguity by leveraging global context. When fruits are small or densely packed (e.g., tangerines and cherries), RGBH HR offers high spatial precision due to its pixel-level prediction, whereas RT-DETR maintains detection consistency without relying on rigid anchor-based proposals. In contrast, YOLOv8 and Faster R-CNN are more suitable for daytime orchard scenes where the fruits are larger, well-separated, and clearly visible. These models provide fast inference and accurate bounding boxes but are more susceptible to performance degradation in clustered or occluded settings.

Evaluation of the MetaFruit, NIHHS-JBNU, and Peach datasets confirmed that models trained on multi-species datasets can generalize effectively across different fruit types and orchard conditions, validating the potential of integrated DL models for diverse agricultural scenarios. This study had several limitations. First, challenges remain in improving the model’s ability to accurately detect heavily occluded fruits and densely clustered fruit instances while maintaining consistent detection across various fruit growth stages. Second, the model was primarily trained and evaluated using high-resolution RGB images without incorporating additional modalities, such as depth or hyperspectral data, which could potentially enhance performance under occlusion and varying illumination conditions. Third, the evaluation was limited to still images, whereas real-world agricultural environments often require robust detection performance in dynamic or video-based settings.

5. Conclusions

This paper presents a comprehensive evaluation of deep learning-based approaches for multi-species fruit-load estimation using the MetaFruit and supplementary datasets. By comparing YOLOv8, RT-DETR, Faster R-CNN, and U-Net-based HR methods, we demonstrate that DL models can effectively estimate fruit load across different orchards and fruit species, without being limited to single-species datasets. The results indicate that detection-based models, such as YOLOv8 and RT-DETR, achieve particularly high precision and recall in complex orchard environments. Notably, RT-DETR strikes a strong balance between accuracy and computational efficiency, thanks to its encoder-free transformer-based architecture. HR models offer a flexible alternative for center localization tasks, as they do not require bounding-box annotations. These findings underscore the feasibility of developing generalized fruit-load estimation systems that can support precision agriculture, reduce labor costs, and facilitate the integration of automation technologies such as robotic harvesting. Future work will focus on enhancing the model’s robustness to occlusion and environmental variability, as well as improving generalization by training on more diverse, multi-species fruit datasets.

Author Contributions

T.-W.Y.: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, visualization. I.-S.O.: writing—review and editing, supervision, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This study received no external funding.

Data Availability Statement

Data is contained within the article. The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, J.; Lammers, K.; Yin, X.; Yin, X.; He, L.; Sheng, J.; Li, Z. MetaFruit meets foundation models: Leveraging a comprehensive multi-fruit dataset for advancing agricultural foundation models. Comput. Electron. Agric. 2025, 231, 109908. [Google Scholar] [CrossRef]
Vasconez, J.P.; Delpiano, J.; Vougioukas, S.; Cheein, F.A. Comparison of convolutional neural networks in fruit detection and counting: A comprehensive evaluation. Comput. Electron. Agric. 2020, 173, 105348. [Google Scholar] [CrossRef]
Maheswari, P.; Raja, P.; Apolo-Apolo, O.E.; Pérez-Ruiz, M. Intelligent fruit yield estimation for orchards using deep learning based semantic segmentation techniques—A review. Front. Plant Sci. 2021, 12, 684328. [Google Scholar] [CrossRef]
Zhang, X.; Toudeshki, A.; Ehsani, R.; Li, H.; Zhang, W.; Ma, R. Yield estimation of citrus fruit using rapid image processing in natural background. Smart Agric. Technol. 2022, 2, 100027. [Google Scholar] [CrossRef]
He, L.; Fang, W.; Zhao, G.; Wu, Z.; Fu, L.; Li, R.; Dhupia, J. Fruit yield prediction and estimation in orchards: A state-of-the-art comprehensive review for both direct and indirect methods. Comput. Electron. Agric. 2022, 195, 106812. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z. Attempting to estimate the unseen—Correction for occluded fruit in tree fruit load estimation by machine vision with deep learning. Agronomy 2021, 11, 347. [Google Scholar] [CrossRef]
Kestur, R.; Meduri, A.; Narasipura, O. MangoNet: A deep semantic segmentation architecture for a method to detect and count mangoes in an open orchard. Eng. Appl. Artif. Intell. 2019, 77, 59–69. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning–Method overview and review of use for fruit detection and yield estimation. Comput. Electron. Agric. 2019, 162, 219–234. [Google Scholar] [CrossRef]
Villacrés, J.; Viscaino, M.; Delpiano, J.; Vougioukas, S.; Cheein, F.A. Apple orchard production estimation using deep learning strategies: A comparison of tracking-by-detection algorithms. Comput. Electron. Agric. 2023, 204, 107513. [Google Scholar] [CrossRef]
Maheswari, P.; Reddy, M.J.; Reddy, V.S.P.; Bhargav, T.S.; Raja, P.; Hoang, V.T. Yield Estimation of Guava Fruit using U-Net Architecture. In Proceedings of the 2022 6th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 28–30 April 2022; pp. 1656–1659. [Google Scholar]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning for real-time fruit detection and orchard fruit load estimation: Benchmarking of ‘MangoYOLO’. Precis. Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
Gao, F.; Fu, L.; Zhang, X.; Majeed, Y.; Li, R.; Karkee, M.; Zhang, Q. Multi-class fruit-on-plant detection for apple in SNAP system using Faster R-CNN. Comput. Electron. Agric. 2020, 176, 105634. [Google Scholar] [CrossRef]
Häni, N.; Roy, P.; Isler, V. A comparative study of fruit detection and counting methods for yield mapping in apple orchards. J. Field Robot. 2020, 37, 263–282. [Google Scholar] [CrossRef]
Behera, S.K.; Rath, A.K.; Sethy, P.K. Fruits yield estimation using Faster R-CNN with MIoU. Multimed. Tools Appl. 2021, 80, 19043–19056. [Google Scholar] [CrossRef]
MacEachern, C.B.; Esau, T.J.; Schumann, A.W.; Hennessy, P.J.; Zaman, Q.U. Detection of fruit maturity stage and yield estimation in wild blueberry using deep learning convolutional neural networks. Smart Agric. Technol. 2023, 3, 100099. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, J.; Chen, Y.; Yang, W.; Zhang, W.; He, Y. Real-time strawberry detection using deep neural networks on embedded system (rtsd-net): An edge AI application. Comput. Electron. Agric. 2022, 192, 106586. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Xu, Y.; Zhang, R. Fruit detection and recognition based on deep learning for automatic harvesting: An overview and review. Agronomy 2023, 13, 1625. [Google Scholar] [CrossRef]
Yang, G.; Wang, J.; Nie, Z.; Yang, H.; Yu, S. A lightweight YOLOv8 tomato detection algorithm combining feature enhancement and attention. Agronomy 2023, 13, 1824. [Google Scholar] [CrossRef]
Xia, X.; Chai, X.; Zhang, N.; Zhang, Z.; Sun, Q.; Sun, T. Culling double counting in sequence images for fruit yield estimation. Agronomy 2022, 12, 440. [Google Scholar] [CrossRef]
Steinbrener, J.; Posch, K.; Leitner, R. Hyperspectral fruit and vegetable classification using convolutional neural networks. Comput. Electron. Agric. 2019, 162, 364–372. [Google Scholar] [CrossRef]
Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. Deepfruits: A fruit detection system using deep neural networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Chen, J. Detrs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Seo, D.; Lee, S.K.; Kim, J.G.; Oh, I.S. High-precision peach fruit segmentation under adverse conditions using Swin transformer. Agriculture 2024, 14, 903. [Google Scholar] [CrossRef]
Ultralytics YOLOv8 Documentation. Available online: https://docs.ultralytics.com/ (accessed on 1 April 2025).
YOLO by Ultralytics. GitHub Repository. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 March 2020).
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Representative examples from MetaFruit dataset, which includes five fruit commodities: (a) apple, (b) orange, (c) lemon, (d) grapefruit, and (e) tangerine. Last column shows the boxed labels.

Figure 2. Representative examples from NIHHS-JBNU dataset.

Figure 3. Representative examples of Peach dataset.

Figure 4. Example heatmaps generated with different standard deviations. Bounding box colors indicate visibility, categorized into three grades: “good” (red), “fair” (yellow), and “bad” (blue).

Figure 5. Training and validation curves for YOLOv8 model on MetaFruit dataset. Top row shows training losses and performance metrics: box loss, classification loss, DFL loss, precision, and recall. Bottom row shows the corresponding validation losses and mAP scores. Blue lines represent the raw values for each epoch, while dotted orange lines indicate the smoothed trends.

Figure 6. Center-point prediction results for MetaFruit dataset using YOLOv8 model (GT boxes (green) and predicted points (red)).

Figure 7. Center-point prediction results for MetaFruit dataset using RT-DETR model (GT boxes (blue) and predicted points (red)).

Figure 8. Center-point prediction results for MetaFruit dataset using Faster R-CNN model (GT boxes (yellow) and predicted points (red)).

Figure 9. Center-point prediction results for MetaFruit dataset using U-Net-based HR model (GT boxes (white) and predicted points (red)).

Figure 10. Examples of center-point prediction results on NIHHS-JBNU dataset.

Figure 11. Examples of center-point prediction results on Peach datasets.

Table 1. Statistics of MetaFruit dataset (‘#’ indicates “number of”).

	# Images	# Boxes	# Avg. Boxes/Image
Apple	812	62,040	76
Orange	926	45,834	49
Lemon	958	42,238	44
Grapefruit	490	12,118	25
Tangerine	1062	85,785	81
Total	4248	248,015	58

Table 2. Statistics of NIHHS-JBNU dataset (‘#’ indicates “number of”).

	Image Size	# Images	# Boxes				# Avg. Boxes Size
	Image Size	# Images	Good	Fair	Bad	Total	Width	Height
Cam1	3648 × 5472	99	1573	2276	3070	6919	98 ± 16	97 ± 17
Cam2	2448 × 3264	100	1234	2027	3080	6341	74 ± 12	74 ± 12
Total		199	2807	4303	6150	13,260

Table 3. Statistics of Peach dataset (‘#’ indicates “number of”).

Image Type	# Images	# Polygon Mask	# Avg. Masks/Image
Tree	80	957	12.0
Partial crop	45	120	2.7
Total	125	1077	8.6

Table 4. Quantitative evaluation of center-point detection performance for YOLOv8, Faster R-CNN, and U-Net-based HR models on MetaFruit test dataset.

Model	Precision	Recall	F1-Score
YOLOv8	0.8033	0.8903	0.8366
RT-DETR	0.8110	0.8731	0.8322
Faster R-CNN	0.7972	0.7187	0.7376
U-Net-based RGB HR	0.9151	0.7047	0.7816
U-Net-based RGBH HR	0.9045	0.7146	0.7836
U-Net-based RGBS HR	0.9166	0.6932	0.7741
U-Net-based RGBHS HR	0.9272	0.6760	0.7647

Table 5. Quantitative evaluation of center-point detection performance for single-species YOLOv8 and U-Net-based RGBH HR models on individual fruit categories in MetaFruit dataset.

Model	Dataset	Precision	Recall	F1-Score
YOLOv8	Apple	0.8162	0.8844	0.8446
	Orange	0.6032	0.8996	0.7125
	Lemon	0.8110	0.7963	0.7924
	Grapefruit	0.8580	0.8944	0.8716
	Tangerine	0.8601	0.9019	0.8766
U-Net-based RGBH HR	Apple	0.8330	0.6768	0.7225
	Orange	0.8935	0.7335	0.7943
	Lemon	0.7691	0.6741	0.6913
	Grapefruit	0.6340	0.7037	0.6319
	Tangerine	0.9283	0.7735	0.8374

Table 6. Quantitative evaluation of center-point detection performance for YOLOv8, RT-DETR, Faster R-CNN, and U-Net-based HR models on NIHHS-JBNU and Peach datasets.

Model	Dataset	Precision	Recall	F1-Score
YOLOv8	NIHHS-JBNU	0.7084	0.7860	0.7124
YOLOv8	Peach	0.5981	0.6978	0.6058
RT-DETR	NIHHS-JBNU	0.6593	0.7852	0.6919
RT-DETR	Peach	0.5714	0.7570	0.6063
Faster R-CNN	NIHHS-JBNU	0.7089	0.7413	0.6979
Faster R-CNN	Peach	0.5275	0.7071	0.6041
U-Net-based RGBH HR	NIHHS-JBNU	0.7182	0.6851	0.7015
U-Net-based RGBH HR	Peach	0.5387	0.6378	0.5841

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yoo, T.-W.; Oh, I.-S. Multi-Species Fruit-Load Estimation Using Deep Learning Models. AgriEngineering 2025, 7, 220. https://doi.org/10.3390/agriengineering7070220

AMA Style

Yoo T-W, Oh I-S. Multi-Species Fruit-Load Estimation Using Deep Learning Models. AgriEngineering. 2025; 7(7):220. https://doi.org/10.3390/agriengineering7070220

Chicago/Turabian Style

Yoo, Tae-Woong, and Il-Seok Oh. 2025. "Multi-Species Fruit-Load Estimation Using Deep Learning Models" AgriEngineering 7, no. 7: 220. https://doi.org/10.3390/agriengineering7070220

APA Style

Yoo, T.-W., & Oh, I.-S. (2025). Multi-Species Fruit-Load Estimation Using Deep Learning Models. AgriEngineering, 7(7), 220. https://doi.org/10.3390/agriengineering7070220

Article Menu

Multi-Species Fruit-Load Estimation Using Deep Learning Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Material

2.2. Methods

2.2.1. YOLOv8

2.2.2. RT-DETR

2.2.3. Faster R-CNN

2.2.4. Heatmap Regression

2.3. Evaluation Metrics

3. Experimental Results

3.1. Dataset and Training Settings

3.2. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI