Artificial Intelligence-Based Detection of On-Ground Chestnuts Toward Automated Picking

Fang, Kaixuan; Lu, Yuzhen; Mu, Xinyang

doi:10.3390/agriengineering8030116

Open AccessArticle

Artificial Intelligence-Based Detection of On-Ground Chestnuts Toward Automated Picking

by

Kaixuan Fang

¹

,

Yuzhen Lu

^2,*

and

Xinyang Mu

²

¹

School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Department of Biosystems and Agricultural Engineering, Michigan State University, East Lansing, MI 48824, USA

^*

Author to whom correspondence should be addressed.

AgriEngineering 2026, 8(3), 116; https://doi.org/10.3390/agriengineering8030116

Submission received: 14 February 2026 / Revised: 12 March 2026 / Accepted: 16 March 2026 / Published: 19 March 2026

(This article belongs to the Special Issue Applications of Computer Vision in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Traditional mechanized chestnut harvesting is too costly for small producers, non-selective, and prone to damaging nuts. Accurate, reliable detection of chestnuts on the orchard floor is crucial for developing low-cost, vision-guided automated harvesting technology. However, developing a reliable chestnut detection system faces challenges in complex environments with shading, varying natural light conditions, and interference from weeds, fallen leaves, stones, and other foreign on-ground objects, which have remained unaddressed. This study collected 319 images of chestnuts on the orchard floor, containing 6524 annotated chestnuts. A comprehensive set of 29 state-of-the-art real-time object detectors, including 14 in the YOLO (v11–v13) and 15 in the RT-DETR (v1–v4) families at various model scales, was systematically evaluated through replicated modeling experiments for chestnut detection. Experimental results show that the YOLOv12m model achieved the best mAP@0.5 of 95.1% among all the evaluated models, while RT-DETRv2-R101 was the most accurate variant among the RT-DETR models, with mAP@0.5 of 91.1%. In terms of mAP@[0.5:0.95], the YOLOv11x model achieved the best accuracy of 80.1%. All models demonstrated significant potential for real-time chestnut detection, and YOLO models outperformed RT-DETR models in terms of both detection accuracy and inference, making them better suited for on-board deployment. This work lays a foundation for developing AI-based, vision-guided intelligent chestnut harvest systems.

Keywords:

artificial intelligence; chestnut; harvest automation; machine vision

1. Introduction

Chestnut (Castanea spp.) is a nutritious and popular nut crop, rich in vitamins and minerals, low in fat, and high in dietary fiber. Its sweet flavor is particularly favored by American consumers. According to the 2022 USDA Census of Agriculture [1], 2845 growers in the United States (U.S.) manage a total of 10,049 acres of fruiting and non-fruiting chestnut orchards, with an average farm size of 3.5 acres. Chestnut farms with a size of 2 hectares (4.94 acres) or less and fewer than 600 chestnut trees are considered small-scale farms by Kang and Guyer [2]. Currently, the USDA defines small farms as operations with an annual gross income of less than $350,000 (https://www.ers.usda.gov/topics/farm-economy/farm-structure-and-organization, accessed on 15 March 2026). A recent study on chestnut production costs in Michigan reported an average total farm revenue of $12,500 per acre [3]. Given the small acreage and revenue levels, chestnut production in the U.S. can be characterized as small-scale farming under the USDA definition.

Michigan leads the U.S. chestnut production, accounting for approximately 13% of the national planted area. However, about 35% of this area comprises non-fruiting chestnut orchards. This indicates that a significant portion of existing land resources has not yet been fully converted into productive capacity, suggesting that the industry has considerable potential for further growth and development.

Chestnuts are highly seasonal fruits that can only maintain their peak commercial quality, size, and health for a relatively short period of time [4]. One of the primary challenges in chestnut production is its susceptibility to pest damage and quality degradation, which often leads to significant post-harvest loss. During the harvest period, chestnuts naturally fall from protective shells (known as burrs) to the orchard floor. Once in direct contact with the ground, the nuts are exposed to adverse environmental conditions, including soil, fallen branches and leaves, surface microbial communities, precipitation, fluctuations in temperature and humidity, etc. Fungi have been identified as the main cause of postharvest chestnut decay [5]. Fungal infection typically occurs after nuts fall to the ground and come into contact with soil, plant residues, and/or dirty water [6]. Additionally, fallen chestnuts are often subjected to vibration or friction, which can result in micro-cracks or abrasions on the shell or pericarp. These micro-wounds serve as the entry points for fungal spores, mycelium, or small insects. Damage caused by wildlife foraging further exacerbates yield and quality losses. Consequently, chestnuts must be harvested promptly after falling [7] to minimize ground exposure and risk of nut deterioration and loss, and ensure optimal product quality.

Despite these risks, on-ground chestnut harvesting still relies primarily on manual picking, which is highly labor-intensive, time-consuming, and increasingly unsustainable as orchard acreage expands. As the chestnut industry continues to grow—particularly among small and family-operated farms—producers are often faced with harvest volumes that exceed what a limited workforce can reasonably manage. To reduce labor demands, various mechanical harvesting systems have been introduced; however, these solutions only partially address the challenges of efficiency, labor dependence, and nut quality preservation. Currently available mechanical harvesting equipment mainly includes vacuum-based harvesters and mechanical sweepers, which may be trailed, mounted, or self-propelled depending on orchard conditions and operational requirements [8]. Vacuum harvesters, commonly used on uneven or sloping terrain, collect chestnuts from the orchard floor through suction pipes and deposit them into collection bags, with efficiency reaching up to approximately 900 kg/h. Mechanical sweepers gather windfallen nuts using rotating brushes and conveyor belts, with an efficiency of up to 1500 kg/h. However, if self-propelled machines are used, they are typically suitable for medium-to-large farms, usually requiring around 15–20 hectares of harvest area. If trailed or mounted machines are used, human labor is still necessary for operation.

Although these machines can reduce some manual effort, they still require continuous human involvement for driving, monitoring, and post-harvest handling, and their high acquisition costs (e.g., $50,000–$100,000 per unit) limit adoption by small-scale producers. Moreover, compared with manual picking, mechanical harvesting has been shown to increase the risk of physical damage to chestnuts, including bruising, abrasions, kernel darkening, and off-odor development [9,10]. In vacuum-based systems, for example, internal kernel damage caused by suction forces may not be immediately visible at harvest and often becomes apparent during storage, leading to increased postharvest losses and the need for additional handling or treatments [11]. While such systems are often described as “automated,” they do not eliminate labor-intensive steps such as separation, quality inspection, and rehandling, nor do they substantially reduce overall labor costs. These limitations underscore the need not merely for improved mechanization, but for a genuinely autonomous, low-cost harvesting solution that minimizes labor input while preserving nut quality.

In recent years, several studies have explored small- to medium-scale, cost-effective harvesting assistance systems for chestnut production. Kang and Guyer (2008) [2] developed and evaluated three chestnut harvester prototypes and proposed a venturi-based separation device to distinguish chestnuts from empty shells. Among them, the airlock blade system successfully picked up all the scattered material with a rate of material pickup of about 56 kg/h. However, the harvesting performance was inconsistent, and the system remained relatively bulky and energy-intensive. De Kleine and Guyer (2013) [12] introduced an airflow-adjustable harvesting system capable of collecting and separating chestnuts from orchard debris, thereby improving the operational efficiency of small orchards. According to tests, the highest chestnut harvesting efficiency could reach 88.44%. It is noted that the performance of the harvesting system was strongly affected by the nut-to-debris ratio and material feed rate. More recently, Greg Peck and colleagues at Cornell AgriTech demonstrated the Silverfox Harvester, which can collect up to 800 pounds of chestnuts per hour at a cost of less than $4000, offering an affordable option for small producers [13]. Nevertheless, this system still relies on manual operation and does not support autonomous harvesting. Overall, existing small-scale mechanical harvesting systems remain constrained by labor dependence, incomplete automation, performance reliability, and the need for additional separation and processing steps.

To address these challenges, there is a clear need to develop a low-cost, high-precision autonomous chestnut harvesting system that incorporates vision capabilities to identify chestnuts on the orchard floor and accurately collect them without causing damage. Such a system would minimize downstream separation processes, significantly reduce labor requirements, and enable efficient, high-quality harvesting tailored for small-scale chestnut producers.

As the core component of an intelligent harvesting platform, reliable machine vision is fundamental to accurate perception, decision-making, and robotic operation. Advances in machine learning (ML) and artificial intelligence (AI)-based vision technologies have significantly accelerated progress in agricultural automation. Early applications of ML focused on crop monitoring and yield prediction [14], and have since expanded to a wide range of tasks. AI-based vision systems have greatly improved the efficiency of pest and disease identification, crop health monitoring, and phenotypic analysis, enabling rapid, data-driven management decisions without human intervention [15]. In harvesting automation, Alaaudeen et al. (2024) [16], for example, combined computer vision with robot harvesting to realize autonomous apple picking, reporting recognition success rates exceeding 95% and retry rates (the rate of retrying after a failed grasp) below 12%. Recent studies have also demonstrated the effectiveness of deep learning in chestnut-related vision tasks. Adão et al. (2019) [17] successfully classified and segmented chestnuts using convolutional neural networks (CNNs), achieving a classification accuracy of 91%. Sun et al. (2023) [18] applied semantic segmentation to aerial images for chestnut tree cover detection, obtaining an average F1 score of 86.13%. However, traditional ML and early deep learning approaches often exhibit limited generalization under the complex visual conditions encountered in real chestnut orchard environments. Challenges such as variable illumination, occlusion, orchard floor clutter, and dynamic environmental conditions frequently degrade detection accuracy as well as real-time performance, limiting their practical deployment in autonomous harvesting systems.

The first step toward automated chestnut harvesting is the development of a robust detection system capable of reliably identifying on-ground chestnuts. Such a system must effectively deal with challenges such as occlusion [19], illumination variations [20], and visually complex backgrounds, where objects such as leaves, stones, and soil share similar color and texture characteristics with chestnuts and can lead to false positives or missed detection. In recent years, YOLO (You Only Look Once) and RT-DETR (Real-Time DEtection TRansformer) models have demonstrated strong performance in agricultural target detection tasks. Mamdouh & Khattab (2021) [21] employed an improved YOLOv4-based algorithm for olive fruit fly, achieving a precision of 0.84, a recall of 0.97, and a mean Average Precision (mAP) of 96.68%. Liao et al. (2025) [22] proposed the YOLO-MECD model based on YOLOv11, achieving a precision of 84.4% and an mAP of 81.6%. Allmendinger et al. (2025) [23] applied the RT-DETR-l model to weed detection, achieving an average precision of 82.44% and an average recall of 66.02%. Mu et al. (2025) [24] conducted a comparative benchmark of YOLO (v8–v12) and RT-DETR (v1–v2) models for blueberry detection using a curated bush canopy dataset, achieving a maximum mAP@50 of 93.6% with the RT-DETRv2-X model. Despite these advances, a systematic and comparative evaluation of state-of-the-art real-time detection models for on-ground chestnut detection under real orchard conditions remains unexplored.

The overall objective of this study was therefore to systematically evaluate the applicability of state-of-the-art real-time object detection models for on-ground chestnut detection in orchard environments. Specifically, the objectives were to: (1) construct a labeled chestnut detection dataset that reflects real orchard conditions; (2) conduct a comprehensive quantitative comparison of representative real-time object detection models, including YOLOv11, YOLOv12, YOLOv13, as well as RT-DETRv1, RT-DETRv2, RT-DETRv3, and RT-DETRv4, in terms of detection accuracy, robustness under complex field conditions, and real-time inference performance; and (3) analyze the implications of the comparative results for the design and deployment of vision-based real-time automated chestnut harvesting systems. Both the dataset and software programs developed in this study have been made publicly available at https://github.com/AgFood-Sensing-and-Intelligence-Lab/ChestnutDetection (accessed on 15 March 2026).

2. Materials and Methods

2.1. Chestnut Dataset

The chestnut image dataset used in this study was collected in a commercial orchard (Owosso, MI, USA) during the 2024 harvest season on 5 October. Image acquisition was conducted between 9:50 am and 11:20 pm under sunny weather conditions. Images were captured by walking through the orchard while recording orchard ground scenes using a hand-held smartphone (iPhone 12, Apple Inc., Cupertino, CA, USA). Although the overall weather condition was sunny, natural illumination variability occurred due to the canopy shadows cast by chestnut trees. The data collection process also encompassed a wide range of ground conditions, including varying grass coverage and soil backgrounds. These variations in illumination and ground appearance provide a diverse and representative dataset for robust modeling.

The acquired images were manually annotated using VGG Image Annotator (VIA) (Dutta and Zisserman, 2019) [25] by drawing bounding boxes for exposed chestnuts in each image. All images were carefully labeled to capture individual chestnuts, including partially visible and occluded ones. Such meticulous annotation was critical for ensuring model robustness under challenging conditions such as occlusion, shadows, and non-uniform backgrounds.

Figure 1 shows representative examples of original images alongside the corresponding images with bounding box annotations. The final annotations were saved in JavaScript Object Notation (JSON) file format, and the consistency and accuracy of the annotations were verified through multiple rounds of review to ensure their quality. The resulting dataset consists of 319 high-resolution (4032 × 3024 pixels) color images with a total of 6524 bounding box annotations. Figure 2 shows the distribution of the number of annotated chestnut instances per image. The chestnut counts per image ranged from 2 to 155, with an average of 23; in particular, approximately 76% of the dataset contained no more than 30 chestnut instances. The chestnut dataset has been made publicly available [26].

2.2. Chest Detection Models

2.2.1. Real-Time Object Detectors

Deep-learning-based object detectors are typically categorized into two types: two-stage and one-stage detectors [27]. One-stage models in particular offer faster and real-time detection capabilities by eliminating the region proposal step in two-stage detectors and performing detection in a single unified process. For automated chestnut harvesting, rapid and reliable real-time detection is crucial to achieve efficient harvest operations, for which one-stage detectors are more practically appropriate. Among the most widely used real-time object detection frameworks are the YOLO and RT-DETR model families, which are briefly presented below.

YOLO Series

YOLO is a real-time object detection framework designed to perform detection in a single forward pass of a neural network. The framework divides the image into a grid and predicts bounding boxes and class probabilities for each cell. The concept of the first version, YOLOv1 [28], is to divide the input image into a grid, and each grid cell is given responsibility for detecting if an object’s center falls within the grid cell. However, initial versions struggled with detecting small objects due to coarse grid divisions and relatively lower localization accuracy when compared with region-based methods. Over the past three years, YOLO has undergone substantial architectural and methodological improvements to address these limitations.

YOLOv11, which was introduced in 2024 by Ultralytics [29], is a comprehensive update to the classic backbone–neck–head paradigm. The incorporation of the Cross-Stage Local Self-Attention (C2PSA) module enables the model to more effectively capture contextual information across multiple layers, thereby improving the accuracy of object detection, especially for small objects. YOLOv12 [30] introduces the Area Attention (A2) module, which maintains a large receptive field while drastically reducing computational complexity, allowing the model to enhance speed without compromising accuracy. YOLOv13 [31] addresses the above-mentioned challenges by proposing a Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism. This approach adaptively exploits latent high-order correlations through hypergraph computation, overcoming the limitation of pairwise correlation and enabling efficient global cross-location and cross-scale feature fusion and enhancement.

RT-DETR Series

DETR (DEtection TRansformer) represents a paradigm shift in object detection by introducing transformer architectures to model global spatial relationships across an image. Unlike traditional detectors, DETR eliminates the need for anchor boxes and non-maximum suppression, simplifying the detection pipeline. DETR’s ability to capture long-range dependencies makes it particularly effective in detecting objects in challenging environments. Recent advancements have improved its convergence speed and performance on small-object detection tasks [32], further expanding its applicability to real-world tasks.

RT-DETRv1 [33] is the first real-time variant of DETR specifically designed to overcome the high computation cost and low inference speed of the original architecture. It achieves real-time end-to-end object detection by replacing the vanilla Transformer encoder with an efficient hybrid encoder. RT-DETR introduces attention-based intra-scale feature interaction (AIFI) to focus on refining features within the same scale, and CNN-based cross-scale feature fusion to integrate multi-scale information.

RT-DETRv2 [34] enhances the flexibility and practicality optimization of RT-DETR by introducing scale-adaptive sampling in the deformable attention module, enabling features at varying scales to be extracted more efficiently. The model also introduces a mask generator to enhance query diversity through perturbation masks applied in the self-attention step. This reduces redundancies and enhances query–ground truth matching. In terms of optimization, RT-DETRv2 combines L1 loss, generalized IoU (GIoU) loss, and variable focus loss (VFL) to balance regression and classification tasks. These strategies jointly improve detection accuracy and computational efficiency.

RT-DETRv3 [35] further improves detection performance by introducing a CNN-based one-to-many label assignment auxiliary head, jointly optimized with the main detection branch to strengthen the encoder’s representations. To effectively address the problem of sparse supervision in object detection, RT-DETRv3 proposes a learning strategy with self-attention perturbations to enhance decoder supervision by diversifying label assignments across multiple query groups. These methods significantly improve model performance and accelerate convergence without increasing inference latency.

Very recently, RT-DETRv4 [36] was proposed to advance real-time detection by addressing the representational bottleneck of lightweight architectures while fully preserving inference efficiency. It introduces a training-time semantic distillation framework that leverages the representational power of advanced Vision Foundation Models (VFMs) without modifying the inference-time architecture. To enable stable and task-aligned semantic transfer, a Deep Semantic Injector (DSI) that integrates high-level semantic representations from VFMs is incorporated into the deep layers of the detector, enriching encoder features. In addition, a Gradient-guided Adaptive Modulation (GAM) strategy is designed to dynamically regulate the strength of semantic injection based on gradient norm ratios, harmonizing semantic distillation with detection optimization. These innovations allow RT-DETRv4 to achieve substantial performance gains with no additional inference latency or deployment cost, offering a new state-of-the-art balance between accuracy and speed.

2.2.2. Model Selection and Configuration

This study selected YOLOv11, YOLOv12, and YOLOv13, as well as RT-DETRv1, RT-DETRv2, RT-DETRv3, and RT-DETRv4, based on their demonstrated performance across different real-time vision tasks [21,24,37], to develop chestnut detection models using the constructed orchard dataset. All detectors were implemented using open-source software packages provided by their respective developers, as summarized in Table 1, and were adapted and retrained for the chestnut detection task in this study. To comprehensively evaluate model performance across capacity levels and to provide deployment flexibility under varying onboard computing resource constraints, multiple architecture variants were considered. Specifically, five variants of YOLOv11 and YOLOv12 (nano, small, medium, large, and extra-large) and four variants of YOLOv13 (nano, small, large, and extra-large) were tested, resulting in a total of 14 YOLO model variants evaluated in this study. Similarly, four backbone variants (R18, R34, R50, and R101) were evaluated for each of RT-DETRv1, RT-DETRv2, and RT-DETRv4, along with three backbone variants (R18, R34, and R50) for RT-DETRv3, yielding a total of 15 RT-DETR model variants.

2.2.3. Experimentation

Figure 2 illustrates the overall modeling process for chestnut detection. The original image annotation data (JSON format) provided by the VIA tool was first converted into YOLO-format labels for training YOLOv11, YOLOv12, and YOLOv13. The YOLO format annotations were then converted to the corresponding COCO data format to ensure compatibility with RT-DETR models. Following the format conversion, the dataset was randomly partitioned into three subsets: training, validation, and test subsets, with a split ratio of 70%, 10%, and 20%, corresponding to 223, 31, and 65 images, respectively.

Model training was conducted using the official implementation released on GitHub, with Python 3.11 on a workstation equipped with eight NVIDIA RTX 4090 GPUs (24 GB each) and a 32-core CPU Intel Xeon Platinum 8481C processor. YOLOv11 used version v8.3.235, YOLOv12 used version v1.0, and all other models utilized the latest version from the main branches of their respective repositories. For the YOLO-based experiments, five variants of YOLOv11 and YOLOv12 (nano, small, medium, large, and extra-large), and four variants of YOLOv13 (nano, small, large, and extra-large) were trained for 200 epochs using the stochastic gradient descent (SGD) optimizer. The initial learning rate was set to 0.01 and adjusted dynamically through cosine learning rate scheduling, with a weight decay value of 0.0005. The training included a warmup phase lasting 3.0 epochs, and the intersection-over-union (IoU) threshold was set at 0.7 for bounding box predictions. The batch size was set at 8, with 12 data-loading workers, and Automatic Mixed Precision (AMP) training was employed to improve computational efficiency and reduce GPU memory consumption.

Since the original dataset images have a high resolution of 4032 × 3024 pixels, the choice of input size requires balancing detection accuracy and computational efficiency. A larger input size preserves more image details, leading to better performance in small-object detection; however, it also significantly increases computation, slows down training and inference, and consumes more GPU memory. Considering both accuracy requirements and available hardware resources, all images were empirically resized to 1024 × 1024 pixels. Data augmentation followed the models’ default automatic augmentation strategies, such as random flipping and scaling. Model outputs, consisting of confidence scores and bounding boxes, were stored in JSON format for subsequent evaluation. Standard metrics in terms of detection accuracy, model complexity, and inference time (Section 2.3) were employed for model performance assessment, consistent with common evaluation practices in object detection studies [37,38].

For the RT-DETR-based experiments, four backbone variants (R18, R34, R50, and R101) were evaluated for RT-DETRv1, RT-DETRv2, and RT-DETRv4, along with three backbone variants (R18, R34, and R50) for RT-DETRv3. All models were fine-tuned from the official pretrained weights and trained for 200 epochs with the SGD optimizer. The image size and number of workers were kept consistent with the YOLO experiments. The batch size was set to 4, the initial learning rate was set to 0.0005, the weight decay was set to 0.0001, and the maximum gradient clipping norm was set to 0.5, while all other settings followed the default configurations. The same evaluation metrics as described in Section 2.3 were used for performance comparison.

Due to the random nature of dataset partitioning and the small number of samples in a dataset, training a detection model on a specific dataset partition can be sensitive to a single dataset partition, potentially leading to biased model performance; if a particular partitioning happens to be favorable, the results may be overly optimistic. To improve the credibility of model performance evaluation, this study employed Monte Carlo cross-validation (CV) (also known as repeated holdout CV) with five replicates as done in [38]. Specifically, the dataset was randomly split into the three subsets five times using different random seeds, and model performance was evaluated on the test data for each replicate. Final performance metrics were computed as the average across all five replicates (Figure 3). Unlike traditional K-fold cross-validation, which divides the dataset only once, this Monte Carlo approach provides more reliable performance estimates by accounting for variability introduced by different data splits, enhancing the credibility of model performance evaluation.

2.3. Performance Evaluation Metrics

The performance of YOLO and RT-DETR object detectors in chestnut detection was evaluated in terms of detection accuracy, model complexity, and inference times [37], which are described below.

2.3.1. Detection Accuracy

The detection accuracy of trained models was assessed on test data using standard object detection metrics, including precision, recall, mAP@0.5, and mAP@[0.5:0.95]. Among the metrics, mAP@0.5 is commonly used as the primary indicator of object detection performance, while mAP@[0.5:0.95] provides a more comprehensive assessment across varying localization strictness. The IoU quantifies the overlap between a predicted bounding box and the corresponding ground-truth box and is defined as the ratio of the area of their intersection to the area of their union. In this study, precision and recall were computed using an IoU threshold of 0.7 to reflect stricter localization requirements, whereas mAP@0.5 and mAP@[0.5:0.95] were calculated following the standard COCO protocol with IoU thresholds of 0.5 and 0.5:0.05:0.95.

For a given object class, detection outcomes are defined as follows:

True Positive (TP): a predicted bounding box that correctly matches a ground-truth object of the same class with an IoU greater than or equal to 0.7.

False Positive (FP): a predicted bounding box that either does not correspond to any ground-truth object or is assigned to an incorrect class, or whose IoU with the matched ground-truth object is below the threshold.

False Negative (FN): a ground-truth object that is not detected by the model.

True Negative (TN): background regions correctly identified as non-object areas (typically not explicitly used in object detection metric calculations).

Based on these definitions, precision (P) and recall (R) are computed as:

P = \frac{T P}{T P + F P}

(1)

R = \frac{T P}{T P + F N}

(2)

The average precision (AP) for a given class is defined as the area under the precision–recall (P–R) curve:

A P = \int_{0}^{1} P (R) d R

(3)

The mean average precision (mAP) is then computed as the mean of AP values across all object classes:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(4)

2.3.2. Number of Model Parameters

The number of model parameters directly correlates with model complexity, which is a critical factor for real-world deployment, especially in resource-constrained environments. Larger models, with more parameters, tend to require more memory and higher computational power during inference, resulting in increased processing times. While these models may offer improved accuracy, the trade-off is often greater resource consumption, leading to slower inference times. This can significantly impact their suitability for real-time applications and embedded devices, where quick decision-making is crucial. Therefore, for embedded systems with limited resources, it is important to balance accuracy with computational efficiency.

2.3.3. Computation Cost and Inference Time

Floating-point operations (FLOPs) are commonly used to measure the computational complexity of a model, representing the number of arithmetic operations required to process a single input. In this study, computational complexity is reported in terms of GFLOPs (GigaFLOPs), which is a standard unit for large-scale deep learning models. Additionally, inference time, defined as the time required for a trained model to generate predictions for an input image, is a critical metric for real-time detection applications. The inference time for each YOLO and RT-DETR detector was computed as the average prediction time across all images in the test set using the same hardware described above.

3. Results

3.1. YOLO Results

Figure 4 presents the training curves for mAP@0.5 and mAP@[0.5:0.95] across all evaluation model variants. All architectures exhibited rapid feature learning capabilities during the early training stages. Specifically, within the first 100 epochs, YOLOv11 achieved over 90% mAP@0.5, whereas YOLOv12 and YOLOv13 reached approximately 75% and 65%, respectively. Across all three model series, mAP@[0.5:0.95] exceeded 65% within the same training window. After approximately 150 training epochs, the detection accuracy across all variants converged and stabilized, indicating that the 200-epoch training schedule adopted was sufficient to achieve stable convergence. This rapid convergence demonstrates the ability of these models to effectively adapt to the dataset for chestnut detection, despite challenging ground-level orchard conditions involving shadows, occlusions, and background clutter.

To further examine training dynamics and potential overfitting, Figure 5 shows the training curves for the training box loss and validation box loss of the YOLO models. As shown in the figure, the loss functions of all models decreased sharply within the first 20 epochs, indicating that the models could quickly learn representative features. Thereafter, the loss values gradually decreased and stabilized, indicating that the models gradually converged during training. Importantly, the validation set loss and training loss showed similar trends with no significant deviation, indicating that overfitting did not occur in the later stages of training. We continuously monitored the validation performance of the models using validation loss and mAP metrics. The stabilization of the loss curves was consistent with the improvement in mAP performance, confirming that the models maintained good generalization ability. Although the loss curves began to stabilize around 100–150 epochs, we continued training for 200 epochs to ensure that all models fully converged and achieved stable performance.

Table 2 summarizes the detection performance of all YOLO models on the test dataset. Overall, all three YOLO families achieved competitive performance, with the accuracy generally improving as the model scale increased. Across all variants, mAP@0.5 ranged from 89.8% (YOLOv13-n) to 95.1% (YOLOv12-m), while mAP@[0.5:0.95] ranged from 60.5% (YOLOv13-n) to 80.1% (YOLOv11-x).

Although the results in Table 2 show performance differences among YOLOv11, YOLOv12, and YOLOv13, the standard deviations of some metrics overlap. To determine whether these differences are statistically significant rather than the result of random fluctuations, a one-way analysis of variance (ANOVA) followed by Fisher’s Least Significant Difference (LSD) multiple comparison tests at the significance level of α = 0.05 was performed. The ANOVA results revealed significant differences among the three model series for both evaluation metrics, including mAP@0.5 (F = 7.59, p = 0.008) and mAP@[0.5:0.95] (F = 21.79, p < 0.001). Subsequent multiple comparison analysis further showed that all pairwise comparisons among YOLOv11, YOLOv12, and YOLOv13 exhibited statistically significant differences (p < 0.05). These results indicate that the observed performance differences among the three model families are statistically meaningful and unlikely to be caused by random variation. Table 3 presents the results of multiple comparisons of mAP performance for YOLOv11, YOLOv12, and YOLOv13 based on one-way ANOVA and LSD tests. The models are labeled with different letters (a, b, c). These letters indicate the statistical significance of pairwise comparisons: models with the same letter show no significant difference, while models with different letters show statistically significant differences.

The YOLOv12-m model achieved the highest mAP@0.5 value of 95.1% and the highest recall value of 89.3% while maintaining a high precision value of 92.9%. These results demonstrate that YOLOv12-m has strong detection accuracy and achieves a good balance between precision and recall. Figure 6 further illustrates an example image of the detection results of YOLOv12-m under complex lighting and severe occlusion conditions. In this test, the model correctly detected 47 out of 49 chestnuts, with only one false positive, achieving a precision (P) of 97.9% and a recall (R) of 95.9%. In contrast, YOLOv11-x attained the highest mAP@[0.5:0.95] (80.1%) and a high precision of 95.3%, with recall = 88.9%, suggesting superior bounding-box localization accuracy and robustness across varying IoU thresholds, which are important for reducing false positives in practical applications. Among the three model families, the YOLOv11 model consistently performed well in terms of precision and recall, with all variants achieving a precision exceeding 94% and a mean precision of 95.3%. Compared to YOLOv11, YOLOv12 showed a slight decrease in mean precision, but its recall remained similar (87.6%). Notably, YOLOv12 exhibited significant improvements in recall for medium, large, and super-large variants, indicating that its attention-based architectural enhancements—such as efficient attention and improved feature aggregation—help recover more true positives under challenging visual conditions. These architectural refinements in YOLOv12 were designed to better capture salient features without sacrificing real-time performance.

To demonstrate the performance differences between YOLOv11-x and YOLOv12-m, we conducted a confusion matrix-based analysis at different IoU thresholds. As shown in Table 4, both models exhibited high true-positive rates when IoU = 0.50. YOLOv11-x had a true-positive rate of 0.85, a false-positive rate of 0.05, and a false-negative rate of 0.10; while YOLOv12-m had a true-positive rate of 0.84, a false-positive rate of 0.06, and a false-negative rate of 0.10. These results indicate that the detection capabilities of the two models are comparable when the IoU threshold is relaxed, which explains why YOLOv12-m achieved the highest mAP@0.5 at an IoU threshold of 0.5. However, the differences became more significant when the IoU threshold was increased to 0.75. YOLOv11-x maintained a relatively high true-positive rate (TP) (0.73), while its false-positive rate (FP) (0.08) and false-negative rate (FN) (0.19) were at a moderate level. In contrast, YOLOv12-m showed a significant decrease in true-positive rate (0.65), and a marked increase in false-positive rate (0.14) and false-negative rate (0.21). This indicates that under stricter IoU criteria, YOLOv12-m is more prone to localization-related errors and background-induced false detections, especially in cluttered scenes or when parts of the chestnut are occluded.

Although YOLOv12 slightly outperformed YOLOv11 in terms of mAP@0.5, its mAP@[0.5:0.95] metric (71.5%) was significantly lower than YOLOv11’s (78.4%), indicating weaker performance under stricter localization criteria. In contrast, YOLOv13 performed poorly across all scales, particularly on the mAP@[0.5:0.95] scale, where its average score was only 64.86%. Even its best-performing variant (YOLOv13-s: precision = 92.1%, recall = 84.0%, mAP@0.5 = 92.3%, mAP@[0.5:0.95] = 66.4%) lagged behind both YOLOv11 and YOLOv12. From an architectural perspective, this discrepancy may be related to YOLOv13’s emphasis on global feature correlation mechanisms such as hypergraph-based adaptive correlation enhancement and full-pipeline feature distribution, which are designed to capture high-order relationships across the entire image space. While such mechanisms have shown benefits on certain benchmarks, they may be less effective for densely packed, small single-class target detection under complex lighting and severe occlusion, where the preservation of local fine-grained spatial details is critical for stringent bounding-box localization, especially at higher IoU thresholds.

Figure 7 shows the precision–recall (PR) curves for different YOLO model variants on the dataset. Overall, the PR curves for the YOLOv11 series consistently lie near the upper right corner of the figure. YOLOv11 maintains high precision even at lower recall levels, resulting in a relatively gentle initial decline in the curve. As recall increases further, precision begins to decline more rapidly, leading to a steeper slope in the later stages of the curve. In contrast, the YOLOv12 series exhibits a different trend. Its PR curve shows a more pronounced decrease in precision at lower recall levels, indicating that precision declines earlier as the confidence threshold is relaxed. However, at higher recall levels, the decline in precision becomes slower, suggesting improved stability of precision when detecting more targets. The YOLOv13 model performs relatively weakly because its PR curve typically falls between the YOLOv11 and YOLOv12 curves for most of the recall range. This indicates that, with similar recall rates, YOLOv13 has lower precision, resulting in a smaller area under its PR curve. The trends of the curves are consistent with the data in Table 2.

Figure 8 illustrates the relationship between model complexity and computational performance, showing the trends of GLOPs and inference time versus the number of model parameters. As the model scale increases, both inference time and GFLOPs exhibit an upward trend. Across corresponding scales, the three YOLO families demonstrated comparable computational complexity, with YOLOv11-x, YOLOv12-x, and YOLOv13-x all approaching 200 GFLOPs. Although these larger models incurred higher computational costs, their inference time remained below 50 ms, corresponding to frame rates exceeding 20 FPS (frames per second). This performance still meets the basic real-time detection requirements of ground-based chestnut harvesting systems.

YOLOv11 models appeared to exhibit the most favorable balance between accuracy and computational efficiency. YOLOv11-n achieved the fastest inference time of 5.6 ms, followed by YOLOv11-s at 8.3 ms, which—combined with its precision of 95.5% and mAP@0.5 of 93.45%—makes it particularly attractive for embedded and edge device-based applications. YOLOv12-m achieves an inference time of 11.7 ms while delivering the highest mAP@0.5 of 95.1%, supporting its feasibility for real-time deployment. In contrast, YOLOv13-x exhibited the highest computational load (198.7 GFLOPs) and longest inference (47.8 ms), rendering it the least suitable for real-time deployment.

3.2. RT-DETR Results

Figure 9 presents the training curves of mAP@0.5 and mAP@[0.5:0.95] for all the RT-DETR (v1–v4) models. All the model variants achieved mAP@0.5 values exceeding 80% and mAP@[0.5:0.95] values above 65% with the first 60 training epochs, followed by performance stabilization after approximately 100 epochs. These trends suggested effective learning and convergence across all models and confirmed that the 200-epoch training schedule adopted in this study was sufficient to achieve stable convergence.

To further analyze the training dynamics and assess potential overfitting, Figure 10 shows the training box loss and validation box loss curves for the RT-DETRv1, RT-DETRv2, RT-DETRv3, and RT-DETRv4 models. As shown, both the training loss and validation loss decrease rapidly within the first 10 epochs, indicating that the model can quickly learn representative features. Thereafter, the rate of decrease slows down and eventually stabilizes as training progresses. The trend of the validation loss is similar to that of the training loss, with no significant deviation, indicating that the model effectively avoids overfitting.

Table 5 summarizes the detection performance of the evaluated RT-DETR variants. Similar to the YOLO results, detection accuracy generally improves with increasing model size. RT-DETRv2 significantly outperforms RT-DETRv1 across all evaluation metrics, demonstrating the effectiveness of its architectural improvements. In contrast, RT-DETRv3’s precision and recall are slightly lower than RT-DETRv1, despite a minor improvement in mAP.

Under the same training conditions and datasets, RT-DETRv4 performs worse than previous variants. RT-DETRv4 aims to leverage a VFM-based semantic distillation framework, which enriches feature representations by injecting high-level semantics from large pre-trained models into the detector during training. While this strategy has been shown to improve overall performance on large, diverse benchmark datasets such as COCO, existing research on dense object detection indicates that semantic knowledge alone is often insufficient to achieve optimal localization performance. In particular, Zheng et al. (2022) [39] pointed out that methods emphasizing global semantic alignment may not fully capture the fine-grained spatial cues required for accurate bounding box regression. On benchmark datasets such as COCO, for dense object detection, localization knowledge distillation usually brings more significant improvements than semantic feature imitation. In experiments on the chestnut dataset, the semantic distillation design of RT-DETRv4 prioritizes global semantic alignment while ignoring localization-related features. The chestnut dataset mainly consists of small, dense, and partially occluded targets, with limited diversity of training samples. Localization-related features are crucial for stringent localization metrics such as mAP@[0.5:0.95], and in this experiment, RT-DETRv4’s mAP@[0.5:0.95] metric was significantly lower than other models. Therefore, additional semantic supervision causes the learning focus to deviate from the precise localization features required for small-object detection, resulting in performance inferior to earlier RT-DETR variants.

Among all models, RT-DETRv2-R101 achieved the best overall performance (precision = 95.1%, recall = 86.3%, mAP@0.5 = 91.1%, mAP@[0.5:0.95] = 71.9%). The precision and recall are comparable to those of the best-performing YOLOv11-series models, highlighting its strong detection capability.

Figure 11 illustrates the relationship among the model size (number of parameters), computational complexity (GFLOPS), and inference time for all evaluated RT-DETR models. Consistent with the trends observed in the YOLO experiments, both inference time and computational cost generally increase with model size. Compared to RT-DETRv1, v2, and v3, the RT-DETRv4 architecture substantially reduces the number of parameters across all backbone scales. Specifically, RT-DETRv4-R18 contains only 10 million parameters, representing a 50% reduction relative to the 20 million parameters used in corresponding R18 variants of v1–v3. At even larger scales, RT-DETRv4-R34 and RT-DETRv4-R50 reduce parameter counts from 31 million and 42 million to 19 million and 31 million, respectively. Even at its largest scale, RT-DETRv4-R101 uses only 62 million parameters, approximately 18% less than the 76 million parameters required by the v1 to v3 versions. These reductions in model size translated directly to improved inference efficiency.

Among all RT-DETR variants, RT-DETRv4-R18 achieved the fastest inference time (23.7 ms), substantially outperforming other models using the same backbone network, including RT-DETRv1-R18 (72.1 ms), RT-DETRv2-R18 (47.4 ms), and RT-DETRv3-R18 (52.7 ms). In contrast, the model achieving the highest mAP@0.5, RT-DETRv2-R101, exhibited an inference time of 66.3 ms. Although all RT-DETR models satisfied basic real-time requirements, their overall inference speed remained slower than that of the YOLO-based detectors evaluated in this study.

3.3. YOLO vs. RT-DETR

Figure 12 illustrates the trade-off between detection accuracy (mAP@0.5 and mAP@[0.5:0.95]) and inference time for the evaluated YOLO and RT-DETR models. As noted above, the RT-DETR models, especially v1–v3, exhibited substantially slower inference speeds than the YOLO-based detectors. Although RT-DETR variants achieved mAP@0.5 values above 85%, their overall performance across both mAP metrics remained inferior to that of the best-performing YOLO models. When inference time and precision are jointly considered, the YOLOv11 family demonstrates the most favorable balance for real-time chestnut detection, providing consistently high accuracy with significantly lower latency, and thus representing the most practical choice for deployment in harvesting systems. Figure 13 presents representative detection results from RT-DETRv2-R101, the strongest RT-DETR variant, applied to the same test image shown in Figure 6. In this case, RT-DETRv2-R101 detected 44 out of 49 chestnuts with a precision of 95.1%, which is lower than the 97.9% achieved by YOLOv12m under identical conditions.

Across multiple evaluation metrics, the performance gap between RT-DETR and YOLOv11/YOLOv12 indicates architectural differences between the two frameworks. RT-DETR models are designed to capture global contextual information and long-range spatial dependencies, which can be advantageous for complex scene understanding. However, this design can limit their ability to preserve the fine-grained spatial features required for the precise localization of small, densely distributed, and partially occluded chestnuts. In contrast, YOLO architectures emphasize hierarchical local feature extraction and multi-scale feature fusion, enabling more robust performance under the complex, ground-level orchard environments. Overall, these results suggest that while RT-DETR may be advantageous for certain complex tasks, YOLO models are more appropriate for chestnut detection applications that demand both high accuracy and real-time processing capability.

3.4. Dynamic Video Stream Detection Results

Figure 14 shows the detection results of YOLOv12-m on a video stream, displaying two frames extracted from the video footage. The video was filmed during the same orchard visit as the chestnut image collection at a commercial orchard (Owosso, Michigan) using the same handheld smartphone (iPhone 12, Apple Inc., Cupertino, CA, USA). These two frames demonstrate the detection performance under dynamic conditions (with camera shake and varying lighting).

In the left image, with stable lighting conditions, the model correctly detected all 10 chestnuts in the scene, achieving a precision and recall of 1.0. This indicates excellent detection accuracy under stable and interference-free lighting conditions. In the right image, with complex lighting conditions, the model correctly detected 8 chestnuts, missed 1 (false negative), and incorrectly detected 4 (false positive). Therefore, the precision is 0.67, and the recall is 0.89. Despite video judder and changing lighting conditions, the model still detected most chestnuts, but false positives and false negatives highlighted the challenges posed by dynamic lighting conditions.

The detection process involved an average of 4.2 ms for preprocessing, 10.7 ms for inference, and 1.1 ms for postprocessing per frame. These times contribute to the overall processing time per frame and reflect the computational efficiency of the model in real-time video stream detection.

These results demonstrate that lighting conditions significantly impact video stream detection performance. Under fluctuating lighting conditions, precision drops significantly due to false positives, while recall remains relatively high. Detection accuracy is more sensitive to lighting changes under interference from factors such as camera shake, indicating that future model improvements should focus on enhancing robustness to cope with such environmental variations, thereby ensuring high detection performance in practical applications.

4. Discussion

Due to the complexity of orchard floor environments, research on ground-level chestnut detection in commercial orchard conditions remains limited. The results of this study demonstrate that both the emerging YOLO and RT-DETR models can effectively identify chestnuts in realistic orchard settings. Within the YOLO family, YOLOv11 consistently achieved the best overall detection performance, outperforming YOLOv12 and YOLOv13. These findings are consistent with those reported by Sapkota et al. (2024) [40], who evaluated multiple YOLO variants (v8–v12) for in-orchard pre-sparse detection of green apples and found that YOLOv11 demonstrated excellent precision, while YOLOv12-1 achieved the highest recall. Together, these results reinforce the robustness and suitability of YOLO models in chestnut detection. YOLOv11’s architecture preserves fine spatial details, enabling tight bounding box localization, while its relatively small model size also supports fast inference speed, making it promising for embedded deployment on harvesting equipment.

In contrast, RT-DETR models exhibited a significantly longer inference time than the YOLO series models. This observation aligns with findings from Saltık et al. (2024) [41], who reported that RT-DETRv1 can achieve competitive mAP at larger image sizes, but at the expense of increased inference time. These characteristics suggest that RT-DETR may be better suited for offline or batch processing scenarios, where global contextual modeling is beneficial and real-time constraints are less stringent.

Several limitations of this study warrant further investigation. First, the chestnut dataset is relatively small, which may limit the generalizability of the findings to orchards with different cultivars, soil types, ground terrains, or harvesting conditions. In future work, we plan to expand the dataset by collecting images from multiple orchards, cultivars, and seasonal conditions. This will enable cross-orchard and cross-season validation experiments, allowing a more rigorous assessment of the robustness and generalization capability of the proposed model under diverse real-world agricultural environments. Second, although both CNN-based YOLO models and Transformer-based RT-DETR models were evaluated, the overall training configuration—including data augmentation strategies and hyperparameter settings—was primarily developed based on the YOLO family. YOLO detectors are well-suited to aggressive geometric and photometric data augmentation, whereas RT-DETR models, due to their query-based Transformer architecture, are generally more sensitive to augmentation strength, query configurations, and training schedules. While RT-DETR was also tuned through adjustments of key hyperparameters such as learning rate and training schedule, further refinement of augmentation pipelines and other model-specific training strategies may still be required to fully exploit its potential performance. The modeling experiments were conducted using static images; real-world deployment will require validation under continuous video streams, varying illumination throughout the day, and mechanical vibrations from harvesting equipment. Motion blur has been shown to degrade image information acquisition and negatively affect object detection tasks in precision agriculture scenarios [42], indicating the need for models robust against motion for reliable field performance. Moreover, future work will focus on expanding the scale and diversity of the dataset, further optimizing RT-DETR-specific training configurations, and validating the performance of models deployed on harvesting platforms in dynamic conditions. This study focused exclusively on chestnut detection and did not explore downstream tasks such as vision-mechanisms integration, vision-guided chestnut picking, and harvesting platform locomotion, which are necessary for developing an autonomous chestnut harvesting system.

The observed performance differences among YOLO variants are closely related to their ability to preserve and exploit high-resolution features for small-object representation under complex backgrounds. Ground-level chestnuts are typically small, densely distributed, and frequently occluded by grass, leaves, or soil, which makes the retention of shallow spatial details and local structural cues particularly critical. Excessive spatial down-sampling in deeper network layers can suppress weak target responses, leading to missed detections or imprecise localization. Models that more effectively leverage multi-scale feature fusion, therefore, tend to achieve higher recall and better localization consistency, particularly under stricter IoU thresholds where accurate boundary regression is essential. This observation suggests that fine-grained spatial information plays a dominant role in distinguishing chestnuts from visually similar background elements.

These findings further suggest that future improvements should focus on specific architectural modifications aimed at enhancing small target perception capabilities. One promising direction is to integrate attention mechanisms, such as the Channel Spatial Attention Module (CBAM), into the backbone or neck network to highlight information-rich spatial regions and suppress background interference. Previous research has shown that attention modules can significantly improve the detector’s ability to localize small targets by reallocating feature weights and enhancing weak target responses. Another potential strategy is to optimize the feature pyramid structure. For example, traditional PAN (pyramid attention network) or FPN (feature pyramid network) structures can be replaced by enhanced multi-scale fusion mechanisms, including improved feature pyramid networks, bidirectional feature pyramid networks, or attention-guided pyramid structures. These methods strengthen the interaction between shallow high-resolution features and deep semantic features. Such multi-scale fusion strategies are widely used to mitigate spatial information loss caused by depth downsampling and improve the detection performance of small targets. Furthermore, introducing an additional detection head specifically designed for small targets and based on a higher-resolution feature map could further improve the detection performance of densely distributed chestnuts. By adding a prediction head to an earlier feature layer corresponding to a higher-resolution feature map, the detector can better capture subtle spatial cues and object boundaries that are typically lost in deeper layers. This architectural improvement can enhance the recall and localization consistency of targets in complex orchard environments, especially when the target is small, partially occluded, and visually similar to background elements.

Research is ongoing to develop a vision-guided chestnut harvesting system by integrating the detection model with robotic manipulation and harvesting mechanisms. In such systems, accurate chestnut localization requires combining 2D detection with depth or stereo information to achieve reliable 3D positioning. Previous studies have demonstrated the feasibility of this approach in agricultural robotics for specialty crops. For example, Zhou et al. (2024) [43] replaced complex 3D CNN architectures with a lightweight combination of 2D detection and stereo vision for the localization of Camellia oleifera fruit, while Ge et al. (2023) [44] showed that bounding-box-based depth estimation can achieve faster and more accurate localization than full 3D clustering approaches. These methods provide useful references for ground-level chestnut localization. From a systems perspective, effective deployment will require coordination between key components, including vision modules, robotic manipulators, and harvesting end-effectors, as emphasized by Chen et al. [45]. Considering that chestnut production in the U.S. is dominated by small-scale orchards, future harvesting solutions must balance detection performance with system cost and operational simplicity.

5. Conclusions

Ground chestnut detection is crucial for enabling vision-based automated chestnut picking. In this study, a new dataset of 319 ground-level images acquired in commercial orchard environments was created, comprising a total of 6524 annotated chestnuts. The state-of-the-art real-time object detectors, including YOLO (v11–v13) and RT-DETR (v1–v4) families, were systematically evaluated for ground chestnut detection. Experiments across multiple model scales demonstrated that YOLOv11 consistently achieved superior localization accuracy at stricter IoU thresholds, achieving a maximum mAP@[0.5:0.95] of 80.1%. Among all evaluated models, YOLOv11s provided the most favorable balance between detection accuracy and inference speed, making it well-suited for real-time, embedded deployment. YOLOv12m achieved the highest mAP@0.5 of 95.1% and recall of 89.3%, highlighting its effectiveness, while YOLOv13 showed no clear performance advantage on this dataset. Although RT-DETR models demonstrated competitive performance, they were overall inferior to YOLO detectors in both detection accuracy and efficiency. Future work will focus on integrating the high-performing detection models into a fully automated chestnut picking system and validating performance under more diverse orchard conditions. Both the curated dataset and models in this study are publicly available to support further research in AI model innovations for chestnut detection. The methods and insights from this study are broadly applicable to other small-target crop detection tasks and support the advancement of vision-based intelligent agricultural automation.

Author Contributions

K.F.: writing—original draft, investigation, formal analysis, software; Y.L.: writing—original draft, review and editing, conceptualization, supervision; X.M.: data curation, investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset created in this study is publicly available at https://zenodo.org/records/19052604.

Acknowledgments

The authors thank Krishna Chaitanya Ginne for annotating the chestnut dataset used in the study.

Conflicts of Interest

The authors declare there is no conflict of interest in this manuscript.

References

USDA-NASS. 2022 Census of Agriculture 2022 Census of Agriculture, National Agricultural Statistics Service. Available online: https://www.nass.usda.gov/Publications/AgCensus/2022/index.php (accessed on 1 February 2026).
Kang, W.S.; Guyer, D. Development of chestnut harvesters for small farms. J. Biosyst. Eng. 2008, 33, 384–389. [Google Scholar] [CrossRef]
Bardenhagen, C.; Lizotte, E. Michigan Chestnut Cost of Production, 2025; Bulletin E3538; Michigan State University Extension: East Lansing, MI, USA, 2026; Available online: https://www.canr.msu.edu/resources/michigan-chestnut-cost-of-production-2025 (accessed on 12 March 2026).
Massantini, R.; Moscetti, R.; Frangipane, M.T. Evaluating progress of chestnut quality: A review of recent developments. Trends Food Sci. Technol. 2021, 113, 245–254. [Google Scholar] [CrossRef]
Jermini, M.; Conedera, M.; Sieber, T.N.; Sassella, A.; Schärer, H.; Jelmini, G.; Höhn, E. Influence of fruit treatments on perishability during cold storage of sweet chestnuts. J. Sci. Food Agric. 2006, 86, 877–885. [Google Scholar] [CrossRef]
Lee, U.K.; Joo, S.; Klopfenstein, N.B.; Kim, M.S. Efficacy of washing treatments in the reduction of post-harvest decay of chestnuts (Castanea crenata ‘Tsukuba’) during storage. Can. J. Plant Sci. 2016, 96, 1–5. [Google Scholar] [CrossRef]
Sieber, T.N.; Jermini, M.; Conedera, M. Effects of the harvest method on the infestation of chestnuts (Castanea sativa) by insects and moulds. J. Phytopathol. 2007, 155, 497–504. [Google Scholar] [CrossRef]
Monarca, D.; Cecchini, M.; Colantoni, A.; Menghini, G.; Moscetti, R.; Massantini, R. The evolution of the chestnut harvesting technique. In II European Congress on Chestnut 1043; ISHS: Leuven, Belgium, 2013; pp. 219–225. [Google Scholar]
Guyer, D.; Donis-Gonzalez, I.; Burns, J.; DeKleine, M. Is internal quality of chestnuts influenced by harvest methods and physical stresses? In V International Chestnut Symposium 1019; ISHS: Leuven, Belgium, 2012. [Google Scholar] [CrossRef]
Monarca, D.; Moscetti, R.; Carletti, L.; Cecchini, M.; Colantoni, A.; Stella, E.; Menghini, C.; Speranza, S.; Massantini, R.; Contini, M.; et al. Quality maintenance and storability of chestnuts manually and mechanically harvested. In II European Congress on Chestnut 1043; ISHS: Leuven, Belgium, 2014. [Google Scholar] [CrossRef]
Ekman, J. Improved Postharvest Management of Chestnuts; Horticulture Innovation Australia: Sydney, Australia, 2014. [Google Scholar]
De Kleine, M.E.; Guyer, D.E. Design, Development, and Evaluation of a Single-Stage Combined Chestnut Harvesting and Material Separation Concept. Appl. Eng. Agric. 2013, 29, 823–829. [Google Scholar] [CrossRef]
ProduceTech. SilverFox Fruit Harvester. Available online: https://producetech.com/en/products/silverfox-fruit-harvester/ (accessed on 1 February 2026).
Rashid, M.; Bari, B.S.; Yusup, Y.; Kamaruddin, M.A.; Khan, N. A comprehensive review of crop yield prediction using machine learning approaches with special emphasis on palm oil yield prediction. IEEE Access 2021, 9, 63406–63439. [Google Scholar] [CrossRef]
Batz, P.; Will, T.; Thiel, S.; Ziesche, T.M.; Joachim, C. From identification to forecasting: The potential of image recognition and artificial intelligence for aphid pest monitoring. Front. Plant Sci. 2023, 14, 1150748. [Google Scholar] [CrossRef]
Alaaudeen, K.M.; Selvarajan, S.; Manoharan, H.; Jhaveri, R.H. Intelligent robotics harvesting system process for fruits grasping prediction. Sci. Rep. 2024, 14, 2820. [Google Scholar] [CrossRef]
Adão, T.; Pádua, L.; Pinho, T.M.; Hruška, J.; Sousa, A.; Sousa, J.; Morais, R.; Peres, E. Multi-Purpose Chestnut Clusters Detection Using Deep Learning: A Preliminary Approach; XLII-3/W8; INESC TEC: Porto, Portugal, 2019; pp. 1–7. [Google Scholar] [CrossRef]
Sun, Y.; Hao, Z.; Guo, Z.; Liu, Z.; Huang, J. Detection and mapping of chestnut using deep learning from high-resolution UAV-based RGB imagery. Remote Sens. 2023, 15, 4923. [Google Scholar] [CrossRef]
Arakawa, T.; Tanaka, T.S.T.; Kamio, S. Detection of on-tree chestnut fruits using deep learning and RGB unmanned aerial vehicle imagery for estimation of yield and fruit load. Agron. J. 2024, 116, 973–981. [Google Scholar] [CrossRef]
McCool, C.; Sa, I.; Dayoub, F.; Lehnert, C.; Upcroft, B.; Perez, T. Visual detection of occluded crop: For automated harvesting. In IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 2016; IEEE: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
Mamdouh, N.; Khattab, A. YOLO-based deep learning framework for olive fruit fly detection and counting. IEEE Access 2021, 9, 84252–84262. [Google Scholar] [CrossRef]
Liao, Y.; Li, L.; Xiao, H.; Xu, F.; Shan, B.; Yin, H. YOLO-MECD: Citrus detection algorithm based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Allmendinger, A.; Saltık, A.O.; Peteinatos, G.G.; Stein, A.; Gerhards, R. Assessing the capability of YOLO-and transformer-based object detectors for real-time weed detection. Precis. Agric. 2025, 26, 52. [Google Scholar] [CrossRef]
Mu, X.; Lu, Y.; Deng, B. A comparative benchmark of real-time detectors for blueberry detection towards precision orchard management. arXiv 2025, arXiv:2509.20580. [Google Scholar] [CrossRef]
Dutta, A.; Zisserman, A. The VIA annotation software for images, audio and video. In Proceedings of the 27th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2276–2279. [Google Scholar]
Lu, Y. ChestnutDetDataset: A Dataset for Chestnut Detection Toward Automated Picking of On-Ground Chestnuts [Data set]. Zenodo. 2026. Available online: https://zenodo.org/records/19052604 (accessed on 15 March 2026).
Zhang, Y.; Li, X.; Wang, F.; Wei, B.; Li, L. A comprehensive review of one-stage networks for object detection. In 2021 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC); IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 6 December 2025).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. In 39th Conference on Neural Information Processing Systems (NeurIPS 2025); Neural Information Processing Systems: San Diego, CA, USA, 2025. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations; OpenReview: Amherst, MA, USA, 2021. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 16965–16974. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Wang, S.; Xia, C.; Lv, F.; Shi, Y. RT-DETRv3: Real-time end-to-end object detection with hierarchical dense positive supervision. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2025; pp. 1628–1636. [Google Scholar] [CrossRef]
Liao, Z.; Zhao, Y.; Shan, X.; Yan, Y.; Liu, C.; Lu, L.; Ji, X.; Chen, J. RT-DETRv4: Painlessly furthering real-time object detection with vision foundation models. arXiv 2025, arXiv:2510.25257. [Google Scholar]
Le, N.T.; Thai, N.; Bui, C. Benchmarking Real-Time Object Detection: Evaluating YOLO and RT-DETR on Speed, Accuracy, and Efficiency. In Information and Communication Technology (SOICT 2024); Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T., Eds.; Communications in Computer and Information Science; Springer: Singapore, 2025; Volume 2351. [Google Scholar] [CrossRef]
Dang, F.; Chen, D.; Lu, Y.; Li, Z. YOLOWeeds: A novel benchmark of YOLO object detectors for multi-class weed detection in cotton production systems. Comput. Electron. Agric. 2023, 205, 107655. [Google Scholar] [CrossRef]
Zheng, Z.; Ye, R.; Wang, P.; Ren, D.; Zuo, W.; Hou, Q.; Cheng, M.-M. Localization distillation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 9407–9416. [Google Scholar]
Sapkota, R.; Meng, Z.; Churuvija, M.; Du, X.; Ma, Z.; Karkee, M. Comprehensive performance evaluation of YOLOv12, YOLO11, YOLOv10, YOLOv9 and YOLOv8 on detecting and counting fruitlet in complex orchard environments. Agric. Commun. 2026, 4, 100125. [Google Scholar] [CrossRef]
Saltık, A.O.; Allmendinger, A.; Stein, A. Comparative analysis of yolov9, yolov10 and rt-detr for real-time weed detection. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 177–193. [Google Scholar]
Deng, B.; Lu, Y.; Brainard, D. Development and preliminary evaluation of a machine vision-guided smart sprayer prototype toward precision vegetable weeding. J. Field Robot. 2025, 59, 96–101. [Google Scholar] [CrossRef]
Zhou, L.; Jin, S.; Wang, J.; Zhang, H.; Shi, M.; Zhou, H. 3D positioning of Camellia oleifera fruit-grabbing points for robotic harvesting. Biosyst. Eng. 2024, 246, 110–121. [Google Scholar] [CrossRef]
Ge, Y.; Xiong, Y.; From, P.J. Three-dimensional location methods for the vision system of strawberry-harvesting robots: Development and comparison. Precis. Agric. 2023, 24, 764–782. [Google Scholar]
Chen, Z.; Lei, X.; Yuan, Q.; Qi, Y.; Ma, Z.; Qian, S.; Lyu, X. Key technologies for autonomous fruit-and vegetable-picking robots: A review. Agronomy 2024, 14, 2233. [Google Scholar] [CrossRef]

Figure 1. Example of original images and annotation images of on-ground chestnuts. The green bounding boxes indicate the labeled chestnuts.

Figure 2. The histogram of the annotated chestnuts per image.

Figure 3. The proposed pipeline of chestnut detection by YOLO/RT-DETR object detectors.

Figure 4. Training curves of mAP@0.5 and mAP@[0.5:0.95] for all the evaluated YOLO models for chestnut detection.

Figure 5. Training curves of train box loss and validation box loss for all the evaluated YOLO models for chestnut detection.

Figure 6. Test image of YOLOv12-m under complex lighting and severe occlusion conditions.

Figure 7. Precision–Recall curves of different YOLO model variants for chestnut detection.

Figure 8. Relationship between number of parameters and (a) GFLOPs, (b) inference time for all YOLOv11–YOLOv13 variants. GFLOPs denote giga floating-point operations.

Figure 9. Training curves of mAP@0.5 and mAP@[0.5:0.95] for the RT-DERTv1, v2, v3, v4 models for chestnut detection.

Figure 10. Training curves of train box loss and validation box loss for the RT-DETR models for chestnut detection.

Figure 11. Relationship between number of parameters and (a) GFLOPs, (b) inference time for all RT-DETRv1–RT-DETRv4 variants. GFLOPs denote giga floating-point operations.

Figure 12. Overall comparison between YOLO and RT-DETR variants for chestnut detection.

Figure 13. Test image of RT-DETRv2-R101 under complex lighting and severe occlusion conditions. The red box indicates that a chestnut has been detected.

Figure 14. Detection results of YOLOv12-m on video frames: The left frame shows detection under optimal lighting conditions, while the right frame demonstrates detection under challenging lighting conditions.

Table 1. Summary of YOLO variants and RT-DETR variants with open-source software packages.

Index	Models	URL	Reference
1	YOLOv11	https://github.com/ultralytics/ultralytics (accessed on 6 December 2025)	Jocher & Qiu (2024) [29]
2	YOLOv12	https://github.com/sunsmarterjie/yolov12 (accessed on 6 December 2025)	Tian et al. (2025) [30]
3	YOLOv13	https://github.com/iMoonLab/yolov13 (accessed on 6 December 2025)	Lei et al. (2025) [31]
4	RT-DETRv1	https://github.com/lyuwenyu/RT-DETR (accessed on 6 December 2025)	Zhao et al. (2023) [33]
5	RT-DETRv2	https://github.com/lyuwenyu/RT-DETR (accessed on 6 December 2025)	Lv et al. (2024) [34]
6	RT-DETRv3	https://github.com/clxia12/RT-DETRv3 (accessed on 6 December 2025)	Wang et al. (2025) [35]
7	RT-DETRv4	https://github.com/RT-DETRs/RT-DETRv4 (accessed on 6 December 2025)	Liao et al. (2025) [36]

Table 2. Comparative performance summary of all YOLOv11, v12, and v13 variants. GFLOPs denote giga floating-point operations.

Index	Models		Precision (100%)	Recall (100%)	mAP@0.5 (100%)	mAP@[0.5:0.95] (100%)	GFLOPs	Inference Time (ms/Image)
1	YOLOv11	YOLOv11n	94.8 ± 1.0	86.9 ± 2.3	92.6 ± 1.4	75.0 ± 2.4	6.3	5.6 ± 0.2
2		YOLOv11s	95.5 ± 0.6	88.0 ± 1.2	93.5 ± 1.4	77.5 ± 0.7	21.3	8.3 ± 0.3
3		YOLOv11m	95.5 ± 0.8	87.6 ± 1.3	93.1 ± 0.6	79.1 ± 0.7	67.6	8.7 ± 0.9
4		YOLOv11l	95.3 ± 0.3	89.0 ± 0.7	93.8 ± 0.5	79.9 ± 0.7	86.6	11.1 ± 2.1
5		YOLOv11x	95.3 ± 0.5	88.9 ± 1.0	93.8 ± 0.5	80.1 ± 0.8	194.4	16.2 ± 2.7
		Average	95.3 ± 0.7	88.1 ± 1.6	93.4 ± 0.9	78.4 ± 2.3	75.2	10.0 ± 1.2
6	YOLOv12	YOLOv12n	91.3 ± 1.0	83.7 ± 1.2	91.6 ± 0.8	65.2 ± 1.2	6.3	8.3 ± 0.5
7		YOLOv12s	92.7 ± 0.7	86.3 ± 1.2	94.0 ± 0.6	70.5 ± 0.8	21.2	11.7 ± 0.7
8		YOLOv12m	92.9 ± 1.0	89.3 ± 1.8	95.1 ± 1.1	73.8 ± 2.4	67.1	11.7 ± 1.5
9		YOLOv12l	92.9 ± 0.9	88.1 ± 1.2	94.9 ± 0.8	74.3 ± 1.9	88.5	14.3 ± 1.7
10		YOLOv12x	92.4 ± 2.0	88.6 ± 1.1	94.6 ± 0.5	73.8 ± 0.7	198.5	22.1 ± 2.1
		Average	92.8 ± 1.1	87.6 ± 1.8	94.5 ± 1.1	71.5 ± 2.4	76.3	13.6 ± 1.3
11	YOLOv13	YOLOv13n	90.1 ± 0.4	80.5 ± 1.2	89.8 ± 0.6	60.5 ± 1.4	6.2	11.9 ± 1.0
12		YOLOv13s	92.1 ± 0.6	84.0 ± 0.9	92.3 ± 0.6	66.4 ± 1.2	20.7	14.8 ± 1.5
13		YOLOv13l	91.4 ± 1.9	83.7 ± 3.3	92.0 ± 2.3	66.3 ± 4.6	88.1	30.7 ± 2.3
14		YOLOv13x	91.2 ± 3.3	83.0 ± 6.0	91.0 ± 4.7	66.2 ± 8.6	198.7	47.8 ± 4,2
		Average	91.2 ± 2.8	82.8 ± 3.9	91.8 ± 2.8	64.9 ± 5.6	78.4	26.3 ± 2.3

Note that ± indicates the standard deviation. Bold text indicates that this indicator is the highest in the same series of models.

Table 3. Multiple comparison results of mAP performance among YOLOv11, YOLOv12, and YOLOv13 based on one-way ANOVA with least significant difference (LSD) tests at a significant level of 5%.

Model	mAP@0.5 (100%)	mAP@[0.5:0.95] (100%)
YOLOv11	93.4 ± 0.9 a	78.4 ± 2.3 a
YOLOv12	94.5 ± 1.1 b	71.5 ± 2.4 b
YOLOv13	91.8 ± 2.8 c	64.9 ± 5.6 c

Table 4. Normalized confusion matrix analysis of YOLOv11-x and YOLOv12-m at different IoU thresholds. IoU, TP, FP, and FN denote intersection over union, true positive, false positive, and false negative, respectively.

Model	IoU	TP	FP	FN
YOLOv11x	0.50	0.85	0.05	0.10
YOLOv11x	0.75	0.73	0.08	0.19
YOLOv12m	0.50	0.84	0.06	0.10
YOLOv12m	0.75	0.65	0.14	0.21

Table 5. Comparative performance summary of RT-DERTv1, v2, v3, and v4 variants. GFLOPs denote giga floating-point operations.

Index	Models		Precision (100%)	Recall (100%)	mAP@0.5 (100%)	mAP@[0.5:0.95] (100%)	GFLOPs	Inference Time (ms/Image)
1	RT-DETRv1	RT-DETRv1-R18	88.5 ± 3.2	81.2 ± 2.8	84.3 ± 0.6	68.5 ± 0.5	152	72.1 ± 3.7
2		RT-DETRv1-R34	90.7 ± 2.5	82.6 ± 3.2	86.8 ± 0.6	70.6 ± 0.4	233	81.3 ± 2.5
3		RT-DETRv1-R50	92.1 ± 2.4	82.9 ± 2.0	87.7 ± 0.5	73.2 ± 0.7	342	90.4 ± 4.3
4		RT-DETRv1-R101	92.8 ± 2.6	83.1 ± 2.4	89.3 ± 0.8	74.5 ± 0.6	656	103.5 ± 4.5
		Average	91.0 ± 2.7	82.5 ± 2.6	87.0 ± 0.6	71.7 ± 0.6	346	86.8 ± 3.7
5	RT-DETRv2	RT-DETRv2-R18	92.5 ± 1.8	80.1 ± 4.0	88.3 ± 1.4	73.7 ± 0.8	153	47.4 ± 1.5
6		RT-DETRv2-R34	94.2 ± 2.2	82.9 ± 3.2	89.4 ± 1.1	75.1 ± 1.2	234	48.8 ± 2.4
7		RT-DETRv2-R50	94.7 ± 2.6	84.7 ± 3.0	90.0 ± 1.8	70.5 ± 1.1	343	56.1 ± 2.3
8		RT-DETRv2-R101	95.1 ± 1.5	86.3 ± 2.4	91.1 ± 1.6	71.9 ± 0.8	657	66.3 ± 3.2
		Average	94.1 ± 2.0	83.5 ± 3.2	89.7 ± 1.5	72.8 ± 1.0	347	54.6 ± 2.4
9	RT-DETRv3	RT-DETRv3-R18	87.6 ± 3.2	79.4 ± 1.8	84.5 ± 3.2	67.2 ± 2.7	154	52.7 ± 2.4
10		RT-DETRv3-R34	90.6 ± 3.4	82.5 ± 2.5	87.8 ± 2.6	71.8 ± 2.6	235	59.2 ± 2.5
11		RT-DETRv3-R50	92.7 ± 3.8	84.5 ± 3.4	90.4 ± 3.0	78.9 ± 2.5	344	67.1 ± 4.7
		Average	90.3 ± 3.5	82.1 ± 2.6	87.6 ± 3.0	72.6 ± 2.6	244	59.7 ± 3.2
12	RT-DETRv4	RT-DETRv4-R18	85.6 ± 1.8	80.3 ± 2.7	80.3 ± 1.8	63.6 ± 2.5	78	23.7 ± 2.1
13		RT-DETRv4-R34	87.0 ± 2.0	82.6 ± 2.5	83.5 ± 3.4	66.5 ± 2.6	167	27.3 ± 2.6
14		RT-DETRv4-R50	88.4 ± 1.4	83.9 ± 1.8	84.6 ± 3.2	70.9 ± 2.7	286	30.5 ± 3.3
15		RT-DETRv4-R101	89.5 ± 2.1	85.5 ± 2.4	86.1 ± 2.9	72.0 ± 2.0	658	38.0 ± 3.7
		Average	87.6 ± 1.8	83.1 ± 2.4	83.7 ± 2.8	68.3 ± 1.2	297	29.89 ± 2.88

Note that ± indicates the standard deviation. Bold text indicates that this indicator is the highest in the same series of models.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fang, K.; Lu, Y.; Mu, X. Artificial Intelligence-Based Detection of On-Ground Chestnuts Toward Automated Picking. AgriEngineering 2026, 8, 116. https://doi.org/10.3390/agriengineering8030116

AMA Style

Fang K, Lu Y, Mu X. Artificial Intelligence-Based Detection of On-Ground Chestnuts Toward Automated Picking. AgriEngineering. 2026; 8(3):116. https://doi.org/10.3390/agriengineering8030116

Chicago/Turabian Style

Fang, Kaixuan, Yuzhen Lu, and Xinyang Mu. 2026. "Artificial Intelligence-Based Detection of On-Ground Chestnuts Toward Automated Picking" AgriEngineering 8, no. 3: 116. https://doi.org/10.3390/agriengineering8030116

APA Style

Fang, K., Lu, Y., & Mu, X. (2026). Artificial Intelligence-Based Detection of On-Ground Chestnuts Toward Automated Picking. AgriEngineering, 8(3), 116. https://doi.org/10.3390/agriengineering8030116

Article Menu

Artificial Intelligence-Based Detection of On-Ground Chestnuts Toward Automated Picking

Abstract

1. Introduction

2. Materials and Methods

2.1. Chestnut Dataset

2.2. Chest Detection Models

2.2.1. Real-Time Object Detectors

YOLO Series

RT-DETR Series

2.2.2. Model Selection and Configuration

2.2.3. Experimentation

2.3. Performance Evaluation Metrics

2.3.1. Detection Accuracy

2.3.2. Number of Model Parameters

2.3.3. Computation Cost and Inference Time

3. Results

3.1. YOLO Results

3.2. RT-DETR Results

3.3. YOLO vs. RT-DETR

3.4. Dynamic Video Stream Detection Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI