1. Introduction
Chestnut (
Castanea spp.) is a nutritious and popular nut crop, rich in vitamins and minerals, low in fat, and high in dietary fiber. Its sweet flavor is particularly favored by American consumers. According to the 2022 USDA Census of Agriculture [
1], 2845 growers in the United States (U.S.) manage a total of 10,049 acres of fruiting and non-fruiting chestnut orchards, with an average farm size of 3.5 acres. Chestnut farms with a size of 2 hectares (4.94 acres) or less and fewer than 600 chestnut trees are considered small-scale farms by Kang and Guyer [
2]. Currently, the USDA defines small farms as operations with an annual gross income of less than
$350,000 (
https://www.ers.usda.gov/topics/farm-economy/farm-structure-and-organization, accessed on 15 March 2026). A recent study on chestnut production costs in Michigan reported an average total farm revenue of
$12,500 per acre [
3]. Given the small acreage and revenue levels, chestnut production in the U.S. can be characterized as small-scale farming under the USDA definition.
Michigan leads the U.S. chestnut production, accounting for approximately 13% of the national planted area. However, about 35% of this area comprises non-fruiting chestnut orchards. This indicates that a significant portion of existing land resources has not yet been fully converted into productive capacity, suggesting that the industry has considerable potential for further growth and development.
Chestnuts are highly seasonal fruits that can only maintain their peak commercial quality, size, and health for a relatively short period of time [
4]. One of the primary challenges in chestnut production is its susceptibility to pest damage and quality degradation, which often leads to significant post-harvest loss. During the harvest period, chestnuts naturally fall from protective shells (known as burrs) to the orchard floor. Once in direct contact with the ground, the nuts are exposed to adverse environmental conditions, including soil, fallen branches and leaves, surface microbial communities, precipitation, fluctuations in temperature and humidity, etc. Fungi have been identified as the main cause of postharvest chestnut decay [
5]. Fungal infection typically occurs after nuts fall to the ground and come into contact with soil, plant residues, and/or dirty water [
6]. Additionally, fallen chestnuts are often subjected to vibration or friction, which can result in micro-cracks or abrasions on the shell or pericarp. These micro-wounds serve as the entry points for fungal spores, mycelium, or small insects. Damage caused by wildlife foraging further exacerbates yield and quality losses. Consequently, chestnuts must be harvested promptly after falling [
7] to minimize ground exposure and risk of nut deterioration and loss, and ensure optimal product quality.
Despite these risks, on-ground chestnut harvesting still relies primarily on manual picking, which is highly labor-intensive, time-consuming, and increasingly unsustainable as orchard acreage expands. As the chestnut industry continues to grow—particularly among small and family-operated farms—producers are often faced with harvest volumes that exceed what a limited workforce can reasonably manage. To reduce labor demands, various mechanical harvesting systems have been introduced; however, these solutions only partially address the challenges of efficiency, labor dependence, and nut quality preservation. Currently available mechanical harvesting equipment mainly includes vacuum-based harvesters and mechanical sweepers, which may be trailed, mounted, or self-propelled depending on orchard conditions and operational requirements [
8]. Vacuum harvesters, commonly used on uneven or sloping terrain, collect chestnuts from the orchard floor through suction pipes and deposit them into collection bags, with efficiency reaching up to approximately 900 kg/h. Mechanical sweepers gather windfallen nuts using rotating brushes and conveyor belts, with an efficiency of up to 1500 kg/h. However, if self-propelled machines are used, they are typically suitable for medium-to-large farms, usually requiring around 15–20 hectares of harvest area. If trailed or mounted machines are used, human labor is still necessary for operation.
Although these machines can reduce some manual effort, they still require continuous human involvement for driving, monitoring, and post-harvest handling, and their high acquisition costs (e.g.,
$50,000–
$100,000 per unit) limit adoption by small-scale producers. Moreover, compared with manual picking, mechanical harvesting has been shown to increase the risk of physical damage to chestnuts, including bruising, abrasions, kernel darkening, and off-odor development [
9,
10]. In vacuum-based systems, for example, internal kernel damage caused by suction forces may not be immediately visible at harvest and often becomes apparent during storage, leading to increased postharvest losses and the need for additional handling or treatments [
11]. While such systems are often described as “automated,” they do not eliminate labor-intensive steps such as separation, quality inspection, and rehandling, nor do they substantially reduce overall labor costs. These limitations underscore the need not merely for improved mechanization, but for a genuinely autonomous, low-cost harvesting solution that minimizes labor input while preserving nut quality.
In recent years, several studies have explored small- to medium-scale, cost-effective harvesting assistance systems for chestnut production. Kang and Guyer (2008) [
2] developed and evaluated three chestnut harvester prototypes and proposed a venturi-based separation device to distinguish chestnuts from empty shells. Among them, the airlock blade system successfully picked up all the scattered material with a rate of material pickup of about 56 kg/h. However, the harvesting performance was inconsistent, and the system remained relatively bulky and energy-intensive. De Kleine and Guyer (2013) [
12] introduced an airflow-adjustable harvesting system capable of collecting and separating chestnuts from orchard debris, thereby improving the operational efficiency of small orchards. According to tests, the highest chestnut harvesting efficiency could reach 88.44%. It is noted that the performance of the harvesting system was strongly affected by the nut-to-debris ratio and material feed rate. More recently, Greg Peck and colleagues at Cornell AgriTech demonstrated the Silverfox Harvester, which can collect up to 800 pounds of chestnuts per hour at a cost of less than
$4000, offering an affordable option for small producers [
13]. Nevertheless, this system still relies on manual operation and does not support autonomous harvesting. Overall, existing small-scale mechanical harvesting systems remain constrained by labor dependence, incomplete automation, performance reliability, and the need for additional separation and processing steps.
To address these challenges, there is a clear need to develop a low-cost, high-precision autonomous chestnut harvesting system that incorporates vision capabilities to identify chestnuts on the orchard floor and accurately collect them without causing damage. Such a system would minimize downstream separation processes, significantly reduce labor requirements, and enable efficient, high-quality harvesting tailored for small-scale chestnut producers.
As the core component of an intelligent harvesting platform, reliable machine vision is fundamental to accurate perception, decision-making, and robotic operation. Advances in machine learning (ML) and artificial intelligence (AI)-based vision technologies have significantly accelerated progress in agricultural automation. Early applications of ML focused on crop monitoring and yield prediction [
14], and have since expanded to a wide range of tasks. AI-based vision systems have greatly improved the efficiency of pest and disease identification, crop health monitoring, and phenotypic analysis, enabling rapid, data-driven management decisions without human intervention [
15]. In harvesting automation, Alaaudeen et al. (2024) [
16], for example, combined computer vision with robot harvesting to realize autonomous apple picking, reporting recognition success rates exceeding 95% and retry rates (the rate of retrying after a failed grasp) below 12%. Recent studies have also demonstrated the effectiveness of deep learning in chestnut-related vision tasks. Adão et al. (2019) [
17] successfully classified and segmented chestnuts using convolutional neural networks (CNNs), achieving a classification accuracy of 91%. Sun et al. (2023) [
18] applied semantic segmentation to aerial images for chestnut tree cover detection, obtaining an average F1 score of 86.13%. However, traditional ML and early deep learning approaches often exhibit limited generalization under the complex visual conditions encountered in real chestnut orchard environments. Challenges such as variable illumination, occlusion, orchard floor clutter, and dynamic environmental conditions frequently degrade detection accuracy as well as real-time performance, limiting their practical deployment in autonomous harvesting systems.
The first step toward automated chestnut harvesting is the development of a robust detection system capable of reliably identifying on-ground chestnuts. Such a system must effectively deal with challenges such as occlusion [
19], illumination variations [
20], and visually complex backgrounds, where objects such as leaves, stones, and soil share similar color and texture characteristics with chestnuts and can lead to false positives or missed detection. In recent years, YOLO (You Only Look Once) and RT-DETR (Real-Time DEtection TRansformer) models have demonstrated strong performance in agricultural target detection tasks. Mamdouh & Khattab (2021) [
21] employed an improved YOLOv4-based algorithm for olive fruit fly, achieving a precision of 0.84, a recall of 0.97, and a mean Average Precision (mAP) of 96.68%. Liao et al. (2025) [
22] proposed the YOLO-MECD model based on YOLOv11, achieving a precision of 84.4% and an mAP of 81.6%. Allmendinger et al. (2025) [
23] applied the RT-DETR-l model to weed detection, achieving an average precision of 82.44% and an average recall of 66.02%. Mu et al. (2025) [
24] conducted a comparative benchmark of YOLO (v8–v12) and RT-DETR (v1–v2) models for blueberry detection using a curated bush canopy dataset, achieving a maximum mAP@50 of 93.6% with the RT-DETRv2-X model. Despite these advances, a systematic and comparative evaluation of state-of-the-art real-time detection models for on-ground chestnut detection under real orchard conditions remains unexplored.
The overall objective of this study was therefore to systematically evaluate the applicability of state-of-the-art real-time object detection models for on-ground chestnut detection in orchard environments. Specifically, the objectives were to: (1) construct a labeled chestnut detection dataset that reflects real orchard conditions; (2) conduct a comprehensive quantitative comparison of representative real-time object detection models, including YOLOv11, YOLOv12, YOLOv13, as well as RT-DETRv1, RT-DETRv2, RT-DETRv3, and RT-DETRv4, in terms of detection accuracy, robustness under complex field conditions, and real-time inference performance; and (3) analyze the implications of the comparative results for the design and deployment of vision-based real-time automated chestnut harvesting systems. Both the dataset and software programs developed in this study have been made publicly available at
https://github.com/AgFood-Sensing-and-Intelligence-Lab/ChestnutDetection (accessed on 15 March 2026).
3. Results
3.1. YOLO Results
Figure 4 presents the training curves for mAP@0.5 and mAP@[0.5:0.95] across all evaluation model variants. All architectures exhibited rapid feature learning capabilities during the early training stages. Specifically, within the first 100 epochs, YOLOv11 achieved over 90% mAP@0.5, whereas YOLOv12 and YOLOv13 reached approximately 75% and 65%, respectively. Across all three model series, mAP@[0.5:0.95] exceeded 65% within the same training window. After approximately 150 training epochs, the detection accuracy across all variants converged and stabilized, indicating that the 200-epoch training schedule adopted was sufficient to achieve stable convergence. This rapid convergence demonstrates the ability of these models to effectively adapt to the dataset for chestnut detection, despite challenging ground-level orchard conditions involving shadows, occlusions, and background clutter.
To further examine training dynamics and potential overfitting,
Figure 5 shows the training curves for the training box loss and validation box loss of the YOLO models. As shown in the figure, the loss functions of all models decreased sharply within the first 20 epochs, indicating that the models could quickly learn representative features. Thereafter, the loss values gradually decreased and stabilized, indicating that the models gradually converged during training. Importantly, the validation set loss and training loss showed similar trends with no significant deviation, indicating that overfitting did not occur in the later stages of training. We continuously monitored the validation performance of the models using validation loss and mAP metrics. The stabilization of the loss curves was consistent with the improvement in mAP performance, confirming that the models maintained good generalization ability. Although the loss curves began to stabilize around 100–150 epochs, we continued training for 200 epochs to ensure that all models fully converged and achieved stable performance.
Table 2 summarizes the detection performance of all YOLO models on the test dataset. Overall, all three YOLO families achieved competitive performance, with the accuracy generally improving as the model scale increased. Across all variants, mAP@0.5 ranged from 89.8% (YOLOv13-n) to 95.1% (YOLOv12-m), while mAP@[0.5:0.95] ranged from 60.5% (YOLOv13-n) to 80.1% (YOLOv11-x).
Although the results in
Table 2 show performance differences among YOLOv11, YOLOv12, and YOLOv13, the standard deviations of some metrics overlap. To determine whether these differences are statistically significant rather than the result of random fluctuations, a one-way analysis of variance (ANOVA) followed by Fisher’s Least Significant Difference (LSD) multiple comparison tests at the significance level of α = 0.05 was performed. The ANOVA results revealed significant differences among the three model series for both evaluation metrics, including mAP@0.5 (F = 7.59,
p = 0.008) and mAP@[0.5:0.95] (F = 21.79,
p < 0.001). Subsequent multiple comparison analysis further showed that all pairwise comparisons among YOLOv11, YOLOv12, and YOLOv13 exhibited statistically significant differences (
p < 0.05). These results indicate that the observed performance differences among the three model families are statistically meaningful and unlikely to be caused by random variation.
Table 3 presents the results of multiple comparisons of mAP performance for YOLOv11, YOLOv12, and YOLOv13 based on one-way ANOVA and LSD tests. The models are labeled with different letters (a, b, c). These letters indicate the statistical significance of pairwise comparisons: models with the same letter show no significant difference, while models with different letters show statistically significant differences.
The YOLOv12-m model achieved the highest mAP@0.5 value of 95.1% and the highest recall value of 89.3% while maintaining a high precision value of 92.9%. These results demonstrate that YOLOv12-m has strong detection accuracy and achieves a good balance between precision and recall.
Figure 6 further illustrates an example image of the detection results of YOLOv12-m under complex lighting and severe occlusion conditions. In this test, the model correctly detected 47 out of 49 chestnuts, with only one false positive, achieving a precision (P) of 97.9% and a recall (R) of 95.9%. In contrast, YOLOv11-x attained the highest mAP@[0.5:0.95] (80.1%) and a high precision of 95.3%, with recall = 88.9%, suggesting superior bounding-box localization accuracy and robustness across varying IoU thresholds, which are important for reducing false positives in practical applications. Among the three model families, the YOLOv11 model consistently performed well in terms of precision and recall, with all variants achieving a precision exceeding 94% and a mean precision of 95.3%. Compared to YOLOv11, YOLOv12 showed a slight decrease in mean precision, but its recall remained similar (87.6%). Notably, YOLOv12 exhibited significant improvements in recall for medium, large, and super-large variants, indicating that its attention-based architectural enhancements—such as efficient attention and improved feature aggregation—help recover more true positives under challenging visual conditions. These architectural refinements in YOLOv12 were designed to better capture salient features without sacrificing real-time performance.
To demonstrate the performance differences between YOLOv11-x and YOLOv12-m, we conducted a confusion matrix-based analysis at different IoU thresholds. As shown in
Table 4, both models exhibited high true-positive rates when IoU = 0.50. YOLOv11-x had a true-positive rate of 0.85, a false-positive rate of 0.05, and a false-negative rate of 0.10; while YOLOv12-m had a true-positive rate of 0.84, a false-positive rate of 0.06, and a false-negative rate of 0.10. These results indicate that the detection capabilities of the two models are comparable when the IoU threshold is relaxed, which explains why YOLOv12-m achieved the highest mAP@0.5 at an IoU threshold of 0.5. However, the differences became more significant when the IoU threshold was increased to 0.75. YOLOv11-x maintained a relatively high true-positive rate (TP) (0.73), while its false-positive rate (FP) (0.08) and false-negative rate (FN) (0.19) were at a moderate level. In contrast, YOLOv12-m showed a significant decrease in true-positive rate (0.65), and a marked increase in false-positive rate (0.14) and false-negative rate (0.21). This indicates that under stricter IoU criteria, YOLOv12-m is more prone to localization-related errors and background-induced false detections, especially in cluttered scenes or when parts of the chestnut are occluded.
Although YOLOv12 slightly outperformed YOLOv11 in terms of mAP@0.5, its mAP@[0.5:0.95] metric (71.5%) was significantly lower than YOLOv11’s (78.4%), indicating weaker performance under stricter localization criteria. In contrast, YOLOv13 performed poorly across all scales, particularly on the mAP@[0.5:0.95] scale, where its average score was only 64.86%. Even its best-performing variant (YOLOv13-s: precision = 92.1%, recall = 84.0%, mAP@0.5 = 92.3%, mAP@[0.5:0.95] = 66.4%) lagged behind both YOLOv11 and YOLOv12. From an architectural perspective, this discrepancy may be related to YOLOv13’s emphasis on global feature correlation mechanisms such as hypergraph-based adaptive correlation enhancement and full-pipeline feature distribution, which are designed to capture high-order relationships across the entire image space. While such mechanisms have shown benefits on certain benchmarks, they may be less effective for densely packed, small single-class target detection under complex lighting and severe occlusion, where the preservation of local fine-grained spatial details is critical for stringent bounding-box localization, especially at higher IoU thresholds.
Figure 7 shows the precision–recall (PR) curves for different YOLO model variants on the dataset. Overall, the PR curves for the YOLOv11 series consistently lie near the upper right corner of the figure. YOLOv11 maintains high precision even at lower recall levels, resulting in a relatively gentle initial decline in the curve. As recall increases further, precision begins to decline more rapidly, leading to a steeper slope in the later stages of the curve. In contrast, the YOLOv12 series exhibits a different trend. Its PR curve shows a more pronounced decrease in precision at lower recall levels, indicating that precision declines earlier as the confidence threshold is relaxed. However, at higher recall levels, the decline in precision becomes slower, suggesting improved stability of precision when detecting more targets. The YOLOv13 model performs relatively weakly because its PR curve typically falls between the YOLOv11 and YOLOv12 curves for most of the recall range. This indicates that, with similar recall rates, YOLOv13 has lower precision, resulting in a smaller area under its PR curve. The trends of the curves are consistent with the data in
Table 2.
Figure 8 illustrates the relationship between model complexity and computational performance, showing the trends of GLOPs and inference time versus the number of model parameters. As the model scale increases, both inference time and GFLOPs exhibit an upward trend. Across corresponding scales, the three YOLO families demonstrated comparable computational complexity, with YOLOv11-x, YOLOv12-x, and YOLOv13-x all approaching 200 GFLOPs. Although these larger models incurred higher computational costs, their inference time remained below 50 ms, corresponding to frame rates exceeding 20 FPS (frames per second). This performance still meets the basic real-time detection requirements of ground-based chestnut harvesting systems.
YOLOv11 models appeared to exhibit the most favorable balance between accuracy and computational efficiency. YOLOv11-n achieved the fastest inference time of 5.6 ms, followed by YOLOv11-s at 8.3 ms, which—combined with its precision of 95.5% and mAP@0.5 of 93.45%—makes it particularly attractive for embedded and edge device-based applications. YOLOv12-m achieves an inference time of 11.7 ms while delivering the highest mAP@0.5 of 95.1%, supporting its feasibility for real-time deployment. In contrast, YOLOv13-x exhibited the highest computational load (198.7 GFLOPs) and longest inference (47.8 ms), rendering it the least suitable for real-time deployment.
3.2. RT-DETR Results
Figure 9 presents the training curves of mAP@0.5 and mAP@[0.5:0.95] for all the RT-DETR (v1–v4) models. All the model variants achieved mAP@0.5 values exceeding 80% and mAP@[0.5:0.95] values above 65% with the first 60 training epochs, followed by performance stabilization after approximately 100 epochs. These trends suggested effective learning and convergence across all models and confirmed that the 200-epoch training schedule adopted in this study was sufficient to achieve stable convergence.
To further analyze the training dynamics and assess potential overfitting,
Figure 10 shows the training box loss and validation box loss curves for the RT-DETRv1, RT-DETRv2, RT-DETRv3, and RT-DETRv4 models. As shown, both the training loss and validation loss decrease rapidly within the first 10 epochs, indicating that the model can quickly learn representative features. Thereafter, the rate of decrease slows down and eventually stabilizes as training progresses. The trend of the validation loss is similar to that of the training loss, with no significant deviation, indicating that the model effectively avoids overfitting.
Table 5 summarizes the detection performance of the evaluated RT-DETR variants. Similar to the YOLO results, detection accuracy generally improves with increasing model size. RT-DETRv2 significantly outperforms RT-DETRv1 across all evaluation metrics, demonstrating the effectiveness of its architectural improvements. In contrast, RT-DETRv3’s precision and recall are slightly lower than RT-DETRv1, despite a minor improvement in mAP.
Under the same training conditions and datasets, RT-DETRv4 performs worse than previous variants. RT-DETRv4 aims to leverage a VFM-based semantic distillation framework, which enriches feature representations by injecting high-level semantics from large pre-trained models into the detector during training. While this strategy has been shown to improve overall performance on large, diverse benchmark datasets such as COCO, existing research on dense object detection indicates that semantic knowledge alone is often insufficient to achieve optimal localization performance. In particular, Zheng et al. (2022) [
39] pointed out that methods emphasizing global semantic alignment may not fully capture the fine-grained spatial cues required for accurate bounding box regression. On benchmark datasets such as COCO, for dense object detection, localization knowledge distillation usually brings more significant improvements than semantic feature imitation. In experiments on the chestnut dataset, the semantic distillation design of RT-DETRv4 prioritizes global semantic alignment while ignoring localization-related features. The chestnut dataset mainly consists of small, dense, and partially occluded targets, with limited diversity of training samples. Localization-related features are crucial for stringent localization metrics such as mAP@[0.5:0.95], and in this experiment, RT-DETRv4’s mAP@[0.5:0.95] metric was significantly lower than other models. Therefore, additional semantic supervision causes the learning focus to deviate from the precise localization features required for small-object detection, resulting in performance inferior to earlier RT-DETR variants.
Among all models, RT-DETRv2-R101 achieved the best overall performance (precision = 95.1%, recall = 86.3%, mAP@0.5 = 91.1%, mAP@[0.5:0.95] = 71.9%). The precision and recall are comparable to those of the best-performing YOLOv11-series models, highlighting its strong detection capability.
Figure 11 illustrates the relationship among the model size (number of parameters), computational complexity (GFLOPS), and inference time for all evaluated RT-DETR models. Consistent with the trends observed in the YOLO experiments, both inference time and computational cost generally increase with model size. Compared to RT-DETRv1, v2, and v3, the RT-DETRv4 architecture substantially reduces the number of parameters across all backbone scales. Specifically, RT-DETRv4-R18 contains only 10 million parameters, representing a 50% reduction relative to the 20 million parameters used in corresponding R18 variants of v1–v3. At even larger scales, RT-DETRv4-R34 and RT-DETRv4-R50 reduce parameter counts from 31 million and 42 million to 19 million and 31 million, respectively. Even at its largest scale, RT-DETRv4-R101 uses only 62 million parameters, approximately 18% less than the 76 million parameters required by the v1 to v3 versions. These reductions in model size translated directly to improved inference efficiency.
Among all RT-DETR variants, RT-DETRv4-R18 achieved the fastest inference time (23.7 ms), substantially outperforming other models using the same backbone network, including RT-DETRv1-R18 (72.1 ms), RT-DETRv2-R18 (47.4 ms), and RT-DETRv3-R18 (52.7 ms). In contrast, the model achieving the highest mAP@0.5, RT-DETRv2-R101, exhibited an inference time of 66.3 ms. Although all RT-DETR models satisfied basic real-time requirements, their overall inference speed remained slower than that of the YOLO-based detectors evaluated in this study.
3.3. YOLO vs. RT-DETR
Figure 12 illustrates the trade-off between detection accuracy (mAP@0.5 and mAP@[0.5:0.95]) and inference time for the evaluated YOLO and RT-DETR models. As noted above, the RT-DETR models, especially v1–v3, exhibited substantially slower inference speeds than the YOLO-based detectors. Although RT-DETR variants achieved mAP@0.5 values above 85%, their overall performance across both mAP metrics remained inferior to that of the best-performing YOLO models. When inference time and precision are jointly considered, the YOLOv11 family demonstrates the most favorable balance for real-time chestnut detection, providing consistently high accuracy with significantly lower latency, and thus representing the most practical choice for deployment in harvesting systems.
Figure 13 presents representative detection results from RT-DETRv2-R101, the strongest RT-DETR variant, applied to the same test image shown in
Figure 6. In this case, RT-DETRv2-R101 detected 44 out of 49 chestnuts with a precision of 95.1%, which is lower than the 97.9% achieved by YOLOv12m under identical conditions.
Across multiple evaluation metrics, the performance gap between RT-DETR and YOLOv11/YOLOv12 indicates architectural differences between the two frameworks. RT-DETR models are designed to capture global contextual information and long-range spatial dependencies, which can be advantageous for complex scene understanding. However, this design can limit their ability to preserve the fine-grained spatial features required for the precise localization of small, densely distributed, and partially occluded chestnuts. In contrast, YOLO architectures emphasize hierarchical local feature extraction and multi-scale feature fusion, enabling more robust performance under the complex, ground-level orchard environments. Overall, these results suggest that while RT-DETR may be advantageous for certain complex tasks, YOLO models are more appropriate for chestnut detection applications that demand both high accuracy and real-time processing capability.
3.4. Dynamic Video Stream Detection Results
Figure 14 shows the detection results of YOLOv12-m on a video stream, displaying two frames extracted from the video footage. The video was filmed during the same orchard visit as the chestnut image collection at a commercial orchard (Owosso, Michigan) using the same handheld smartphone (iPhone 12, Apple Inc., Cupertino, CA, USA). These two frames demonstrate the detection performance under dynamic conditions (with camera shake and varying lighting).
In the left image, with stable lighting conditions, the model correctly detected all 10 chestnuts in the scene, achieving a precision and recall of 1.0. This indicates excellent detection accuracy under stable and interference-free lighting conditions. In the right image, with complex lighting conditions, the model correctly detected 8 chestnuts, missed 1 (false negative), and incorrectly detected 4 (false positive). Therefore, the precision is 0.67, and the recall is 0.89. Despite video judder and changing lighting conditions, the model still detected most chestnuts, but false positives and false negatives highlighted the challenges posed by dynamic lighting conditions.
The detection process involved an average of 4.2 ms for preprocessing, 10.7 ms for inference, and 1.1 ms for postprocessing per frame. These times contribute to the overall processing time per frame and reflect the computational efficiency of the model in real-time video stream detection.
These results demonstrate that lighting conditions significantly impact video stream detection performance. Under fluctuating lighting conditions, precision drops significantly due to false positives, while recall remains relatively high. Detection accuracy is more sensitive to lighting changes under interference from factors such as camera shake, indicating that future model improvements should focus on enhancing robustness to cope with such environmental variations, thereby ensuring high detection performance in practical applications.
4. Discussion
Due to the complexity of orchard floor environments, research on ground-level chestnut detection in commercial orchard conditions remains limited. The results of this study demonstrate that both the emerging YOLO and RT-DETR models can effectively identify chestnuts in realistic orchard settings. Within the YOLO family, YOLOv11 consistently achieved the best overall detection performance, outperforming YOLOv12 and YOLOv13. These findings are consistent with those reported by Sapkota et al. (2024) [
40], who evaluated multiple YOLO variants (v8–v12) for in-orchard pre-sparse detection of green apples and found that YOLOv11 demonstrated excellent precision, while YOLOv12-1 achieved the highest recall. Together, these results reinforce the robustness and suitability of YOLO models in chestnut detection. YOLOv11’s architecture preserves fine spatial details, enabling tight bounding box localization, while its relatively small model size also supports fast inference speed, making it promising for embedded deployment on harvesting equipment.
In contrast, RT-DETR models exhibited a significantly longer inference time than the YOLO series models. This observation aligns with findings from Saltık et al. (2024) [
41], who reported that RT-DETRv1 can achieve competitive mAP at larger image sizes, but at the expense of increased inference time. These characteristics suggest that RT-DETR may be better suited for offline or batch processing scenarios, where global contextual modeling is beneficial and real-time constraints are less stringent.
Several limitations of this study warrant further investigation. First, the chestnut dataset is relatively small, which may limit the generalizability of the findings to orchards with different cultivars, soil types, ground terrains, or harvesting conditions. In future work, we plan to expand the dataset by collecting images from multiple orchards, cultivars, and seasonal conditions. This will enable cross-orchard and cross-season validation experiments, allowing a more rigorous assessment of the robustness and generalization capability of the proposed model under diverse real-world agricultural environments. Second, although both CNN-based YOLO models and Transformer-based RT-DETR models were evaluated, the overall training configuration—including data augmentation strategies and hyperparameter settings—was primarily developed based on the YOLO family. YOLO detectors are well-suited to aggressive geometric and photometric data augmentation, whereas RT-DETR models, due to their query-based Transformer architecture, are generally more sensitive to augmentation strength, query configurations, and training schedules. While RT-DETR was also tuned through adjustments of key hyperparameters such as learning rate and training schedule, further refinement of augmentation pipelines and other model-specific training strategies may still be required to fully exploit its potential performance. The modeling experiments were conducted using static images; real-world deployment will require validation under continuous video streams, varying illumination throughout the day, and mechanical vibrations from harvesting equipment. Motion blur has been shown to degrade image information acquisition and negatively affect object detection tasks in precision agriculture scenarios [
42], indicating the need for models robust against motion for reliable field performance. Moreover, future work will focus on expanding the scale and diversity of the dataset, further optimizing RT-DETR-specific training configurations, and validating the performance of models deployed on harvesting platforms in dynamic conditions. This study focused exclusively on chestnut detection and did not explore downstream tasks such as vision-mechanisms integration, vision-guided chestnut picking, and harvesting platform locomotion, which are necessary for developing an autonomous chestnut harvesting system.
The observed performance differences among YOLO variants are closely related to their ability to preserve and exploit high-resolution features for small-object representation under complex backgrounds. Ground-level chestnuts are typically small, densely distributed, and frequently occluded by grass, leaves, or soil, which makes the retention of shallow spatial details and local structural cues particularly critical. Excessive spatial down-sampling in deeper network layers can suppress weak target responses, leading to missed detections or imprecise localization. Models that more effectively leverage multi-scale feature fusion, therefore, tend to achieve higher recall and better localization consistency, particularly under stricter IoU thresholds where accurate boundary regression is essential. This observation suggests that fine-grained spatial information plays a dominant role in distinguishing chestnuts from visually similar background elements.
These findings further suggest that future improvements should focus on specific architectural modifications aimed at enhancing small target perception capabilities. One promising direction is to integrate attention mechanisms, such as the Channel Spatial Attention Module (CBAM), into the backbone or neck network to highlight information-rich spatial regions and suppress background interference. Previous research has shown that attention modules can significantly improve the detector’s ability to localize small targets by reallocating feature weights and enhancing weak target responses. Another potential strategy is to optimize the feature pyramid structure. For example, traditional PAN (pyramid attention network) or FPN (feature pyramid network) structures can be replaced by enhanced multi-scale fusion mechanisms, including improved feature pyramid networks, bidirectional feature pyramid networks, or attention-guided pyramid structures. These methods strengthen the interaction between shallow high-resolution features and deep semantic features. Such multi-scale fusion strategies are widely used to mitigate spatial information loss caused by depth downsampling and improve the detection performance of small targets. Furthermore, introducing an additional detection head specifically designed for small targets and based on a higher-resolution feature map could further improve the detection performance of densely distributed chestnuts. By adding a prediction head to an earlier feature layer corresponding to a higher-resolution feature map, the detector can better capture subtle spatial cues and object boundaries that are typically lost in deeper layers. This architectural improvement can enhance the recall and localization consistency of targets in complex orchard environments, especially when the target is small, partially occluded, and visually similar to background elements.
Research is ongoing to develop a vision-guided chestnut harvesting system by integrating the detection model with robotic manipulation and harvesting mechanisms. In such systems, accurate chestnut localization requires combining 2D detection with depth or stereo information to achieve reliable 3D positioning. Previous studies have demonstrated the feasibility of this approach in agricultural robotics for specialty crops. For example, Zhou et al. (2024) [
43] replaced complex 3D CNN architectures with a lightweight combination of 2D detection and stereo vision for the localization of
Camellia oleifera fruit, while Ge et al. (2023) [
44] showed that bounding-box-based depth estimation can achieve faster and more accurate localization than full 3D clustering approaches. These methods provide useful references for ground-level chestnut localization. From a systems perspective, effective deployment will require coordination between key components, including vision modules, robotic manipulators, and harvesting end-effectors, as emphasized by Chen et al. [
45]. Considering that chestnut production in the U.S. is dominated by small-scale orchards, future harvesting solutions must balance detection performance with system cost and operational simplicity.