1. Introduction
Cherries are among the most cherished fruits globally, known for their exquisite taste and nutritional value. In Chile, cherries have become a cornerstone of the agricultural export sector, with a reputation for exceptional quality. Introducing innovative cherry varieties, such as the early-maturing Cherry Burst
TM by Bloom Fresh, underscores the industry’s commitment to maintaining its competitive edge [
1]. Despite logistical and quality assurance challenges, Chile remains the world’s leading cherry exporter, accounting for 95.7% of the Southern Hemisphere’s export supply [
2]. In the 2022–2023 season alone, Chile exported 356,442 tons of cherries, predominantly to China, which remains the primary market for this prized fruit.
During the 2022–2023 season, Chile produced 445,500 tons of cherries and exported 356,442 tons, equivalent to 71.3 million 5-kg boxes. On the other hand, the Southern Hemisphere as a whole exported 372,337 tons of cherries; of these, Chile accounted for 95.7 percent, followed by Argentina with 8173 tons in the 2021–2022 season, then Australia with 3975 tons, New Zealand with 3220 tons, and, finally, South Africa with 624 tons of cherries [
2,
3]. The main market for Chilean cherries continues to be China, although there was a slight decrease in the 2022–2023 season compared to 2020–2021, when 313,961 tons (88.1% of the total) were shipped to China and Hong Kong; in the previous season, this figure was 322,188 tons. Other important markets included North America and Europe. To the former 13,877 tons of cherries were exported (12,741 to the US and 1135 to Canada), while to the latter 6254 tons were exported (England 3317, Holland 1801, Spain 694, and 441 tons to other countries). Specialists assure that Chile will maintain its growth, with production expected to double in the next five years, reaching 830,000 tons, equivalent to 166 million 5-kilogram boxes, by the 2026–2027 season considering that current cherry plantations in our country cover 62,000 hectares [
2,
3].
Regarding the volumes exported in the 2023–2024 season, exporters have shipped 413,979 tons of cherries to international markets. China continues to be the main destination, showing a new record, with shipments 3.3% higher than last season, amounting to 377 thousand tons, according to Claudia Soler, executive director of Cherry Committee [
3]. The main cherry varieties shipped were: Lapins with 43% of the total volume exported, Santina (21%), Regina (19%), Sweet Heart and Bing (both with 4% each). Other notable varieties included Kordia (3%), Skeena (2%), Royal Dawn and Rainer (both with 1%), among others. In total, the Chilean industry exports more than 36 cherry varieties worldwide. Some of these are new early varieties such as Meda Rex, Sweet Aryana, and Royal Lynn, among others, though these are still produced in low volumes, not exceeding 80 tons exported.
Ensuring quality is critical to the success of Chilean cherry exports. Stringent quality control measures are required to meet the high standards of international markets, particularly China, which demands premium-grade cherries. Traditional manual methods for quality assessment—such as using calibrating rings for size and color tablets for ripeness—are labor-intensive, slow, and prone to human error. Moreover, subjectivity in evaluating parameters like color and texture can lead to inconsistencies in quality control [
3]. Hence, quality control of cherries remains key to safeguarding Chile’s strict export standards.
However, how is cherry quality evaluated at harvest? One method is the calibration and quality control of cherries using artificial vision, wherein image-processing software determines three types of information describing the quality of the fruit: the color as an indicator of ripeness, the presence of defects such as cracking, and the size [
4]. At processing plants, before hydrocooling, the fruit is received by the quality control team, which obtains a representative sample from each lot for thorough evaluation. The purpose of fruit sampling is to determine fruit quality and condition. This information is used to manage processes, provide parameters to producers, and often to segregate raw materials and define packaging strategies. The vast majority of exporters destine 100% of their cherry exports to the Chinese market, so they work with different labels and formats. Fruit classified as Premium is designated with the exporter’s “TOP” label and is generally also packed in smaller formats, understanding that this fruit will opt for the highest selling prices. In such cases, accurate lot segregation is essential. Among the different cherry varieties, Sweet cherry is particularly prized. Its quality can be evaluated using several objective methodologies, such as caliber, color, texture, soluble solids content (SSC), titratable acidity (TA), as well as maturity indexes. Functional and nutritional compounds are also frequently determined, in response to consumer demand [
5]. Most sweet cherries are consumed fresh, while a small portion is value-added to make processed food products [
6].
Cherry producers always rely on good sampling, which not only implies the quantity of fruit to be analyzed but also how the sampling is conducted [
7]. As a rule, the more fruit and the greater the number of bins, the better. This allows producers to cover the great diversity and obtain the most representative fruit [
3]. However, this objective must be balanced with operational efficiency. Until recently, producers relied on 100% manual receptions to characterize all the parameters, taking samples of between 100 and 200 fruits at most. However, how was the sample evaluated? For sizing, the complete sample was passed fruit by fruit through calibrating rings and sorted by caliber. A similar approach was used for comparing the color of the fruit with color tablets, to ensure the fruit color was adequate. Both systems being manual were slow and tedious. In addition, visual assessments of color are susceptible to lighting conditions, which can reduce accuracy of evaluation.
Recent advances in artificial intelligence (AI) and image-processing technologies offer transformative potential for the cherry industry. Automated systems leveraging image-based analysis and artificial neural networks have shown promise in addressing the limitations of traditional methods. These technologies enable rapid, non-destructive quality assessments, accurately measuring parameters like size, color, and the presence of defects. For instance, Baiocco [
8] demonstrated the efficacy of an image-based system for detecting pits in cherries, significantly reducing errors and enhancing efficiency.
Similarly, near-infrared spectroscopy combined with chemometrics has been used to monitor cherry quality changes under various storage conditions, providing valuable insights into post-harvest management [
9]. These innovations highlight the potential of AI-driven tools to revolutionize quality assessment practices in the cherry industry.
Despite growing enthusiasm for agricultural applications of transfer learning, developing AI systems for agriculture requires validation not just of theoretical performance, but particularly of real-world operational effectiveness. Our study addresses this critical need through the following research question:
R.Q.: How does the effectiveness of transfer learning models (namely, VGG16, ResNet50, and EfficientNetB0) vary when transitioning from controlled laboratory environments to practical implementation in commercial cherry orchards?
Rationale: Real-world validation is crucial because controlled environments do not adequately capture the variability of commercial orchards (e.g., variable lighting, diverse viewing angles, and natural obstructions). Moreover, diagnostic errors—such as false negatives that miss diseased plants or false positives that trigger unnecessary treatments—have direct economic consequences for growers and can undermine disease management strategies.
Hence, this paper investigates the potential of transfer learning for automated visual assessment of cherry tree health, focusing on its operational viability in real-world agricultural settings. Building upon established methodologies in agricultural AI [
10,
11], we critically examine how pre-trained convolutional neural networks perform when adapted for disease detection in commercial cherry orchards, with particular attention to discrepancies between laboratory and field performance.
The remainder of this paper is structured as follows:
Section 2 provides machine learning background;
Section 3 reviews previous work involving artificial intelligence and machine learning for plant disease detection;
Section 4 presents our transfer learning approaches and compares three deep learning architectures for leaf health classification;
Section 5 presents the experimental results;
Section 6 discusses threats to validity;
Section 7 analyzes the findings; and
Section 8 provides conclusions and future research directions.
3. Related Work
AI is increasingly employed in plant health to enhance disease detection, improve plant breeding, and optimize growth conditions [
26]. In the context of disease detection, ML techniques, including DL, are leveraged to develop advanced methods for identifying and classifying plant diseases. These methods enable early disease detection, facilitating timely intervention and significantly reducing plant mortality rates [
27]. Algorithms such as random forests, neural networks, and support vector machines are commonly used to predict plant diseases based on observable symptoms, including changes in shape, size, and wilting [
28,
29]. Considering health assessment techniques, artificial neural networks (ANN) and stacked models have proven particularly effectivity in analyzing phenotypic data [
30]. Additionally, ML plays an important role in uncovering complex interactions within cellular systems, especially in identifying pathogen effector genes involved in plant immunity [
31]. Indeed, ML serves as a powerful tool in plant health, providing innovative solutions for disease detection, health assessment, phenotyping, and genomics analysis.
For sweet cherries, ML algorithms, such as YOLOv5, are used to detect and characterize stressed tissues by identifying infected leaves and branches. This supports the early detection of diseases or stress factors, such as water shortages, which is vital for preserving tree health and maximizing yield [
32]. Moreover, ANN and adaptive neuro-fuzzy inference systems are employed to estimate antioxidant activity and anthocyanin content in sweet cherries during ripening, offering a faster and more economical alternative to traditional laboratory techniques [
33]. Likewise, drones and predictive models have been developed to estimate antioxidant content in cherry fruits using multispectral imagery captured by drones [
34]. In this context, ML is also utilized in automating sweet cherry harvesting. Machine vision systems are designed to detect and segment cherry tree branches, streamlining automated harvesting processes. This reduces labor costs and enhances efficiency by minimizing manual handling and operations [
35]. Furthermore, datasets such as Cherry CO are employed to train machine learning algorithms for cherry detection, segmentation, and maturity recognition, enabling the development of high-performance models that automate assessment and harvesting tasks, thereby further improving fruit farming efficiency [
36].
To better understand the evolution of these methods, a comparison between traditional and technological approaches is presented in
Table 1. This comparison highlights key differences in aspects such as speed, accuracy, scalability, and cost-effectiveness. Traditional methods, while accessible, are often limited by their manual and time-consuming nature. Conversely, technological approaches driven by AI offer scalable, automated solutions that significantly reduce operational costs and improve consistency.
As can be seen, AI, ML, and DL have played an important role in sweet cherry agriculture by improving disease detection, estimating fruit quality attributes, and automating harvesting processes. These advancements contribute to more efficient and sustainable farming practices, ultimately enhancing productivity and fruit quality. However, these approaches tend to focus primarily on the fruit itself, which does not necessarily enable the early detection of potential diseases affecting the tree. Therefore, our study aims to deepen the care of sweet cherry plants through a proactive approach that leverages AI to analyze tree leaves for potential diseases, even before the fruit has developed.
5. Results
This section presents a comprehensive evaluation of the three deep learning architectures (VGG16, ResNet50, and EfficientNetB0) for cherry leaf disease detection; systematically comparing their performance under both controlled laboratory conditions and real-world orchard environments. The analysis is structured into two principal dimensions:
Laboratory metrics: Quantitative assessment of validation accuracy, loss dynamics, and overfitting tendencies using standardized test sets.
Field performance: Operational effectiveness evaluated through confusion matrices and robustness to environmental variables (lighting, occlusion, etc.).
5.1. Training and Validation Dynamics
The models are trained for 20 epochs with early stopping (patience = 3).
Figure 6 shows the training progression versus validation metrics, both for accuracy (a) and loss (b), revealing fundamental differences in the learning behavior of the three architectures.
VGG16
It stopped at Epoch 18 due to early stopping
Training accuracy reached 99.97% by epoch 5
Validation accuracy plateaued at 99.73% by epoch 5
Validation loss increased from 0.31 to 0.52 after epoch 5
Possible overfitting after Epoch 5
ResNet50
It stopped at Epoch 13 due to early stopping
Achieved 98.80% training accuracy and 98.64% validation accuracy
Smooth validation loss reduction (0.0379 → 0.0367)
Possible stable convergence
EfficientNetB0
It stopped at Epoch 8 due to early stopping
Stagnant at 50.78% training/55.03% validation accuracy
Erratic validation loss (0.6743–2.0118 range)
No meaningful learning occurred
Failed convergence
5.2. Field Performance Validation
For a comprehensive evaluation, and regardless of what training performance might suggest, the models are evaluated under real-world orchard conditions using 1472 unseen images.
Table 4,
Table 5 and
Table 6 present the classification performance through confusion matrices, while
Table 7 summarizes key operational metrics.
Table 4.
Confusion Matrix for VGG16 Field Performance, showing ≈24.0% false negative rate.
Table 4.
Confusion Matrix for VGG16 Field Performance, showing ≈24.0% false negative rate.
Predicted | Count |
---|
Actual | | Diseased | Healthy | Total | ![Processes 13 02559 i001 Processes 13 02559 i001]() |
Diseased | 460 | 387 | 847 |
Healthy | 353 | 272 | 625 |
Total | 813 | 659 | 1472 |
Table 5.
Confusion Matrix for ResNet50 Field Performance, showing ≈22.4% false negatives.
Table 5.
Confusion Matrix for ResNet50 Field Performance, showing ≈22.4% false negatives.
Predicted | Count |
---|
Actual | | Diseased | Healthy | Total | ![Processes 13 02559 i002 Processes 13 02559 i002]() |
Diseased | 470 | 377 | 847 |
Healthy | 331 | 294 | 625 |
Total | 801 | 671 | 1472 |
Table 6.
Confusion Matrix for EfficientNetB0 showing that field performance is equivalent to random guessing (48% accuracy).
Table 6.
Confusion Matrix for EfficientNetB0 showing that field performance is equivalent to random guessing (48% accuracy).
Predicted | Count |
---|
Actual | | Diseased | Healthy | Total | ![Processes 13 02559 i003 Processes 13 02559 i003]() |
Diseased | 370 | 477 | 847 |
Healthy | 287 | 338 | 625 |
Total | 657 | 815 | 1472 |
Operational Metrics
The quantitative field performance metrics reveal critical differences in model reliability under real-world conditions for diseased leaves, as shown in
Table 7.
Table 7.
Field performance comparison.
Table 7.
Field performance comparison.
Model | Accuracy | F1-Score | Recall | Precision |
---|
VGG16 | 0.50 | 0.55 | 0.54 | 0.57 |
ResNet50 | 0.52 | 0.57 | 0.55 | 0.59 |
EfficientNetB0 | 0.48 | 0.49 | 0.44 | 0.56 |
Three key operational insights emerge from these metrics:
Laboratory performance does not guarantee field reliability: All models showed significant accuracy drops in real-world conditions despite high laboratory validation scores, revealing a major domain adaptation gap.
False negatives compromise disease control: Even the best-performing model (ResNet50) missed over 22% of diseased leaves, posing a serious risk for timely phytosanitary intervention.
Model architecture affects robustness: While ResNet50 maintained relatively better field performance, EfficientNetB0 failed to generalize, highlighting that lightweight models may lack the capacity needed for noisy, real-world inputs.
5.3. Laboratory vs. Field Performance Gap
Despite the promising results observed during controlled training and validation, a substantial gap emerged when models were deployed under real-world orchard conditions. This discrepancy highlights a key limitation in the practical use of transfer learning for agricultural diagnostics.
Two of the three architectures demonstrated high accuracy in laboratory settings, exceeding 98% in the case of VGG16 and ResNet50. However, when exposed to real-world variability such as uneven lighting, occlusion by other leaves or branches, and natural leaf deformities, their performance degraded dramatically. VGG16 and ResNet50 dropped to 50% and 52% accuracy, respectively, while EfficientNetB0 failed to generalize altogether, reaching only 48% accuracy.
This sharp decline is not merely a quantitative drop but a qualitative shift in behavior: laboratory-optimized models failed to capture the complexity and noise inherent in field data. In particular, the elevated false negative rates across all models indicate that diseased leaves are frequently misclassified as healthy, an outcome that directly threatens the timeliness and effectiveness of phytosanitary management.
These findings suggest that conventional training pipelines, even when using transfer learning and early stopping, are insufficient for deployment in unstructured, outdoor environments. Closing this performance gap will likely require domain adaptation strategies, more diverse and field-representative training datasets, and architectures explicitly designed for robustness under environmental variability.
To better understand the nature of this gap, a qualitative inspection of misclassified field images was performed. Most errors were associated with non-ideal lighting conditions (e.g., shadowed leaves or overexposure), occlusions from branches or neighboring leaves, and natural deformations such as curled or partially damaged leaves. These findings suggest that environmental variability, rather than model limitations alone, significantly contributed to reduced field accuracy.
7. Discussion
The results presented in this study reveal a significant gap between the performance of deep learning models in controlled laboratory settings and their operational effectiveness in real-world orchard environments. While the VGG16 and ResNet50 architectures achieved validation accuracies exceeding 98% during training, their field performance dropped drastically to 50% and 52%, respectively. This sharp decline underscores the fragility of transfer learning-based systems when exposed to domains characterized by high variability and noise—conditions that are inherent to agricultural environments. While advanced domain adaptation techniques such as DeepCORAL [
37] and DANN [
38] could potentially address the domain shift challenges observed in our study, we deliberately excluded these methods to establish baseline transfer learning performance under realistic deployment constraints. This decision reflects practical considerations for agricultural applications: edge devices deployed in orchard environments typically operate under severe computational limitations that preclude complex adaptation pipelines. Moreover, the inherent variability of orchard conditions—including frequent occlusions, varying lighting, and seasonal changes—violates the assumptions of coherent domain shifts that underpin many adaptation algorithms.
A particularly concerning observation is the elevated false-negative rate, which exceeded 22% even in the best-performing model (ResNet50). In agricultural disease monitoring contexts, such misclassifications have direct operational consequences: undetected diseased leaves can lead to delayed interventions, pathogen spread, and substantial economic losses, while our study employed F1-scores to partially address the inherent class imbalance (approximately 55% diseased leaves in our dataset), these findings underscore the need for more nuanced evaluation frameworks in future research. Specifically, metrics such as Area Under the Precision–Recall Curve (AUC-PR) and 95% confidence intervals would provide more robust performance assessments, particularly under field conditions where disease prevalence varies significantly across orchards and seasons. Such metrics would better illuminate the critical trade-offs between false alarms and missed detections—distinctions that are often obscured by aggregate accuracy measures but are essential for informing deployment decisions in real-world agricultural monitoring systems.
The poor convergence performance of EfficientNetB0 under both laboratory and field conditions underscores a fundamental tension between architectural efficiency and domain adaptability. Our results suggest that the model’s failure stems not from inherent representational limitations, but from the mismatch between its architectural requirements and our standardized training protocol. EfficientNetB0’s MBConv blocks are specifically designed for higher input resolutions (≥240 px) and exhibit sensitivity to batch normalization parameters under small batch conditions—neither of which were accommodated in our uniform experimental setup. This finding has broader implications for the deployment of efficient architectures in specialized domains: while computational efficiency remains crucial for edge applications, it cannot substitute for proper architectural adaptation to domain-specific constraints. The challenge lies not in achieving computational efficiency per se, but in developing training protocols that can effectively leverage the architectural strengths of different models within the constraints of agricultural monitoring scenarios.
Moreover, qualitative analysis of misclassified images reveals that natural environmental factors—such as uneven lighting, leaf occlusions, and morphological deformations—had a tangible impact on model predictions. These observations suggest that simple transfer of pre-trained models is insufficient; instead, domain adaptation techniques, expanded datasets with representative field conditions, and robust learning strategies are necessary to bridge the lab-to-field performance gap.
Collectively, these findings call for a reassessment of validation protocols in AI-based agricultural applications. Evaluating systems solely under idealized conditions could produce results that may not generalize to operational conditions in their real-world utility. Future development should prioritize early-stage field validation to ensure proposed solutions are not only accurate in theory but also reliable and actionable in practice.
8. Conclusions and Future Work
This study investigated the application of transfer learning for automated cherry tree health monitoring, focusing on the performance gap between controlled laboratory conditions and real-world orchard environments. Three deep learning architectures—VGG16, ResNet50, and EfficientNetB0—were evaluated, revealing significant discrepancies in their effectiveness across these settings. While VGG16 and ResNet50 achieved high validation accuracies (exceeding 97% and 98%, respectively) under laboratory conditions, their field performance dropped dramatically to approximately 50%, with elevated false negative rates posing a critical risk for disease management. EfficientNetB0, despite its computational efficiency, failed to generalize effectively, underscoring the limitations of lightweight models in noisy agricultural environments.
These findings highlight a fundamental challenge in deploying AI for agricultural diagnostics: models optimized for controlled conditions may lack the robustness required for real-world variability. Factors such as uneven lighting, occlusions, and natural leaf deformations significantly degraded model performance, emphasizing the need for domain-specific adaptations. The high false-negative rates observed in this study are particularly concerning, as they could delay interventions and exacerbate disease spread, with tangible economic consequences for cherry producers.
These results call for a paradigm shift in how AI models are validated for agricultural applications. Laboratory performance alone is insufficient to guarantee operational reliability; field testing must be integrated early in the development cycle to identify and address domain-specific challenges.
In terms of future work, research should empirically evaluate four complementary approaches to address the indoor-to-outdoor domain shift observed in this study: adversarial domain adaptation methods (e.g., Domain-Adversarial Neural Networks [
38]) to align feature distributions between controlled and field environments; feature-level alignment techniques (e.g., DeepCORAL [
37], Maximum Mean Discrepancy [
39]) to minimize distributional discrepancies; parameter-efficient fine-tuning strategies using limited field-acquired samples to adapt pre-trained models to specific orchard conditions; and visual explanation techniques such as Class Activation Mapping (CAM) or Grad-CAM [
40] to enhance model interpretability and prediction reliability under field conditions. These investigations will be essential for quantifying the performance gains achievable through domain adaptation while maintaining the computational efficiency required for practical edge deployment in agricultural monitoring systems. Such work will ultimately inform the development of robust, deployable AI-assisted solutions that can bridge the gap between laboratory research and real-world agricultural applications, particular in Chilean cherry orchards.