You are currently on the new version of our website. Access the old version .
AlgorithmsAlgorithms
  • Article
  • Open Access

7 January 2026

An Algorithmic Framework for Cocoa Ripeness Classification: A Comparative Analysis of Modern Deep Learning Architectures on Drone Imagery

and
Department of Computer Science, Manhattan University, New York, NY 10471, USA
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Artificial Intelligence Algorithms for Prediction, Control, Classification, Regression, and Intelligent Signal Processing in Industry

Abstract

This study addresses the challenge of automating cocoa pod ripeness classification from drone imagery through a comprehensive and statistically rigorous investigation conducted on data collected from Ghanaian cocoa fields. We perform a direct comparison by subjecting a curated set of seven deep learning models to an identical, advanced algorithmic framework. This pipeline incorporates high-resolution ( 384 × 384 ) imagery, aggressive TrivialAugmentWide data augmentation, a weighted loss function with label smoothing, a unified two-stage fine-tuning strategy, and validation with Test Time Augmentation (TTA). To ensure statistical robustness, all experiments were repeated three times using different random seeds. Under these demanding experimental conditions, modern architectures demonstrated strong and consistent performance on this dataset: the Swin Transformer achieved the highest mean accuracy ( 79.27 % ± 0.56 % ), followed closely by ConvNeXt-Base ( 79.21 % ± 0.13 % ). In contrast, classic architectures such as ResNet-101 ( 55.86 % ± 4.01 % ) and ResNet-50 ( 64.32 % ± 0.94 % ) showed substantially reduced performance. A paired t-test confirmed that these differences are statistically significant ( p < 0.05 ). These results suggest that, within the evaluated setting, modern CNN- and transformer-based architectures exhibit greater robustness under challenging, statistically validated conditions, indicating their potential suitability for drone-based agricultural monitoring tasks.

1. Introduction

1.1. The Challenge of the Chocolate Bean

From artisanal chocolate bars to global commodities, the quality of all cocoa products begins with a single, critical decision on the farm: the timing of the harvest. For the nation of Ghana, where cocoa is a cornerstone of the economy [1], this decision carries significant weight. A cocoa pod, if harvested too early, yields beans that fail to ferment properly; if harvested too late, the beans risk disease and degradation [2]. The traditional method for this crucial task relies on the trained eye of experienced farmers, a process that is inherently subjective, labor-intensive, and difficult to scale across vast and often remote plantations. This manual bottleneck represents a significant challenge to maximizing both the quality and quantity of the yield.

1.2. The Promise of an Eye in the Sky

Precision agriculture, leveraging technologies like Unmanned Aerial Vehicles (UAVs) or drones, offers a transformative solution [2]. An “eye in the sky” can survey thousands of trees in a fraction of the time required for manual inspection, capturing high-resolution imagery rich with potential data. However, translating this raw imagery into actionable intelligence presents its own set of complex technical hurdles. The task requires a system to solve a complex visual puzzle of color, texture, and shadow, often obscured by a dense canopy that creates variable lighting and frequent occlusions [3]. The sheer detail of the drone imagery, while valuable, demands a computational approach that is both powerful and efficient.
The economic stakes motivating this “eye in the sky” approach for Ghana cannot be overstated; cocoa is not just a crop but a primary driver of export revenue and a source of livelihood for millions. However, translating raw drone imagery into actionable harvest intelligence is a significant algorithmic challenge fraught with complexities beyond simple occlusion and variable lighting. The very nature of aerial data acquisition introduces motion blur from the drone’s movement and drastic scale variation, where pods can appear as a few pixels in one frame and a large object in another. Furthermore, the task is a classic “fine-grained classification” problem, characterized by high intra-class variance (e.g., two “riped” pods look different due to sun exposure or disease) and low inter-class variance (e.g., the subtle color shift between “mature-unripe” and “riped” stages). Any successful system must therefore be robust enough to learn these nuanced, distinguishing features from imperfect, real-world data.

1.3. The Evolution of Machine Perception

This study is built upon a foundational investigation into the efficacy of deep learning for this task. Preliminary screening of classic Convolutional Neural Networks (CNNs) indicated that deep, complex architectures are necessary to learn the subtle visual cues of cocoa ripeness, raising a critical research question: could newer, more powerful architectural paradigms unlock a new level of performance and generalization?
The field of computer vision has recently undergone a significant evolution, spurred by the success of Vision Transformers (ViTs) like the Swin Transformer, which adapt models from natural language processing to capture global spatial relationships in images [4]. Concurrently, a new generation of “modern” CNNs, such as ConvNeXt, has emerged, integrating design principles from transformers back into a convolutional framework to achieve state-of-the-art results with high efficiency [5]. This creates a compelling opportunity to compare these distinct architectural families (classic CNNs, Vision Transformers, and modern CNNs) on a single, challenging, real-world problem.

1.4. Our Contribution

While numerous studies have compared CNNs and transformers, few have rigorously evaluated their resilience under the “modern training recipes” required for high-resolution aerial imagery. The existing literature often relies on standard benchmarks or mild augmentations. This study fills a critical gap by subjecting a wide range of architectures to a unified “stress-test” framework.
The purpose of this framework is not merely to find the highest accuracy, but to simulate complex, real-world visual variance using aggressive augmentation, high-resolution imagery ( 384 × 384 ), and advanced optimization. To ensure our findings are reliable, all experiments were repeated three times with different random seeds for statistical validation.
The primary contributions are as follows:
  • A rigorous evaluation of architectural robustness, demonstrating that while modern architectures (ConvNeXt, Swin) thrive under aggressive augmentation pipelines, classic deep architectures (ResNet-101) suffer catastrophic performance collapse.
  • A statistically validated comparison of three distinct deep learning paradigms evaluated on the specific challenges of the Cocoa Ghana dataset, supported by paired t-tests and confidence intervals.
  • The definition of a reproducible, algorithmic training standard (Algorithm 1) that serves as a baseline for future agricultural deep learning research.
Algorithm 1 Unified Two-Stage Training and Evaluation Pipeline
1:
Input: Dataset D, List of model names M n a m e s
2:
Parameters: Image size R i m g 384 , TTA repetitions N T T A
3:
L t r a i n , L v a l , L T T A PrepareData ( D , R i m g )
4:
Calculate class weights W c from frequencies in L t r a i n
5:
for each model_name in M n a m e s  do
6:
     Load pretrained model M; Replace final classifier layer
7:
     // — Stage 1: Train Head —
8:
     Freeze backbone; Unfreeze head
9:
     Train M (Stage 1) using L t r a i n , W c , and high L R
10:
     // — Stage 2: Fine-Tune Full Network —
11:
     Unfreeze all layers
12:
     Train M (Stage 2) using L t r a i n , W c , and low L R
13:
     // — Evaluation —
14:
     Test M on L T T A using prediction averaging (TTA)
15:
 end for

3. Methodology

This section details the comprehensive pipeline developed for training and evaluating deep learning models for cocoa pod classification. The methodology was designed as a two-part experimental process. First, a broad evaluation of classic CNN architectures was conducted to establish baseline performance. Second, an advanced training pipeline was developed and applied to a curated set of high-performing and state-of-the-art models to push the performance limits.

3.1. Dataset and Preparation

The study utilized the “Drone-based Agricultural Dataset for Crop Yield Estimation”, specifically the Cocoa Ghana subset, sourced from KaraAgro AI [18]. To illustrate the variability within the dataset, Figure 1 presents representative drone images from each of the five classes, highlighting challenges such as occlusion and varying illumination. This dataset consists of 4069 high-resolution images captured by drones over Ghanaian cocoa fields. Each image is paired with a corresponding label file that identifies the primary subject according to a five-class taxonomy: (0) cocoa-pod-mature-unripe, (1) cocoa-tree, (2) cocoa-pod-immature, (3) cocoa-pod-riped, and (4) cocoa-pod-spoilt.
Figure 1. Representative drone image examples from each of the five classes in the dataset. The images highlight the visual diversity and challenges, such as varying illumination, partial occlusion by canopy, different viewing angles, and subtle color differences between ripeness stages.
The dataset was partitioned into training, validation, and test sets using a fixed split ratio of 70%, 15%, and 15%, respectively. To support statistical validation, this splitting procedure was initiated multiple times using different random seeds, as described in Section 3.4. Across all runs, this resulted in approximately 2848 images for training, 610 for validation, and 611 for testing per split. The distribution of images across the five classes, summarized in Table 2, reveals a notable class imbalance, particularly for the cocoa-pod-riped class, which is the least represented. This imbalance motivated the use of a weighted loss function during training. The dataset is publicly available under the CC BY 4.0 license, permitting use and redistribution with appropriate credit.
Table 2. Class distribution across data partitions.

3.2. Model Selection and Preliminary Screening

To identify the most viable candidates for the rigorous analysis, we first conducted a broad preliminary screening of fifteen pretrained architectures. This included various depths of ResNet, DenseNet, EfficientNet, VGG, and MobileNet. As detailed in Table A1, older architectures such as VGG and AlexNet failed to achieve competitive accuracy, likely due to their lack of residual connections and inefficient parameter usage.
From this screening, we selected a representative suite of seven models to advance to the main study. The selection criteria ensured coverage of distinct design paradigms:
  • ResNet-50 and ResNet-101: Selected as the standard benchmarks for residual CNNs to test whether increased depth correlates with robustness in this domain.
  • DenseNet-169: Selected to evaluate feature reuse via dense connections.
  • EfficientNet-B4: Chosen over B0 or B7 as it offers an optimal trade-off between parameter count and input resolution scaling for 384 × 384 imagery.
  • MobileNetV3-Large: Selected to represent the upper bound of lightweight, edge-deployable performance compared to the smaller variants.
  • Swin-Base and ConvNeXt-Base: Selected as state-of-the-art representatives of Vision Transformers and modern hybrid CNNs, respectively. The overall structure of the Swin Transformer, characterized by its hierarchical design and patch-merging capabilities, is depicted in Figure 2.
    Figure 2. The Swin Transformer architecture uses a hierarchical approach, processing images through stages of increasing depth while merging image patches to create a multi-scale feature representation (image from Liu et al. [4]).

3.3. Experimental Design and Training Framework

To ensure a fair comparison, all seven selected models were trained using an identical, state-of-the-art training pipeline. This framework was designed to maximize robustness and generalization. To provide a comprehensive overview of the proposed approach, Figure 3 illustrates the unified experimental design, detailing the workflow from data preparation and the two-stage fine-tuning strategy to the final statistical evaluation.
Figure 3. The unified experimental design and training framework. This flowchart summarizes the data preparation, the two-stage fine-tuning strategy applied to all models, and the statistical evaluation protocol.
  • Image Resolution and Augmentation: Images were resized to 384 × 384 pixels. This resolution was selected as a trade-off: it preserves fine-grained texture details of the cocoa pods better than standard 224 × 224 inputs, without incurring the prohibitive computational cost of 512 × 512 . The training set was processed using TrivialAugmentWide [19], an automated policy that randomly selects aggressive augmentations (e.g., rotation, color jitter, solarize) to maximize generalization.
  • Test Time Augmentation (TTA): For evaluation, we employed a stochastic TTA strategy to further assess model robustness. Rather than relying on fixed crops, the model performed inference on five distinct, randomly augmented views of each test image. Unlike the training phase, TTA utilized a conservative augmentation pipeline consisting of random horizontal flips ( p = 0.5 ), random rotations ( ± 15 ), and subtle color jittering (brightness and contrast ± 0.1 ). The final prediction was derived by averaging the softmax probabilities of these five stochastic views, reducing the impact of outliers and simulating varying viewing conditions.
  • Loss Function: To address the class imbalance, a Weighted Cross-Entropy Loss function was employed. Class weights w c were computed using the inverse frequency formula: w c = N C × n c , where N is the total number of samples, C is the number of classes (5), and n c is the number of samples in class c. We also applied label smoothing (factor 0.1) to prevent overfitting.
  • Unified Two-Stage Fine-Tuning: All seven models were trained using the same two-stage strategy to ensure a fair comparison.
    Stage 1: Only the final classifier head was trained for 15 epochs with a high learning rate ( L R = 1 × 10 3 or 4 × 10 4 ), allowing the new layers to adapt.
    Stage 2: The entire network was unfrozen and fine-tuned end-to-end for 50 epochs with a lower learning rate ( L R = 3 × 10 5 or 4 × 10 5 ) and early stopping patience of 10.
  • Optimization: The AdamW optimizer was used with a Cosine Annealing Scheduler to systematically adjust the learning rate.
  • Hardware Environment: All experiments were conducted on a workstation equipped with an NVIDIA Quadro P5000 GPU (16GB VRAM). The batch size was set to 12 to accommodate the 384 × 384 resolution and gradient requirements. Total training time for the two-stage pipeline averaged approximately 4 h per model per seed.
In this work, we define fairness as the use of a shared training recipe across all evaluated architectures. By holding optimization settings constant, we aim to isolate the impact of architectural design choices while reducing variability due to hyperparameter tuning. Although some architectures may achieve higher absolute performance under architecture-specific optimization, a shared training protocol enables a more controlled comparison and supports clearer attribution of performance differences in an applied evaluation context.

3.4. Statistical Validation

To ensure the robustness of our conclusions, all experiments were repeated three times, each with a different random seed (42, 84, and 126). These seeds governed the train/validation/test data splits and model weight initialization. The performance metrics (accuracy, F1-score, test loss) are reported in the Results Section as the mean ± standard deviation across these three runs.
To determine whether the performance differences were statistically meaningful, we conducted a paired t-test on the accuracy results from the three runs. This test compares the sets of results, pairing them by the seed they were run with, to ascertain whether the difference between two models is statistically significant (p < 0.05).

3.5. Evaluation Protocol

The final performance of each fully trained model was evaluated on the hold-out test set to ensure an unbiased assessment. For our multi-class classification problem, performance was quantified using a suite of standard metrics derived from the model’s predictions. These metrics rely on four fundamental outcomes for each class: True Positives (TP), the number of correct predictions for a class; True Negatives (TN), the number of correct predictions that were not the class; False Positives (FP), instances where the model incorrectly predicted a class; and False Negatives (FN), instances where the model missed a class.
The primary quantitative metrics are defined as follows:
  • Accuracy: This metric measures the overall correctness of the model across all classes. It is calculated as the ratio of all correct predictions to the total number of predictions made.
    Accuracy = TP + TN TP + TN + FP + FN
  • Precision: Precision quantifies the reliability of a positive prediction. For a given class, it answers the question “Of all the instances the model predicted to be this class, what fraction was actually correct?”
    Precision = TP TP + FP
  • Recall (Sensitivity): Recall measures the model’s ability to find all relevant instances of a class. It answers the question “Of all the actual instances of this class in the dataset, what fraction did the model correctly identify?”
    Recall = TP TP + FN
  • F1-Score: The F1-score is the harmonic mean of precision and recall, providing a single score that balances both metrics. It is particularly useful when the class distribution is imbalanced.
    F 1 - Score = 2 × Precision × Recall Precision + Recall
For this study, we report the macro-averaged precision, recall, and F1-score, which calculates the metric independently for each class and then takes the unweighted average. This approach treats all classes equally, regardless of their frequency in the dataset.
In addition to these metrics, we generated two key visualizations to provide a more qualitative assessment of model performance:
  • Confusion Matrix: A confusion matrix is a C × C grid, where C is the number of classes. It provides a detailed breakdown of classification performance by showing the relationship between the true labels and the predicted labels. The diagonal elements represent the number of correctly classified instances for each class, while off-diagonal elements reveal the specific misclassifications, indicating which classes are most often confused with one another.
  • Precision–Recall (PR) Curve: A PR curve is a two-dimensional plot that illustrates the trade-off between precision (y-axis) and recall (x-axis) for a given class across a range of decision thresholds. A curve that bows out toward the top-right corner indicates a model with both high precision and high recall. The Area Under the PR Curve (AUC-PR) serves as a single, aggregate measure of performance, with a higher value indicating a more skillful classifier.

3.6. Computational Analysis

To provide a comprehensive analysis beyond predictive accuracy, we evaluated the computational and memory requirements of the top-performing models from each architectural family, as detailed in Table 3. An algorithm’s utility in a real-world application, such as deployment on a drone, is determined not only by its accuracy but also by its efficiency. The number of trainable parameters indicates the model’s memory footprint, while GFLOPs (Giga Floating-Point Operations) quantify the computational cost for a single forward pass. Finally, inference time measures the practical speed of the model on relevant hardware. Together, these metrics are crucial for determining the feasibility of deploying these algorithms in resource-constrained environments.
Table 3. Computational and efficiency analysis of all models from the advanced pipeline. All metrics are calculated for a 384 × 384 input resolution. Inference times were measured for a single image on an NVIDIA Quadro P5000 GPU.

4. Results

This section presents the empirical outcomes of our statistically validated experimental framework. We first detail the aggregate performance of all seven models, followed by a per-class analysis of the top-performing architectures.

4.1. Statistical Performance Analysis

All seven models were trained three times with different random seeds (42, 84, 126) under the identical advanced pipeline described in Section 3. The aggregate performance is reported in Table 4 as the mean ± standard deviation.
Table 4. Consolidated performance metrics (mean ± Std. Dev. over 3 seeds). All models were trained under an identical advanced pipeline. The p-value from a paired t-test between ConvNeXt and ResNet101 was 0.0140.
The results reveal a clear hierarchy in architectural robustness. The modern architectures, specifically the Swin Transformer and ConvNeXt-Base, demonstrated superior performance, achieving accuracies of approximately 79.27% and 79.21%, respectively. More importantly, these models exhibited remarkable stability. ConvNeXt-Base, in particular, showed a negligible standard deviation of 0.13 across runs. To strictly quantify this reliability, we calculated 95% confidence intervals (CI) using the t-distribution for small samples ( n = 3 ). ConvNeXt-Base achieved a tight 95% CI of [ 78.89 % , 79.53 % ] , confirming that its high performance is systematic and resilient to variations in data splits.
In stark contrast, the classic residual architectures struggled significantly under this rigorous training regime. ResNet-101, despite its depth, suffered a severe performance collapse, dropping to a mean accuracy of 55.86%. The high standard deviation of 4.01 resulted in a wide confidence interval of [ 45.90 % , 65.82 % ] . This extreme variance indicates that in some runs, the model failed to converge to a competitive solution, suggesting that it lacks the inductive biases necessary to handle the aggressive augmentations used in this framework.
The analysis of test loss values provides further insight into model confidence. The modern models achieved significantly lower test losses (approximately 0.60 to 0.62) compared to the classic ResNet models (0.95 to 1.13). A lower loss indicates that the modern models were not only correct more often but also more confident in their predictions. The high loss for ResNet-101 suggests that even when it predicted correctly, it did so with low probability margins, a trait that is undesirable for deployment in safety-critical or economically sensitive agricultural applications.
To statistically validate these observations, a paired t-test was conducted comparing the accuracy of the most robust modern CNN (ConvNeXt-Base) against the deepest classic CNN (ResNet-101). The assumption of normality was maintained based on the consistent experimental design. The analysis yielded a test statistic of t ( 2 ) 8.38 and a p-value of 0.0140 . Since p < 0.05 , we reject the null hypothesis and confirm that the architectural superiority of the modern backbone is statistically significant.
Formal statistical testing was primarily used to assess performance differences between modern and classic architectures, which constitute the central focus of this study. Comparisons among modern architectures with closely matched performance are therefore interpreted descriptively based on mean and variance.

4.2. Visual Performance Analysis

To further understand the classification capabilities of the proposed framework, we analyzed the confusion matrices and precision–recall curves of one of the top-performing models, ConvNeXt-Base.
As illustrated in Figure 4, the model demonstrates a balanced performance across all classes. Crucially, the confusion matrix (Figure 4a) shows minimal confusion between the most economically significant classes, ‘cocoa-pod-riped’ and ‘cocoa-pod-spoilt’. The precision–recall curves (Figure 4b) confirm this robustness, with the model maintaining high precision scores even at higher recall thresholds. This indicates that the unified two-stage training strategy successfully imparts fine-grained discriminative features to the modern architecture.
Figure 4. Comparative visual analysis of architectural robustness under the advanced pipeline. (a,b) The modern ConvNeXt-Base demonstrates high precision and clean separation between classes. (c,d) In contrast, the classic ResNet-101 exhibits significant confusion, particularly misclassifying ’Tree’ and ’Immature’ classes, visually confirming the performance collapse detailed in Table 4.
Conversely, the failure of the ResNet architectures can be traced to specific class confusions. As detailed in the Appendix A Table A1 data, ResNet-101 struggled profoundly with the ‘cocoa-tree’ class, achieving a recall of only 39%. It frequently misclassified background foliage as unripe pods. This suggests that the older receptive field designs are less capable of contextualizing objects within dense clutter compared to the large-kernel designs of ConvNeXt or the shifted-window attention mechanisms of Swin Transformers.

4.3. Model Interpretability

To validate that the models are learning relevant agricultural features rather than background noise, we employed Gradient-weighted Class Activation Mapping (Grad-CAM). This technique visualizes the regions of the image that most influenced the model’s prediction.
As shown in Figure 5, the ConvNeXt model consistently focuses its attention on the cocoa pods themselves, effectively ignoring the complex canopy and trunk structures. This localization capability confirms that the high accuracy metrics reported in Table 4 are a result of genuine feature learning, further validating the model’s potential for reliable real-world deployment. While Grad-CAM provides qualitative evidence of the model’s focus, we note that quantitative localization metrics (such as Intersection over Union) could not be calculated, as the dataset provides image-level classification labels rather than ground-truth bounding box annotations. However, the visual alignment between the activation heatmaps and the physical location of the pods in the modern architectures strongly correlates with the high classification accuracy observed.
Figure 5. Comparative Grad-CAM visualization of model interpretability. (a) The modern ConvNeXt model correctly focuses attention solely on the target cocoa pod (red activation), ignoring the background. (b) In contrast, the classic ResNet-101 displays erroneous localization, focusing intently on the tree trunk and branches rather than the pod. This misinterpretation of environmental features (confusing brown bark for ripe pods) illustrates the architectural inability to contextualize objects within dense foliage, leading to the misclassification.

4.4. Ablation Study Discussion

The advanced framework employed in this study integrates several components, including TrivialAugmentWide, Test Time Augmentation, and weighted loss. While a full ablation study to isolate the independent contribution of each component would provide further theoretical insight, such an analysis is computationally intensive and beyond the scope of this comparative study. The primary focus here remains on the relative robustness of differing architectures under a fixed, rigorous standard. Future work will dissect these components to further optimize the training pipeline.

5. Conclusions

This research examined the feasibility of using deep learning to automate cocoa pod ripeness classification from drone imagery collected in Ghana, a task of significant importance to local agricultural productivity. Through a rigorous and statistically validated evaluation, our findings indicate that while automated classification is feasible within this setting, model performance is highly sensitive to architectural choice.

5.1. Principal Findings

Using a unified experimental framework designed to reflect challenging visual conditions encountered in drone-based agricultural monitoring, we observed a consistent performance hierarchy across repeated runs on the studied dataset. Modern architectures such as the Swin Transformer and ConvNeXt-Base achieved stable accuracies of approximately 79%, demonstrating resilience to aggressive data augmentation and high intra-class variability in this context. In contrast, classic residual architectures such as ResNet-101 exhibited markedly lower and less stable performance under the same conditions, achieving a mean accuracy of 55.86%. A paired t-test confirmed that these differences are statistically significant ( p < 0.05 ). These results suggest that, for fine-grained classification tasks involving complex aerial imagery similar to the dataset studied here, modern Vision Transformers and hybrid CNN architectures may offer advantages over earlier CNN designs. We note, however, that further validation across diverse geographic regions, crop types, and imaging conditions is necessary to fully assess the generalizability of these findings.

5.2. Limitations and Future Horizons

While this study provides a robust architectural comparison, we acknowledge several limitations that define the path for future research:
  • Ablation Analysis: Our advanced pipeline integrated multiple techniques (weighted loss, TTA, aggressive augmentation). A granular ablation study to isolate the individual contribution of each component was not conducted due to computational constraints but remains a necessary step to optimize the training recipe further.
  • Domain Shift and Geographic Generalization: A significant limitation of this study is the geographic specificity of the dataset. All evaluation imagery was acquired from a single region in Ghana. Consequently, the models have not been validated against the visual variations found in other major cocoa-producing regions, such as Côte d’Ivoire, Indonesia, or Ecuador. Factors such as differing soil coloration (background noise), varying solar angles, and regional differences in cocoa pod varieties could introduce domain shift that degrades model performance. Future work must prioritize cross-regional validation to ensure the algorithm’s robustness for global deployment.
  • Task Granularity: This study focused on image-level classification. The immediate next step is to adapt the top-performing ConvNeXt backbone into an object detection framework (such as Faster R-CNN or YOLO) to provide farmers with precise pod counts and localization.

5.3. Implications for Precision Agriculture

The tools and techniques explored here are building blocks for systems that can empower farmers. By validating that modern deep learning models can maintain high precision even in complex environments, we have cleared a technical hurdle for the deployment of autonomous monitoring systems. However, this improved accuracy comes with a computational cost. Our results indicate that lightweight or older models may be insufficient for this specific aerial task. Consequently, the deployment of these robust algorithms will likely require edge-computing solutions capable of running heavier architectures like ConvNeXt, rather than relying solely on the most lightweight mobile processors. Transitioning these robust algorithms from research to the field offers the potential to optimize harvest timing, reduce waste, and secure livelihoods for smallholder farmers across Ghana and beyond.

Author Contributions

Conceptualization, T.M. and A.A.; methodology, T.M.; software, T.M.; validation, T.M. and A.A.; formal analysis, T.M.; investigation, T.M.; resources, A.A.; writing—original draft preparation, T.M.; writing—review and editing, T.M. and A.A.; visualization, T.M.; supervision, A.A.; project administration, T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We extend our sincere gratitude to Manhattan University and, specifically, the Kakos Center for Scientific Computing, for providing the essential computational resources necessary for this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

This appendix provides a detailed per-class breakdown of the precision, recall, and F1-score for all evaluated architectures. The data is organized into two phases: the Preliminary Screening, which covers the initial survey of fifteen models, and the Main Study, which details the performance of the seven selected models under the unified advanced pipeline (using data from the representative Seed 42).
Table A1. Comprehensive per-class performance metrics. The ‘Preliminary Screening’ section details the initial survey of 15 models that informed our selection. The ‘Main Study’ section details the final performance of the 7 selected models under the rigorous unified pipeline (Seed 42).
Table A1. Comprehensive per-class performance metrics. The ‘Preliminary Screening’ section details the initial survey of 15 models that informed our selection. The ‘Main Study’ section details the final performance of the 7 selected models under the rigorous unified pipeline (Seed 42).
Mature-UnripeTreeImmatureRipedSpoilt
PhaseModelPRF1PRF1PRF1PRF1PRF1
Preliminary ScreeningResNet1010.830.820.820.750.690.720.780.770.770.770.830.800.840.920.88
ResNet500.810.820.820.780.660.710.740.750.740.750.880.810.870.890.88
DenseNet1690.850.780.810.690.720.710.770.700.740.710.850.780.870.890.88
ResNet1520.790.830.810.760.620.680.730.760.740.760.810.780.840.880.86
EfficientNet B60.780.830.800.730.640.680.770.710.740.740.780.760.840.890.87
EfficientNet B30.790.800.790.710.660.690.780.720.750.730.780.760.830.920.87
ResNet340.760.830.790.760.620.680.770.740.750.710.760.730.850.890.87
MobileNet_v20.780.780.780.710.630.670.740.750.740.680.820.740.860.830.85
DenseNet1210.750.810.780.720.600.650.730.710.720.710.800.750.870.870.87
ResNet180.740.760.750.720.620.670.740.730.740.720.830.770.840.860.85
EfficientNet B50.780.810.790.750.560.640.670.700.690.680.740.710.780.880.83
SqueezeNet1_00.760.800.780.650.680.670.740.730.740.720.530.610.760.850.80
AlexNet0.470.710.570.520.320.400.530.560.540.440.200.280.670.680.67
VGG190.440.720.550.250.110.150.000.000.000.170.280.210.420.490.45
VGG160.390.840.530.000.000.000.090.010.020.240.050.080.270.540.36
Main Study (Adv.)ConvNeXt-Base0.840.790.810.730.670.700.730.810.770.770.850.810.880.900.89
Swin-Base0.830.830.830.740.700.720.780.730.750.770.830.800.870.950.91
DenseNet1690.840.730.780.700.710.700.760.740.750.690.820.750.820.890.85
EfficientNet-B40.810.780.790.710.660.680.730.760.740.730.720.720.780.920.84
MobileNetV3-L0.790.830.810.760.490.600.680.780.730.670.840.750.850.850.85
ResNet500.780.640.700.600.460.520.590.750.660.560.600.580.670.890.77
ResNet1010.770.450.570.530.360.430.550.630.590.400.620.490.490.830.62

References

  1. Food and Agriculture Organization (FAO). Cocoa’s Contribution to Ghana’s Economy; Food and Agriculture Organization (FAO): Rome, Italy, 2018. [Google Scholar]
  2. Amoa-Awua, W.K.; Madsen, M.; Olaiya, A.; Ban-Kofi, L.; Jakobsen, M. Quality Manual for Production and Primary Processing of Cocoa; Cocoa Research Institute of Ghana (CRIG): Accra, Ghana, 2007. [Google Scholar]
  3. Abbas, A.; Zhang, Z.; Zheng, H.; Alami, M.M.; Alrefaei, A.F.; Abbas, Q.; Naqvi, S.A.H. Drones in Plant Disease Assessment, Efficient Monitoring, and Detection: A Way Forward to Smart Agriculture. Agronomy 2023, 13, 1524. [Google Scholar] [CrossRef]
  4. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  5. Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  6. Saranya, N.; Srinivasan, K.; Kumar, S.P. Banana ripeness stage identification: A deep learning approach. J. Ambient Intell. Humanized Comput. 2022, 13, 4033–4039. [Google Scholar] [CrossRef]
  7. Chen, L.; Li, S.; Bai, Q.; Yang, J.; Jiang, S.; Miao, Y. Review of Image Classification Algorithms Based on convolutional neural networks. Remote Sens. 2021, 13, 4712. [Google Scholar] [CrossRef]
  8. Rizzo, M.; Marcuzzo, M.; Zangari, A.; Gasparetto, A.; Albarelli, A. Fruit Ripeness Classification: A Survey. Artif. Intell. Agric. 2023, 6, 144–163. [Google Scholar] [CrossRef]
  9. Moya, V.; Quito, A.; Pilco, A.; Vásconez, J.P.; Vargas, C. Crop Detection and Maturity Classification Using a YOLOv5-Based Image Analysis. Emerg. Sci. J. 2024, 8, 496–512. [Google Scholar] [CrossRef]
  10. Ayikpa, K.J.; Mamadou, D.; Gouton, P.; Adou, K.J. Classification of Cocoa Pod Maturity Using Similarity Tools. Data 2023, 8, 99. [Google Scholar] [CrossRef]
  11. Ayikpa, K.J.; Ballo, A.B.; Mamadou, D.; Gouton, P. Optimization of Cocoa Pods Maturity Classification Using Stacking. J. Imaging 2024, 10, 327. [Google Scholar] [CrossRef] [PubMed]
  12. Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning for real-time fruit detection and orchard fruit load estimation: MangoYOLO. Precis. Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
  13. Ekawaty, Y.; Indrabayu; Areni, I.S. Automatic Cacao Pod Detection Under Outdoor Condition Using Computer Vision. In Proceedings of the 2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), Yogyakarta, Indonesia, 20–21 November 2019; pp. 31–34. [Google Scholar]
  14. Suyuti, J.; Indrabayu; Zainuddin, Z.; Basri, B. Detection and Counting of the Number of Cocoa Fruits on Trees Using UAV. In Proceedings of the 2023 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Bali, Indonesia, 13–15 July 2023; pp. 257–262. [Google Scholar]
  15. Ferraris, S.; Meo, R.; Pinardi, S.; Salis, M.; Sartor, G. Machine Learning as a Strategic Tool for Helping Cocoa Farmers in Côte D’Ivoire. Sensors 2023, 23, 7632. [Google Scholar] [CrossRef] [PubMed]
  16. Rajeena P. P., F.; S. U., A.; Moustafa, M.A.; Ali, M.A.S. Detecting Plant Disease in Corn Leaf Using Efficientnet architecture—An analytical approach. Electronics 2023, 12, 1938. [Google Scholar]
  17. Halstead, M.; McCool, C.; Denman, S.; Perez, T.; Fookes, C. Fruit Quantity and Ripeness Estimation Using a Robotic Vision System. IEEE Robot. Autom. Lett. 2018, 3, 2995–3002. [Google Scholar] [CrossRef]
  18. KaraAgro AI Foundation. Drone-Based Agricultural Dataset for Crop Yield Estimation; Hugging Face: New York, NY, USA, 2023. [Google Scholar] [CrossRef]
  19. Müller, S.G.; Hutter, F. TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation. arXiv 2021, arXiv:2103.10158. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.