1. Introduction
Avocado (
Persea americana) is one of the most economically important fruit crops in Latin America, particularly in countries such as Mexico, Peru, and Chile [
1]. Its economic relevance stems from its high commercial value and its role as a major agricultural export product.
Avocado production is affected by a wide range of diseases [
2] caused by fungi, bacteria, viruses, and insects. Among the most important pathogens are
Phytophthora cinnamomi,
Colletotrichum gloeosporioides,
Sphaceloma perseae, and
Oidium spp. This study focuses on anthracnose, one of the most prevalent and destructive fungal diseases affecting avocado fruits. Anthracnose is primarily associated with fungi belonging to the genera Colletotrichum and Gloeosporium and can significantly reduce fruit quality and marketability.
Early detection of anthracnose [
3] is particularly important because symptoms often remain latent during cultivation and become visible only during postharvest storage and transportation [
4]. Consequently, infected fruits may be distributed together with healthy ones, causing economic losses throughout the supply chain. Automated detection systems can assist producers and distributors by identifying infected fruits before commercialization and preventing the spread of damage within storage and transport batches [
5].
To address this challenge, this study proposes a weighted ensemble model based on convolutional neural networks (CNNs) for anthracnose detection in avocado fruits. The proposed approach is evaluated against widely used benchmark architectures, including VGG-16 [
6], ResNet-18 [
7], and MobileNetV2 [
8]. For this purpose, a dedicated dataset containing 2218 images of healthy and anthracnose-infected Fuerte avocados was developed.
The methodological contributions of this work are based on the following aspects:
- -
Development of a dedicated dataset comprising 2218 images of healthy and anthracnose-infected Fuerte avocado fruits collected under real conditions.
- -
Design of a computationally efficient disease-detection framework based on shallow convolutional neural networks, providing an alternative to deeper and more resource-intensive architectures.
- -
Implementation of a class-imbalance handling strategy to improve the learning process and classification performance.
- -
Development and experimental optimization of a weighted ensemble model that combines multiple shallow CNN architectures for anthracnose detection.
- -
Comprehensive comparative evaluation against state-of-the-art CNN architectures, including VGG-16, ResNet-18, and MobileNetV2, supported by statistical significance analysis.
The literature review identified eight studies addressing disease detection in avocado leaves and fruits. Among them, only one study focused specifically on anthracnose detection and employed a dataset of 674 images collected from publicly available online repositories. In contrast, the present study introduces a newly developed dataset comprising 2218 images of Fuerte avocados, providing a substantially larger and more specialized resource for anthracnose detection research.
2. Related Works
For the literature review, studies indexed in Scopus between 2015 and 2025 were considered. The inclusion criteria comprised publications focused on the detection, classification, or diagnosis of diseases affecting fruit crops or fruit trees using computer vision, machine learning, deep learning, RGB imaging, multispectral imaging, or related technologies. Studies exclusively devoted to pathogen identification, agronomic management practices, epidemiological analyses, or approaches lacking automated detection methods were excluded.
A total of 33 relevant studies were identified. These studies employed a wide range of techniques, ranging from traditional machine learning algorithms to advanced deep learning architectures, for disease detection in fruits and other plant organs across different crop species. However, only eight studies specifically addressed avocado-related diseases. Among these, five focused on laurel wilt disease affecting different parts of the avocado plant, whereas only three investigated diseases occurring directly on avocado fruits. This finding reveals the limited attention that fruit disease detection in avocado has received within the existing literature.
Among the studies focused on avocado fruits, [
9] investigated the detection of internal rot using X-ray imagery and reported an accuracy of 98% with the U-Net++ architecture. Study [
10] addressed stem-end rot detection in Hass avocados using a dataset of 560 images and machine learning models such as Random Forest and XGBoost, achieving accuracies ranging from 56% to 65%. More recently, [
11] employed a dataset of 674 images collected from publicly available repositories, including Kaggle, and evaluated several deep learning architectures, such as ResNet50, InceptionV3, EfficientNetB2, VGG16, and DenseNet121. Their best-performing model achieved an accuracy of 99.03%, a precision of 98.92%, and a recall of 98.89%.
Table 1 summarizes the related works on the topic of this manuscript.
Among the 33 selected studies, accuracy was the most frequently reported evaluation metric. However, only a limited number of studies provided complementary performance indicators such as precision, recall, F1-score, sensitivity, specificity, or statistical significance analyses. This observation suggests that model reliability and robustness are often insufficiently assessed in the existing literature, particularly in scenarios involving class imbalance, where accuracy alone may provide an incomplete representation of model performance.
Furthermore, many studies primarily focused on demonstrating proof-of-concept results under controlled experimental conditions. Consequently, the generalization capability and practical applicability of the proposed methods in real-world agricultural environments remain largely unexplored. These limitations highlight the need for more comprehensive evaluation frameworks that consider multiple performance metrics, dataset characteristics, and operational constraints.
Based on the literature review, several research gaps were identified regarding avocado fruit disease detection. In particular, the limited availability of dedicated datasets, the scarcity of studies focused on anthracnose detection, and the insufficient assessment of model reliability motivated the development of the proposed approach.
Table 2 summarizes the main differences between the related studies and the present work, highlighting the identified research gaps and the contributions of this study.
3. Methodology
Figure 1 summarizes the process for the implementation of the proposed model and the various models under study.
The process of this study is summarized in
Figure 1. Initially, the acquisition of Fuerte avocado images is carried out; these images are then labeled, followed by the creation of the training, validation, and test partitions. Subsequently, the architectures and various models are implemented and evaluated using F1-score, recall, precision, and accuracy.
3.1. Data Collection
The constructed dataset consists of 2218 images. Images of fruits collected from different avocado trees located in the Moquegua Valley, Mariscal Nieto Province, Moquegua Region, Peru, were used. These images were collected between April and July 2025. The dataset includes images captured with a 50 MP camera with an f/18 aperture, at a resolution of 3072 × 4080 pixels, against a gray background in JPG format, in an environment with natural lighting conditions. For the implementation of the proposed model and the baseline models, the images were resized to 224 × 224 pixels in the same format.
3.2. Labeling
The labeling was carried out by a specialist in avocado diseases. Following the labeling process, of the 2218 images, 1730 correspond to the healthy class or 0, meaning the fruit shows no anthracnose disease, and 488 correspond to the unhealthy class or 1, meaning the fruit presents anthracnose.
Some examples of the labeling results are shown in
Figure 2.
3.3. Data Subsets
For the experiments, the data were divided into training, validation, and test sets. Of these, 1572 images correspond to the training set, 216 to the validation set, and 430 to the test set. Of the 1572 training images, 1242 belong to the healthy class and 330 to the unhealthy class. Of the 216 validation images, 170 belong to the healthy class and 46 to the unhealthy class. Finally, of the 430 test images, 318 belong to the healthy class and 112 to the unhealthy class.
Table 3 shows a summary of elements per class in each subset.
3.4. Handling Class Imbalance
Various strategies can address class imbalance, such as increasing the minority class (over-sampling), reducing the majority class (under-sampling), and class weighting. This study employs class weighting, a mathematical technique used during machine learning and deep learning model training. Class weighting modifies the loss function so that instead of all errors having the same cost, a different penalty or weight is assigned based on the class. Consequently, the majority class receives a low weight, while the minority class receives a high weight.
The weight of each class
is estimated using Equation (1).
where
is the number of samples of the training subset;
is the number of classes;
is the number of samples of the class.
According to
Table 3 and Equation (1), the class weights for classes 0 and 1 within the training subset are obtained, as presented in
Table 4.
3.5. Data Augmentation
Data augmentation enables the generation of artificial variations in training images in order to increase data diversity and improve the model’s generalization capability. In this case, the training set contains 1572 original images, which were dynamically transformed during training through different geometric and spatial operations. The Python 3.11.7 code block shown in
Figure 3 enables data augmentation on the training data.
rescale=1./255: normalizes pixel values to the range [0,1] by dividing each value by 255. rotation_range=20: randomly rotates each image by up to ±20 degrees. width_shift_range=0.2: horizontally shifts the image by up to 20% of its width. height_shift_range=0.2: vertically shifts the image by up to 20% of its height. shear_range=0.2: applies shear transformations to the image. zoom_range=0.2: performs random zoom-in or zoom-out transformations of up to 20%. horizontal_flip=True: horizontally flips some images. fill_mode=‘nearest’: fills empty pixels generated by the transformations using the value of the nearest pixel.
This generator does not physically create new images stored on disk. Instead, during each epoch, it produces new transformed versions of the 1572 original images in real time (on-the-fly). As a result, the model observes different variations in the same image across different training iterations, which reduces overfitting and improves robustness to changes in orientation, position, and scale.
In this study, the training process is carried out over 50 epochs; therefore, the model will not only see 1572 fixed images, but also thousands of dynamically generated variants derived from them. This effectively increases the diversity of the training set without the need to collect new real images.
3.6. Modeling
The basic CNN models were implemented in Jupyter 7.2.2 IDE using the TensorFlow 2.18.0 library in Python 3.11.7; their respective architectures are shown in
Figure 4.
The architectures of the basic CNN models consist of different convolutional blocks—3, 4, 5, and 6—depending on the model. Each model takes as input images of 224 × 224 pixels with 3 channels (RGB).
Each convolutional block contains a Conv2D layer that applies a number of filters to detect simple features such as edges and basic textures. It also includes a BatchNormalization layer that normalizes the activations to stabilize and accelerate training, and a MaxPooling2D layer that reduces the feature map by half (112 × 112).
The GlobalAveragePooling2D block, instead of flattening the tensor, computes the spatial average of each feature map, producing a vector of 256 values. This drastically reduces the number of parameters and helps prevent overfitting.
The final block corresponds to the classifier, which contains a fully connected Dense layer that learns combinations of the extracted features. The Dropout layer randomly deactivates a percentage of neurons during training to regularize the model and reduce overfitting. The output Dense layer produces a probability per class; in this study, the number of classes is 2.
All models were trained for 50 epochs, using Adam as the optimizer, a learning rate of 0.0001, and early stopping.
In summary,
Table 5 presents the hyperparameters of the implemented CNN models.
3.7. Evaluation
The results of the implemented models were evaluated using
Accuracy,
Precision,
Recall, and
F1
-Score, which are estimated through Equations (2)–(5).
where
TP are the True Positives,
FP the False Positives,
TN the True Negatives, and
FN the False Negatives.
4. Results and Discussions
4.1. Results
The confusion matrices of the four models based on convolutional neural networks are shown in
Figure 5. Likewise, the respective ROC curves of these models are shown in
Figure 6.
According to
Figure 6, it can be observed that the standalone models present an AUC above 95%, with CNN6 being the best model at an AUC of 98.80%, followed closely by CNN5 and CNN4. CNN3 is the model with the lowest AUC (95.60%).
With the aim of improving the results, ensemble models with varying weights were implemented. The results of these models are shown in
Table 6. The best ensemble model is the one that combines the four aforementioned models, presenting the confusion matrix shown in
Figure 7.
According to
Table 6, it can be observed that in most metrics, such as Accuracy, Precision, and F1-Score, the best standalone model is the CNN model with 6 layers (CNN6). However, in terms of Recall, the best model is the CNN with 3 layers (CNN3).
On the other hand, it can be observed that of the two weighted ensemble models based on three models, the CNN4 + CNN5 + CNN6 model outperforms the best standalone CNN6 model in terms of Accuracy and Recall. However, in terms of Precision and F1-Score, it yields lower results.
The weighted ensemble model based on the four standalone models achieves the best results, outperforming the best standalone model in Accuracy, Recall, and F1-Score.
To define the ensemble weights, the standalone models were first ranked according to their individual F1-Score values, since this metric provides a balanced evaluation between precision and recall and is more appropriate for imbalanced classification scenarios. Based on this ranking, CNN6 obtained the highest performance, followed by CNN4, CNN5, and CNN3.
The weight assignment strategy was designed under the assumption that models with higher predictive reliability should contribute more strongly to the final ensemble decision. Therefore, larger weights were assigned to the best-performing models, while lower-performing models received smaller contributions.
Additionally, several alternative weight distributions were experimentally evaluated for both the three-model and four-model ensemble configurations in order to analyze the sensitivity of the ensemble to different contribution levels. The evaluated combinations are presented in
Table 7. Among them, the second weight configuration achieved the best validation performance and was therefore selected for the final ensemble model.
The ensemble prediction was computed as a weighted average of the individual model probabilities according to Equation (6).
where
Higher weights were assigned to models with higher F1-Score values in order to prioritize models with better balance between false positives and false negatives.
4.2. Discussions
In this section, the results achieved by the weighted ensemble model are discussed, and these are compared with the results achieved by other state-of-the-art models.
4.2.1. Benchmark Models
To analyze the results of the proposed model, three well-known state-of-the-art models were implemented: VGG-16, ResNet18, and MobileNetV2, whose architectures are shown in
Table 8.
VGG-16 is a well-known convolutional neural network (CNN) architecture in computer vision that contains 16 convolutional layers and 3 fully connected layers.
ResNet-18 is another convolutional neural network (CNN) architecture that belongs to the ResNet (Residual Networks) family. It has 18 deep layers, primarily CNNs, and a final fully connected layer. ResNet uses skip connections, or residual connections, that allow information to bypass layers.
MobileNet is a family of convolutional neural networks (CNNs) designed by Google to be used on resource-constrained devices such as mobile phones, tablets, or embedded systems.
Table 8 shows the hyperparameters of the benchmark models.
Table 9 shows the results achieved by the benchmark models.
According to
Table 9, in general, the three benchmark models yield lower results than the weighted ensemble model. The benchmark model that stands out most among the three is VGG-16, with a Recall (0.9535) that surpasses all other models, including the one proposed in this study. In terms of Accuracy (0.9488), Precision (0.9375), and F1-Score (0.9052), the weighted ensemble model proposed in this study achieves superior results compared to all benchmark models.
It is important to highlight that the simpler models outperform the larger models in most metrics. This does not mean they are generally better, but rather that for the specific problem of anthracnose detection in avocado fruits, classical CNNs and the weighted ensemble model are more suitable.
It is also worth noting that the avocado domain is not represented in the ImageNet dataset, and that the large models have been trained on this collection.
Also, it is important to highlight the superiority of VGG-16 regarding the Recall metric. According to Equation (4), Recall is based on TP and FN; for the VGG-16 model, a TP of 82 and an FN of 4 were obtained. Although this TP is not better than the TP of 105 achieved by the proposed model, the FN of 4 is superior to the proposed model’s FN of 15, which greatly influences the Recall estimation. The main strength of VGG-16 compared to all implemented models is its TN prediction of 314, outperforming all other models. This directly influences the FN prediction, presenting the lowest value among them.
4.2.2. Statistical Analysis
To determine whether a statistically significant difference exists between the proposed model and the benchmark models, the McNemar test was conducted, the results of which are shown in
Table 10.
According to
Table 10, the McNemar test reveals that the differences between the proposed model and both ResNet-18 and MobileNetV2 are statistically significant, since their
p-values are lower than 0.05. This indicates that the prediction errors produced by these benchmark models differ significantly from those of the proposed approach, suggesting that the proposed model achieves a genuinely different and improved classification behavior rather than a performance variation caused by random fluctuations.
In contrast, the p-value obtained for VGG-16 is greater than 0.05, indicating that the differences between the proposed model and VGG-16 are not statistically significant. Although the proposed approach achieved slightly better performance metrics, the McNemar test suggests that both models exhibit comparable classification behavior on the evaluated dataset.
Overall, these results support the robustness of the proposed model and confirm that its performance improvement over ResNet-18 and MobileNetV2 is statistically reliable.
4.2.3. Computational Cost and Latency
The training time of the implemented models, as well as the inference time in seconds on a computer with a HP Boeblingen, Germany, Core i7-13700H, processor at 2.40 GHz, 16 GB of RAM, running on a Windows 11 environment, is shown in
Table 11.
In this study, the computational cost refers to the time required to train a model and is estimated through Equation (7).
where
T: Total training time in seconds.
E: Total number of epochs executed during the training phase.
S: Number of steps (batches) per epoch.
: Inference and backpropagation time for a single step s.
: Time taken for evaluation on the validation set at the end of each epoch.
According to
Table 11, the benchmark models based on transfer learning exhibit lower training times compared to the CNN models trained from scratch. This behavior can be explained by the use of pre-trained ImageNet weights in VGG-16, ResNet-18, and MobileNetV2, where only the final classification layers are fine-tuned during training. As a result, the optimization process converges faster and requires fewer computational resources than models initialized with random weights. In contrast, the CNN3–CNN6 models must learn feature representations entirely from the training dataset, which increases the computational cost and training time.
Regarding inference latency, ResNet-18 and MobileNetV2 achieved similar performance, requiring 142 s for the complete test set and approximately 0.43 s per image. In comparison, VGG-16 presented the highest latency (194 s and 0.59 s per image), mainly due to its large number of parameters, particularly in the fully connected layers located at the end of the architecture. These dense layers significantly increase the computational burden during inference.
The CNN models trained from scratch (CNN3–CNN6) showed lower inference latency than the benchmark models, with per-image processing times ranging from 0.26 s to 0.30 s. This reduction is associated with their simpler architectures and lower parameter counts, which decrease computational complexity during forward propagation. Among these models, CNN3 achieved the lowest latency, whereas CNN5 required the highest computational cost among the proposed CNN architectures.
Finally, although the ensemble model achieved the best classification performance, its inference latency is inherently higher because predictions from multiple standalone models must be computed and combined. Consequently, the total latency of the ensemble corresponds approximately to the accumulated latency of its constituent models, representing a trade-off between predictive accuracy and computational efficiency.
4.2.4. Extra Tests
Additionally, beyond the experiments reported in this study, the proposed model was evaluated on two additional datasets: one containing data collected in 2026, and another corresponding to the dataset used to train the models reported in [
11].
This dataset is still under development. At present, it consists of 233 images, including 154 images belonging to class 0 (healthy) and 79 images belonging to class 1 (unhealthy). These images were collected between March and May 2026. The same imaging equipment used for the 2025 dataset was employed for image acquisition.
Figure 8 presents the results obtained by the proposed model on the 2026 dataset, which was not used during training. The confusion matrix indicates an accuracy of 0.9657, a precision of 0.8987, a recall of 1.0000, and an F1-score of 0.9467. Despite being collected in a different year, the dataset yielded performance metrics comparable to those obtained on the original test set, demonstrating the robustness of the proposed model and its ability to generalize across different data acquisition periods.
The dataset used in [
11].
This dataset contains three classes: healthy, anthracnose, and scab. For the purposes of this study, only the first two classes were considered, resulting in a total of 466 images, including 283 images from class 0 (healthy) and 183 images from class 1 (unhealthy). The image resolution is 300 × 300 pixels, which is considerably lower than that of the images used in the previous datasets.
Figure 9 presents the results obtained on this dataset.
According to
Figure 9, when this dataset was used as an additional test set, the proposed model achieved an accuracy of 0.5043, a precision of 0.9071, a recall of 0.4368, and an F1-score of 0.5897. These results indicate that the model does not generalize well to this dataset. In particular, the confusion matrix reveals a high number of false negatives, resulting in a substantial reduction in recall. This performance degradation is likely attributable to the significant differences between this dataset and the datasets used during training. As summarized in
Table 12, factors such as image resolution, image shape, color characteristics, acquisition conditions, and dataset composition may have contributed to the observed decline in predictive performance.
These results suggest the presence of a domain shift between the training dataset and this external dataset, which limits the model’s ability to generalize effectively. The main factors that may explain this behavior are summarized in
Table 12.
5. Conclusions and Future Work
5.1. Conclusions
Based on the experimental results, the proposed weighted ensemble model demonstrated strong performance for anthracnose detection in Fuerte avocado fruits, achieving an F1-score of 0.9052, a precision of 0.9375, and an accuracy of 0.9488. These results indicate that the proposed approach is an effective and computationally efficient alternative for automated anthracnose detection.
Furthermore, the weighted ensemble consistently achieved better overall performance than the benchmark models across most evaluation metrics. However, the statistical analysis performed using McNemar’s test showed that the observed improvements were statistically significant only when compared with ResNet-18 and MobileNetV2. Therefore, while the proposed model exhibited superior predictive performance, the differences with some competing approaches should be interpreted with caution from a statistical perspective.
Overall, the findings suggest that combining multiple shallow CNN architectures through a weighted ensemble strategy can provide a robust solution for anthracnose detection, while maintaining lower computational complexity than deeper state-of-the-art models.
5.2. Future Work
Although the proposed weighted ensemble achieved the best overall performance, the results indicate several opportunities for further improvement. In particular, the recall value obtained by the proposed model (0.8750) remains lower than that achieved by VGG-16 (0.9535). Since recall is especially important in disease detection applications, where minimizing false negatives is critical, future research could investigate hybrid ensemble or stacking approaches that combine lightweight CNN architectures with high-performing deep models to further enhance disease identification capabilities.
In addition, the experimental results showed that the CNN6 model achieved performance comparable to that of the weighted ensemble while requiring a simpler architecture. Therefore, future work may focus on optimizing individual CNN models through decision-threshold tuning, probability calibration, and advanced hyperparameter optimization techniques. Such strategies could improve the balance between true positive and true negative predictions while reducing computational requirements and inference latency.
Finally, future studies should evaluate the proposed approach on larger and more diverse datasets collected under different environmental conditions, acquisition devices, and geographical locations. This would enable a more comprehensive assessment of the model’s robustness, generalization capability, and suitability for deployment in real-world agricultural scenarios.