1. Introduction
Plant image recognition has become an important application of artificial intelligence in agriculture and biodiversity informatics. In farming, plant pests and diseases can devastate crop yields; up to 40% of global crop production is lost annually, costing over
$220 billion, according to the Food and Agriculture Organization [
1]. Early and accurate identification of plant ailments is therefore critical for food security. Beyond agriculture, automated plant species identification can assist botanists and laypersons in rapidly recognizing plant types, which is valuable for environmental monitoring and medicinal plant discovery [
2]. Due to the time-consuming process and the need for an expert in manual plant identification, computer vision techniques offer robust and accurate results in automating the process. In recent years, deep learning has gained an important role in analyzing plant images for different problems, such as disease detection and species classification [
2,
3]. Convolutional Neural Network (CNN)- based models tend to learn rich hierarchical features from plant images and achieve superior results compared to previous systems [
2].
However, CNNs have a limitation in capturing and learning long-term dependencies in images due to their fixed receptive fields, determined by kernel sizes [
4]. This limitation of CNNs creates a challenging problem for plant images in which disease or species appear in distant regions of the leaf or the plant. Recently, Vision Transformers (ViTs) have introduced a new approach to image recognition, replacing convolutional layers with self-attention mechanisms [
4]. This approach improves the models’ global feature-capturing ability and tends to learn global patterns within plants. Besides the promising improvements in ViTs, the need for large amounts of data to achieve accurate convergence might make the models prone to overfitting on small datasets [
3,
5]. These strengths and limitations of the architectures led researchers to create hybrid models and to overcome them in plant image analysis.
In this study, we employed a hybrid deep learning model (HDL-PlantNet) that combines a CNN and a Transformer, specifically designed for plant leaf classification. The model includes the EfficientNetV2-Small as a convolutional backbone for initial feature extraction, a custom Transformer encoder to enhance extracted features for long-range dependencies, and a classification head. The aim of this architecture is to improve the class separability across diverse species captured in real-life environments while capturing fine-grained local features and global contextual patterns.
The primary contribution of HDL-PlantNet is its task-specific adaptation of a compact EfficientNetV2-S backbone, combined with a Transformer-based feature refinement module, for field-based plant species recognition. Contrary to the large-scale hybrid architectures such as CoAtNet, MaxViT [
6], and Swin-based models, HDL-PlantNet is designed to improve contextual feature modeling after convolutional feature extraction without increasing computational cost. This design is particularly relevant for plant identification under field conditions, where pattern distinctions are based on small or partially occluded regions.
A primary plant dataset, The Cyprus Seasonal Flora Image Dataset (CSFID) [
7], is used in the experiments, and the Swedish Leaf Dataset [
8] is also used as a supplementary dataset to assess the consistency of the HDL-PlantNet under controlled environments.
2. Related Studies
To leverage the strengths of both CNNs and Transformers, researchers have begun exploring hybrid architectures that combine convolutional and self-attention modules within a single model [
4]. By merging CNN’s strong local feature extraction with the ViT’s global context modeling, such hybrid models can produce more discriminative representations of plant images. Recent studies in the agricultural domain indicate that CNN–Transformer hybrids can improve classification performance and robustness.
Lee et al. [
9] demonstrated that a deep CNN could automatically learn discriminative leaf features and outperform hand-crafted feature methods, achieving 99.6% accuracy in classifying 44 plant species from the new MalayaKew leaf dataset. This study led to further plant recognition studies using CNNs on field imagery. Dyrmann et al. [
10] built a custom CNN to distinguish 22 weed and crop species at early growth stages, reporting 86.2% accuracy. This validated CNNs’ potential for real-world agricultural classification, albeit with the recognition that more training images per class were needed to improve performance further [
11].
Mohanty et al. [
12] used publicly available PlantVillage dataset with 54,306 leaf images and trained deep networks to identify 38 crop–disease classes. Their study achieved 99.35% overall accuracy on held-out test data. Carranza-Rojas et al. [
13] performed one of the first large-scale plant species classification studies using deep learning. They evaluated CNNs on thousands of herbarium specimen images spanning over 1000 species, reporting that transfer learning yielded an impressive 90% top-5 accuracy. However, such high accuracy in lab conditions also raised concerns about overfitting and domain bias [
14].
By 2020, the transformers originally developed for natural language processing were adapted for vision tasks, signaling a new direction beyond pure CNNs. Dosovitskiy et al. [
15] introduced the Vision Transformer (ViT), which treats an image as a sequence of patch embeddings and applies a Transformer encoder for classification [
6]. Remarkably, a ViT pre-trained on sufficient data surpassed the accuracy of state-of-the-art CNNs on ImageNet, ending the dominance of CNNs in image recognition. However, ViTs have far fewer built-in inductive biases than CNNs, which made them data-hungry and prone to overfitting on limited data [
6,
16]. Without extensive pre-training, vanilla ViT underperformed ResNets on standard datasets [
6].
As CNNs and Transformers each offered unique advantages, researchers increasingly experimented with hybrid architectures that combine convolutional and self-attention mechanisms. The general idea is to utilize CNN layers for efficient local feature extraction and Transformer blocks for modeling long-range dependencies and global context. In 2021, Google’s CoAtNet explicitly married convolutions and Transformers by stacking convolutional layers in early network stages and Transformer layers in later stages [
17]. In 2022, a number of sophisticated hybrid models emerged in the computer vision community. Tu et al. [
6] introduced MaxViT (Maximal Multi-Axis ViT), a hierarchical model that integrates depthwise convolutions (MBConv blocks) with a novel multi-axis self-attention mechanism in each building block.
Recent studies show that the last decade has seen plant recognition research shift from traditional CNN solutions to the advent of vision transformers to improve capabilities, ultimately leading to hybrid CNN–Transformer approaches that aim to harness the strengths of both. Recently, Xu et al. [
18] introduced a symmetric hybrid network (PLTransformer) that uses multi-scale convolutional modules and an overlap-attentive downsampler to capture both local textures and global context in plant disease images. Their model achieves 99.95% on PlantVillage, demonstrating the power of CNN+Transformer fusion. Similarly, Lee et al. [
19] propose a Plant-CNN-ViT ensemble (ResNet50, DenseNet, Xception, and ViT) for leaf classification, achieving nearly 100% on standard leaf datasets. These recent studies illustrate that integrating CNNs and transformers can improve feature representation and performance.
For instance, Zhang et al. [
20] introduce the FewMedical-XJAU benchmark for fine-grained medicinal plants, featuring diverse natural backgrounds and high intra-class variability. They also propose a multimodal fusion model to cope with subtle differences under challenging conditions. Rodriguez-Vazquez et al. [
21] show that unsupervised adversarial alignment can cut plant-counting error by 97% under strong cross-domain shifts. On large-scale plant tasks, Malik et al. [
22] report only 87% accuracy on the PlantCLEF challenge using EfficientNet-B1, illustrating the difficulty of fine-grained species classification across hundreds of categories.
These studies highlight that background clutter, lighting changes, and novel species reduce performance in practice. Importantly, most existing hybrid models and benchmarks still assume relatively clean or controlled data.
3. Materials and Methods
3.1. Datasets
3.1.1. The Cyprus Seasonal Flora Image Dataset (CSFID)
The Cyprus Seasonal Flora Image Dataset [
7] consists of 3072 labeled training and 768 labeled test images of 27 plant species commonly found in Cyprus. Each image represents a distinct flora class captured under real-world conditions, including herbs, shrubs, trees, and flowering plants. Images were captured in different seasons using different mobile phones under varied lighting, backgrounds, and scales to make the dataset suitable for robust training and real-world deployment and evaluation of multiclass plant classification models in computer vision tasks.
Table 1 presents the detailed plant species and training and test samples for each species.
Contrary to the many existing plant datasets collected under laboratory or standardized conditions [
8,
12], the CSFID dataset was collected entirely in natural environments without isolation of plants or arrangement of scenes. This eliminates the need for controlled acquisition settings for the users and allows the dataset to reflect real-world scenarios in a more realistic way, where multiple plant species, occlusions, background clutter, and environmental variations naturally occur. As a result, the CSFID dataset provides a more challenging and realistic benchmark for evaluating plant recognition models.
The dataset was originally split into a train and a test set. Single-plant images or uniform-background images were included in the training set, while more complex multi-plant images captured in crowded plant regions were reserved for testing. This provided an assessment of the model’s generalization under a challenging real-world distribution shift across multiple plants, partial occlusions, background interference, and varying spatial compositions.
The official CSFID split does not follow a conventional IID classification protocol. The training set of CSFID primarily contains single-plant or relatively uniform-background images, whereas the test set contains more challenging multi-plant scenes with occlusions, background clutter, and varying spatial compositions. For that reason, the model performances are based on the interpretation of classification performance under an official domain-shift evaluation setting rather than under a standard random train–test split. This train–test split protocol provides a realistic assessment of model robustness to distribution shifts commonly encountered in practical plant identification applications.
Additionally, the acquisition conditions of plant images, such as different seasons, times of day, weather conditions, and mobile camera devices, introduce challenging variations in illumination, color distribution, angle, and brightness. The multi-plant scenes in the test set also aim to evaluate the model’s abilities in real-world scenarios.
Figure 1 and
Figure 2 present example images for plant species of the Cyprus Seasonal Flora Image Dataset. The first row shows training images, while the second row presents test images of the same plant species. It is clear that the test images have the same characteristics as the training images; however, they also include additional artifacts, plants, or environmental effects that make recognition challenging.
Figure 1 shows how the aloe vera, kalanchoe, and iris images differ in training and testing, whereas single plant or different flower characteristics appear in the training images, while multiple plants, crowded environments, or flower characteristics appear in the test images. Similarly,
Figure 2 presents the plant and flower captured during different seasons and maturity stages.
3.1.2. Swedish Leaf Dataset (SLD)
We considered the Swedish Leaf Dataset [
8] as a supplementary dataset to assess the consistency of the proposed HDL-PlantNet architecture under controlled-environment and standardized imaging conditions. This benchmark dataset comprises leaf images from 15 plant species, with approximately 75 samples per class, captured under controlled conditions.
Figure 3 presents sample leaf images from the Swedish Leaf Dataset.
3.2. Deep Learning Models
3.2.1. VGG16
VGG16 [
23] is a deep CNN with 16 learnable layers, including stacked 3 × 3 convolutional filters and max pooling layers. It is one of the simplest and most uniform deep learning architectures. VGG16 is widely used for image classification, transfer learning, and feature extraction due to its efficient learning of hierarchical features.
3.2.2. ResNet50
ResNet [
24] is the first CNN model to introduce residual skip connections to avoid vanishing gradients. ResNet50 is a 50-layer architecture and, due to its strong performance and lower computational cost, is commonly used in computer vision tasks.
3.2.3. EfficientNetV2 Small
EfficientNetV2-S [
25] is an improved version of EfficientNet [
26] that aims to provide high accuracy with faster training speed. It combines MBConv and fused MBConv blocks as a new-generation architecture. Its efficacy in scaling features in a balanced way makes it widely preferred for practical image classification.
3.2.4. ConvNeXt-Tiny
ConvNeXt-Tiny [
27] is another new-generation lightweight CNN that incorporates Vision Transformers with CNN efficiency. ConvNeXt-Tiny achieves strong visual recognition performance and is widely used when efficient yet powerful CNN backbones are needed.
3.2.5. MaxViT
MaxViT [
6] is a hybrid model that combines convolutional layers with Transformer-based attention mechanisms. Both local window and global grid attention are used to capture short- and long-range image relationships. This design enables strong performance on image classification and other vision tasks.
3.3. The Proposed HDL-PlantNet Model
The proposed HDL-PlantNet model is a custom hybrid architecture that combines EfficientNetV2-S backbone with a Transformer encoder block to leverage the complementary strengths of both approaches. The HDL-PlantNet is not proposed as a fundamentally new CNN–Transformer paradigm. Instead, its contribution lies in the task-specific adaptation of a compact Transformer-based contextual refinement module for challenging field-based plant recognition. The proposed model accepts input images, normalized using the standard ImageNet mean and standard deviation.
EfficientNetV2-S is employed as the primary backbone network to extract hierarchical representations. The model uses pretrained ImageNet weights for transfer learning. The backbone was initialized with pretrained ImageNet weights and fine-tuned for the target plant dataset. Prior to the transformer module, the EfficientNetV2-S backbone effectively captured local textures, shapes, and edge information. Although the EfficientNetV2-S backbone was retained within the HDL-PlantNet architecture, only a subset of its parameters participated in gradient-based optimization during fine-tuning. Specifically, the later backbone layers and the Transformer module remained trainable, whereas the remaining backbone layers were frozen. Therefore, trainable parameter counts reflect the adopted training strategy rather than the full architectural complexity of the model.
The original EfficientNetV2-S classification head, consisting of pooling, dropout, and fully connected output layers, is removed and replaced with a Transformer-enhanced custom classification module to improve representational efficacy. The extracted features are reshaped into token sequences and processed using a multi-head self-attention mechanism. This block consists of attention layers, residual skip connections, layer normalization, and a feed-forward network. The Transformer module is used to model spatial dependencies and relationships in different image regions. Therefore, the hybrid structure enables simultaneous local and global feature learning.
Then, an adaptive average pooling layer is used to generate a compact global descriptor for features. Finally, a fully connected classification layer maps the learned representation to the corresponding plant classes.
Figure 4 presents the basic block diagram of the proposed HDL-PlantNet model.
Table 2 shows the layer-by-layer composition of the proposed model with its parameters.
3.4. Experimental Design and Evaluation Metrics
Due to the Cyprus Seasonal Flora Image Dataset being pre-divided into training and testing subsets, the models were trained using the training set and subsequently evaluated on the test set. The test set was not used for iterative hyperparameter optimization or architecture tuning.
The Swedish Leaf Dataset is split into train and test sets using an 80:20 hold-out since the image characteristics are uniform for each class. The results are obtained using the test set.
Initially, hyperparameters are tuned for all models on the Cyprus Seasonal Flora Image Dataset training set using 5-fold cross-validation. The same five stratified cross-validation folds were used for all models to ensure a fair comparison. Hyperparameter selection was based primarily on the mean Macro-F1 score across validation folds because of the class imbalance present in the dataset. During the optimization process, learning rate, batch size, number of epochs, and dropout parameters were systematically adjusted, and selection was based on the mean Macro-F1 scores due to the class imbalance of CSFID.
Table 3 presents the details of hyperparameter optimization variations in detail.
As a result, a batch size of 16, a learning rate of , and a dropout rate of consistently achieved the best overall performance across all evaluated models. Even though minor variations occurred for the optimal epoch numbers between models, ranging between 8 and 11 epochs, the observed differences in F1-score within this interval remained at approximately . Therefore, we fixed the training epochs at 10 for all models.
After determining the optimal hyperparameters, each model was retrained on the full training set and evaluated once on the official test set. The test set was not used during model selection or hyperparameter optimization.
The models are trained using the cross-entropy loss function and the Adam optimizer. The models were trained and evaluated independently on each dataset. The performance of the models was evaluated using common multiclass evaluation metrics, Accuracy, Macro-F1 Score, Macro-Recall, Macro-Precision, Macro-Specificity, and Macro-AUC score for all datasets.
After determining the final configuration, all competing models were trained and evaluated with the same fixed random seed and official dataset split to ensure reproducible, consistent comparison conditions. For all experiments, no additional data augmentation techniques are applied during training. Images are resized to pixels and normalized using the ImageNet mean and standard deviation.
Additionally, a paired McNemar statistical test [
28] is conducted to determine the optimal model on the Cyprus Seasonal Flora Image Dataset. To account for multiple pairwise comparisons among the evaluated models, Holm–Bonferroni correction was applied to control the family-wise error rate. Adjusted
p-values were used when interpreting statistical significance.
All experiments were conducted on a Windows 11 system equipped with 64 GB RAM, Intel® Core™ i9-14900 KF, and NVIDIA GeForce RTX 5090 with PyTorch 2.8.0.
4. Results
4.1. Results on Cyprus Seasonal Flora Image Dataset
Macro-averaged results are obtained and analyzed for all models on the Cyprus Seasonal Flora Image Dataset in order to perform comparative evaluation, and class-based results are obtained and analyzed using the model that achieved the highest observed performance determined by the macro-averaged results.
4.1.1. Macro-Averaged Results on Cyprus Seasonal Flora Image Dataset
Macro-averaged results showed that the ResNet50 model did not achieve competitive performance in classifying plant species and achieved the lowest scores across all metrics. It achieved 46.99% overall accuracy and 38.62% Macro-F1 Score. Even though the MaxViT model achieved high accuracy (96.54%), it could not obtain effective results in other metrics. The MaxViT model achieved 42.31% Macro-Recall and 42.12% Macro-F1 Score, which indicates that the model focuses on dominant classes while failing to classify classes with lower image counts.
The VGG16 and ConvNeXt-Tiny models obtained similar results; however, ConvNeXt-Tiny achieved slightly higher results in all metrics. In particular, ConvNeXt-Tiny produced more precise results for positive predictions (Macro-Precision = 85.60%), whereas VGG16 failed to predict with high precision (78.31%).
EfficientNetV2-S achieved higher results compared to other pre-trained deep learning models and obtained 87.71% Macro-F1 Score and 86.71% Macro-Recall. However, the proposed HDL-PlantNet model achieved the highest observed performance in all metrics (Macro-F1 Score = 90.06%, Macro-Precision = 91.70%, and Macro-Recall = 89.38%) and outperformed all other models considered in this study.
Figure 5 presents the confusion matrices of the top 2 models, HDL-PlantNet and EfficientNetV2 Small.
Table 4 presents all results for all models in detail.
4.1.2. Class-Based Results on Cyprus Seasonal Flora Image Dataset
Table 5 presents the detailed class-based results obtained by the proposed HDL-PlantNet and the backbone EfficientNetV2 S, which obtained the second-highest scores on the Cyprus Seasonal Flora Image Database.
Since the proposed HDL-PlantNet outperformed other models in macro-averaged results, it is analyzed in class-based results.
The HDL-PlantNet model showed outstanding performance in classifying Crown Daisy, Iris, and Pine plants, achieving 100% across all metrics. However, the model had some difficulties in the discrimination of the mandarin (56.25%), and orange classes (57.14%) belonging to the same family. The model tended to favor the lemon class, which had a larger number of training images.
Some species in the dataset contain fewer samples than others, which may reduce the models’ capability to sufficiently discriminate minor classes under complex field conditions. However, despite these challenges, the proposed HDL-PlantNet architecture maintained relatively stable macro-level performance. This suggests that the contextual-attention refinement mechanism may improve robustness for minority categories under difficult field conditions.
4.1.3. Statistical Results on Cyprus Seasonal Flora Image Dataset
McNemar analysis was used to evaluate paired prediction disagreement on identical test samples, complementing the macro-averaged performance metrics.
Table 6 presents the paired McNemar test results comparing the proposed HDL-PlantNet model with the benchmark architectures on the Cyprus Seasonal Flora Image Dataset. The significance decisions were based on Holm–Bonferroni-adjusted
p-values.
The findings indicate that the HDL-PlantNet showed statistically significant improvements compared to VGG16 (), ResNet50 (), ConvNeXt-Tiny (), and MaxViT (). Statistical results suggest that the superior classification performance of the proposed HDL-PlantNet model against these models is unlikely to be due to random variation.
However, the comparison between HDL-PlantNet and EfficientNetV2-S is slightly above the 0.05 threshold. Since EfficientNetV2-S also serves as the backbone of the proposed model, this result suggests that the Transformer enhancement provided measurable but modest gains.
Overall, the statistical analysis supports the effectiveness of HDL-PlantNet compared to VGG16, ResNet50, ConvNeXt Tiny, and MaxViT. HDL-PlantNet achieved a practically meaningful improvement over EfficientNetV2-S in the Macro-F1 score, increasing from 87.71% to 90.06%, but the improvement was not statistically significant because it did not meet the conventional 0.05 significance threshold.
4.2. Results on Swedish Leaf Dataset
The experiment on the SLD was performed as a supplementary benchmark evaluation rather than an indicator of real-world generalization of the models.
The Swedish Leaf Dataset is used to test the proposed model to determine its effectiveness across different datasets, and a similar comparative evaluation is performed across all the considered deep learning models.
Given the low intra-class variation, uniform pattern representation, and clean backgrounds of leaf images in the Swedish Leaf Dataset, all models achieved full or similar recognition performance. However, the VGG16 model obtained slightly lower results with 95.56% Macro-Recall and 95.16% Macro-F1 Score.
Even though the EfficientNetV2-S model failed to recognize a few samples, it achieved >99% performance across all metrics. The other models, ResNet50, ConvNeXt-Tiny, and the proposed HDL-PlantNet, achieved 100% performance.
The Swedish Leaf Dataset results demonstrate that the considered models and the proposed architecture maintain stable performance under controlled imaging conditions.
Table 7 presents the obtained results on the Swedish Leaf Dataset in detail.
4.3. Ablation Study
An ablation study was conducted to evaluate and analyze the contributions of the Transformer module and its variants in HDL-PlantNet on the Cyprus Seasonal Flora Image Dataset. In the ablation study, 4 model variants, the EfficientNetV2-S backbone alone (A0_NO_TX), the HDL-PlantNet with a single transformer using two attention heads (A1_H2), the HDL-PlantNet with a single transformer using eight attention heads (A1_H8), and the HDL-PlantNet with two transformers, each with four attention heads (A3_2B), are tested. All other components, including the EfficientNetV2-S backbone, classification head, and training schedule, were kept identical across experiments.
Table 8 presents the details and results of the ablation study.
The ablation results indicate that the transformer configuration influences the balance between Macro-Recall and Macro-Precision. The proposed HDL-PlantNet architecture achieved the highest overall Macro-F1 score (90.06%), representing a modest but practically meaningful improvement over the EfficientNetV2-S backbone (87.71%).
Increasing the number of attention heads or transformer blocks did not consistently improve performance. Although the variant with two transformer blocks and four attention heads achieved relatively higher Macro-Precision (99.07%), its Macro-Recall decreased substantially (Macro-Recall = 63.64%), indicating reduced effectiveness at identifying positive samples. This reduction might be attributed to over-parameterization with the limited dataset size, which was insufficient to effectively optimize a larger number of attention subspaces. Similarly, removing the transformer module resulted in lower Macro-Recall and Macro-F1 scores compared to the proposed HDL-PlantNet architecture.
Overall, the results demonstrate that larger transformer blocks do not contribute to the convergence, particularly on relatively small, fine-grained plant image datasets. These findings are consistent with previous studies reporting that self-attention mechanisms can improve feature representation and contextual modeling in visual recognition tasks [
29]. However, transformer integration was not universally beneficial; while the single-block transformer configurations improved performance, increasing the number of attention heads or transformer blocks led to notable performance degradation.
5. Discussion
The employed hybrid deep learning model, HDL-PlantNet, demonstrated encouraging performance on two datasets. The use of EfficientNetV2-S as the backbone, combined with a Transformer and attention head module, contributed to the modeling of long-range dependencies within the images. Consequently, color, structural, and regional features could be represented more effectively [
4,
30]. Consequently, the reported results reflect the model’s robustness to a predefined distribution shift and should not be interpreted as universally representative of all fine-grained plant classification settings.
Instead of employing multiple heavy Transformer blocks, we adopted a compact single-block configuration with carefully selected attention heads to preserve computational feasibility while maintaining strong discriminative performance. Additionally, the ablation study demonstrated that the Transformer encoder module integrated into the EfficientNetV2-S backbone provides a measurable contribution, and the absence of this module or structural modifications negatively affects feature representation.
The ablation study further shows that the proposed contextual refinement module improves performance compared with alternative Transformer configurations and the backbone-only architecture [
4,
20].
HDL-PlantNet achieved a balanced classification performance and computational efficiency. Although the proposed architecture includes additional Transformer-based operations compared to EfficientNetV2-S, there was a moderate increase in computational complexity with GLOPs (approximately
). Furthermore, HDL-PlantNet required substantially fewer trainable parameters (9,191,939) than the compared architectures due to partial parameter freezing. This led to reduced optimization complexity and the preservation of higher representations. Despite the increased architectural complexity, the proposed model maintained real-time inference performance with competitive latency (14.86 ms/image) and FPS (67.29).
Table 9 shows the comparison of the computational efficiency of the models in detail.
As determined during hyperparameter optimization, all experiments were performed using a batch size of 16. However, latency and FPS measurements were obtained using a batch size of 1 to evaluate single-image inference performance.
Latency values correspond to the average of 1000 forward passes after an initial warm-up phase of 100 inferences on the RTX 5090 GPU. Data-loading and preprocessing times were excluded from the measurements.
The success and strength of the proposed model were primarily demonstrated on the Cyprus Seasonal Flora Image Dataset, which was introduced as the primary dataset and consists of intertwined plant images collected under real-life environmental conditions. The proposed model outperformed all five benchmark models across all evaluation metrics and successfully classified minority plant classes with relatively fewer images in the imbalanced dataset. These results are consistent with prior studies’ findings, such as ConvTransNet-S (with LPU + transformer), which showed large gains over standalone CNNs and ViTs in complex scenes [
4]. Compared to large ensembles such as Plant-CNN-ViT [
19], which achieved ~100% accuracy on lab leaf sets, our approach is more compact and explicitly tailored to noisy field images.
The proposed HDL-PlantNet demonstrated strong effectiveness in field-based plant recognition across varied natural backgrounds, lighting conditions, intertwined plants, and visually similar species.
The GradCAM++ [
31,
32] analysis demonstrated that the proposed model focused on biologically discriminative regions of different plant species under varying visual conditions. In
Figure 6a,b, the activation map highlights leaf and flower structures of the olive tree, indicating that the model relies on seasonal characteristic texture and patterns for classification. In
Figure 6c,d, the model successfully identified the olive tree from trunk-related structural features, despite the absence of visible leaves or flowers. Similarly,
Figure 6e,f shows that the model concentrated on rose flower regions, whereas
Figure 6g,h demonstrates that the model can also utilize leaf morphology and vein structures for rose classification even when flowers are absent from the scene.
However, it should be noted that images of lemons, oranges, and mandarins belonging to the same family were occasionally confused, with the minority classes more frequently misclassified as the lemon class, which had more training images. This finding suggests that increasing the number of samples for certain underrepresented classes would lead to a more balanced and practically meaningful real-world dataset.
Figure 7 shows how the model misclassified an orange image as a lemon.
Even though the Swedish Leaf Dataset is a relatively easy benchmark on which modern deep models can achieve high performance [
33], it indicates the consistency of HDL-PlantNet across controlled-environment datasets. The obtained results on this dataset are consistent with prior studies, reporting up to 99.47% accuracy on Swedish leaves using a CNN with data standardization, and an ensemble of pre-trained CNNs has reached 100% on this dataset [
33].
Previous studies have reported similarly high accuracies mainly under controlled imaging conditions. Mohanty et al. [
12] achieved 99.35% accuracy on the PlantVillage disease dataset, while Ferentinos [
14] achieved approximately 99.5% using CNN-based models. Sekeroglu and Inan [
34] obtained 97.2% accuracy on the primary leaf dataset created in a laboratory environment using a shallow multilayer perceptron.
However, these datasets largely consisted of clean and isolated leaf images captured under favorable conditions. In contrast, the Cyprus Seasonal Flora Image Dataset presented a more realistic real-world scenario. Therefore, the strong results obtained by HDL-PlantNet suggest that modern hybrid architectures can substantially narrow the long-standing performance gap between laboratory-based and field-based plant image classification.
Table 10 summarizes the plant and disease classification studies for different datasets.
Although the proposed HDL-PlantNet model achieved promising results, the study has several limitations. First, the imbalanced class distribution of plant images in CSFID might create bias for dominant classes. Second, even though the proposed model was trained on two different datasets with distinct characteristics, an external validation is required to validate real-world deployment. Third, visually similar species from the same botanical family remained challenging to distinguish and required further investigation. Future work would address these issues by providing larger multi-region datasets and balanced sampling strategies. Additionally, the proposed HDL-PlantNet is a primarily task-oriented adaptation of existing CNN–Transformer concepts and does not propose a fundamentally new architectural approach. Its primary contribution is to provide measurable benefits for plant recognition under challenging domain-shift conditions. Finally, using a fixed random seed to ensure reproducibility might prevent the generalization of the results, and additional repetitions with multiple random seeds could provide a more comprehensive assessment.
6. Conclusions
In this study, we proposed a hybrid CNN–Transformer architecture, Hybrid Deep Learning PlantNet (HDL-PlantNet), designed for fine-grained plant species classification. Two datasets are used in the study. A primary dataset, the Cyprus Seasonal Flora Image Dataset, is used to incorporate real-world distortions and challenges, while the Swedish Leaf Dataset is used to assess the proposed model’s consistency under controlled conditions.
Comprehensive and comparative experiments were conducted, and the proposed HDL-PlantNet model consistently outperformed several state-of-the-art models, achieving 90.06% and 100% F1 scores on the Cyprus Seasonal Flora Image Dataset and the Swedish Leaf Dataset, respectively. These results moderately surpass the best-performing baseline (EfficientNetV2-S) and establish measurable performance for plant species classification under real-world field conditions by integrating an EfficientNetV2-S backbone with a Transformer encoder that combines local and global feature representations within a unified framework.
The ablation study confirmed the role of the Transformer module, as its removal caused a performance decrease of up to 3%. This indicates that the selected hybrid configuration has the potential to provide more balanced performance compared to the backbone model. The results obtained in this study might improve plant species recognition by demonstrating that a CNN–Transformer-based hybrid architecture could achieve promising performance on a real-world, field-collected dataset. Overall, the proposed approach and released dataset enable real-time plant identification in real-life complex environments and support informed ecological decision-making for farmers, ecologists, and the general public.
Our future work will include extending the species list in the Cyprus Seasonal Flora Image Dataset, evaluating the proposed model across field-domain datasets, and developing a mobile application for real-world use.