A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments

Malann, Daniel Chwaifo; Cavus, Nadire; Sekeroglu, Boran

doi:10.3390/app16125810

Open AccessArticle

A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments

by

Daniel Chwaifo Malann

^1,2,*

,

Nadire Cavus

^1,2

and

Boran Sekeroglu

^2,3

¹

Computer Information Systems, Faculty of AI & Informatics, Near East University, N. Cyprus, Mersin 10, Nicosia 99138, Türkiye

²

Computer Information Systems Research and Technology Center, Near East University, N. Cyprus, Mersin 10, Nicosia 99138, Türkiye

³

Computer Engineering, Faculty of AI & Informatics, Near East University, N. Cyprus, Mersin 10, Nicosia 99138, Türkiye

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 5810; https://doi.org/10.3390/app16125810

Submission received: 12 May 2026 / Revised: 3 June 2026 / Accepted: 3 June 2026 / Published: 9 June 2026

Download

Browse Figures

Versions Notes

Abstract

Plant recognition plays a vital role in agriculture and biodiversity monitoring, and deep learning, particularly convolutional neural networks (CNNs), has gained increased attention for automating this task. However, CNNs have a limitation in their ability to handle complex patterns due to the difficulty in capturing global contextual information. Furthermore, plant datasets are often created in laboratory environments that minimize discrimination challenges, enabling the analysis of model performance. This study proposes a hybrid deep learning model, HDL-PlantNet, for real-world plant recognition on the primary dataset, the Cyprus Seasonal Flora Image Dataset (CSFID), comprising 27 plant species. The HDL-PlantNet model integrates an EfficientNetV2-S convolutional backbone with a Transformer encoder to capture both spatial contextual and long-range dependencies. Additionally, the Swedish Leaf Dataset is used as a supplementary dataset to analyze the consistency of the HDL-PlantNet under controlled environments. Five benchmark CNN models are used for comparative evaluation, and statistical tests and an ablation study are conducted to assess the results. The proposed model achieved the highest observed Macro-F1 and Macro-AUC scores among the evaluated models, reaching 90.06% and 99.59%, respectively. The results demonstrate that combining convolutional and Transformer architectures yields computationally effective performance in fine-grained plant classification while maintaining a compact model size suitable for further research. This study contributes to real-time plant identification studies and supports informed ecological decision-making.

Keywords:

HDL-PlantNet; plant recognition; EfficientNetV2-Small; transformer; Cyprus Seasonal Flora Image Dataset; Swedish Leaf Dataset

1. Introduction

Plant image recognition has become an important application of artificial intelligence in agriculture and biodiversity informatics. In farming, plant pests and diseases can devastate crop yields; up to 40% of global crop production is lost annually, costing over $220 billion, according to the Food and Agriculture Organization [1]. Early and accurate identification of plant ailments is therefore critical for food security. Beyond agriculture, automated plant species identification can assist botanists and laypersons in rapidly recognizing plant types, which is valuable for environmental monitoring and medicinal plant discovery [2]. Due to the time-consuming process and the need for an expert in manual plant identification, computer vision techniques offer robust and accurate results in automating the process. In recent years, deep learning has gained an important role in analyzing plant images for different problems, such as disease detection and species classification [2,3]. Convolutional Neural Network (CNN)- based models tend to learn rich hierarchical features from plant images and achieve superior results compared to previous systems [2].

However, CNNs have a limitation in capturing and learning long-term dependencies in images due to their fixed receptive fields, determined by kernel sizes [4]. This limitation of CNNs creates a challenging problem for plant images in which disease or species appear in distant regions of the leaf or the plant. Recently, Vision Transformers (ViTs) have introduced a new approach to image recognition, replacing convolutional layers with self-attention mechanisms [4]. This approach improves the models’ global feature-capturing ability and tends to learn global patterns within plants. Besides the promising improvements in ViTs, the need for large amounts of data to achieve accurate convergence might make the models prone to overfitting on small datasets [3,5]. These strengths and limitations of the architectures led researchers to create hybrid models and to overcome them in plant image analysis.

In this study, we employed a hybrid deep learning model (HDL-PlantNet) that combines a CNN and a Transformer, specifically designed for plant leaf classification. The model includes the EfficientNetV2-Small as a convolutional backbone for initial feature extraction, a custom Transformer encoder to enhance extracted features for long-range dependencies, and a classification head. The aim of this architecture is to improve the class separability across diverse species captured in real-life environments while capturing fine-grained local features and global contextual patterns.

The primary contribution of HDL-PlantNet is its task-specific adaptation of a compact EfficientNetV2-S backbone, combined with a Transformer-based feature refinement module, for field-based plant species recognition. Contrary to the large-scale hybrid architectures such as CoAtNet, MaxViT [6], and Swin-based models, HDL-PlantNet is designed to improve contextual feature modeling after convolutional feature extraction without increasing computational cost. This design is particularly relevant for plant identification under field conditions, where pattern distinctions are based on small or partially occluded regions.

A primary plant dataset, The Cyprus Seasonal Flora Image Dataset (CSFID) [7], is used in the experiments, and the Swedish Leaf Dataset [8] is also used as a supplementary dataset to assess the consistency of the HDL-PlantNet under controlled environments.

2. Related Studies

To leverage the strengths of both CNNs and Transformers, researchers have begun exploring hybrid architectures that combine convolutional and self-attention modules within a single model [4]. By merging CNN’s strong local feature extraction with the ViT’s global context modeling, such hybrid models can produce more discriminative representations of plant images. Recent studies in the agricultural domain indicate that CNN–Transformer hybrids can improve classification performance and robustness.

Lee et al. [9] demonstrated that a deep CNN could automatically learn discriminative leaf features and outperform hand-crafted feature methods, achieving 99.6% accuracy in classifying 44 plant species from the new MalayaKew leaf dataset. This study led to further plant recognition studies using CNNs on field imagery. Dyrmann et al. [10] built a custom CNN to distinguish 22 weed and crop species at early growth stages, reporting 86.2% accuracy. This validated CNNs’ potential for real-world agricultural classification, albeit with the recognition that more training images per class were needed to improve performance further [11].

Mohanty et al. [12] used publicly available PlantVillage dataset with 54,306 leaf images and trained deep networks to identify 38 crop–disease classes. Their study achieved 99.35% overall accuracy on held-out test data. Carranza-Rojas et al. [13] performed one of the first large-scale plant species classification studies using deep learning. They evaluated CNNs on thousands of herbarium specimen images spanning over 1000 species, reporting that transfer learning yielded an impressive 90% top-5 accuracy. However, such high accuracy in lab conditions also raised concerns about overfitting and domain bias [14].

By 2020, the transformers originally developed for natural language processing were adapted for vision tasks, signaling a new direction beyond pure CNNs. Dosovitskiy et al. [15] introduced the Vision Transformer (ViT), which treats an image as a sequence of patch embeddings and applies a Transformer encoder for classification [6]. Remarkably, a ViT pre-trained on sufficient data surpassed the accuracy of state-of-the-art CNNs on ImageNet, ending the dominance of CNNs in image recognition. However, ViTs have far fewer built-in inductive biases than CNNs, which made them data-hungry and prone to overfitting on limited data [6,16]. Without extensive pre-training, vanilla ViT underperformed ResNets on standard datasets [6].

As CNNs and Transformers each offered unique advantages, researchers increasingly experimented with hybrid architectures that combine convolutional and self-attention mechanisms. The general idea is to utilize CNN layers for efficient local feature extraction and Transformer blocks for modeling long-range dependencies and global context. In 2021, Google’s CoAtNet explicitly married convolutions and Transformers by stacking convolutional layers in early network stages and Transformer layers in later stages [17]. In 2022, a number of sophisticated hybrid models emerged in the computer vision community. Tu et al. [6] introduced MaxViT (Maximal Multi-Axis ViT), a hierarchical model that integrates depthwise convolutions (MBConv blocks) with a novel multi-axis self-attention mechanism in each building block.

Recent studies show that the last decade has seen plant recognition research shift from traditional CNN solutions to the advent of vision transformers to improve capabilities, ultimately leading to hybrid CNN–Transformer approaches that aim to harness the strengths of both. Recently, Xu et al. [18] introduced a symmetric hybrid network (PLTransformer) that uses multi-scale convolutional modules and an overlap-attentive downsampler to capture both local textures and global context in plant disease images. Their model achieves 99.95% on PlantVillage, demonstrating the power of CNN+Transformer fusion. Similarly, Lee et al. [19] propose a Plant-CNN-ViT ensemble (ResNet50, DenseNet, Xception, and ViT) for leaf classification, achieving nearly 100% on standard leaf datasets. These recent studies illustrate that integrating CNNs and transformers can improve feature representation and performance.

For instance, Zhang et al. [20] introduce the FewMedical-XJAU benchmark for fine-grained medicinal plants, featuring diverse natural backgrounds and high intra-class variability. They also propose a multimodal fusion model to cope with subtle differences under challenging conditions. Rodriguez-Vazquez et al. [21] show that unsupervised adversarial alignment can cut plant-counting error by 97% under strong cross-domain shifts. On large-scale plant tasks, Malik et al. [22] report only 87% accuracy on the PlantCLEF challenge using EfficientNet-B1, illustrating the difficulty of fine-grained species classification across hundreds of categories.

These studies highlight that background clutter, lighting changes, and novel species reduce performance in practice. Importantly, most existing hybrid models and benchmarks still assume relatively clean or controlled data.

3. Materials and Methods

3.1. Datasets

3.1.1. The Cyprus Seasonal Flora Image Dataset (CSFID)

The Cyprus Seasonal Flora Image Dataset [7] consists of 3072 labeled training and 768 labeled test images of 27 plant species commonly found in Cyprus. Each image represents a distinct flora class captured under real-world conditions, including herbs, shrubs, trees, and flowering plants. Images were captured in different seasons using different mobile phones under varied lighting, backgrounds, and scales to make the dataset suitable for robust training and real-world deployment and evaluation of multiclass plant classification models in computer vision tasks. Table 1 presents the detailed plant species and training and test samples for each species.

Contrary to the many existing plant datasets collected under laboratory or standardized conditions [8,12], the CSFID dataset was collected entirely in natural environments without isolation of plants or arrangement of scenes. This eliminates the need for controlled acquisition settings for the users and allows the dataset to reflect real-world scenarios in a more realistic way, where multiple plant species, occlusions, background clutter, and environmental variations naturally occur. As a result, the CSFID dataset provides a more challenging and realistic benchmark for evaluating plant recognition models.

The dataset was originally split into a train and a test set. Single-plant images or uniform-background images were included in the training set, while more complex multi-plant images captured in crowded plant regions were reserved for testing. This provided an assessment of the model’s generalization under a challenging real-world distribution shift across multiple plants, partial occlusions, background interference, and varying spatial compositions.

The official CSFID split does not follow a conventional IID classification protocol. The training set of CSFID primarily contains single-plant or relatively uniform-background images, whereas the test set contains more challenging multi-plant scenes with occlusions, background clutter, and varying spatial compositions. For that reason, the model performances are based on the interpretation of classification performance under an official domain-shift evaluation setting rather than under a standard random train–test split. This train–test split protocol provides a realistic assessment of model robustness to distribution shifts commonly encountered in practical plant identification applications.

Additionally, the acquisition conditions of plant images, such as different seasons, times of day, weather conditions, and mobile camera devices, introduce challenging variations in illumination, color distribution, angle, and brightness. The multi-plant scenes in the test set also aim to evaluate the model’s abilities in real-world scenarios.

Figure 1 and Figure 2 present example images for plant species of the Cyprus Seasonal Flora Image Dataset. The first row shows training images, while the second row presents test images of the same plant species. It is clear that the test images have the same characteristics as the training images; however, they also include additional artifacts, plants, or environmental effects that make recognition challenging.

Figure 1 shows how the aloe vera, kalanchoe, and iris images differ in training and testing, whereas single plant or different flower characteristics appear in the training images, while multiple plants, crowded environments, or flower characteristics appear in the test images. Similarly, Figure 2 presents the plant and flower captured during different seasons and maturity stages.

3.1.2. Swedish Leaf Dataset (SLD)

We considered the Swedish Leaf Dataset [8] as a supplementary dataset to assess the consistency of the proposed HDL-PlantNet architecture under controlled-environment and standardized imaging conditions. This benchmark dataset comprises leaf images from 15 plant species, with approximately 75 samples per class, captured under controlled conditions. Figure 3 presents sample leaf images from the Swedish Leaf Dataset.

3.2. Deep Learning Models

3.2.1. VGG16

VGG16 [23] is a deep CNN with 16 learnable layers, including stacked 3 × 3 convolutional filters and max pooling layers. It is one of the simplest and most uniform deep learning architectures. VGG16 is widely used for image classification, transfer learning, and feature extraction due to its efficient learning of hierarchical features.

3.2.2. ResNet50

ResNet [24] is the first CNN model to introduce residual skip connections to avoid vanishing gradients. ResNet50 is a 50-layer architecture and, due to its strong performance and lower computational cost, is commonly used in computer vision tasks.

3.2.3. EfficientNetV2 Small

EfficientNetV2-S [25] is an improved version of EfficientNet [26] that aims to provide high accuracy with faster training speed. It combines MBConv and fused MBConv blocks as a new-generation architecture. Its efficacy in scaling features in a balanced way makes it widely preferred for practical image classification.

3.2.4. ConvNeXt-Tiny

ConvNeXt-Tiny [27] is another new-generation lightweight CNN that incorporates Vision Transformers with CNN efficiency. ConvNeXt-Tiny achieves strong visual recognition performance and is widely used when efficient yet powerful CNN backbones are needed.

3.2.5. MaxViT

MaxViT [6] is a hybrid model that combines convolutional layers with Transformer-based attention mechanisms. Both local window and global grid attention are used to capture short- and long-range image relationships. This design enables strong performance on image classification and other vision tasks.

3.3. The Proposed HDL-PlantNet Model

The proposed HDL-PlantNet model is a custom hybrid architecture that combines EfficientNetV2-S backbone with a Transformer encoder block to leverage the complementary strengths of both approaches. The HDL-PlantNet is not proposed as a fundamentally new CNN–Transformer paradigm. Instead, its contribution lies in the task-specific adaptation of a compact Transformer-based contextual refinement module for challenging field-based plant recognition. The proposed model accepts

224 \times 224

input images, normalized using the standard ImageNet mean and standard deviation.

EfficientNetV2-S is employed as the primary backbone network to extract hierarchical representations. The model uses pretrained ImageNet weights for transfer learning. The backbone was initialized with pretrained ImageNet weights and fine-tuned for the target plant dataset. Prior to the transformer module, the EfficientNetV2-S backbone effectively captured local textures, shapes, and edge information. Although the EfficientNetV2-S backbone was retained within the HDL-PlantNet architecture, only a subset of its parameters participated in gradient-based optimization during fine-tuning. Specifically, the later backbone layers and the Transformer module remained trainable, whereas the remaining backbone layers were frozen. Therefore, trainable parameter counts reflect the adopted training strategy rather than the full architectural complexity of the model.

The original EfficientNetV2-S classification head, consisting of pooling, dropout, and fully connected output layers, is removed and replaced with a Transformer-enhanced custom classification module to improve representational efficacy. The extracted features are reshaped into token sequences and processed using a multi-head self-attention mechanism. This block consists of attention layers, residual skip connections, layer normalization, and a feed-forward network. The Transformer module is used to model spatial dependencies and relationships in different image regions. Therefore, the hybrid structure enables simultaneous local and global feature learning.

Then, an adaptive average pooling layer is used to generate a compact global descriptor for features. Finally, a fully connected classification layer maps the learned representation to the corresponding plant classes. Figure 4 presents the basic block diagram of the proposed HDL-PlantNet model. Table 2 shows the layer-by-layer composition of the proposed model with its parameters.

3.4. Experimental Design and Evaluation Metrics

Due to the Cyprus Seasonal Flora Image Dataset being pre-divided into training and testing subsets, the models were trained using the training set and subsequently evaluated on the test set. The test set was not used for iterative hyperparameter optimization or architecture tuning.

The Swedish Leaf Dataset is split into train and test sets using an 80:20 hold-out since the image characteristics are uniform for each class. The results are obtained using the test set.

Initially, hyperparameters are tuned for all models on the Cyprus Seasonal Flora Image Dataset training set using 5-fold cross-validation. The same five stratified cross-validation folds were used for all models to ensure a fair comparison. Hyperparameter selection was based primarily on the mean Macro-F1 score across validation folds because of the class imbalance present in the dataset. During the optimization process, learning rate, batch size, number of epochs, and dropout parameters were systematically adjusted, and selection was based on the mean Macro-F1 scores due to the class imbalance of CSFID. Table 3 presents the details of hyperparameter optimization variations in detail.

As a result, a batch size of 16, a learning rate of

1 \times 10^{- 4}

, and a dropout rate of

0.2

consistently achieved the best overall performance across all evaluated models. Even though minor variations occurred for the optimal epoch numbers between models, ranging between 8 and 11 epochs, the observed differences in F1-score within this interval remained at approximately

\pm 1.1 %

. Therefore, we fixed the training epochs at 10 for all models.

After determining the optimal hyperparameters, each model was retrained on the full training set and evaluated once on the official test set. The test set was not used during model selection or hyperparameter optimization.

The models are trained using the cross-entropy loss function and the Adam optimizer. The models were trained and evaluated independently on each dataset. The performance of the models was evaluated using common multiclass evaluation metrics, Accuracy, Macro-F1 Score, Macro-Recall, Macro-Precision, Macro-Specificity, and Macro-AUC score for all datasets.

After determining the final configuration, all competing models were trained and evaluated with the same fixed random seed and official dataset split to ensure reproducible, consistent comparison conditions. For all experiments, no additional data augmentation techniques are applied during training. Images are resized to

224 \times 224

pixels and normalized using the ImageNet mean and standard deviation.

Additionally, a paired McNemar statistical test [28] is conducted to determine the optimal model on the Cyprus Seasonal Flora Image Dataset. To account for multiple pairwise comparisons among the evaluated models, Holm–Bonferroni correction was applied to control the family-wise error rate. Adjusted p-values were used when interpreting statistical significance.

All experiments were conducted on a Windows 11 system equipped with 64 GB RAM, Intel^® Core™ i9-14900 KF, and NVIDIA GeForce RTX 5090 with PyTorch 2.8.0.

4. Results

4.1. Results on Cyprus Seasonal Flora Image Dataset

Macro-averaged results are obtained and analyzed for all models on the Cyprus Seasonal Flora Image Dataset in order to perform comparative evaluation, and class-based results are obtained and analyzed using the model that achieved the highest observed performance determined by the macro-averaged results.

4.1.1. Macro-Averaged Results on Cyprus Seasonal Flora Image Dataset

Macro-averaged results showed that the ResNet50 model did not achieve competitive performance in classifying plant species and achieved the lowest scores across all metrics. It achieved 46.99% overall accuracy and 38.62% Macro-F1 Score. Even though the MaxViT model achieved high accuracy (96.54%), it could not obtain effective results in other metrics. The MaxViT model achieved 42.31% Macro-Recall and 42.12% Macro-F1 Score, which indicates that the model focuses on dominant classes while failing to classify classes with lower image counts.

The VGG16 and ConvNeXt-Tiny models obtained similar results; however, ConvNeXt-Tiny achieved slightly higher results in all metrics. In particular, ConvNeXt-Tiny produced more precise results for positive predictions (Macro-Precision = 85.60%), whereas VGG16 failed to predict with high precision (78.31%).

EfficientNetV2-S achieved higher results compared to other pre-trained deep learning models and obtained 87.71% Macro-F1 Score and 86.71% Macro-Recall. However, the proposed HDL-PlantNet model achieved the highest observed performance in all metrics (Macro-F1 Score = 90.06%, Macro-Precision = 91.70%, and Macro-Recall = 89.38%) and outperformed all other models considered in this study. Figure 5 presents the confusion matrices of the top 2 models, HDL-PlantNet and EfficientNetV2 Small. Table 4 presents all results for all models in detail.

4.1.2. Class-Based Results on Cyprus Seasonal Flora Image Dataset

Table 5 presents the detailed class-based results obtained by the proposed HDL-PlantNet and the backbone EfficientNetV2 S, which obtained the second-highest scores on the Cyprus Seasonal Flora Image Database.

Since the proposed HDL-PlantNet outperformed other models in macro-averaged results, it is analyzed in class-based results.

The HDL-PlantNet model showed outstanding performance in classifying Crown Daisy, Iris, and Pine plants, achieving 100% across all metrics. However, the model had some difficulties in the discrimination of the mandarin (56.25%), and orange classes (57.14%) belonging to the same family. The model tended to favor the lemon class, which had a larger number of training images.

Some species in the dataset contain fewer samples than others, which may reduce the models’ capability to sufficiently discriminate minor classes under complex field conditions. However, despite these challenges, the proposed HDL-PlantNet architecture maintained relatively stable macro-level performance. This suggests that the contextual-attention refinement mechanism may improve robustness for minority categories under difficult field conditions.

4.1.3. Statistical Results on Cyprus Seasonal Flora Image Dataset

McNemar analysis was used to evaluate paired prediction disagreement on identical test samples, complementing the macro-averaged performance metrics.

Table 6 presents the paired McNemar test results comparing the proposed HDL-PlantNet model with the benchmark architectures on the Cyprus Seasonal Flora Image Dataset. The significance decisions were based on Holm–Bonferroni-adjusted p-values.

The findings indicate that the HDL-PlantNet showed statistically significant improvements compared to VGG16 (

p = 0.0099

), ResNet50 (

p = 1.4 \times 10^{- 5}

), ConvNeXt-Tiny (

p = 0.0094

), and MaxViT (

p = 1.3 \times 10^{- 5}

). Statistical results suggest that the superior classification performance of the proposed HDL-PlantNet model against these models is unlikely to be due to random variation.

However, the comparison between HDL-PlantNet and EfficientNetV2-S is slightly above the 0.05 threshold. Since EfficientNetV2-S also serves as the backbone of the proposed model, this result suggests that the Transformer enhancement provided measurable but modest gains.

Overall, the statistical analysis supports the effectiveness of HDL-PlantNet compared to VGG16, ResNet50, ConvNeXt Tiny, and MaxViT. HDL-PlantNet achieved a practically meaningful improvement over EfficientNetV2-S in the Macro-F1 score, increasing from 87.71% to 90.06%, but the improvement was not statistically significant because it did not meet the conventional 0.05 significance threshold.

4.2. Results on Swedish Leaf Dataset

The experiment on the SLD was performed as a supplementary benchmark evaluation rather than an indicator of real-world generalization of the models.

The Swedish Leaf Dataset is used to test the proposed model to determine its effectiveness across different datasets, and a similar comparative evaluation is performed across all the considered deep learning models.

Given the low intra-class variation, uniform pattern representation, and clean backgrounds of leaf images in the Swedish Leaf Dataset, all models achieved full or similar recognition performance. However, the VGG16 model obtained slightly lower results with 95.56% Macro-Recall and 95.16% Macro-F1 Score.

Even though the EfficientNetV2-S model failed to recognize a few samples, it achieved >99% performance across all metrics. The other models, ResNet50, ConvNeXt-Tiny, and the proposed HDL-PlantNet, achieved 100% performance.

The Swedish Leaf Dataset results demonstrate that the considered models and the proposed architecture maintain stable performance under controlled imaging conditions. Table 7 presents the obtained results on the Swedish Leaf Dataset in detail.

4.3. Ablation Study

An ablation study was conducted to evaluate and analyze the contributions of the Transformer module and its variants in HDL-PlantNet on the Cyprus Seasonal Flora Image Dataset. In the ablation study, 4 model variants, the EfficientNetV2-S backbone alone (A0_NO_TX), the HDL-PlantNet with a single transformer using two attention heads (A1_H2), the HDL-PlantNet with a single transformer using eight attention heads (A1_H8), and the HDL-PlantNet with two transformers, each with four attention heads (A3_2B), are tested. All other components, including the EfficientNetV2-S backbone, classification head, and training schedule, were kept identical across experiments. Table 8 presents the details and results of the ablation study.

The ablation results indicate that the transformer configuration influences the balance between Macro-Recall and Macro-Precision. The proposed HDL-PlantNet architecture achieved the highest overall Macro-F1 score (90.06%), representing a modest but practically meaningful improvement over the EfficientNetV2-S backbone (87.71%).

Increasing the number of attention heads or transformer blocks did not consistently improve performance. Although the variant with two transformer blocks and four attention heads achieved relatively higher Macro-Precision (99.07%), its Macro-Recall decreased substantially (Macro-Recall = 63.64%), indicating reduced effectiveness at identifying positive samples. This reduction might be attributed to over-parameterization with the limited dataset size, which was insufficient to effectively optimize a larger number of attention subspaces. Similarly, removing the transformer module resulted in lower Macro-Recall and Macro-F1 scores compared to the proposed HDL-PlantNet architecture.

Overall, the results demonstrate that larger transformer blocks do not contribute to the convergence, particularly on relatively small, fine-grained plant image datasets. These findings are consistent with previous studies reporting that self-attention mechanisms can improve feature representation and contextual modeling in visual recognition tasks [29]. However, transformer integration was not universally beneficial; while the single-block transformer configurations improved performance, increasing the number of attention heads or transformer blocks led to notable performance degradation.

5. Discussion

The employed hybrid deep learning model, HDL-PlantNet, demonstrated encouraging performance on two datasets. The use of EfficientNetV2-S as the backbone, combined with a Transformer and attention head module, contributed to the modeling of long-range dependencies within the images. Consequently, color, structural, and regional features could be represented more effectively [4,30]. Consequently, the reported results reflect the model’s robustness to a predefined distribution shift and should not be interpreted as universally representative of all fine-grained plant classification settings.

Instead of employing multiple heavy Transformer blocks, we adopted a compact single-block configuration with carefully selected attention heads to preserve computational feasibility while maintaining strong discriminative performance. Additionally, the ablation study demonstrated that the Transformer encoder module integrated into the EfficientNetV2-S backbone provides a measurable contribution, and the absence of this module or structural modifications negatively affects feature representation.

The ablation study further shows that the proposed contextual refinement module improves performance compared with alternative Transformer configurations and the backbone-only architecture [4,20].

HDL-PlantNet achieved a balanced classification performance and computational efficiency. Although the proposed architecture includes additional Transformer-based operations compared to EfficientNetV2-S, there was a moderate increase in computational complexity with GLOPs (approximately

+ 13.8 %

). Furthermore, HDL-PlantNet required substantially fewer trainable parameters (9,191,939) than the compared architectures due to partial parameter freezing. This led to reduced optimization complexity and the preservation of higher representations. Despite the increased architectural complexity, the proposed model maintained real-time inference performance with competitive latency (14.86 ms/image) and FPS (67.29). Table 9 shows the comparison of the computational efficiency of the models in detail.

As determined during hyperparameter optimization, all experiments were performed using a batch size of 16. However, latency and FPS measurements were obtained using a batch size of 1 to evaluate single-image inference performance.

Latency values correspond to the average of 1000 forward passes after an initial warm-up phase of 100 inferences on the RTX 5090 GPU. Data-loading and preprocessing times were excluded from the measurements.

The success and strength of the proposed model were primarily demonstrated on the Cyprus Seasonal Flora Image Dataset, which was introduced as the primary dataset and consists of intertwined plant images collected under real-life environmental conditions. The proposed model outperformed all five benchmark models across all evaluation metrics and successfully classified minority plant classes with relatively fewer images in the imbalanced dataset. These results are consistent with prior studies’ findings, such as ConvTransNet-S (with LPU + transformer), which showed large gains over standalone CNNs and ViTs in complex scenes [4]. Compared to large ensembles such as Plant-CNN-ViT [19], which achieved ~100% accuracy on lab leaf sets, our approach is more compact and explicitly tailored to noisy field images.

The proposed HDL-PlantNet demonstrated strong effectiveness in field-based plant recognition across varied natural backgrounds, lighting conditions, intertwined plants, and visually similar species.

The GradCAM++ [31,32] analysis demonstrated that the proposed model focused on biologically discriminative regions of different plant species under varying visual conditions. In Figure 6a,b, the activation map highlights leaf and flower structures of the olive tree, indicating that the model relies on seasonal characteristic texture and patterns for classification. In Figure 6c,d, the model successfully identified the olive tree from trunk-related structural features, despite the absence of visible leaves or flowers. Similarly, Figure 6e,f shows that the model concentrated on rose flower regions, whereas Figure 6g,h demonstrates that the model can also utilize leaf morphology and vein structures for rose classification even when flowers are absent from the scene.

However, it should be noted that images of lemons, oranges, and mandarins belonging to the same family were occasionally confused, with the minority classes more frequently misclassified as the lemon class, which had more training images. This finding suggests that increasing the number of samples for certain underrepresented classes would lead to a more balanced and practically meaningful real-world dataset. Figure 7 shows how the model misclassified an orange image as a lemon.

Even though the Swedish Leaf Dataset is a relatively easy benchmark on which modern deep models can achieve high performance [33], it indicates the consistency of HDL-PlantNet across controlled-environment datasets. The obtained results on this dataset are consistent with prior studies, reporting up to 99.47% accuracy on Swedish leaves using a CNN with data standardization, and an ensemble of pre-trained CNNs has reached 100% on this dataset [33].

Previous studies have reported similarly high accuracies mainly under controlled imaging conditions. Mohanty et al. [12] achieved 99.35% accuracy on the PlantVillage disease dataset, while Ferentinos [14] achieved approximately 99.5% using CNN-based models. Sekeroglu and Inan [34] obtained 97.2% accuracy on the primary leaf dataset created in a laboratory environment using a shallow multilayer perceptron.

However, these datasets largely consisted of clean and isolated leaf images captured under favorable conditions. In contrast, the Cyprus Seasonal Flora Image Dataset presented a more realistic real-world scenario. Therefore, the strong results obtained by HDL-PlantNet suggest that modern hybrid architectures can substantially narrow the long-standing performance gap between laboratory-based and field-based plant image classification. Table 10 summarizes the plant and disease classification studies for different datasets.

Although the proposed HDL-PlantNet model achieved promising results, the study has several limitations. First, the imbalanced class distribution of plant images in CSFID might create bias for dominant classes. Second, even though the proposed model was trained on two different datasets with distinct characteristics, an external validation is required to validate real-world deployment. Third, visually similar species from the same botanical family remained challenging to distinguish and required further investigation. Future work would address these issues by providing larger multi-region datasets and balanced sampling strategies. Additionally, the proposed HDL-PlantNet is a primarily task-oriented adaptation of existing CNN–Transformer concepts and does not propose a fundamentally new architectural approach. Its primary contribution is to provide measurable benefits for plant recognition under challenging domain-shift conditions. Finally, using a fixed random seed to ensure reproducibility might prevent the generalization of the results, and additional repetitions with multiple random seeds could provide a more comprehensive assessment.

6. Conclusions

In this study, we proposed a hybrid CNN–Transformer architecture, Hybrid Deep Learning PlantNet (HDL-PlantNet), designed for fine-grained plant species classification. Two datasets are used in the study. A primary dataset, the Cyprus Seasonal Flora Image Dataset, is used to incorporate real-world distortions and challenges, while the Swedish Leaf Dataset is used to assess the proposed model’s consistency under controlled conditions.

Comprehensive and comparative experiments were conducted, and the proposed HDL-PlantNet model consistently outperformed several state-of-the-art models, achieving 90.06% and 100% F1 scores on the Cyprus Seasonal Flora Image Dataset and the Swedish Leaf Dataset, respectively. These results moderately surpass the best-performing baseline (EfficientNetV2-S) and establish measurable performance for plant species classification under real-world field conditions by integrating an EfficientNetV2-S backbone with a Transformer encoder that combines local and global feature representations within a unified framework.

The ablation study confirmed the role of the Transformer module, as its removal caused a performance decrease of up to 3%. This indicates that the selected hybrid configuration has the potential to provide more balanced performance compared to the backbone model. The results obtained in this study might improve plant species recognition by demonstrating that a CNN–Transformer-based hybrid architecture could achieve promising performance on a real-world, field-collected dataset. Overall, the proposed approach and released dataset enable real-time plant identification in real-life complex environments and support informed ecological decision-making for farmers, ecologists, and the general public.

Our future work will include extending the species list in the Cyprus Seasonal Flora Image Dataset, evaluating the proposed model across field-domain datasets, and developing a mobile application for real-world use.

Author Contributions

Conceptualization, D.C.M., B.S. and N.C.; methodology, D.C.M.; software, D.C.M. and B.S.; validation, D.C.M., B.S. and N.C.; formal analysis, B.S. and N.C.; investigation, B.S.; resources, B.S.; data curation, B.S.; writing—original draft preparation, D.C.M., B.S. and N.C.; writing—review and editing, D.C.M., B.S. and N.C.; visualization, D.C.M., B.S. and N.C.; supervision, B.S. and N.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Cyprus Seasonal Flora Image Dataset is available at https://data.mendeley.com/datasets/dfy8grjkss/1 (accessed on 24 January2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

United Nations; USDA. United Nations Declares 2020 as the International Year of Plant Health; USDA: Washington, DC, USA, 2020. Available online: https://www.usda.gov/about-usda/news/press-releases/2020/01/27/united-nations-declares-2020-international-year-plant-health (accessed on 24 January 2026).
Tan, J.-W.; Chang, S.-W.; Abdul-Kareem, S.; Yap, H.J.; Yong, K.-T. Deep Learning for Plant Species Classification Using Leaf Vein Morphometric. IEEE ACM Trans. Comput. Biol. Bioinform. 2020, 17, 82–90. [Google Scholar] [CrossRef]
Thakur, P.; Chaturvedi, S.; Khanna, P.; Sheorey, T.; Ojha, A. Vision transformer meets convolutional neural network for plant disease classification. Ecol. Inform. 2023, 77, 102245. [Google Scholar] [CrossRef]
Jia, S.; Wang, G.; Li, H.; Liu, Y.; Shi, L.; Yang, S. ConvTransNet-S: A CNN-Transformer Hybrid Disease Recognition Model for Complex Field Environments. Plants 2025, 14, 2252. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 200. [Google Scholar] [CrossRef]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. MaxViT: Multi-Axis Vision Transformer. arXiv 2022, arXiv:2204.01697. [Google Scholar] [CrossRef]
Sekeroglu, B. The Cyprus Seasonal Flora Image Dataset (CSFID); Mendeley Data: New York, NY, USA, 2025; Version 1. [Google Scholar] [CrossRef]
Söderkvist, O.J.O. Computer Vision Classification of Leaves from Swedish Trees. Master’s Thesis, Linköping University, Linkoping, Sweden, 2001. Available online: https://www.cvl.isy.liu.se/en/research/datasets/swedish-leaf/ (accessed on 17 October 2025).
Lee, S.H.; Chan, C.S.; Wilkin, P.; Remagnino, P. Deep-plant: Plant identification with convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 452–456. [Google Scholar] [CrossRef]
Dyrmann, M.; Karstoft, H.; Midtiby, H.S. Plant species classification using deep convolutional neural network. Biosyst. Eng. 2016, 151, 72–80. [Google Scholar] [CrossRef]
Peteinatos, G.G.; Reichel, P.; Karouta, J.; Andújar, D.; Gerhards, R. Weed Identification in Maize, Sunflower, and Potatoes with the Aid of Convolutional Neural Networks. Remote Sens. 2020, 12, 4185. [Google Scholar] [CrossRef]
Mohanty, S.P.; Hughes, D.P.; Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci. 2016, 7, 1419. [Google Scholar] [CrossRef]
Carranza-Rojas, J.; Goeau, H.; Bonnet, P.; Mata-Montero, E.; Joly, A. Going deeper in the automated identification of Herbarium specimens. BMC Evol. Biol. 2017, 17, 181. [Google Scholar] [CrossRef] [PubMed]
Ferentinos, K. Deep learning models for plant disease detection and diagnosis. Comput. Electron. Agric. 2018, 145, 311–318. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Hasan, H.; Garcia, M.A.; Rashwan, H.; Puig, D. CoHAtNet: An integrated convolutional-transformer architecture with hybrid self-attention for end-to-end camera localization. Image Vis. Comput. 2025, 162, 105674. [Google Scholar] [CrossRef]
Xu, C.; Yang, T. A Symmetric Multi-Scale Convolutional Transformer Network for Plant Disease Image Classification. Symmetry 2025, 17, 1232. [Google Scholar] [CrossRef]
Lee, C.P.; Lim, K.M.; Song, Y.X.; Alqahtani, A. Plant-CNN-ViT: Plant Classification with Ensemble of Convolutional Neural Networks and Vision Transformer. Plants 2023, 12, 2642. [Google Scholar] [CrossRef]
Zhang, T.; Huang, S.; Kezierbieke, G.; Halimu, Y.; Li, H. FewMedical-XJAU: A Challenging Benchmark for Fine-Grained Medicinal Plant Classification. Sensors 2025, 25, 5499. [Google Scholar] [CrossRef]
Rodriguez-Vazquez, J.; Fernandez-Cortizas, M.; Perez-Saura, D.; Molina, M.; Campoy, P. Overcoming Domain Shift in Neural Networks for Accurate Plant Counting in Aerial Images. Remote Sens. 2023, 15, 1700. [Google Scholar] [CrossRef]
Malik, O.A.; Ismail, N.; Hussein, B.R.; Yahya, U. Automated Real-Time Identification of Medicinal Plants Species in Natural Environment Using Deep Learning Models—A Case Study from Borneo Region. Plants 2022, 11, 1952. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Kavzoglu, T. Object-Oriented Random Forest for High Resolution Land Cover Mapping Using Quickbird-2 Imagery. In Handbook of Neural Computation; Academic Press: Cambridge, MA, USA, 2017; pp. 607–619. [Google Scholar] [CrossRef]
Borhani, Y.; Khoramdel, J.; Najafi, E. A deep learning based approach for automated plant disease classification using vision transformer. Sci. Rep. 2022, 12, 11554. [Google Scholar] [CrossRef]
E. M., S.; Chandy, D.A.; P. M., S.; Poulose, A. A Hybrid Deep Learning Model for Aromatic and Medicinal Plant Species Classification Using a Curated Leaf Image Dataset. AgriEngineering 2025, 7, 243. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Chattopadhyay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V. Grad-CAM++: Generalized Gradient-based Visual Explanations for Deep Convolutional Networks. arXiv 2017, arXiv:1710.11063. [Google Scholar] [CrossRef]
Li, G.; Zhang, R.; Qi, D.; Ni, H. Plant-Leaf Recognition Based on Sample Standardization and Transfer Learning. Appl. Sci. 2024, 14, 8122. [Google Scholar] [CrossRef]
Sekeroglu, B.; Inan, Y. Leaves Recognition System Using a Neural Network. Procedia Comput. Sci. 2016, 102, 578–582. [Google Scholar] [CrossRef]

Figure 1. Example train (first row) and test (second row) plant images of Cyprus Seasonal Flora Image Dataset, (a) Aloe Vera, (b) Kalanchoe, and (c) Iris.

Figure 2. Example train (first row) and test (second row) plant images of Cyprus Seasonal Flora Image Dataset, (a) Rose, (b) Rosemary, and (c) Basil.

Figure 3. Example leaf images of Swedish Leaf Dataset (SLD).

Figure 4. Basic block diagram of the proposed HDL-PlantNet model.

Figure 5. Confusion matrices of best-performing models: (a) The proposed HDL-PlantNet. (b) Backbone EfficientNetV2 Small.

Figure 6. GradCAM++ visualization results for HDL-PlantNet. (a,c,e,g) Original test images. (b,d,f,h) Corresponding GradCAM++ activation maps. The highlighted regions indicate the discriminative visual features used by the model during classification, including flowers, leaves, trunk structures, and local texture patterns under different seasonal and scene complexities.

Figure 7. GradCAM++ Visualizations, (a) Orange image and (b) GradCAM++ Visualization of misclassified image.

Table 1. The Cyprus Seasonal Flora Image Dataset distribution according to plant species.

Plant Species	Class	Train (n)	Test (n)	Total (n)
Aloe Vera	P1	112	28	140
Arabian Jasmine	P2	104	26	130
Basil	P3	56	14	70
Cape marguerite	P4	136	34	170
Crown Daisy	P5	44	11	55
Cycas	P6	154	39	193
Cypress	P7	114	28	142
Fig	P8	62	16	78
Geranium	P9	180	45	225
Grapevine	P10	62	15	77
Iris	P11	127	32	159
Jasmine	P12	42	11	53
Kalanchoe	P13	69	17	86
Lemon	P14	242	61	303
Loquat	P15	202	50	252
Magnolia	P16	133	33	166
Mandarin	P17	62	16	78
Nerium oleander	P18	75	19	94
Nettle	P19	42	11	53
Night Blooming Jasmine	P20	167	42	209
Olive	P21	166	41	207
Orange	P22	57	14	71
Pine	P23	29	7	36
Polygala myrtifolia	P24	45	11	56
Rose	P25	365	91	456
Rosemary	P26	80	20	100
Yellow Jasmine	P27	145	36	181

Table 2. Layer-by-layer composition of the proposed HDL-PlantNet model.

Layer (Type)	Output Shape	Parameters	Description
Input Image	$224 \times 224 \times 3$	0	Input plant image (RGB)
EfficientNetV2-S Backbone	$7 \times 7 \times 1280$	≈22 million	Pretrained CNN feature extractor (stack of Conv + Fused-MBConv/MBConv blocks)
Transformer Encoder	$7 \times 7 \times 1280$	≈9 million	4-head self-attention + feed-forward block over $7 \times 7$ spatial tokens (includes LayerNorm and dropout)
Global Average Pooling	$1 \times 1 \times 1280$	0	Averages spatial features into a 1280-D global descriptor
Flatten	1280	0	Flattens the 1280-D pooled feature for classification
Fully Connected (Dense)	K	$1280 \times K + K$	Output layer for K plant classes (single linear layer producing logits)

Table 3. Hyperparameter optimization details for the models.

Hyperparameter	Range	Final Value
Learning Rate	$1 \times 10^{- 3}$ , $1 \times 10^{- 4}$ , $1 \times 10^{- 5}$	$1 \times 10^{- 4}$
Batch Size	4, 8, 16, 32, 64	16
Dropout	0.2, 0.3, 0.5	0.2
Epochs	5–25	10

Table 4. Macro-averaged comparative performances on Cyprus Seasonal Flora Image Dataset.

Model	Accuracy (%)	Macro-Precision (%)	Macro-Recall (%)	Macro-F1 Score (%)	Macro-AUC (%)
VGG16	98.99	78.31	82.75	83.52	94.02
ResNet50	46.99	42.16	39.69	38.62	39.13
ConvNext-Tiny	99.00	85.60	84.63	84.16	83.30
EfficientNetV2-S	99.27	90.84	86.71	87.71	99.57
MaxViT	96.54	56.02	42.31	42.12	49.50
HDL-PlantNet	99.32	91.70	89.38	90.06	99.59

Bold values indicate the highest results.

Table 5. Class-based accuracy (%) results of the proposed HDL-PlantNet and backbone EfficientNetV2 Small on Cyprus Seasonal Flora Image Dataset.

Plant	Test Samples	HDL-PlantNet	EfficientNetV2 S
Aloe Vera	28	96.43	96.42
Arabian Jasmine	26	69.23	80.76
Basil	14	92.86	78.57
Cape Marguerite	34	100	100
Crown Daisy	11	100	81.82
Cycas	39	100	100
Cypress	28	92.86	92.86
Fig	16	87.5	100
Geranium	45	93.33	95.55
Grapevine	15	73.33	100
Iris	32	100	100
Jasmine	11	100	63.64
Kalanchoe	17	94.12	82.35
Lemon	61	100	95.08
Loquat	50	80	88
Magnolia	33	84.85	90.90
Mandarin	16	56.25	50.00
Nerium oleander	19	84.21	63.16
Nettle	11	100	100
Night Blooming Jasmine	42	85.71	95.23
Olive	41	100	95.12
Orange	14	57.14	28.57
Pine	7	100	100
Polygala myrtifolia	11	90.91	90.91
Rose	91	92.31	92.31
Rosemary	20	85	80.00
Yellow Jasmine	36	97.22	100
Macro Averaged	768 (total)	89.38	86.71

Bold values indicate the highest results.

Table 6. Detailed statistical results on Cyprus Seasonal Flora Image Dataset.

Comparison	McNemar p-Value	p-Value Status	Interpretation
HDL-PlantNet vs. VGG16	0.0099	<0.05	Significant
HDL-PlantNet vs. ResNet50	$1.4 \times 10^{- 5}$	<0.05	Significant
HDL-PlantNet vs. ConvNeXt-Tiny	0.0094	<0.05	Significant
HDL-PlantNet vs. EfficientNetV2-S	0.0512	>0.05	Slightly not significant
HDL-PlantNet vs. MaxViT	$1.3 \times 10^{- 5}$	<0.05	Significant

Table 7. Detailed results on the Swedish Leaf Dataset.

Model	Accuracy (%)	Macro-Recall (%)	Macro-Specificity (%)	Macro-F1 Score (%)	Macro-AUC (%)
VGG16	95.56	95.56	99.68	95.16	99.43
ResNet50	100.00	100.00	100.00	100.00	100.00
ConvNext-Tiny	100.00	100.00	100.00	100.00	100.00
EfficientNetV2-S	99.11	99.11	99.94	99.12	100.00
MaxViT	99.82	98.66	99.90	98.67	100.00
HDL-PlantNet	100.00	100.00	100.00	100.00	100.00

Bold values indicate the highest results.

Table 8. The detailed results of ablation study.

Ablation	Transformer Block	Attention Head	Accuracy (%)	Macro-Recall (%)	Macro-Precision (%)	Macro-F1 Score (%)	Macro-AUC (%)
A0_NO_TX	-	-	99.27	86.71	90.84	87.71	99.57
A1_H2	1	2	97.79	84.12	94.11	89.01	98.39
A1_H8	1	8	94.40	75.56	75.55	61.26	93.51
A3_2B	2	4 + 4	99.48	63.64	99.07	77.78	98.84
HDL-PlantNet	1	4	99.32	89.38	91.70	90.06	99.59

Bold values indicate the highest results.

Table 9. Comparison of the computational complexity and inference efficiency of the proposed HDL-PlantNet and benchmark architectures.

Model	Total Parameters	Trainable Parameters	FLOPs (GFLOPs)	Latency (ms/Image)	FPS
ResNet50	23,563,355	23,563,355	8.178	5.00	199.93
MaxViT	30,421,475	30,421,475	11.12	22.29	44.84
VGG16	134,371,163	134,371,163	3.09	1.474	679.56
ConvNeXt Tiny	27,840,891	27,840,891	8.90	5.26	189.85
EfficientNetV2-S	20,181,331	20,181,331	5.69	11.75	85.08
HDL-PlantNet	29,369,427	9,191,939	6.48	14.86	67.29

Total parameter count reflects architectural complexity. Trainable parameter count depends on training strategy. Since HDL-PlantNet employed partial parameter freezing, trainable parameter counts should not be interpreted as a direct architectural comparison with fully trainable baseline models.

Table 10. Plant and disease classification studies for different datasets.

Study	Dataset	Dataset Property	Approach	Accuracy
Mohanty et al. [12]	PlantVillage	Controlled cond.	Deep CNN models	99.35%
Ferentinos [14]	PlantVillage (augmented version)	Controlled cond.	Deep CNN models	99.50%
Sekeroglu and Inan [34]	Private leaf dataset	Controlled cond.	MLP	97.20%
Jia et al. [4]	PlantVillage	Controlled cond.	CNN–Transformer Hybrid (ConvTransNet-S)	98.85%
Lee et al. [19]	Swedish Folio Flavia MalayaKew	Controlled Semi-controlled	Plant-CNN-ViT	100–99.83%
Malik et al. [22]	PlantCLEF 2015	Uncontrolled cond.	EfficientNet-B1-based deep learning model	87–84%
This study	SLD	Controlled cond.	Custom CNN-based model	100%
This study	CSFID	Real-world uncontrolled	Custom CNN-based model	99.32%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Malann, D.C.; Cavus, N.; Sekeroglu, B. A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments. Appl. Sci. 2026, 16, 5810. https://doi.org/10.3390/app16125810

AMA Style

Malann DC, Cavus N, Sekeroglu B. A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments. Applied Sciences. 2026; 16(12):5810. https://doi.org/10.3390/app16125810

Chicago/Turabian Style

Malann, Daniel Chwaifo, Nadire Cavus, and Boran Sekeroglu. 2026. "A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments" Applied Sciences 16, no. 12: 5810. https://doi.org/10.3390/app16125810

APA Style

Malann, D. C., Cavus, N., & Sekeroglu, B. (2026). A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments. Applied Sciences, 16(12), 5810. https://doi.org/10.3390/app16125810

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments

Abstract

1. Introduction

2. Related Studies

3. Materials and Methods

3.1. Datasets

3.1.1. The Cyprus Seasonal Flora Image Dataset (CSFID)

3.1.2. Swedish Leaf Dataset (SLD)

3.2. Deep Learning Models

3.2.1. VGG16

3.2.2. ResNet50

3.2.3. EfficientNetV2 Small

3.2.4. ConvNeXt-Tiny

3.2.5. MaxViT

3.3. The Proposed HDL-PlantNet Model

3.4. Experimental Design and Evaluation Metrics

4. Results

4.1. Results on Cyprus Seasonal Flora Image Dataset

4.1.1. Macro-Averaged Results on Cyprus Seasonal Flora Image Dataset

4.1.2. Class-Based Results on Cyprus Seasonal Flora Image Dataset

4.1.3. Statistical Results on Cyprus Seasonal Flora Image Dataset

4.2. Results on Swedish Leaf Dataset

4.3. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI