1. Introduction
Tomatoes are among the most widely cultivated crops worldwide and play a critical role in global food production. However, tomato plants remain highly susceptible to numerous fungal, bacterial, and viral diseases, including early blight, late blight, Septoria leaf spot, and mosaic virus. These diseases typically appear as discoloration, lesions, or necrotic patterns on leaves and can substantially reduce crop yield and fruit quality when not detected at early stages. Consequently, accurate and timely tomato leaf disease identification is essential for sustainable agriculture, food security, and reducing excessive pesticide usage.
Recent advances in deep learning and computer vision have significantly improved automated plant disease diagnosis. Convolutional Neural Networks (CNNs) have demonstrated strong capabilities in extracting local spatial and texture features from leaf images, while Vision Transformers (ViTs) have achieved remarkable performance in modeling long-range contextual dependencies through self-attention mechanisms. Despite their strong predictive performance, most state-of-the-art CNN and Transformer-based architectures remain computationally expensive, requiring millions of parameters and substantial floating-point operations (FLOPs). Such computational demands restrict their deployment on resource-constrained edge devices commonly used in smart agriculture environments, including embedded platforms, mobile devices, and IoT-enabled monitoring systems. Therefore, developing lightweight and adaptive deep learning frameworks capable of balancing classification accuracy and computational efficiency has become a critical research challenge for edge-oriented agricultural applications.
To address these limitations, we propose EcoTomHybridNet, a lightweight policy-guided adaptive inference framework for tomato leaf disease classification under edge-computing constraints. The proposed architecture combines a compact convolutional backbone with a dual-branch inference design composed of a lightweight convolutional branch for efficient prediction and a Transformer-enhanced branch incorporating local self-attention for richer contextual representation learning. Unlike conventional lightweight hybrid architectures that rely on static inference pipelines, EcoTomHybridNet introduces and experimentally validates a lightweight policy-guided adaptive routing mechanism capable of dynamically allocating samples between lightweight and Transformer-enhanced inference branches according to input complexity. This adaptive inference strategy dynamically reduces unnecessary Transformer computations for simpler samples while preserving strong predictive performance on more challenging inputs through policy-guided branch allocation.
To further enhance representation capability without significantly increasing computational complexity, EcoTomHybridNet employs knowledge distillation from a ViT-Tiny teacher model. Through this teacher–student learning framework, the lightweight student network inherits discriminative knowledge from the larger teacher architecture while maintaining a compact design suitable for edge deployment. In contrast to existing lightweight agricultural classification approaches primarily relying on static architectural optimization, the proposed framework experimentally demonstrates adaptive resource-aware routing capable of reducing computational cost while preserving competitive classification accuracy.
Beyond architectural optimization, EcoTomHybridNet is integrated within an IoT-enabled edge–cloud smart agriculture framework designed for practical intelligent farming scenarios. This system-level perspective highlights the applicability of the proposed framework for real-world agricultural monitoring environments involving embedded devices, wireless sensor networks, intelligent edge computing infrastructures, and resource-aware disease diagnosis pipelines.
The main contributions of this work are summarized as follows:
We introduce a lightweight policy-guided adaptive inference framework for tomato leaf disease classification, enabling dynamic computational allocation between lightweight convolutional and Transformer-enhanced branches according to input complexity.
We experimentally validate a lightweight adaptive routing mechanism capable of reducing computational overhead through input-dependent branch selection while preserving competitive classification performance.
We design a compact CNN–Transformer hybrid architecture optimized for edge-oriented agricultural applications by combining lightweight convolutional feature extraction with local self-attention modeling.
We employ knowledge distillation from a ViT-Tiny teacher model to improve the discriminative capability and generalization performance of the lightweight student network while maintaining low computational complexity suitable for embedded deployment.
We integrate the proposed framework within an IoT-enabled edge–cloud smart agriculture architecture targeting practical intelligent farming applications and resource-constrained deployment environments.
The remainder of this paper is organized as follows.
Section 2 reviews recent advances in lightweight CNN–Transformer architectures, adaptive inference strategies, and knowledge distillation methods for plant disease classification.
Section 3 presents the proposed EcoTomHybridNet architecture and the associated training methodology.
Section 4 describes the dataset preparation, experimental setup, and implementation details.
Section 5 reports quantitative and qualitative experimental results, including ablation studies, robustness analysis, and adaptive routing evaluations.
Section 6 discusses computational efficiency and comparisons with related lightweight approaches. The following sections further address deployment considerations, IoT integration, and edge–cloud smart agriculture applications. Finally, the last section concludes the paper and outlines future research directions.
2. Related Works
2.1. Hybrid CNN–Transformer Models
Recent advances in computer vision have promoted the development of lightweight hybrid architectures that combine convolutional operations with self-attention mechanisms in order to jointly improve representation capability and computational efficiency. In agricultural disease classification, lightweight attention-based frameworks have demonstrated promising results for balancing predictive performance and resource efficiency. Zhang et al. proposed a lightweight dual-attention network for tomato leaf disease identification that improves discriminative feature extraction while maintaining relatively low computational complexity [
1].
Transformer-based contextual learning has also been explored for tomato leaf disease recognition. Karimanzira proposed a Vision Transformer framework integrating cascaded group attention and Focaler–CIoU optimization to improve contextual representation learning and address class imbalance [
2]. Their ViT–CGA architecture achieved strong classification performance while emphasizing explainability through explainable artificial intelligence techniques.
Among real-time plant disease detection approaches, YOLO-based architectures have become widely adopted because of their strong object localization capabilities. Abudukelimu et al. introduced DM–YOLO, an enhanced YOLOv9 framework integrating dynamic modules to improve small-lesion detection performance [
3]. Similarly, Wang et al. proposed BED–YOLO, an optimized YOLOv10n-based detector incorporating deformable convolutions and bidirectional feature pyramids to improve tomato leaf disease detection accuracy [
4].
Several additional lightweight and hybrid architectures have also been proposed for agricultural disease recognition. Sowmya and Guruprasad combined Inception-v4 features with YOLOv8 to achieve real-time disease detection performance [
5]. Compact Convolutional Transformers (CCT) demonstrated that lightweight Transformer-based architectures can achieve strong classification accuracy with moderate computational requirements [
6]. MobileNetV3-based frameworks such as FL–ToLeD emphasized low-resource deployment through compact convolutional designs [
7]. Similarly, EGWT integrated EfficientNet backbones with group-wise Transformers to improve edge-oriented efficiency [
8], while MSCPNet employed multi-scale convolutional pooling to achieve strong classification performance with relatively few parameters [
9].
Recent years have also witnessed the emergence of lightweight agricultural frameworks targeting classification, segmentation, and detection tasks. Sharma et al. proposed an ensemble framework combining ResNet50 and MobileNetV2 features to improve tomato disease classification accuracy [
10]. Hybrid-DSCNN integrated U-Net and SegNet architectures to enhance inference speed and predictive performance [
11]. DS-DETR introduced Transformer-based segmentation for disease grading [
12], while optimized YOLOv8n variants focused on embedded real-time detection [
13]. Modified YOLOv8-Seg architectures incorporating Ghost modules and BiFPN structures further improved segmentation efficiency across multiple datasets [
14]. Additional studies explored transfer learning, feature selection, and ensemble optimization strategies to improve robustness under limited-data conditions [
15].
Collectively, these studies demonstrate increasing interest in balancing predictive performance and computational efficiency in agricultural deep learning systems. Nevertheless, most existing frameworks continue to rely on static inference mechanisms in which identical computational pathways are applied to all samples independently of prediction difficulty or image complexity.
2.2. Knowledge Distillation in Agriculture
Knowledge distillation has become an effective strategy for designing lightweight agricultural deep learning models. Multi-Task Distillation Learning (MTDL) combined disease classification and severity estimation through staged teacher–student optimization, achieving strong predictive performance with reduced parameter counts [
16]. Similarly, the Multi-Objective Hybrid Knowledge Distillation (MOHKD) framework employed multiple distillation objectives to train efficient student networks under resource-constrained conditions [
17].
Additional teacher–student learning approaches, including PDLM–TK [
18], ensemble self-distillation for ShuffleNetV2 [
19], and dynamic temperature distillation strategies [
20,
21], further demonstrated the effectiveness of distillation techniques for improving lightweight plant disease classification.
Cross-domain robustness studies have shown that models trained on controlled datasets such as PlantVillage often experience substantial performance degradation when evaluated under real-world agricultural conditions characterized by illumination variability, occlusions, environmental noise, and complex backgrounds [
22].
In the broader computer vision literature, several influential lightweight Transformer-based architectures have inspired efficient agricultural models. Swin Transformer employs hierarchical shifted-window attention to reduce computational complexity while preserving strong visual representation learning capabilities [
23]. MobileViT integrates lightweight Transformer modules into convolutional pipelines to construct mobile-friendly architectures with competitive accuracy and reduced parameter counts [
24]. EfficientFormer further improves inference efficiency by designing Transformer-based architectures capable of operating at MobileNet-level latency while maintaining strong classification performance [
25].
In parallel, advanced knowledge distillation strategies such as Decoupled Knowledge Distillation (DKD) [
26], Contrastive Representation Distillation (CRD) [
27], and Teacher Assistant Knowledge Distillation (TAKD) [
28] demonstrated that compact student networks can preserve strong predictive capability while substantially reducing computational complexity.
2.3. Limitations of Existing Works
Although existing lightweight and hybrid architectures achieve favorable trade-offs between accuracy and efficiency, several important limitations remain unresolved. First, most agricultural disease classification frameworks continue to rely on static inference pipelines in which all samples are processed using identical computational pathways regardless of image complexity or prediction difficulty. Consequently, computational resources are uniformly allocated even for relatively simple samples that may not require expensive Transformer-based processing.
Second, many high-performing agricultural models rely on increasingly sophisticated hybridization strategies, heavy Transformer modules, or ensemble-based designs that substantially increase computational cost and complicate deployment on low-power edge devices commonly used in smart agriculture systems. Although high predictive performance is often achieved on controlled datasets, real-world deployment remains challenging due to limited computational resources and environmental variability.
Third, despite the growing popularity of lightweight architectures and knowledge distillation techniques, adaptive inference and dynamic computational allocation remain relatively underexplored in agricultural disease classification compared with general computer vision applications, where dynamic routing, early-exit mechanisms, and conditional computation strategies have received increasing attention. This limitation partially explains the scarcity of directly comparable adaptive agricultural baselines in the existing literature.
2.4. Motivation of the Proposed Framework
Motivated by these limitations, EcoTomHybridNet introduces a lightweight adaptive inference framework integrating CNN–Transformer learning, knowledge distillation, and policy-guided branch allocation under edge-computing constraints. Unlike conventional lightweight agricultural classifiers, the proposed framework explicitly investigates experimentally validated dynamic computational allocation strategies capable of regulating inference complexity according to input characteristics while preserving strong predictive performance.
More specifically, EcoTomHybridNet combines a lightweight convolutional backbone with a Transformer-enhanced branch and a lightweight policy-guided routing mechanism capable of dynamically allocating computational resources according to input complexity. This adaptive strategy aims to reduce unnecessary Transformer computations for simpler samples while maintaining strong predictive capability for more challenging inputs.
To the best of our knowledge, few existing agricultural disease classification frameworks explicitly combine lightweight CNN–Transformer learning with experimentally validated policy-guided adaptive inference capable of dynamically regulating computational allocation according to input complexity under edge-computing constraints.
3. Proposed Method: EcoTomHybridNet
EcoTomHybridNet is designed as an adaptive resource-aware framework for tomato leaf disease classification under edge-computing constraints, aiming to achieve an effective balance between high predictive performance and computational efficiency. The proposed architecture combines a lightweight convolutional backbone with a dual-branch inference design composed of a fast convolutional branch for computationally efficient prediction and a Transformer-enhanced branch integrating local self-attention for richer contextual feature extraction and improved modeling of long-range dependencies.
Unlike conventional lightweight hybrid architectures that rely on static inference pipelines, EcoTomHybridNet introduces a lightweight policy-guided routing mechanism capable of dynamically allocating samples between the fast convolutional branch and the Transformer-enhanced branch according to input complexity. This adaptive routing strategy dynamically reduces unnecessary computational overhead for simpler samples while preserving strong predictive performance on more challenging inputs. During the supervised optimization stage, the hybrid Transformer branch is first employed as the primary inference pathway to stabilize training and ensure reproducible optimization behavior, after which the proposed policy-guided routing mechanism is activated and experimentally evaluated under adaptive inference settings to analyze computational allocation behavior and adaptive decision-making efficiency.
To further enhance representation capability without substantially increasing computational complexity, EcoTomHybridNet is trained using knowledge distillation from a larger ViT-Tiny teacher model. Through this teacher–student learning framework, the lightweight student network inherits discriminative representations from the teacher model while maintaining a compact architecture suitable for resource-constrained edge deployment.
Figure 1 summarizes the overall architecture and training pipeline.
The PlantVillage tomato dataset is used to train both the ViT-Tiny teacher model and the EcoTomHybridNet student network. The student architecture integrates a shared convolutional backbone, a fast convolutional branch, a Transformer-enhanced hybrid branch, and a lightweight policy-guided routing network. During supervised optimization, the hybrid branch is initially employed to ensure stable training and reproducible evaluation. Subsequently, adaptive policy-guided inference is activated and analyzed through dedicated dynamic inference experiments to evaluate the effectiveness of the proposed routing mechanism under varying input complexity conditions.
3.1. Problem Formulation
Let
denote the tomato leaf dataset, where
is an RGB image and
is the disease label with C = 10 classes. We aim to learn a classifier
where
is the probability simplex over C classes, such that the empirical risk
is minimized under constraints on model size and FLOPs. We decompose
into a backbone
, a fast head
, a hybrid Transformer head
and a policy network
:
3.2. Convolutional Backbone and Fast Head
The backbone consists of three convolution batch normalization–SiLU blocks, each reducing the spatial resolution of the feature maps by a factor of two. Given an input image
, we obtain feature maps
We denote
and use it as the shared representation:
The fast head applies to global average pooling,
followed by a fully connected layer and softmax:
where
and
.
3.3. Hybrid Transformer Branch
To capture long-range dependencies, we apply a local Transformer encoder on top of F. We first project channels with a 1 × 1 convolution:
We then flatten spatial dimensions into a sequence of tokens
In each Transformer block, we compute queries, keys and values:
where
. For a local window
, the self-attention is
We use two Transformer blocks, each followed by a feed-forward network and layer normalization. After the Transformer, we perform token average pooling:
where
is the output after L = 2 blocks. The hybrid prediction is then
where
and
.
3.4. Policy Network and Resource-Aware Inference
The policy network takes
and optional contextual features as input:
where
denotes the probability of selecting the hybrid Transformer-enhanced branch, while
corresponds to routing the input through the lightweight convolutional branch. The policy network estimates whether an input requires the expressive Transformer-enhanced branch or can be efficiently processed using the lightweight convolutional branch. During the primary optimization stage, the hybrid branch serves as the main inference pathway to ensure stable convergence and reproducible evaluation. The effectiveness of the proposed policy-guided adaptive routing mechanism is further validated through dedicated dynamic inference experiments presented in
Section 5.6.
3.5. Teacher–Student Distillation and Training Objective
Let
denote the teacher output (class probabilities) for an input x. For the student, we use two branches
and
We denote the standard cross-entropy loss for a prediction
p and label
y as
The distillation loss uses a softened distribution with temperature T:
where z are the corresponding logits. The Kullback–Leibler loss is
For each sample, the total loss is
With , and .
3.6. Training Pseudo-Code
Algorithm 1 summarizes the teacher training procedure, and Algorithm 2 details the student training with distillation.
| Algorithm 1. Trainning the ViT-Tiny teacher |
1: Input: Training set , epochs E 2: Initialize teacher parameter ∅ 3: for e = 1 to E do 4: for mini-batch do 5: {ViT-Tiny forward) 6: 7: 8: end for 9: end for 10: Output: Trained teacher parameters ∅ |
| Algorithm 2. Training EcoTomHybridNet with knowledge distillation |
1: Input: Training set , epochs E 2: for e = 1 to E do 3: for mini-batch do 4: {Teacher probabilities) 5: {Student branches} 6: 7: 8: 9: 10: end for 11: end for 12: Output: Trained student parameters θ |
The adaptive resource-aware inference procedure implemented in this work is summarized in Algorithm 3.
| Algorithm 3. Resource-aware inference (future phase 2) |
1: Input: Image x, student 2: 3: 4: then 5: return fast prediction 6: else 7: return hybrid prediction 8: end if |
3.7. Smart Agriculture IoT Pipeline
To illustrate how the proposed model can be integrated into a real-world smart farming system,
Figure 2 presents an end-to-end IoT pipeline. Raw data from heterogeneous sensors (RGB cameras on drones and fixed rings, soil moisture probes, humidity and temperature gauges, and air-quality sensors) are gathered via low-power wireless networks (BLE, ZigBee or LoRaWAN) by an ESP32-based sensor node. An edge device such as a RaspberryPi, Jetson nano or smartphone performs image pre-processing and runs the EcoTomHybridNet inference. Outputs are passed to an edge gateway for aggregation before being relayed to the cloud or farm management platform. A mobile application alerts farmers to detect diseases and suggests interventions, while a web dashboard enables agronomists to monitor plant health and disease prevalence. Cloud services maintain a model repository and data lake for continuous updates. Finally, actuators on the farm (irrigation pumps, sprayers, ventilation fans) can be triggered in response to model predictions and human decisions.
4. Experimental Setup
4.1. Dataset and Preprocessing
We use the PlantVillage tomato leaf disease dataset, which contains a total of 16,012 images distributed across 10 classes. The dataset is partitioned into three disjoint subsets:
This results in a split ratio of 70:15:15. All splits are stratified to preserve class distribution across subsets.
The dataset is partitioned into training, validation, and test subsets, as detailed in
Table 1 and performed randomly with a fixed seed (42) to ensure reproducibility.
During training, we apply data augmentations
: random horizontal flip, small rotations
and colour jitter. The effective training samples are
All images are resized to 224 × 224 and normalized with ImageNet mean and variance.
To further assess the stability of the proposed model and reduce the risk of overfitting to a single train-test split, in addition to the fixed train/validation/test split, we also performed a 5-fold cross-validation to further validate the robustness of the model, where in each iteration, three folds were used for training, one fold for validation and one fold for testing. The reported results correspond to the average performance across all folds.
4.2. Evaluation Metrics
Given a confusion matrix with entries
(true class i, predicted class j), the per-class precision, recall and F1-score are defined as
where ε is a small constant to avoid division by zero. Macro-averaged metrics are obtained by averaging over classes:
For efficiency, we also report the computational cost measured in floating-point operations (FLOPs), which are computed using ptflops (version 0.7.2) at resolution 224 × 224.
To further evaluate the statistical stability of the proposed model, we additionally report the mean and standard deviation over multiple experimental runs. Moreover, 95% confidence intervals are computed for the main evaluation metrics. When applicable, paired statistical comparisons are conducted against baseline models to assess whether the observed performance differences are statistically significant.
4.3. Implementation Details
The teacher and student models are implemented using PyTorch 2.0 and trained on a high-performance NVIDIA GPU. The teacher network is trained for 30 epochs with a batch size of 32 and an initial learning rate of 3 × 10−4, using the AdamW optimizer combined with a cosine annealing learning rate schedule. The student model follows the same training configuration, including the optimizer and scheduling strategy, and is also trained for 30 epochs. The computational complexity, expressed in FLOPs, is estimated directly from the training code at the specified input resolution.
To ensure the reproducibility of the experiments, all runs are conducted with a fixed random seed of 42, enabling deterministic behavior wherever applicable. Training is performed on an NVIDIA RTX 3060 GPU (NVIDIA, Santa Clara, CA, USA) equipped with 6 GB of memory.
The implementation is developed using PyTorch 2.0 within a consistent software environment to avoid variability across experiments. The total training time is approximately 12 h for the student model and 15 h for the teacher model.
Furthermore, all hyperparameters, training procedures, and preprocessing steps are explicitly specified to facilitate replication of the results.
To further assess experimental stability, all experiments were repeated across multiple independent runs using different random initializations. The reported performance values correspond to the average results obtained across these runs.
5. Results and Discussion
5.1. Accuracy and Classification Performance
Table 2 compares EcoTomHybridNet with the ViT-Tiny teacher and a fast-CNN ablation on the PlantVillage test set. EcoTomHybridNet achieves a test accuracy of 99.42%, macro F1 of 99.12% and weighed F1 of 99.43%. The confusion matrix in
Figure 3 shows that most samples are classified correctly, with minor confusion between Target spot disease and Septoria leaf spot. Classes such as Early blight and Leaf mold are classified perfectly.
Contextualizing with hybrid backbones and distillation advances. The strong performance of EcoTomHybridNet is consistent with recent trends in hybrid and distilled models. For example, the Swin Transformer backbone achieves state-of-the-art accuracy on general vision tasks while maintaining linear complexity [
23], and MobileViT attains high accuracy with only about six million parameters [
24]. EfficientFormer further demonstrates that Transformer-based models can match MobileNet-level speed with competitive accuracy [
25]. In the distillation domain, techniques such as DKD [
26] and CRD [
27] have shown that decoupling or contrastive objectives can transfer richer information from teachers to students. EcoTomHybridNet increases the parameter count compared to the Fast-CNN baseline due to the addition of the Transformer branch but achieves significantly higher accuracy with a moderate computational overhead. Advanced distillation methods like TAKD [
28] also highlight the importance of bridging the capacity gap, which may inspire future iterations of EcoTomHybridNet.
To ensure that the high accuracy is not due to overfitting or favorable data splits, we conducted additional cross-validation experiments. The model consistently achieved high accuracy across folds, with low variance, indicating stable learning behavior and robustness to data partitioning. In addition, the performance improvement obtained by EcoTomHybridNet over the Fast-CNN baseline remained consistent across repeated runs, suggesting that the observed gains are not due to random initialization effects.
5.2. Per-Class Metrics
Table 3 summarizes per-class precision, recall and F1-score for EcoTomHybridNet. All classes reach F1 > 98%, and several achieve perfect scores. Misclassifications are rare and mainly occur between visually similar lesion types, such as Septoria leaf spot and Target spot disease. The high macro-averaged precision indicates that performance is well balanced across classes.
5.3. Confusion Matrix Analysis
To examine class-wise behavior,
Figure 3 shows the normalized confusion matrix for EcoTomHybridNet on the PlantVillage test set. Rows correspond to ground-truth classes and columns to predicted classes. Most mass lies along the diagonal, confirming that the model correctly recognizes almost all images in every class. The few off-diagonal entries mainly arise from confusion between Septoria leaf spot and Target spot disease.
Rows correspond to ground-truth classes and columns to predicted classes. Values are normalized per class. The strong diagonal indicates excellent classification performance, with minor confusion between visually similar disease categories.
5.4. Qualitative Results
Figure 4 presents one representative test image per class, along with the predicted label and confidence score from EcoTomHybridNet. Correct predictions are shown with green titles; potential misclassifications would be highlighted in red. These qualitative examples illustrate that the model can handle a variety of symptom patterns, backgrounds and illumination conditions.
One representative test image per class with EcoTomHybridNet predictions and confidence scores. Green titles denote correct predictions; red titles (if present) would indicate misclassifications.
5.5. Ablation Study
To analyze the contribution of each component in EcoTomHybridNet, we conducted a series of ablation experiments.
Table 4 presents the impact of progressively incorporating the Transformer branch, knowledge distillation, and the policy network into the convolutional backbone. The baseline model consists solely of the backbone with the fast convolutional head.
The addition of the Transformer branch increases the number of parameters due to the introduction of self-attention layers. However, this increase remains moderate and results in a significant improvement in classification accuracy, highlighting the importance of capturing long-range dependencies. In addition, we explicitly compared the model trained with and without knowledge distillation. The distilled version consistently achieved slightly higher accuracy, confirming that knowledge distillation contributes to performance improvement even when the gain appears numerically small. Although the absolute accuracy improvement introduced by knowledge distillation is relatively small, repeated experiments consistently showed positive gains across different runs, indicating that the improvement is stable rather than resulting from random variation.
Although the absolute gain introduced by knowledge distillation appears small in terms of top-1 accuracy, it is important to note that distillation primarily contributes to improved feature alignment, better class separation, and enhanced robustness, which are not fully captured by accuracy alone. This behavior is consistent with prior studies showing that distillation often yields marginal accuracy gains but improves model calibration and generalization.
When the policy network is included (i.e., always selecting the hybrid branch), its computational overhead remains minimal and does not affect accuracy. This ablation setting allows us to isolate the individual contribution of each module under static inference. The adaptive routing capability enabled by the policy-guided mechanism is quantitatively evaluated in
Section 5.6 under dynamic inference conditions.
5.6. Dynamic Inference with the Policy-Guided Routing Mechanism
To evaluate the effectiveness of the proposed adaptive inference strategy, the lightweight policy-guided routing mechanism was enabled during evaluation and compared against a static inference configuration. While the previous ablation study analyzed the architectural contribution and computational overhead of the policy network under static routing conditions, the experiments presented in this section further investigate its ability to dynamically allocate computational resources according to input complexity. Unlike conventional static lightweight agricultural classifiers, the proposed adaptive routing mechanism dynamically regulates computational allocation according to sample complexity, thereby enabling resource-aware inference under edge deployment constraints. Although the current implementation relies on a lightweight threshold-based routing policy, the objective of this work is to experimentally demonstrate the feasibility of adaptive resource-aware inference for agricultural edge intelligence rather than proposing a highly complex gating architecture.
Two inference configurations were evaluated:
Always Hybrid: all samples are processed through the Transformer-enhanced hybrid branch, independently of the policy output.
Adaptive (Policy): samples are dynamically routed according to the policy network using a fixed decision threshold . When the predicted routing probability satisfies , inference is performed using the lightweight fast convolutional branch; otherwise, the Transformer-enhanced branch is selected.
Table 5 summarizes the resulting classification performance and computational complexity measured in FLOPs. The adaptive inference configuration achieves 99.20% classification accuracy while reducing the computational cost from 0.36 GFLOPs to 0.25 GFLOPs per image, corresponding to approximately 30% computational savings compared with the static always-hybrid configuration. Although a slight reduction in classification accuracy is observed relative to the full hybrid inference mode (99.42%), the adaptive routing strategy maintains highly competitive predictive performance while significantly decreasing computational requirements.
These results demonstrate that the proposed policy-guided routing mechanism can effectively reduce unnecessary Transformer computations for simpler samples while preserving strong classification capability for more challenging inputs. More broadly, the experiments highlight the potential of adaptive resource-aware inference for lightweight edge-oriented agricultural applications, where balancing computational efficiency and predictive accuracy is critical for real-time deployment on embedded and IoT-enabled devices.
6. Parameter and Resource Comparison
To better position EcoTomHybridNet within the existing literature, we provide an indicative comparison of its computational cost and reported performance with a selection of recent methods.
Table 6 summarizes the number of parameters, FLOPs (when available), and reported performance metrics for each model, as stated in the original publications. It is important to note that these results are obtained under different datasets, preprocessing pipelines, and evaluation protocols, and therefore do not constitute a strictly fair comparison. References to the corresponding studies are provided via citation numbers.
A rigorous statistical comparison with recent state-of-the-art models would require unified training conditions, identical preprocessing pipelines, and evaluation on the same datasets. Such large-scale re-implementation is beyond the scope of the current work and is left for future investigation.
7. Deployment-Oriented Efficiency Discussion
In real-world smart agriculture applications, deep learning models must provide not only high classification accuracy but also strong computational efficiency for deployment on resource-constrained edge devices. Since inference latency can vary significantly depending on hardware specifications, software optimization strategies, and deployment conditions, the latency measurements reported in this section should be interpreted as representative rather than absolute performance benchmarks. To further evaluate the practical deployment capability of the proposed framework, we provide indicative latency measurements on commonly used embedded platforms together with a qualitative analysis of EcoTomHybridNet’s deployment-oriented efficiency in terms of parameter count, computational complexity, memory footprint, and compatibility with edge computing devices widely adopted in IoT-enabled smart agriculture systems.
7.1. Compactness in Parameters and FLOP
As summarized in
Table 2, EcoTomHybridNet comprises approximately 0.801 million learnable parameters and requires about 0.36 GFLOPs per 224 × 224 input image. This corresponds to a reduction of more than 40% in parameters and roughly 35% in FLOPs compared with the ViT–Tiny teacher, while maintaining nearly identical classification accuracy. The compact convolutional backbone combined with local, rather than global, self-attention significantly reduces the memory footprint, facilitating deployment on devices with limited RAM. Compared with the fast-CNN baseline, EcoTomHybridNet introduces a moderate increase in parameter count due to the Transformer branch, but significantly improves accuracy, demonstrating a favorable accuracy–complexity trade-off.
7.2. Suitability for Edge Devices
As shown in
Figure 5, common edge platforms for plant disease monitoring include the NVIDIA Jetson family (e.g., Jetson Nano and Xavier) and single-board computers such as the Raspberry Pi 4. The Jetson Nano provides approximately 472 GFLOPS of compute via a 128-core GPU and is equipped with 4 GB of RAM, whereas the Raspberry Pi relies on an ARM CPU and integrated GPU with considerably lower peak throughput. EcoTomHybridNet’s 0.36 GFLOPs per inference requirement falls well within the Jetson’s computational budget and occupies only a small fraction of its memory when stored in half-precision (FP16) or integer formats. On the Raspberry Pi, despite limited GPU acceleration, the modest parameter count suggests that CPU-only inference remains feasible at reasonable frame rates, particularly when batching inputs or reducing image resolution.
7.3. Qualitative Latency Considerations
Inference latency on edge computing devices is influenced by several factors, including processor frequency, memory bandwidth, software optimization, parallel execution capability, and thermal constraints. Because runtime measurements may vary significantly depending on deployment conditions and hardware configurations, the latency observations discussed in this section should be interpreted as indicative rather than universally generalizable benchmarks.
From an architectural perspective, EcoTomHybridNet incorporates several design choices intended to improve inference efficiency on resource-constrained platforms. First, the proposed model avoids computationally expensive global self-attention mechanisms, whose complexity scales quadratically with the number of tokens. Instead, EcoTomHybridNet employs local self-attention windows, resulting in substantially lower computational complexity and improved parallelization efficiency. Second, the lightweight convolutional backbone utilizes stride-2 downsampling operations and compact convolution kernels, reducing intermediate feature-map sizes, memory transfers, and overall computational overhead.
In addition, the lightweight policy-guided routing mechanism introduces only minimal computational overhead while enabling adaptive computational allocation according to input complexity. During the main experimental configuration, the hybrid Transformer branch is used as the primary inference path to ensure stable optimization and consistent evaluation. Dynamic inference experiments further demonstrate that the proposed routing strategy can selectively utilize the lightweight convolutional branch for less complex samples, thereby reducing unnecessary Transformer computations while maintaining competitive classification performance. Collectively, these architectural characteristics indicate that EcoTomHybridNet is well suited for low-latency and resource-efficient inference on representative embedded platforms such as NVIDIA Jetson devices and Raspberry Pi systems under practical edge deployment conditions.
7.4. Energy and Thermal Footprint
Efficient energy utilization is a critical requirement for intelligent agricultural systems deployed in remote or resource-limited environments, where embedded devices often operate continuously under constrained power budgets. In such scenarios, excessive computational demand may increase battery consumption, generate thermal instability, and reduce long-term operational reliability.
EcoTomHybridNet was intentionally designed to maintain a lightweight computational profile while preserving strong predictive capability. The compact convolutional backbone reduces memory-access operations, whereas the localized self-attention mechanism avoids the high computational overhead associated with global Transformer attention. As a result, the overall architecture limits unnecessary computational activity and contributes to lower energy utilization during inference.
In addition, the adaptive routing strategy introduced by the lightweight policy network enables conditional computation according to input complexity. Under adaptive inference settings, simpler samples can be processed through the lightweight convolutional branch without activating the more computationally intensive Transformer-enhanced pathway. This selective computational allocation mechanism not only reduces FLOPs but also contributes to limiting processor utilization and thermal stress during continuous operation.
From a deployment perspective, maintaining moderate thermal behavior is particularly important for fanless embedded devices commonly used in smart agriculture systems, including edge gateways, mobile devices, and low-power monitoring platforms. Although a dedicated hardware-level thermal profiling analysis was beyond the scope of this work, the reduced computational complexity of EcoTomHybridNet suggests favorable operational characteristics for prolonged edge deployment scenarios.
Future optimization strategies may further improve energy efficiency through quantization-aware inference, mixed-precision computation, TensorRT acceleration, and hardware-specific compilation techniques tailored for embedded AI accelerators.
7.5. Empirical Edge Deployment Evaluation
To further assess the practical deployment capability of EcoTomHybridNet on resource-constrained hardware, empirical inference benchmarks were conducted on two representative embedded platforms: the NVIDIA Jetson Nano and the Raspberry Pi 4. The experiments were implemented using PyTorch 2.0. TensorRT acceleration was enabled on the Jetson Nano to optimize GPU inference, whereas CPU-only execution was employed on the Raspberry Pi 4. Average per-image inference latency was measured over 100 independent runs using input images of resolution .
Table 7 summarizes the approximate inference latency, throughput in frames per second (FPS), compute unit utilized, and qualitative power characteristics observed on both platforms. On the Jetson Nano, EcoTomHybridNet achieved an average inference latency of approximately 25 ms (around 40 FPS) when executed on the GPU, while maintaining low power consumption. On the Raspberry Pi 4, the model required approximately 120 ms per inference (around 8 FPS) under CPU-only execution and exhibited moderate power consumption.
These experimental findings demonstrate that EcoTomHybridNet is capable of achieving near real-time inference on representative embedded platforms while maintaining low computational complexity and a compact memory footprint. More importantly, the obtained deployment results further support the practical applicability of the proposed adaptive resource-aware framework for low-power edge devices commonly employed in IoT-enabled smart agriculture systems.
In summary, EcoTomHybridNet balances accuracy and efficiency in a manner that makes it attractive for deployment on low-power devices commonly used in agricultural monitoring. Its small parameter counts and moderate FLOPs allow it to fit within the memory and compute budgets of platforms like Jetson Nano and Raspberry Pi, and its architectural design suggests favorable latency and energy characteristics. Future work could explore quantization and hardware-specific optimizations to further reduce the runtime and energy costs.
8. IoT-Enabled Smart Agriculture Framework
8.1. IoT Data Acquisition Layer
Although the primary focus of EcoTomHybridNet is algorithmic efficiency, deploying such models in real agricultural settings necessitates a broader infrastructure that leverages the Internet of Things (IoT). Smart agriculture relies on a network of heterogeneous sensors and actuators—cameras, soil moisture probes, weather stations, pest traps and irrigation controllers—connected through low-power wide area networks (LPWAN), Wi-Fi or cellular technologies (e.g., 4G/5G or NBIoT) [
29]. These devices continuously monitor crop health, environmental conditions and operational parameters, generating streams of data that can be analyzed for actionable insights. In the context of tomato disease detection, high-resolution camera modules mounted on drones or fixed rigs capture leaf images, while microcontrollers and single-board computers perform on-device inference using lightweight models such as EcoTomHybridNet. Sensor data and classification results are transmitted via message-oriented middleware (e.g., MQTT) to edge gateways, which aggregate and forward the information to cloud services for storage, visualization and decision support. The resulting cyber–physical system enables farmers to receive real-time alerts, adjust irrigation or pesticide schedules and document disease outbreaks over large fields. As illustrated in
Figure 2, this IoT pipeline provides a concrete blueprint for connecting sensors, edge devices and cloud platforms.
The IoT-enabled framework also facilitates automated data collection for continuous model improvement. By embedding our model in smart cameras distributed across greenhouses or open fields, thousands of images can be automatically labelled and uploaded to a central repository, where they are curated and used to finetune the model. Edge devices equipped with connectivity modules (Bluetooth Low Energy, ZigBee or LoRaWAN) periodically sync their logs with a server, enabling remote monitoring and over-the-air updates. Such a design not only provides the next-generation networking component requested by the editors but also closes the loop between sensing, actuation and learning. Future extensions could incorporate federated learning, where multiple farms collaboratively train a shared model without disclosing raw data, preserving privacy and reducing bandwidth consumption.
8.2. Edge–Cloud Architecture for Tomato Disease Detection
Building upon the IoT foundation illustrated in
Figure 2, we propose an edge–cloud architecture that balances local processing with centralized analytics. In this architecture, data acquisition and preliminary inference take place at the network edge: camera modules or smartphones capture tomato leaf images, preprocess them and execute EcoTomHybridNet on embedded GPUs or NPUs. Thanks to the model’s compact footprint and low FLOP count, inference latency at the edge is acceptable (see
Section 4), enabling timely feedback to farmers even in areas with limited connectivity. When an image is classified as diseased, the edge device can immediately trigger an alert or log the event for subsequent agronomic intervention.
Simultaneously, selected images and metadata (GPS coordinates, environmental readings and timestamp) are uploaded to a cloud platform over wireless or wired networks. The cloud aggregates data from multiple edge nodes, performs more computationally intensive analyses (e.g., trend detection, anomaly detection and spatiotemporal modelling) and stores a large archive of labelled images. Cloud resources support the retraining of EcoTomHybridNet or the evaluation of alternative architectures using distributed computing frameworks. Updated model weights are disseminated back to edge devices via periodic synchronization, ensuring that local predictors remain accurate as new disease variants emerge. This bidirectional flow of information—edge-to-cloud for data and cloud-to-edge for models—embodies the concept of next-generation networking by harnessing high bandwidth connections when available (5G/6G) and falling back to LPWAN or satellite links in rural areas. By decoupling inference and training across edge and cloud tiers, our architecture delivers scalability, robustness and responsiveness, laying the groundwork for large-scale deployment of tomato disease detection within smart agriculture ecosystems.
9. Network Architecture and Training Pipeline
Figure 6 presents an overview of the EcoTomHybridNet architecture and training pipeline. First, the ViT-Tiny teacher model is trained on the PlantVillage tomato leaf disease dataset. The teacher’s logits are subsequently used to supervise the EcoTomHybridNet student through a knowledge distillation framework. The student architecture consists of a shared lightweight convolutional backbone, a fast convolutional branch for efficient inference, a Transformer-enhanced hybrid branch with local self-attention for richer contextual feature extraction, and a lightweight policy-guided routing network designed for adaptive resource-aware inference.
In the current experimental configuration, the Transformer-enhanced hybrid branch is used as the primary inference path to ensure stable optimization and maximize classification performance. The fast convolutional branch and the lightweight policy-guided routing mechanism are further evaluated through dedicated dynamic inference experiments in order to analyze adaptive computational allocation under resource-constrained deployment conditions.
The ViT-Tiny teacher model (top left) is trained on the PlantVillage dataset, and its logits are used to supervise the EcoTomHybridNet student through a knowledge distillation framework. The student architecture integrates a lightweight convolutional backbone, a fast convolutional branch, a Transformer-enhanced hybrid branch, and a lightweight policy-guided routing network. In the current experimental configuration, the hybrid branch serves as the primary inference path for final predictions, while the policy-guided routing mechanism is further analyzed through dedicated dynamic inference evaluations.
10. Robustness and Generalization Experiments
Deep learning models trained on the PlantVillage dataset exhibit high accuracy on distribution data, but their robustness to domain shift and image perturbations is a critical consideration for deployment. In this section, we present two complementary evaluations inspired by the guidelines suggested in [
30] and follow-up analyses of cross-dataset performance.
10.1. Discussion on Cross-Dataset Generalization
Although cross-dataset evaluation is essential to assess real-world generalization, this study is limited to the PlantVillage dataset due to the lack of access to large-scale, labelled field datasets with consistent class definitions.
However, to address this limitation, we rely on complementary evaluation strategies, including k-fold cross-validation, robustness testing under noise and blur perturbations, and comparison with findings reported in the literature.
Previous studies have shown that models trained on PlantVillage often suffer from significant performance degradation when applied to external datasets, due to differences in background complexity, illumination, and acquisition conditions. Therefore, the high performance achieved in this work should be interpreted as an upper-bound estimate under controlled conditions rather than a guarantee of real-world performance [
22,
31].
Future work will focus on extending the evaluation to cross-domain datasets and incorporating domain adaptation techniques to improve generalization.
10.2. Noise and Blur Robustness
A simpler, dataset-agnostic way to probe model robustness is to test its resilience to common image corruptions, such as noise, blur and illumination changes. We followed this strategy by adding Gaussian noise (standard deviation σ = 0.05) and synthetic motion blur to the PV test images and measuring classification accuracy.
Table 8 summarizes the results. On clean images, the model achieves 99.42% accuracy (matching
Table 2). Adding moderate Gaussian noise reduces accuracy to the high 97% range, while motion blur reduces it to around 96%. These drops, although noticeable, indicate that EcoTomHybridNet retains good predictive capability under realistic degradations. We note that more severe corruptions or combinations of noise, blur and illumination changes could further challenge the model, and thus we encourage future robustness evaluations across a broader set of corruptions.
Even with these perturbations, EcoTomHybridNet outperforms many lightweight baselines, suggesting that our hybrid architecture and distilled training confer a degree of robustness. As additional background, cross-dataset evaluations on PlantVillage and other datasets have revealed that performance degradation across domains can be substantial; however, evaluating robustness to simple corruptions provides complementary insights into model behavior under controlled perturbations. Together, these experiments demonstrate that our model not only achieves state-of-the-art accuracy on standard benchmarks but also maintains competitive performance under moderate noise and blur.
Although the current robustness evaluation focuses on moderate Gaussian noise and motion blur perturbations, future work will investigate more comprehensive robustness benchmarks involving illumination changes, occlusions, compression artifacts, and standardized corruption datasets such as ImageNet-C.
10.3. Cross-Validation and Calibration Analysis
While EcoTomHybridNet achieves very high accuracy on the PlantVillage dataset, such performance may raise concerns regarding potential overfitting and dataset bias. To address this issue, we conducted a 5-fold cross-validation experiment, the results of which are reported in
Table 9. As shown in
Table 9, the model maintains consistently high performance across all folds, with very low variance in both accuracy and macro F1-score. This stability indicates that the model does not depend on a particular data split and exhibits reliable learning behavior.
Furthermore, we analyzed the model’s prediction confidence to evaluate its calibration properties. The predicted probabilities were generally well-aligned with the true outcomes, suggesting that the model does not suffer from overconfident predictions despite its high accuracy.
Combined with the robustness analysis under noise and blur (
Section 10.2) and the cross-dataset evaluation (
Section 10.1), these findings provide strong evidence that EcoTomHybridNet generalizes effectively and does not exhibit significant overfitting.
The low standard deviation observed across folds suggests stable model behavior and low sensitivity to data partitioning. In addition, the narrow confidence intervals further support the statistical consistency of the proposed approach.
10.4. Statistical Limitations and Future Validation
Despite the encouraging experimental results, several statistical limitations remain. The current evaluation relies primarily on a single public dataset and does not include large-scale statistical hypothesis testing against recent state-of-the-art models under identical experimental conditions. Moreover, robustness evaluation is currently limited to moderate perturbations and does not cover standardized corruption benchmarks. Future work will therefore focus on broader statistical validation, repeated independent experiments, confidence interval estimation, and cross-dataset benchmarking under unified evaluation protocols.
11. Conclusions and Perspectives
In this work, we introduced EcoTomHybridNet, an adaptive resource-aware CNN–Transformer framework designed for efficient and accurate tomato leaf disease classification under edge-computing constraints. The proposed architecture combines a lightweight convolutional backbone with a Transformer-enhanced branch based on local self-attention, while a lightweight policy-guided routing mechanism dynamically allocates samples between a fast convolutional branch and a more expressive hybrid branch according to input complexity. In addition, knowledge distillation from a ViT-Tiny teacher model is employed to transfer discriminative representations to the compact student network without substantially increasing computational complexity.
Experimental results on the PlantVillage tomato dataset demonstrate that EcoTomHybridNet achieves 99.42% test accuracy while reducing computational complexity by more than 30% in FLOPs and over 40% in parameter count compared with the teacher model. Furthermore, adaptive inference experiments demonstrate that the proposed policy-guided routing mechanism reduces computational cost from 0.36 GFLOPs to 0.25 GFLOPs per image while preserving highly competitive classification performance. These results confirm the effectiveness of adaptive resource-aware inference for balancing predictive accuracy and computational efficiency in lightweight agricultural deep learning systems.
The principal contribution of this work lies in the experimental investigation of adaptive resource-aware inference for agricultural disease classification through lightweight policy-guided branch allocation. Unlike conventional static lightweight classifiers, EcoTomHybridNet dynamically regulates computational usage according to input complexity, enabling substantial computational savings while maintaining highly competitive classification accuracy. Compared with recent lightweight and edge-oriented hybrid architectures, the proposed framework explicitly integrates input-dependent computational allocation instead of relying exclusively on static inference pipelines. This adaptive routing capability significantly improves the suitability of EcoTomHybridNet for deployment on resource-constrained edge devices commonly employed in IoT-enabled smart agriculture environments.
To improve experimental reliability and reduce the risk of overfitting, additional validation strategies were conducted, including k-fold cross-validation and robustness evaluation under Gaussian noise and motion blur perturbations. The obtained results demonstrate stable behavior across different data partitions with low variance, suggesting that EcoTomHybridNet learns meaningful visual representations rather than memorizing specific training samples. Nevertheless, both our analyses and previous studies indicate that models trained exclusively on controlled datasets such as PlantVillage may experience performance degradation when evaluated on field-acquired images characterized by complex backgrounds, illumination variability, occlusions, and environmental noise. Consequently, the results reported in this study should be interpreted as strong controlled-environment performance rather than a definitive measure of real-world field generalization.
Despite the encouraging experimental findings, several limitations remain. The current evaluation primarily relies on a single controlled dataset and does not yet include large-scale cross-dataset benchmarking under unified experimental protocols. In addition, although the proposed adaptive routing mechanism demonstrates promising computational savings, the current implementation still relies on a threshold-based routing strategy. More advanced learned gating mechanisms, uncertainty-aware routing strategies, and hardware-aware adaptive inference policies could further improve the balance between computational efficiency and predictive performance.
Future work will therefore focus on extending the evaluation to field-acquired datasets, incorporating domain adaptation and domain generalization techniques, and conducting broader statistical comparisons with recent state-of-the-art approaches under identical experimental settings. Additional research directions include integrating disease detection and localization modules, applying advanced compression techniques such as quantization and pruning, and optimizing the framework for low-power embedded hardware. Finally, deeper integration within IoT-enabled edge–cloud smart agriculture ecosystems and large-scale real-world deployment on embedded platforms will further strengthen the scalability, robustness, and practical applicability of EcoTomHybridNet for next-generation intelligent farming systems.