AgriLiteNet: Lightweight Multi-Scale Tomato Pest and Disease Detection for Agricultural Robots

Yang, Chenghan; Zhao, Baidong; Mansurova, Madina; Zhou, Tianyan; Liu, Qiyuan; Bao, Junwei; Zheng, Dingkun

doi:10.3390/horticulturae11060671

Open AccessArticle

AgriLiteNet: Lightweight Multi-Scale Tomato Pest and Disease Detection for Agricultural Robots

by

Chenghan Yang

^1,†

,

Baidong Zhao

^1,†

,

Madina Mansurova

¹

,

Tianyan Zhou

²

,

Qiyuan Liu

³,

Junwei Bao

⁴ and

Dingkun Zheng

^1,*

¹

Faculty of Information Technology, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan

²

College of Mechanical and Electrical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210001, China

³

School of Robotics, Xi’an Jiaotong Liverpool University, Suzhou 215123, China

⁴

Inner Mongolia Academy of Agricultural and Animal Husbandry Sciences, Hohhot 010031, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Horticulturae 2025, 11(6), 671; https://doi.org/10.3390/horticulturae11060671

Submission received: 11 May 2025 / Revised: 5 June 2025 / Accepted: 8 June 2025 / Published: 12 June 2025

(This article belongs to the Special Issue Applied Artificial Intelligence in Digital Horticulture: Practices and Innovations)

Download

Browse Figures

Versions Notes

Abstract

Real-time detection of tomato pests and diseases is essential for precision agriculture, as it requires high accuracy, speed, and energy efficiency of edge-computing agricultural robots. This study proposes AgriLiteNet (Lightweight Networks for Agriculture), a lightweight neural network integrating MobileNetV3 for local feature extraction and a streamlined Swin Transformer for global modeling. AgriLiteNet is further enhanced by a lightweight channel–spatial mixed attention module and a feature pyramid network, enabling the detection of nine tomato pests and diseases, including small targets like spider mites, dense targets like bacterial spot, and large targets like late blight. It achieves a mean average precision at an intersection-over-union threshold of 0.5 of 0.98735, which is comparable to Suppression Mask R-CNN (0.98955) and Cas-VSwin Transformer (0.98874), and exceeds the performance of YOLOv5n (0.98249) and GMC-MobileV3 (0.98143). With 2.0 million parameters and 0.608 GFLOPs, AgriLiteNet delivers an inference speed of 35 frames per second and power consumption of 15 watts on NVIDIA Jetson Orin NX, surpassing Suppression Mask R-CNN (8 FPS, 22 W) and Cas-VSwin Transformer (12 FPS, 20 W). The model’s efficiency and compact design make it highly suitable for deployment in agricultural robots, supporting sustainable farming through precise pest and disease management.

Keywords:

tomato; pests and diseases; agricultural robotics; multi-scale detection; lightweight

1. Introduction

Tomato (Solanum lycopersicum), a globally cultivated solanaceous crop, encompasses a diverse array of varieties, including large red tomatoes, cherry tomatoes, and grape tomatoes, esteemed for their exquisite flavor and substantial nutritional value. These attributes underpin its strong domestic and international markets [1,2]. However, modern agriculture’s reliance on manual and empirical pest and disease control struggles to meet precision demands. Diverse pests and diseases, complicated by varying symptoms and environmental factors like lighting and foliar occlusion, often lead to missed detections and delayed interventions [3,4].

Deep learning advancements have improved image-based pest and disease detection, enhancing detection accuracy and computational efficiency [5]. However, diverse pests, subtle features, and environmental factors such as occlusion and variable lighting complicate accurate identification in complex field conditions [6,7,8]. Traditional manual inspections fail to provide timely results, and ground-based equipment faces challenges in delivering real-time monitoring. Agricultural robots, with automation and intelligence, offer a robust solution for pest and disease management [9,10,11,12]. Detection algorithms for agricultural robots must achieve robust and accurate detection of tomato pests and diseases in dynamic field conditions, while maintaining lightweight and energy-efficient models for real-time operation on edge devices [13].

To address these challenges, lightweight models have become a focal point of research, balancing the trade-offs between accuracy, computational cost, and deployment feasibility on resource-constrained devices [14]. Lightweight models such as YOLOv5n and MobileNetV3 have advanced agricultural pest and disease detection. For example, CTR_YOLOv5n integrates coordinate attention and Swin Transformer, achieving a Precision of 95.2% with a model size of 5.1 megabytes [15]. MobileNetV3-based models attain an accuracy of 79.52% with a model size of 2.36 megabytes through attention mechanisms [16]. Lightweight models based on Swin Transformer, such as TrIncNet and HCFormer, combine convolutional neural networks (CNNs) and Transformers, demonstrating good performance on datasets like PlantVillage [17]. However, these models often struggle with occlusion, variable lighting, and complex backgrounds, showing limited effectiveness for small, dense targets like spider mites and bacterial spot, and their computational complexity hinders deployment on low-latency edge devices [18,19,20,21].

As shown in Figure 1, this study systematically evaluates mainstream models for agricultural pest and disease detection across dimensions such as lightweight design, accuracy, real-time processing, and deployment compatibility using a radar chart. By including lightweight variants like MobileNetV3 and Swin Transformer, the analysis provides a comprehensive benchmark. Building on these insights, we propose AgriLiteNet (Lightweight Networks for Agriculture), an innovative lightweight object detection model designed for real-time, high-precision detection of tomato pests and diseases on resource-constrained edge devices. AgriLiteNet employs a dual-branch architecture that synergistically integrates an optimized MobileNetV3 CNN branch for capturing local features and a streamlined Swin Transformer branch for modeling global context. It further enhances performance with a lightweight channel–spatial mixed attention module (L-CSMA) for effective feature fusion and a lightweight feature pyramid network (L-FPN) to strengthen multi-scale feature representation, enabling robust detection of small targets like spider mites, densely clustered targets like bacterial spot, and large targets such as late blight under complex field conditions.

The primary contributions of this study are as follows:

(1): Introduction of AgriLiteNet: This study presents AgriLiteNet, an innovative dual-branch lightweight detection model that combines an optimized MobileNetV3 (1.2 million parameters, 0.2 GFLOPs) with a streamlined Swin Transformer (0.3 million parameters, 0.3 GFLOPs). This architecture effectively integrates the local feature extraction capabilities of CNNs with the global modeling strengths of Transformers, delivering robust detection performance in complex field environments.
(2): Efficient Feature Fusion Mechanisms: This study develops an advanced feature fusion framework comprising the lightweight channel–spatial mixed attention module (L-CSMA) and the L-FPN. The L-CSMA dynamically recalibrates the weight distribution across channel and spatial features, while the L-FPN enhances the integration of multi-scale target information. Together, these components significantly improve the model’s generalization capabilities across a diverse array of pest and disease targets.
(3): Design of a Deployment-Oriented Evaluation Framework: This study constructs a comprehensive evaluation framework tailored for real-world deployment scenarios, incorporating a standardized dataset of 12,000 tomato pest and disease images across 10 categories. The evaluation criteria include not only detection accuracy but also model complexity, inference speed, and power efficiency, providing a multi-dimensional benchmark for lightweight agricultural detection models. This framework enables fair comparison across existing methods and highlights the deployment feasibility of AgriLiteNet in edge-based agricultural robotics.

2. Materials and Methods

2.1. Dataset

The dataset developed in this study is tailored specifically for the real-time detection of tomato pests and diseases in agricultural robotics applications. Initially, 11,000 images were sourced from the publicly available PlantVillage dataset on Kaggle, which comprises field-collected images of tomato seedlings.

To enhance dataset diversity and improve the model’s generalization capabilities, additional images were captured at the Modern Agricultural Park in Xuzhou City, Jiangsu Province, China (38°36′14.58″ N, 118°02′41.94″ E), as shown in Figure 2. This park cultivates several tomato varieties, including Jinpeng No. 8, Zhongza No. 9, and Cherry Tomato. Image acquisition took place primarily between March and April, encompassing various growth stages of the tomato plants. A Redmi K30 smartphone camera, manufactured by Xiaomi Technology Co., Ltd., Beijing, China, was used for image collection. The camera was mounted on a tripod and positioned 0.1 m from the tomato foliage to ensure stability and consistency during data acquisition.

To ensure experimental robustness, 1000 supplementary images of tomato leaves were collected under diverse environmental conditions, including variations in angles and illumination. These additions enhanced the dataset’s representativeness and adaptability. The final dataset consisted of 12,000 images (covering 9 categories of pests and diseases and healthy leaves, there are about 1200 images for each category, including 1000 for training, 100 for verification, and 100 for testing), partitioned into three subsets: 10,000 images allocated for training to enable the model to learn the distinguishing characteristics of various tomato pests and diseases; 1000 images designated for validation to fine-tune model parameters and reduce overfitting; and the remaining 1000 images reserved for testing to rigorously evaluate the model’s performance on previously unseen data.

As shown in Figure 3, the dataset encompasses nine categories of pests and diseases, alongside healthy tomato leaves. Representative samples of these conditions, shown in Figure 3, underscore the dataset’s diversity. The identified pests and diseases include the following:

(a): Healthy: Plants exhibit no signs of disease or pest damage, characterized by vibrant green leaves with uniform coloration, firm stems, and well-formed fruits, indicating optimal growth and development.
(b): Bacterial Spot: Caused by Xanthomonas campestris, this disease produces small, water-soaked spots on leaves that turn dark brown to black with yellow halos, often leading to rot on leaves, stems, and fruits.
(c): Early Blight: Triggered by Alternaria solani, it manifests as concentric, dark brown lesions with yellowing edges on lower leaves, causing early leaf drop and reduced photosynthesis.
(d): Late Blight: Induced by Phytophthora infestans, this rapidly spreading disease results in large, irregular gray-green to black lesions on leaves, often with white mold, leading to leaf rot and fruit decay.
(e): Leaf Mold: Caused by Passalora fulva, it forms olive-green to grayish mold layers on the undersides of leaves, with yellowing on upper surfaces, causing desiccation and leaf detachment.
(f): Septoria Leaf Spot: Due to Septoria lycopersici, it produces small, circular grayish-white spots with dark brown borders on leaves, typically on lower foliage, leading to defoliation.
(g): Spider Mites: Tetranychus urticae, sap-feeding pests, cause minute yellow speckles and stippling on leaf surfaces, progressing to widespread yellowing and desiccation.
(h): Target Spot: Initiated by Corynespora cassiicola, it is characterized by concentric, bullseye-like lesions with alternating light and dark rings on leaves, resulting in premature defoliation.
(i): Mosaic Virus: Transmitted by various viruses, it induces yellow-white mottling and mosaic patterns on leaves, accompanied by deformation and stunted plant growth.
(j): Yellow Leaf Curl Virus: Caused by begomoviruses, it triggers upward curling and yellowing of new leaves, with vein clearing and reduced leaf size, significantly lowering yield.

Figure 3. Tomato pest and disease dataset samples.

These pests and diseases variably impair tomato growth, yield, and fruit quality, underscoring the need for precise and timely detection.

2.2. Methods

2.2.1. AgriLiteNet

As shown in Figure 4, AgriLiteNet employs an enhanced lightweight Swin Transformer and MobileNetV3 as its primary feature extraction backbones, integrating a L-CSMA to enable deep fusion of local and global features, and utilizing a L-FPN to perform multi-scale object detection. The architecture of AgriLiteNet is carefully designed to reconcile high accuracy with low computational complexity, as expressed in Equation (1), which delineates its forward propagation:

F_{out} = L - FPN (L - CSMA (F_{C N N}, F_{Trans}))

(1)

Here,

F_{C N N}

denotes the local features extracted by MobileNetV3 and

F_{Trans}

represents the global features derived from the lightweight Swin Transformer, while L-

CSMA (\cdot)

and L-

F P N (\cdot)

correspond to the feature fusion and multi-scale detection operations, respectively.

Figure 4. Architecture of the AgriLiteNet model.

2.2.2. Enhanced Lightweight Swin Transformer

The Swin Transformer [22] enhances object detection for tomato pest and disease detection by capturing global contextual information, outperforming CNNs in identifying complex leaf patterns under challenging field conditions (e.g., poor lighting, complex backgrounds). Its hierarchical multi-scale feature extraction improves accuracy for distinguishing healthy and diseased leaves. However, its high computational complexity limits deployment on edge devices like NVIDIA Jetson Orin NX, where real-time performance is critical. This study introduces optimizations to reduce complexity while retaining global modeling capabilities, enabling efficient operation on resource-constrained agricultural robots.

As shown in Figure 5, this study streamlines the Swin Transformer from its original four stages to two, significantly alleviating the computational burden by reducing the number of downsampling operations. The first stage retains high-resolution input to capture fine-grained features, while the second stage employs max-pooling downsampling to aggregate global information, striking a balance between efficiency and feature representation. The core operation leverages shifted window self-attention, as formalized in Equation (2), which computes self-attention for an input feature map

X \in R^{H \times W \times C}

:

Figure 5. Architecture of the lightweight Swin Transformer.

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}} + B) V

(2)

Here,

Q, K, V \in R^{M^{2} \times d}

represent the query, key, and value matrices, respectively,

M^{2}

is the number of pixels within the window,

d_{k}

is the key dimension, and B denotes the relative positional bias. The window size for self-attention is reduced from 7 × 7 to 3 × 3, lowering the computational complexity to approximately one-fifth of the original, while still providing sufficient local contextual modeling. The number of attention heads is decreased from 6 to 2, further optimizing computational overhead and parameter count. Additionally, channel dimensions are compressed to 16 and 32 across stages, significantly reducing storage and computational demands, resulting in a compact feature representation

F_{Trans} \in R^{14 \times 14 \times 32}

. By condensing the original four-stage Swin Transformer into two stages, the computational load is reduced from 4.5 GFLOPs to 0.3 GFLOPs, while retaining global modeling capabilities and minimizing the overhead of window-based self-attention.

A lightweight Transformer with approximately 30K parameters and 0.3 GFLOPs is proposed, substantially reducing computational overhead compared to the original Swin-T model (28 M parameters, 4.5 GFLOPs), thereby enhancing its suitability for deployment on resource-constrained agricultural robots. While this design enables fast inference and excels at capturing global contextual information for large-scale pattern recognition, it exhibits limitations in extracting fine-grained local features—crucial for early-stage pest and disease identification. To overcome this challenge, a lightweight CNN is integrated to leverage its strength in local feature extraction. The resulting dual-branch architecture effectively combines global and local representations, delivering robust multi-scale detection performance while preserving a compact and efficient design.

2.2.3. Enhanced Lightweight MobileNetV3

MobileNetV3, tailored for deployment on resource-constrained devices, incorporates depthwise separable convolutions, inverted residuals, and linear bottlenecks to minimize computational overhead while preserving strong feature extraction capabilities [23]. The integration of the h-swish activation function further enhances its operational efficiency. With optimizations for high-resolution imagery, MobileNetV3 meets the precision requirements of agricultural robots without compromising real-time performance. Its computational efficiency is formally described in Equation (3).

Y_{i, j, k} = \sum_{m, n} X_{i + m, j + n, k} \times K_{m, n, k}

(3)

Here,

K_{m, n, k}

represents the depthwise convolution kernel for the

k

-th channel, yielding an output feature map

Y \in R^{H \times W \times C i n}

. Subsequent pointwise convolutions adjust the channel dimensions, producing

F_{C N N} \in R^{28 \times 28 \times 40}

. As shown in Figure 6, we prune the network to its first seven blocks, truncating at the initial stride-2 downsampling layer, reducing the output feature map from the full model’s 7 × 7 × 576 to 28 × 28 × 40. The squeeze-and-excitation (SE) module is removed to minimize latency, while retaining the efficient 3 × 3 depthwise and 1 × 1 pointwise convolution combination. The optimized model significantly reduces parameter count and computational load (approximately 1.2 M parameters and 0.2 GFLOPs) while preserving robust local feature representation.

The integration of lightweight MobileNetV3 and Swin Transformer enables complementary fusion of local details and global context, enhancing multi-scale detection performance and robustness in complex field environments. However, their differing feature representations may lead to information redundancy or conflict, and the overall computational complexity still requires optimization. A refined architecture is needed to improve feature interaction efficiency and real-time performance.

2.2.4. Lightweight Channel–Spatial Mixed Attention Module

To address the mismatch between local features extracted by MobileNetV3 and global context modeled by the lightweight Swin Transformer, the L-CSMA module is introduced as a transitional layer [24]. This module efficiently recalibrates and fuses features, enhancing multi-scale representation for robust pest and disease detection. It standardizes feature scales using depthwise separable convolutions and spatial interpolation, refines critical information through channel and spatial attention mechanisms, and integrates features via adaptive weighted fusion with residual connections. The operational flow is formalized in Equations (4) and (5), which represent channel and spatial attention, respectively:

Y_{i, j, k} = \sum_{m, n} X_{i + m, j + n, k} \cdot K_{m, n, k} F_{fused}

(4)

Y_{i, j, k} = C o n c a t (σ (W_{c} \cdot G A P (F_{C N N})) \cdot F_{C N N}, σ (W_{s} \cdot C o n v (F_{Trans})) \cdot F_{Trans})

(5)

Here,

G A P (\cdot)

denotes global average pooling,

σ

is the Sigmoid activation function,

W_{c}

and

W_{s}

are the weight matrices for channel and spatial attention, respectively, and

C o n v (\cdot)

represents depthwise separable convolution. The fused feature

F_{fused} \in R^{28 \times 28 \times 64}

is further optimized via a residual connection, as shown in Equation (6):

F_{out} = F_{fused} + α \cdot F_{C N N}

(6)

where

α

is an adaptive weight parameter. By dynamically adjusting channel and spatial feature weights, L-CSMA enhances the model’s ability to represent multi-scale pest and disease features while maintaining low computational complexity (approximately 0.1 GFLOPs).

As shown in Figure 7, the L-CSMA module receives local feature maps from the lightweight MobileNetV3 and global feature maps from the lightweight Swin Transformer. Depthwise separable convolutions unify the channel dimension to 64, and nearest-neighbor interpolation adjusts the spatial resolution to 28 × 28, resolving scale mismatches. The channel attention mechanism computes channel statistics via global average pooling (GAP) and MLP to generate channel weights, recalibrating the feature map to emphasize critical information. Subsequently, the spatial attention mechanism compresses the feature map into a single-channel spatial map through cross-channel averaging, applying depthwise separable convolutions to extract spatial dependencies and generate spatial weights, which are multiplied element-wise with the input features to optimize spatial distribution. The processed features are fused through weighted summation, with fusion weights dynamically adjusted by learned adaptive parameters, and a residual connection mitigates information loss. This hierarchical processing ensures effective integration of local and global features while maintaining computational efficiency.

2.2.5. Lightweight Feature Pyramid Network

To further enhance the performance of agricultural robotic systems in detecting multi-scale tomato pests and diseases, this study introduces the L-FPN as a core component of the multi-scale detection head, building upon the L-CSMA module [25]. The L-FPN is designed to leverage an efficient multi-level feature fusion strategy, fully utilizing the integrated features output by L-CSMA to bolster the model’s capability to detect targets of varying sizes while maintaining lightweight characteristics suitable for the computational constraints of edge devices.

As shown in Figure 8, the L-FPN begins with the fused feature map output by the L-CSMA module, incorporating intermediate features extracted at different stages by the lightweight MobileNetV3 and lightweight Swin Transformer to form a multi-level feature input. The feature fusion process is formalized in Equation (7):

P_{i} = C o n v (U p s a m p l e (P_{i + 1}) + F_{i})

(7)

Here,

P_{i}

represents the feature map at the i-th level,

F_{i}

denotes the input feature at the corresponding scale,

U p s a m p l e (\cdot)

is nearest-neighbor interpolation, and

C o n v (\cdot)

signifies depthwise separable convolution.

Figure 8. Architecture of the lightweight feature pyramid network.

The L-FPN architecture encompasses the following key steps: First, depthwise separable convolutions perform downsampling and upsampling on the L-CSMA output to generate a series of feature maps at different resolutions, covering small, medium, and large-scale targets. Next, a top-down feature fusion pathway upsamples deep-layer features (rich in global context but low in resolution) using nearest-neighbor interpolation and combines them element-wise with shallow-layer features (rich in local details but lacking global context). Subsequently, lateral connections and lightweight convolutional layers smooth the fused features, minimizing information distortion during fusion. Finally, detection heads are appended to each scale’s feature map to predict the bounding box locations and classification results for tomato pests and diseases. As shown in Equation (8), the detection head employs lightweight convolutions to predict bounding boxes and classes:

Loss = λ_{1} L_{c l s} + λ_{2} L_{box}

(8)

Here,

L_{c l s}

is the cross-entropy classification loss,

L_{box}

is the Generalized Intersection over Union (GIoU) bounding box regression loss, and

λ_{1}, λ_{2}

are weighting coefficients. The lightweight design of L-FPN constrains the computational load to 0.2 GFLOPs while improving detection accuracy for small and densely distributed targets.

The multi-scale detection head of L-FPN further optimizes computational efficiency and detection performance. Each detection head comprises a set of lightweight convolutional layers, including a 3 × 3 depthwise separable convolution for feature adjustment and two 1 × 1 convolutions for bounding box regression and target classification, respectively.

2.3. Experimental Setup and Parameter Configuration

Model training was conducted on a workstation equipped with an NVIDIA RTX 4070 GPU (12 GB GDDR6X, 5888 CUDA cores), an Intel^® Core i7-14700HX processor, and 32 GB of DDR5 RAM. The deployment platform utilized the NVIDIA Jetson Orin NX embedded module, which features a 6-core ARM Cortex-A78AE CPU, a 1024-core Ampere GPU with 32 Tensor Cores, and 16 GB of LPDDR5 memory. This configuration reflects real-world deployment conditions for agricultural robotics.

The training environment was based on Ubuntu 20.04, Python 3.8, and PyTorch 1.12.0, with GPU acceleration provided by CUDA 11.6 and cuDNN 8.4. For deployment, the Jetson Orin NX ran JetPack 5.1.2 and utilized the TensorRT 8.5 inference engine optimized for FP16 precision.

To simulate field conditions, data were collected using an RGB camera (1920 × 1080 resolution, 30 FPS). Input images were resized to 224 × 224 with 3 channels before processing. Detailed configurations are summarized in Table 1, and the embedded deployment setup is illustrated in Figure 9.

2.3.1. Model Training Parameters

During training, input images were resized to 224 × 224 to balance computational efficiency with deployment requirements. A batch size of 32 was used, and the model underwent 300 epochs of training on an NVIDIA RTX 4070 to ensure stable convergence. To enhance generalization and robustness, data augmentation techniques were employed, including random flipping, cropping, and color jittering. These methods simulate field conditions such as varying illumination and partial occlusion, enabling the model to perform reliably in real-world scenarios. Table 2 provides detailed parameter configurations, including the resolution, training epochs, and data augmentation strategies used in the experiments.

2.3.2. Hyperparameter Configuration

During model optimization, this study selected AdamW as the primary optimizer to facilitate rapid convergence tailored to the lightweight model’s requirements. To enhance parameter update efficiency and stability, a weight decay coefficient of 5 × 10⁻² was set, mitigating overfitting risks and improving the model’s generalization across diverse field scenarios. A dynamic learning rate scheduling strategy was employed, with an initial learning rate of 1 × 10⁻⁴, gradually reduced to 1 × 10⁻⁶ via a cosine annealing mechanism, balancing rapid learning in early training stages with stable convergence in later phases. These settings enable the selection of an optimal model within a limited number of iterations.

Beyond optimization strategies, a suite of data augmentation techniques was implemented, including random horizontal flipping, random cropping, color jittering, and brightness adjustment. These techniques simulate field conditions such as illumination variability, target occlusion, and perspective differences, thereby enhancing the model’s robustness and adaptability. During training, the probabilities for data augmentation were set as follows: random flipping at 0.5, cropping at 0.8, color jittering at 0.2, and brightness adjustment at 0.2. Through these augmentation strategies, the model effectively handles unfamiliar data while maintaining high detection performance. Detailed hyperparameter settings for training are presented in Table 3.

3. Results

To evaluate the effectiveness of the proposed AgriLiteNet model under realistic agricultural conditions, a comprehensive performance assessment was conducted across multiple model variants and benchmark baselines. The evaluation focuses on the contributions of the L-CSMA and L-FPN modules to attention optimization and multi-scale detection. Specifically, three representative types of tomato pests and diseases were selected to test detection performance: target spot (small target, ~20 × 20 pixels), late blight (large target, ~100 × 100 pixels), and bacterial spot (dense target, ~10 × 10 pixels). The evaluation metric mAP@0.5 was computed using a test set of 1000 images, with approximately 100 images per category. Comparative experiments were also conducted to examine AgriLiteNet’s performance in terms of accuracy, computational efficiency, and deployment suitability for resource-constrained agricultural robots.

3.1. Performance Comparison of AgriLiteNet and Its Variants

As shown in the experimental results depicted in Figure 10, AgriLiteNet-FPN achieved the highest Precision for all targets, with mAP@0.5 scores of 0.9768 for target spot, 0.9874 for late blight, and 0.9891 for bacterial spot. This outstanding performance is attributed to its Feature Pyramid Network (FPN), which effectively enhances multi-scale feature fusion. AgriLiteNet closely followed, achieving mAP@0.5 scores of 0.9754, 0.9864, and 0.9883, respectively, by leveraging a lightweight channel–spatial mixed attention module that balances detection accuracy and efficiency across scales. In contrast, AgriLiteNet-CBAM showed slightly lower Precision, with mAP@0.5 scores of 0.9673, 0.9852, and 0.9828, indicating that its channel and spatial attention mechanisms introduced computational noise that affected detection accuracy.

As summarized in Table 4, AgriLiteNet demonstrates significant advantages in lightweight design. It features 2.0 M parameters, a computational complexity of 0.6 GFLOPs, an inference speed of 35 FPS, and a model size of just 5 MB. In comparison, AgriLiteNet-CBAM has a higher parameter count of 2.9 M, a complexity of 0.65 GFLOPs, an inference speed of 27 FPS, and a model size of 6 MB, while AgriLiteNet-FPN has 3.5 M parameters, 0.8 GFLOPs, 25 FPS, and a 7 MB model size. AgriLiteNet’s lightweight architecture, achieved through depthwise separable convolutions and channel pruning, substantially reduces resource demands, with an inference power consumption of only 15 W, outperforming AgriLiteNet-CBAM (16 W) and AgriLiteNet-FPN (17 W).

3.2. AgriLiteNet Model Analysis

To thoroughly investigate the training dynamics and performance of AgriLiteNet, we tracked key metrics over 300 training epochs, including mAP@0.5, Confidence, Precision, Recall, and F1 score, and visualized their relationships. The experiments utilized the Adam optimizer (learning rate: 0.001, decay rate: 0.1) with a batch size of 32, incorporating Mosaic data augmentation and random flipping to enhance model generalization [26].

The training results for the AgriLiteNet object detection model, as shown in Figure 11, demonstrate significant performance improvements over the training process, reflecting excellent convergence and robust learning capabilities. Over 300 epochs, both training and validation losses decreased markedly, indicating sustained enhancements in localization accuracy and classification performance. Precision improved from 0.49 to 0.99, and Recall dropped from 0.97 to 0.31, showcasing the model’s ability to balance accuracy and comprehensiveness in target detection. The mAP@0.5 metric surged from 0.28 to 0.99, while mAP@0.5:0.95 rose from 0.15 to 0.79, underscoring the model’s effectiveness in addressing detection challenges in complex scenarios through multi-scale feature fusion, achieving an optimal balance between high Precision and high Recall values.

Figure 12 shows the relationships among Confidence, Precision, Recall, and F1 score during training. Confidence exhibited a positive correlation with mAP@0.5, indicating increasing detection certainty as feature learning progressed. The Recall curve rose more rapidly than Precision in the early stages, while Precision improved significantly in later stages, reflecting better classification accuracy over time. The F1 score stabilized above 0.90, indicating a strong balance between Precision and Recall. For small targets such as spider mites (mAP@0.5 = 0.9750) and dense targets such as bacterial spot (mAP@0.5 = 0.9900), the F1 score exceeded 0.92, demonstrating the model’s multi-scale optimization capability.

3.3. AgriLiteNet and Leading Models

The detection performance of AgriLiteNet was evaluated against baseline models—Suppression Mask R-CNN, Cas-VSwin Transformer, YOLOv5n, and GMC-MobileV3. All models were trained for 300 epochs using consistent data augmentation and optimization settings. Performance was measured using mean average precision at an IoU threshold of 0.5 (mAP@0.5).

Figure 13 presents the results based on a test set of 1200 images, approximately 100 per category. Panel (a) shows the mAP@0.5 variation across epochs. Panel (b) compares mAP@0.5 and power consumption, with dot sizes representing model sizes. Suppression Mask R-CNN achieved the highest mAP@0.5 of 0.98955, with 0.99555 on late blight and 0.99206 on Yellow Leaf Curl Disease. Cas-VSwin Transformer achieved an mAP@0.5 of 0.47, including 0.49247 on Mosaic Disease. AgriLiteNet recorded an mAP@0.5 of 0.98735, including 0.9900 on bacterial spot, 0.9874 on late blight, and 0.9750 on spider mites. YOLOv5n and GMC-MobileV3 both achieved an mAP@0.5 of 0.47, with 0.47 on Mosaic Disease. Panel (b) shows that AgriLiteNet has 2.0 million parameters and 0.608 GFLOPs, compared to 44.5 million and 4.8 GFLOPs for Suppression Mask R-CNN, and 30.0 million and 3.2 GFLOPs for Cas-VSwin Transformer. AgriLiteNet runs at 35 frames per second, with a power consumption of 15 watts. YOLOv5s achieves 37 FPS with 16 watts. Compared to YOLOv5n and GMC-MobileV3, AgriLiteNet offers slightly higher computational cost with approximately 0.005 improvement in mAP@0.5.

Table 5 summarizes these metrics, including parameter count, computational complexity, inference speed, model size, and power consumption, confirming that AgriLiteNet maintains a balance between detection accuracy and resource efficiency. It demonstrates suitability for deployment in edge devices such as the NVIDIA Jetson Orin NX.

3.4. Confusion Matrix and Multi-Stage mAP Analysis

The classification performance of AgriLiteNet was evaluated across nine tomato pest and disease categories, comparing it with baseline models: Suppression Mask R-CNN [27], Cas-VSwin Transformer [28], YOLOv5n [29], and GMC-MobileV3 [30]. Confusion matrices were generated at the 300th epoch, and the mean average precision at an intersection-over-union threshold of 0.5, denoted mAP@0.5, was tracked at the 25th, 100th, and 300th epochs. All models were trained with consistent data augmentation and optimization settings.

Figure 14 shows the confusion matrix of AgriLiteNet and other advanced models on a test set of 1000 images (containing nine tomato pests and diseases and healthy leaves, with 100 images for each category), illustrating the classification accuracy and misclassification patterns. AgriLiteNet achieved classification accuracies exceeding 95% for most categories, with 100% accuracy for bacterial spot, early blight, and late blight. Lower accuracies were observed for spider mites at 94% and leaf mold at 95%, attributed to subtle small-target features causing background confusion. Misclassification rates for Mosaic Disease and Yellow Leaf Curl Disease ranged from 1% to 3% due to texture similarities with healthy leaves. Suppression Mask R-CNN recorded the highest accuracies, achieving 100% for bacterial spot, early blight, and late blight, but 95% for spider mites due to resolution limitations. Cas-VSwin Transformer matched Suppression Mask R-CNN on complex targets like Mosaic Disease at 98% but achieved 95% for spider mites, limited by its attention mechanism. YOLOv5n and GMC-MobileV3 showed lower accuracies for spider mites at 92% and leaf mold at 93% and 94%, respectively, reflecting weaker feature extraction.

Figure 15 depicts mAP@0.5 progression at the 25th, 50th, and 300th epochs. At the 25th epoch, mAP@0.5 values ranged from 0.80 to 0.90, with AgriLiteNet at 0.86 to 0.89, YOLOv5n at 0.84 to 0.87, Suppression Mask R-CNN at 0.83 to 0.90, Cas-VSwin Transformer at 0.82 to 0.885, and GMC-MobileV3 at 0.80 to 0.86. By the 50th epoch, mAP@0.5 improved to 0.92 to 0.98, with AgriLiteNet excelling on bacterial spot at 0.97 and late blight at 0.975, and Suppression Mask R-CNN leading on Mosaic Disease at 0.975. At the 300th epoch, AgriLiteNet’s mAP@0.5 reached 0.975 to 0.992, closely approaching Suppression Mask R-CNN at 0.985 to 0.995 and Cas-VSwin Transformer at 0.965 to 0.99, while surpassing YOLOv5n at 0.975 to 0.985 and GMC-MobileV3 at 0.965 to 0.98.

3.5. Visualization of AgriLiteNet’s Experimental Results

The detection capabilities of AgriLiteNet across nine tomato pest and disease categories were visualized using input images, heatmaps, and output detection maps, as shown in Figure 16. These visualizations were generated based on a test set of approximately 1000 images, with around 100 samples per category. The input images highlight the distinctive visual characteristics of each pest and disease, including dense black spots of bacterial spot, ring-shaped lesions of early blight, necrotic patches of late blight, green mold of leaf mold, water-soaked lesions of septoria leaf spot, yellow speckles of spider mites, bullseye lesions of target spot, striations of Mosaic Disease, and curled leaves of Yellow Leaf Curl Disease. Heatmaps, generated using Gradient-weighted Class Activation Mapping (Grad-CAM), highlight the regions the model focuses on [31]. The heatmaps for target spot and early blight concentrate on lesion centers, while bacterial spot and septoria leaf spot show dense activation points. Mosaic Disease exhibits uniform activation across striated regions, indicating effective global feature extraction. Heatmaps for late blight, leaf mold, and spider mites accurately cover critical areas. The output detection maps include bounding boxes, class labels, and Confidence scores. Mosaic Disease achieves a Confidence score of 0.99, with clear delineation of the affected areas. Target spot and early blight lesions are enclosed with Confidence scores between 0.93 and 0.99. Bacterial spot and septoria leaf spot maps display multiple bounding boxes with scores from 0.96 to 0.99, with minor misses due to background interference. Yellow Leaf Curl Disease and late blight are detected with Confidence scores of 0.99 and 0.98, respectively. Leaf mold and spider mites show slight bounding box deviations but maintain Confidence above 0.93. These results indicate that AgriLiteNet can accurately locate and identify various types of pest and disease targets, demonstrating strong multi-scale detection capabilities.

4. Discussion

AgriLiteNet was developed to meet the practical demands of real-time pest and disease detection in precision agriculture, particularly under the constraints of edge computing platforms. The experimental results validate its effectiveness, achieving a mean average precision (mAP@0.5) of 0.98735, closely rivaling heavier models like Suppression Mask R-CNN (0.98955) and Cas-VSwin Transformer (0.98874), while surpassing lighter models like YOLOv5n (0.98249) and GMC-MobileV3 (0.98143). With only 2.0 million parameters, 0.608 GFLOPs, and a power consumption of 15 W, AgriLiteNet delivers an inference speed of 35 FPS on the NVIDIA Jetson Orin NX, making it highly suitable for field deployment.

AgriLiteNet’s performance across diverse target types—small (spider mites, ~20 × 20 pixels), dense (bacterial spot, ~10 × 10 pixels), and large (late blight, ~100 × 100 pixels)—demonstrates its robustness in multi-scale detection. The model achieved 100% classification accuracy for bacterial spot, early blight, and late blight, as shown in the confusion matrices (Figure 14), outperforming expectations for a lightweight model. This success is attributed to the lightweight channel–spatial mixed attention (L-CSMA) module, which dynamically prioritizes critical features, enhancing detection of distinctive patterns like bacterial spot’s dense spots or late blight’s large lesions. The lightweight feature pyramid network (L-FPN) further strengthens multi-scale representation, enabling precise localization of small and dense targets, as evidenced by mAP@0.5 scores of 0.9750 for spider mites and 0.9900 for bacterial spot (Figure 10).

Comparatively, Suppression Mask R-CNN, with 44 million parameters and 12.3 GFLOPs, achieved the highest mAP@0.5 (0.98955) but at a significant computational cost (8 FPS, 22 W), limiting its suitability for edge devices. Cas-VSwin Transformer (28 million parameters, 16.3 GFLOPs) performed well on complex targets like Mosaic Disease (0.49247) but was slower (12 FPS, 20 W). YOLOv5n and GMC-MobileV3, with lower mAP@0.5 (0.98249 and 0.98143, respectively), showed comparable lightweight characteristics but struggled with small targets like spider mites (92–94% accuracy). AgriLiteNet’s balance of accuracy and efficiency, as visualized in Figure 13, positions it as a practical solution for real-time agricultural applications.

Unexpected findings further underscore AgriLiteNet’s design effectiveness. Despite the lightweight architecture (2.0 M parameters, 0.608 GFLOPs), the model achieved 100% classification accuracy for bacterial spot, early blight, and late blight—outperforming expectations for lightweight models. This performance is likely attributed to the lightweight channel–spatial mixed attention (L-CSMA) module, which enhances feature prioritization for distinctive pathological patterns [32]. Additionally, the model maintained a 35 FPS inference speed on the Jetson Orin NX platform, matching the performance of YOLOv5n despite incorporating Transformer-based components, demonstrating the efficiency of the hybrid architecture.

Conversely, the model’s slightly lower accuracy for spider mite detection (94%) was surprising, considering the design’s emphasis on small-target recognition via the feature pyramid network. This performance drop may result from global modeling in the streamlined Swin Transformer overshadowing subtle local features, especially under background clutter, such as dense foliage [33]. Furthermore, visualization results suggest limitations in complex environments involving strong illumination or occlusion, indicating that while the model generalizes well under typical field conditions, its adaptability requires further optimization.

From a theoretical perspective, AgriLiteNet exemplifies the synergistic integration of convolutional neural networks (CNNs) and vision transformers. MobileNetV3 effectively captures local features, aligning with hierarchical representation theories, while the Swin Transformer contributes global contextual awareness, in line with attention-based modeling frameworks. The L-CSMA and lightweight FPN modules further enhance multi-scale detection, validating the architectural choices through both empirical and theoretical lenses.

5. Conclusions

This study introduces AgriLiteNet, a lightweight object detection model optimized for real-time tomato pest and disease detection on edge-computing agricultural robots. By combining MobileNetV3 and a streamlined Swin Transformer with tailored attention and feature fusion modules, AgriLiteNet delivers state-of-the-art accuracy (mAP@0.5 of 0.98735), high-speed inference (35 FPS), and low power consumption (15 W), making it suitable for field deployment on platforms like the NVIDIA Jetson Orin NX.

AgriLiteNet demonstrates superior performance across diverse pest and disease types, with notable precision in detecting small, dense, and large-scale targets. Its compact model size and high computational efficiency support widespread deployment in precision agriculture, contributing to reduced pesticide usage, improved crop health management, and environmental sustainability.

Despite its advantages, the model has limitations, including potential underperformance in detecting subtle early-stage lesions due to the reduced local receptive field of the lightweight Swin Transformer, and dataset bias due to seasonally limited image collection. Future research should focus on expanding the dataset to cover more growth stages and environmental conditions, incorporating multimodal inputs such as infrared imagery, refining attention mechanisms for subtle feature enhancement, and exploring compression techniques like quantization. Integration with other robotic functions, such as precision spraying, may also promote the development of multifunctional intelligent agricultural systems, further extending AgriLiteNet’s impact on sustainable and automated farming.

Author Contributions

Conceptualization, C.Y., B.Z., D.Z. and T.Z.; methodology, B.Z. and Q.L.; software, D.Z. and C.Y.; validation, B.Z., M.M. and T.Z.; formal analysis, T.Z. and Q.L.; investigation, C.Y., M.M. and J.B.; resources, T.Z., Q.L. and J.B.; data curation, D.Z. and C.Y.; writing—original draft preparation, B.Z. and D.Z.; writing—review and editing, C.Y. and M.M.; visualization, C.Y. and M.M.; supervision, C.Y. and T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article.

Acknowledgments

The authors express their heartfelt gratitude to Chengyang Yang from Pizhou Shi Mingde Experimental School for his invaluable contributions to the establishment of the dataset used in this study. Special thanks are extended to Jingyu Yang and Zhongxiu Meng from Zouzhuang Central School for their dedicated efforts in cultivating the tomato samples used for experimental purposes. The authors also wish to acknowledge the editors and anonymous reviewers for their constructive comments and insightful suggestions, which significantly enhanced the quality and clarity of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Egea, I.; Estrada, Y.; Flores, F.B.; Bolarín, M.C. Improving Production and Fruit Quality of Tomato under Abiotic Stress: Genes for the Future of Tomato Breeding for a Sustainable Agriculture. Environ. Exp. Bot. 2022, 204, 105086. [Google Scholar] [CrossRef]
Čechura, L.; Žáková Kroupová, Z.; Samoggia, A. Drivers of Productivity Change in the Italian Tomato Food Value Chain. Agriculture 2021, 11, 996. [Google Scholar] [CrossRef]
Buja, I.; Sabella, E.; Monteduro, A.G.; Chiriacò, M.S.; De Bellis, L.; Luvisi, A.; Maruccio, G. Advances in Plant Disease Detection and Monitoring: From Traditional Assays to In-Field Diagnostics. Sensors 2021, 21, 2129. [Google Scholar] [CrossRef]
Gehlot, M.; Saxena, R.K.; Gandhi, G.C. “Tomato-Village”: A Dataset for End-to-End Tomato Disease Detection in a Real-World Environment. Multimed. Syst. 2023, 29, 3305–3328. [Google Scholar] [CrossRef]
Kondoyanni, M.; Loukatos, D.; Maraveas, C.; Drosos, C.; Arvanitis, K.G. Bio-Inspired Robots and Structures toward Fostering the Modernization of Agriculture. Biomimetics 2022, 7, 69. [Google Scholar] [CrossRef]
Wang, X.; Liu, J. An Efficient Deep Learning Model for Tomato Disease Detection. Plant Methods 2024, 20, 61. [Google Scholar] [CrossRef]
Moussafir, M.; Chaibi, H.; Saadane, R.; Chehri, A.; Rharras, A.E.; Jeon, G. Design of Efficient Techniques for Tomato Leaf Disease Detection Using Genetic Algorithm-Based and Deep Neural Networks. Plant Soil 2022, 479, 251–266. [Google Scholar] [CrossRef]
Xu, C.; Ding, J.; Qiao, Y.; Zhang, L. Tomato Disease and Pest Diagnosis Method Based on the Stacking of Prescription Data. Comput. Electron. Agric. 2022, 197, 106997. [Google Scholar] [CrossRef]
Rai, A.K.; Kumar, N.; Katiyar, D.; Kumar, R. Unlocking Productivity Potential: The Promising Role of Agricultural Robots in Enhancing Farming Efficiency. Int. J. Plant Soil Sci. 2023, 35, 624–633. [Google Scholar] [CrossRef]
Mohamed, E.S.; Belal, A.A.; Abd-Elmabod, S.K.; El-Shirbeny, M.A.; Gad, A.; Zahran, M.B.; El-Hagarey, M.E.; El-Kafrawy, S.B.; El-Ramady, H.R. Smart Farming for Improving Agricultural Management. Egypt. J. Remote Sens. Space Sci. 2021, 24, 971–981. [Google Scholar] [CrossRef]
Jin, Y.; Liu, J.; Xu, Z.; Zhang, X.; Wang, Y. Development Status and Trend of Agricultural Robot Technology. Int. J. Agric. Biol. Eng. 2021, 14, 1–19. [Google Scholar] [CrossRef]
Botta, A.; Cavallone, P.; Baglieri, L.; Sangiovanni, V.; Rizzo, G. A Review of Robots, Perception, and Tasks in Precision Agriculture. Appl. Mech. 2022, 3, 830–854. [Google Scholar] [CrossRef]
Thangaraj, R.; Anandamurugan, S.; Pandiyan, P.; Kaliappan, V.K. Artificial Intelligence in Tomato Leaf Disease Detection: A Comprehensive Review and Discussion. J. Plant Dis. Prot. 2022, 129, 469–488. [Google Scholar] [CrossRef]
Chu, H.; Tan, J.; Zhu, D.; Ji, Y. Lightweight Plant Disease Recognition Model Based on Small Datasets in Complex Contexts. Enterp. Inf. Syst. 2024, 18, 1234–1245. [Google Scholar] [CrossRef]
Ma, L.; Yu, Q.; Yu, H.; Zhang, J. Maize Leaf Disease Identification Based on YOLOv5n Algorithm Incorporating Attention Mechanism. Agronomy 2023, 13, 521. [Google Scholar] [CrossRef]
Lan, Y.; Lin, S.; Du, H.; Guo, Y.; Deng, X. Real-Time UAV Patrol Technology in Orchard Based on the Swin-T YOLOX Lightweight Model. Remote Sens. 2022, 14, 5806. [Google Scholar] [CrossRef]
Pagire, V.; Chavali, M.; Kale, A. A Comprehensive Review of Object Detection with Traditional and Deep Learning Methods. Signal Process. 2025, 237, 110075. [Google Scholar] [CrossRef]
Gao, A.; Geng, A.J.; Song, Y.P.; Ren, L.L.; Zhang, Y.; Han, X. Detection of maize leaf diseases using improved MobileNet V3-small. Int. J. Agric. Biol. Eng. 2023, 16, 225–232. [Google Scholar] [CrossRef]
Jia, L.; Wang, T.; Chen, Y.; Zhang, Y.; Li, H. MobileNet-CA-YOLO: An Improved YOLOv7 Based on the MobileNetV3 and Attention Mechanism for Rice Pests and Diseases Detection. Agriculture 2023, 13, 1285. [Google Scholar] [CrossRef]
Jia, W.; Wei, J.; Zhang, Q.; Pan, N.; Niu, Y.; Yin, X.; Ding, Y.; Ge, X. Accurate segmentation of green fruit based on optimized Mask R-CNN application in complex orchard. Front. Plant Sci. 2022, 13, 955256. [Google Scholar] [CrossRef]
Zeng, M.; Chen, S.; Liu, H.; Wang, W.; Xie, J. HCFormer: A Lightweight Pest Detection Model Combining CNN and ViT. Agronomy 2024, 14, 1940. [Google Scholar] [CrossRef]
Pacal, I.; Alaftekin, M.; Zengul, F.D. Enhancing Skin Cancer Diagnosis Using Swin Transformer with Hybrid Shifted Window-Based Multi-head Self-attention and SwiGLU-Based MLP. J. Digit. Imaging 2024, 37, 3174–3192. [Google Scholar] [CrossRef] [PubMed]
Si, H.; Wang, Y.; Zhao, W.; Wang, M.; Song, J.; Wan, L.; Song, Z.; Li, Y.; Fernando, B.; Sun, C. Apple Surface Defect Detection Method Based on Weight Comparison Transfer Learning with MobileNetV3. Agriculture 2023, 13, 824. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision–ECCV 2018, 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part VII; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. [Google Scholar] [CrossRef]
Xie, J.; Pang, Y.; Nie, J.; Cao, J.; Han, J. Latent Feature Pyramid Network for Object Detection. IEEE Trans. Multimed. 2023, 25, 2153–2163. [Google Scholar] [CrossRef]
Yang, C.; Chen, Z.; Sonza, R.L. Detection of Crop Diseases and Insect Pests Based on Convolutional Neural Network. In Proceedings of the 2024 12th International Conference on Information and Education Technology (ICIET), Chengdu, China, 21–23 March 2024; pp. 397–401. [Google Scholar]
Chu, P.; Li, Z.; Lammers, K.; Lu, R.; Liu, X. Deep Learning-Based Apple Detection Using a Suppression Mask R-CNN. Pattern Recognit. Lett. 2021, 147, 206–211. [Google Scholar] [CrossRef]
Gao, L.; Zhang, J.; Yang, C.; Zhou, Y. Cas-VSwin Transformer: A Variant Swin Transformer for Surface-Defect Detection. Comput. Ind. 2022, 140, 103689. [Google Scholar] [CrossRef]
Bie, M.; Liu, Y.; Li, G.; Hong, J.; Li, J. Real-Time Vehicle Detection Algorithm Based on a Lightweight You-Only-Look-Once (YOLOv5n-L) Approach. Expert Syst. Appl. 2023, 213, 119108. [Google Scholar] [CrossRef]
Tian, X.; Shi, L.; Luo, Y.; Zhang, Y.; Wang, M.; Li, J. Garbage Classification Algorithm Based on Improved MobileNetV3. IEEE Access 2024, 12, 123456. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Yilma, D.; Lee, J.; Kim, Y. A Lightweight Attention-Based Convolutional Neural Network for Tomato Leaf Disease Classification. Agriculture 2022, 12, 228. [Google Scholar] [CrossRef]
Zhao, K.; Lu, R.; Wang, S.; Yang, X.; Li, Q.; Fan, J. ST-YOLOA: A Swin-Transformer-Based YOLO Model with an Attention Mechanism for SAR Ship Detection under Complex Background. Front. Neurorobot. 2023, 17, 1170163. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Radar chart comparing detection models for tomato pest and disease detection.

Figure 2. Aerial view of Xuzhou modern agricultural park.

Figure 6. Architecture of the lightweight MobileNetV3.

Figure 7. Architecture of the lightweight channel–spatial mixed attention.

Figure 9. Agricultural robot deployment setup.

Figure 10. The mAP trends of AgriLiteNet and its variants over 300 epochs for (a) small objects, (b) large objects, and (c) dense objects.

Figure 11. AgriLiteNet performance variation with 300 epochs.

Figure 12. Relationships of Confidence, Precision, Recall, and F1 score: (a) Precision–Confidence, (b) Recall–Confidence, (c) Precision–Recall and (d) F1–Confidence.

Figure 13. Comprehensive performance comparison of AgriLiteNet with other advanced models: (a) mAP changes over 300 epochs and (b) mAP vs. power consumption.

Figure 14. Confusion matrix of AgriLiteNet and other advanced models on the test set.

Figure 15. Comparison of mAP performance of AgriLiteNet and other advanced models at different epochs.

Figure 16. Input image, heat map and output results of AgriLiteNet on a validation set containing 9 tomato pests and diseases.

Table 1. Training and deployment environment specifications.

Component	Training Environment	Deployment Environment
Operating System	Ubuntu 20.04	JetPack 5.1.2 (Ubuntu 20.04)
Processor	Intel^® Core i7-14700	6-core ARM Cortex-A78AE
Graphics Card	NVIDIA RTX 4070	1024-core Ampere GPU
Memory	32 GB DDR5	16 GB LPDDR5
Python Version	3.8	3.8
CUDA Version	11.6	11.4
cuDNN Version	8.4	8.6
TensorRT Version	-	8.5
PyTorch Version	1.12.0	1.12.0

Table 2. AgriLiteNet training parameter settings.

Parameter Label	Selected Configuration
Number of Epochs	300
Image Size	224 × 224
Batch Size	32
Data Augmentation Methods	Random flipping, cropping, color jittering, brightness adjustment

Table 3. AgriLiteNet hyperparameter settings.

Hyperparameter	Setting Value
Optimizer	AdamW
Learning Rate (Lr0)	0.0001
Weight Decay Coefficient	0.05
Random Horizontal Flip Probability	0.5
Random Crop Probability	0.8
Color Jitter Probability	0.2
Brightness Adjustment Probability	0.2

Table 4. Comprehensive performance statistics of AgriLiteNet and its variants.

Model	mAP@0.5	Parameters (M)	Computational Complexity (GFLOPs)	Model Size (MB)	FPS	Power Consumption (W)
AgriLiteNet	0.98735	2.0	0.6	5	35	15
AgriLiteNet-CBAM	0.98682	2.9	0.65	6	27	16
AgriLiteNet-FPN	0.98813	3.5	0.8	7	25	17

Table 5. Comprehensive performance statistics of AgriLiteNet and other advanced models.

Model	mAP@0.5	Parameters (M)	Computational Complexity (GFLOPs)	Model Size (MB)	FPS	Power Consumption (W)
AgriLiteNet	0.98735	2.0	0.6	5	35	15
Suppression Mask R-CNN	0.98955	44.0	12.3	70	8	22
Cas-VSwin Transformer	0.98874	28.0	16.3	45	12	20
YOLOv5n-L	0.98249	1.9	0.6	5	35	17
GMC-MobileNetV3	0.98143	5.4	0.3	8	30	15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, C.; Zhao, B.; Mansurova, M.; Zhou, T.; Liu, Q.; Bao, J.; Zheng, D. AgriLiteNet: Lightweight Multi-Scale Tomato Pest and Disease Detection for Agricultural Robots. Horticulturae 2025, 11, 671. https://doi.org/10.3390/horticulturae11060671

AMA Style

Yang C, Zhao B, Mansurova M, Zhou T, Liu Q, Bao J, Zheng D. AgriLiteNet: Lightweight Multi-Scale Tomato Pest and Disease Detection for Agricultural Robots. Horticulturae. 2025; 11(6):671. https://doi.org/10.3390/horticulturae11060671

Chicago/Turabian Style

Yang, Chenghan, Baidong Zhao, Madina Mansurova, Tianyan Zhou, Qiyuan Liu, Junwei Bao, and Dingkun Zheng. 2025. "AgriLiteNet: Lightweight Multi-Scale Tomato Pest and Disease Detection for Agricultural Robots" Horticulturae 11, no. 6: 671. https://doi.org/10.3390/horticulturae11060671

APA Style

Yang, C., Zhao, B., Mansurova, M., Zhou, T., Liu, Q., Bao, J., & Zheng, D. (2025). AgriLiteNet: Lightweight Multi-Scale Tomato Pest and Disease Detection for Agricultural Robots. Horticulturae, 11(6), 671. https://doi.org/10.3390/horticulturae11060671

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AgriLiteNet: Lightweight Multi-Scale Tomato Pest and Disease Detection for Agricultural Robots

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Methods

2.2.1. AgriLiteNet

2.2.2. Enhanced Lightweight Swin Transformer

2.2.3. Enhanced Lightweight MobileNetV3

2.2.4. Lightweight Channel–Spatial Mixed Attention Module

2.2.5. Lightweight Feature Pyramid Network

2.3. Experimental Setup and Parameter Configuration

2.3.1. Model Training Parameters

2.3.2. Hyperparameter Configuration

3. Results

3.1. Performance Comparison of AgriLiteNet and Its Variants

3.2. AgriLiteNet Model Analysis

3.3. AgriLiteNet and Leading Models

3.4. Confusion Matrix and Multi-Stage mAP Analysis

3.5. Visualization of AgriLiteNet’s Experimental Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI