Lightweight Visual Detection Framework for Real-Time Rice Leaf Disease Identification on Edge Mobile Robots

Xu, Yan; Liu, Yinan; Meng, Xiangchen; Yuan, Qing; Wang, Dazhong; Wu, Liyan; Yue, Xiang; Feng, Longlong; Liu, Cuihong

doi:10.3390/agriculture16131383

Open AccessArticle

Lightweight Visual Detection Framework for Real-Time Rice Leaf Disease Identification on Edge Mobile Robots

by

Yan Xu

,

Yinan Liu

,

Xiangchen Meng

,

Qing Yuan

,

Dazhong Wang

,

Liyan Wu

,

Xiang Yue

,

Longlong Feng

and

Cuihong Liu

^*

College of Engineering, Shenyang Agricultural University, Shenyang 110866, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(13), 1383; https://doi.org/10.3390/agriculture16131383 (registering DOI)

Submission received: 26 May 2026 / Revised: 20 June 2026 / Accepted: 23 June 2026 / Published: 25 June 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Rice leaf diseases severely threaten global food security, and efficient on-site detection remains challenging for resource-constrained field inspection robots. This work introduces a lightweight visual detection framework designed for the real-time and accurate identification of rice leaf diseases on agricultural edge mobile platforms. A dataset of 4622 annotated images compiled from mobile-device acquisition and publicly available online sources, covering three representative disease categories, together with an independent public benchmark, was used for evaluation. The framework integrates three complementary modules: adaptive multi-scale feature extraction via a dynamic hybrid convolution backbone (C3k2-DICN), cross-scale parameter sharing in the detection head (CSDH) to reduce redundancy, and dual-path downsampling (ADown) to preserve disease-discriminative information during resolution compression. Compared to the YOLO11n baseline, the proposed approach reduced GFLOPs by 36.5% and parameter count by 34.6%, while achieving 88.42% mAP@0.5 and 45.82% mAP@0.5:0.95 on the compiled dataset and 91.71% mAP@0.5 on the public benchmark, indicating accuracy competitive with or superior to all evaluated comparison models. Deployed on an NVIDIA Jetson TX2 with TensorRT FP16 acceleration, the model ran in real time on-device, reaching 32.2 FPS for the TensorRT inference stage and 19.8 FPS for the full end-to-end pipeline including image pre- and post-processing. The framework offers a practical basis for lightweight on-device rice disease detection; closed-loop validation on a moving field robot is left to future work.

Keywords:

rice disease detection; lightweight neural network; YOLO; dynamic convolution; edge deployment; field inspection robot; plant protection

1. Introduction

Rice (Oryza sativa L.), as the primary caloric staple for over 3.5 billion people worldwide, plays an irreplaceable strategic role in global food security [1]. However, fungal and bacterial diseases such as blast, brown spot, and bacterial blight persist throughout the entire rice growing season. Estimated mean annual global yield losses due to these diseases are as high as 10–30% [2,3,4], posing a serious threat to agricultural production in major rice-growing regions. Currently, field disease diagnosis remains heavily reliant on in-person surveys and subjective judgment by plant protection experts. Such methods are not only labor-intensive and poorly reproducible, but also extremely limited in coverage efficiency for large-scale continuous planting areas. Once the optimal treatment window is missed, economic losses are often irreversible [5,6]. Against this backdrop, developing an automated disease detection system that can adapt to complex field environments has become an urgent priority in modern plant protection research.

The rapid development of deep learning-based image recognition has opened new pathways for automated agricultural disease diagnosis. Convolutional neural networks can learn end-to-end multi-level visual representations of lesions from raw images, achieving performance far superior to traditional image processing methods in various crop disease detection tasks [7,8]. In particular, the YOLO family of single-stage detection frameworks has been widely adopted in agricultural intelligent sensing due to its fast inference speed and flexible end-to-end deployment [9]. Recent work has further extended this direction: deep learning-based object detection for crop monitoring has been systematically reviewed by Lu et al. [10], covering pest, yield, weed, and growth detection applications that collectively demonstrate the broad applicability of efficient detection architectures in precision agriculture.

Toward higher detection accuracy, researchers have mainly pursued two strategies: stronger backbones and richer multi-scale feature aggregation. Representative works integrate attention or specialized convolution into the YOLO family (for example, the FasterPW-plus-attention design of BMDNet-YOLO [11], the multi-scale small-object optimization of MDAS-YOLO [12], large separable-kernel receptive-field expansion [13], and guided-filtering combined with BiFPN fusion [14]), reporting high accuracy on their respective crop datasets. The shared limitation, however, is that such accuracy gains are usually obtained at the cost of increased parameters and computation, which constrains deployment on resource-limited field devices and motivates explicitly efficiency-oriented designs.

Accordingly, lightweight network design has received considerable attention in agricultural detection. Building on depthwise-separable convolution, efficient channel attention, and compact backbones, models such as DPDB-YOLO [15], DE-YOLO [16], a MobileNetv4–BiFPN fusion [17], and YOLO-VW [18] substantially reduce parameters while maintaining accuracy, and transformer-based and privacy-preserving variants have also been reported [19]. Nevertheless, most of these solutions are tuned for a specific crop or for controlled imaging conditions, and their generalizability to open-field environments is limited. A favorable accuracy–efficiency trade-off for real-time, edge-deployed detection on agricultural robots therefore remains an open problem, which is the gap this work addresses.

Focusing on rice disease detection in field environments, models must address a dual challenge. First, dynamic natural lighting, mutual canopy occlusion, and high diversity in lesion morphology at different disease stages make precise localization and discrimination extremely difficult [20]. Second, the computational overhead of most high-accuracy models far exceeds the capacity of field edge devices such as unmanned aerial vehicles (UAVs) and handheld plant protection instruments [21], creating an engineering gap between laboratory performance and field-usable systems.

The objective of this study is twofold: to design a lightweight rice leaf disease detection model that improves the accuracy–efficiency trade-off for edge deployment, and to assess its on-device inference feasibility on an embedded GPU platform. Two related questions are beyond the present scope and are treated as future work: closed-loop field evaluation on a moving robot, and the determination of an optimal image scale for training or on-device operation. Based on the above analysis, this paper proposes DCA-YOLO, a lightweight rice disease detection model for edge deployment with YOLO11n as the baseline. Our contributions are summarized as follows: (1) We propose C3k2-DICN, a dynamic hybrid convolution module employing a multi-branch structure with data-driven dynamic weight allocation to enhance adaptive perception of lesion features at different scales, achieving co-optimization of detection accuracy and model lightweighting. (2) We introduce the Cross-Scale Shared Detection Head (CSDH), which replaces independently parameterized per-scale prediction branches with a cross-scale parameter-sharing design to compress detection head redundancy. (3) We incorporate and adapt the Adaptive Dual-path Downsampling module (ADown) into YOLO11n, which decouples spatial resolution compression into complementary parallel paths to maximize retention of disease-discriminative information during downsampling, and systematically validate its contribution within the proposed architecture through controlled ablation experiments.

To position these contributions precisely, we clarify the origin of each module. The C3k2-DICN backbone is newly designed in this work: its core Dynamic Inception Mixer, the data-driven dynamic-kernel-weight mechanism, and the morphology-aligned strip-kernel branches form, to our knowledge, a new combination tailored to rice-lesion morphology. The CSDH is a newly designed cross-scale shared detection head that instantiates the general principle of cross-scale parameter sharing as a concrete scale-adaptive-projection plus shared-refinement structure. The ADown module is adopted from YOLOv9 [22] and is integrated and systematically validated within the proposed architecture rather than newly proposed here. To make these distinctions explicit, Table 1 summarizes the architectural provenance of each DCA-YOLO component, indicating which modules are newly designed, adapted, or left unchanged relative to the YOLO11n baseline.

The remainder of this paper is organized as follows: Section 2 presents the materials and methods; Section 3 reports the experimental results; Section 4 discusses the findings; Section 5 concludes the paper.

2. Materials and Methods

2.1. Dataset Compilation and Processing

This study used a rice leaf disease image dataset of 4622 annotated images assembled from two complementary sources, images acquired by the authors using mobile devices and images obtained from publicly available online repositories, covering three representative disease categories: rice blast (Magnaporthe oryzae), brown spot (Bipolaris oryzae), and bacterial blight (Xanthomonas oryzae pv. oryzae). Because the images came from heterogeneous sources and were not acquired under a single standardized protocol, their resolution, capture devices, optical settings, and acquisition conditions varied across images and were not systematically documented. None of the images were captured with the onboard camera of the field robot platform described in Section 2.8.

The source images further differ in viewpoint, background, and image quality, and some exhibit shallow depth of field, partial blur, or uneven illumination. Because such characteristics are inherited from the source images rather than introduced by a controlled-capture setup, the shallow depth of field visible in some examples reflects this source heterogeneity rather than a specific imaging configuration. During curation, clearly mislabeled, duplicated, or very-low-quality images were removed. The crop growth stage and disease severity associated with each image were not systematically recorded and consequently cannot be controlled or reported here, which is acknowledged as a limitation in Section 4.

All bounding boxes were manually annotated by three of the authors using the LabelMe annotation tool (version 5.10.1), and every image was cross-checked by at least one other annotator to ensure labeling consistency; images with evidently incorrect or inconsistent labels were discarded during curation. To enable independent verification, the per-class image and bounding-box counts are reported for each split. For the training set, bacterial blight, brown spot, and rice blast comprise 1101, 1096, and 1039 images (1909, 2928, and 3650 bounding boxes), respectively; for the validation set, 317, 337, and 270 images (524, 872, and 918 boxes); and for the test set, 164, 169, and 131 images (268, 376, and 471 boxes). In total, the dataset provides 11,916 annotated lesion instances. Because a few images contain lesions of more than one disease, the per-class image counts sum to slightly more than the corresponding split totals. Because the web-sourced portion of the images is subject to third-party copyright, the images are not redistributed; instead, the curated annotation files and the list of original online sources are provided. The online images were obtained from the Kaggle data-sharing platform (https://www.kaggle.com, accessed on 5 January 2026).

Following a stratified random sampling strategy, the dataset was first partitioned at the image level into training (3235 images), validation (924 images), and test (463 images) subsets in a 7:2:1 ratio; the per-class image and bounding-box counts reported above refer to these original, pre-augmentation images. To address class imbalance at the lesion-instance level (bacterial blight has the fewest annotated boxes in the training set), data augmentation was then applied only to the minority-class samples within the training subset after this split, including random rotation (±30°), horizontal and vertical flipping, brightness perturbation (±20%), and Gaussian noise injection, whereas the validation and test subsets retained only original, non-augmented images. Because the partition is performed at the image level and strictly precedes augmentation, no augmented variant of any training image can appear in the validation or test subset, which eliminates train–test data leakage and the associated risk of inflated evaluation results.

To evaluate cross-domain generalization capability, this study additionally incorporated a publicly available rice disease dataset from the Roboflow Universe platform (https://universe.roboflow.com/pest-i83ul/rice-leaf-disease-weadf, accessed on 25 May 2026) as an external independent benchmark. This dataset contains 2553 images covering various common rice diseases, and exhibits significant differences from the compiled dataset in image acquisition conditions, capture devices, and annotation styles, and therefore serves as an independent benchmark for testing whether the proposed architectural improvements generalize to a different dataset. Example images from the compiled dataset are shown in Figure 1, and the category and scale distribution of the public dataset are presented in Figure 2.

2.2. Proposed DCA-YOLO Architecture

Object detection has been widely deployed across domains such as agricultural monitoring and industrial inspection [23]. Among existing frameworks, the YOLO family has established mainstream status in agricultural intelligent sensing by virtue of its real-time performance [24,25]. This paper adopts YOLO11n, the lightweight variant of the YOLO11 series [26], as the baseline model. While YOLO11n achieves competitive detection accuracy, two aspects offer room for further improvement toward resource-constrained agricultural edge deployment: (1) the backbone feature extraction can be enhanced with more adaptive multi-scale receptive field design to better capture the morphological heterogeneity of rice lesions; and (2) the independently parameterized per-scale detection head branches introduce parameter redundancy that can be reduced through cross-scale sharing without sacrificing detection performance. To exploit these opportunities, we propose DCA-YOLO, which systematically enhances YOLO11n along three dimensions (Figure 3): C3k2-DICN for adaptive multi-scale feature extraction, a lightweight Cross-Scale Shared Detection Head (CSDH) to reduce detection head redundancy, and ADown to preserve discriminative disease features during downsampling. The network retains the YOLO11 backbone–neck–head structure, with P1–P5 denoting the five backbone feature-extraction stages and the neck following the standard YOLO11 PAN–FPN design that combines a top-down feature pyramid network (FPN) with a bottom-up path-aggregation network (PAN).

2.3. C3k2-DICN Dynamic Hybrid Convolution Module

A fundamental challenge in fine-grained visual recognition is that discriminative patterns vary substantially in scale, orientation, and spatial extent across different object instances. Fixed-kernel convolutional networks are inherently limited in dynamically adapting receptive fields to input content. To address this, this paper proposes the Dynamic Inception Mixer (DIM) module as a data-driven, content-adaptive feature extraction unit. DIM is embedded into the C3k2 module of YOLO11 to construct the improved C3k2-DICN (Dynamic Inception Convolution Network) module (Figure 4a). The DIM module adopts a hierarchically progressive architectural design, composed of three core components.

2.3.1. Dynamic Inception Depth-Wise Convolution Layer (DynamicInceptionDWConv2d)

This layer introduces a data-driven Dynamic Kernel Weights (DKW) mechanism. This mechanism follows the principle of dynamic convolution, which aggregates several candidate kernels through input-dependent attention instead of applying a single fixed kernel [27]. Rice leaf diseases exhibit highly heterogeneous morphological characteristics: blast lesions are typically spindle-shaped with gray-white centers and brown margins; bacterial blight produces elongated stripe-shaped necrotic lesions along the vein direction; and brown spot presents as near-circular scattered distributions with strong background interference.

To address these morphological differences, three depthwise separable convolutional branches are deployed in parallel: the 3 × 3 square kernel captures local texture details of near-circular lesions; the 1 × k horizontal strip kernel (k = 11, empirically selected on the validation set from candidates {7, 9, 11, 13}) aligns with transverse extension of bacterial blight stripe lesions; and the k × 1 vertical strip kernel perceives the longitudinal extension of blast’s spindle-shaped lesions. This decomposition of a depthwise convolution into a square kernel and two orthogonal strip (band) kernels follows the Inception-style depthwise design, in which a large-kernel depthwise convolution is split into a small square kernel and orthogonal band kernels for efficiency [28]. The DKW generator compresses the spatial dimension via global average pooling, generates three sets of channel-level attention weights through a 1 × 1 convolution, and performs adaptive weighted fusion after Softmax normalization:

y = σ (B N (\sum_{i = 1}^{3} α_{i} (x) \cdot F_{i} (x)))

(1)

where F_i denotes the i-th depthwise convolutional branch, α_i(x) represents the dynamically generated normalized attention weights, σ is the Sigmoid Linear Unit (SiLU) activation function, and BN denotes the batch normalization layer.

2.3.2. Multi-Scale Hybrid Convolution Module (DynamicInceptionMixer)

Rice leaf lesion features exhibit significant complementarity across different receptive field scales: kernel size 3 corresponds to local perception of early-stage micro-lesions, while kernel size 5 accommodates modeling of mid- to late-stage lesion expansion. The DynamicInceptionMixer module adopts a channel splitting and parallel processing strategy: input features are evenly divided along the channel dimension into two groups, respectively fed into DynamicInceptionDWConv2d layers configured with different kernel sizes (default {3, 5}), achieving parallel extraction and aggregation of cross-scale lesion features. After concatenating the two paths, a 1 × 1 convolution performs cross-channel information interaction and feature recalibration (Figure 4c).

2.3.3. Unified Network Building Block (DynamicIncMixerBlock)

DynamicIncMixerBlock constructs the network building unit based on a dual-path residual structure. The feature mixing path applies batch normalization before feeding into DynamicInceptionMixer to extract multi-scale lesion spatial features; the channel interaction path employs a Convolutional Gated Linear Unit (Convolutional GLU) [29] for efficient channel-dimension feature transformation, where its gating branch dynamically suppresses background noise channels. Both residual branches incorporate Layer Scale and DropPath mechanisms:

z_{1} = x + D r o p P a t h (γ_{1} \cdot M i x e r (B N (x)))

(2)

z_{2} = z_{1} + D r o p P a t h (γ_{2} \cdot C G L U (B N (z_{1})))

(3)

where γ₁ and γ₂ are learnable Layer Scale scaling parameters.

2.4. Cross-Scale Shared Detection Head (CSDH)

A well-recognized challenge in multi-scale visual detection is that independently parameterized prediction heads for each scale lead to parameter redundancy and hinder scale-invariant representation learning. Cross-scale parameter sharing forces the model to use a unified set of refinement parameters across all detection scales, acting as an inductive bias that encourages generalizable, scale-agnostic discriminative features. Weight-shared detection heads across multiple feature scales have also been adopted in recent real-time detectors to reduce head redundancy [30]. The standard YOLO11n detection head independently configures regression and classification branches for each detection scale, with parameter count growing linearly with the number of detection layers. CSDH replaces this redundant design with cross-scale parameter sharing (Figure 5).

The construction of CSDH follows a three-stage logic: scale-adaptive projection, shared refinement, and decoupled head prediction. First, an independent 3 × 3 convolutional projection layer with Group Normalization (GN) [31] (chosen over Batch Normalization because small per-scale batch sizes make BN statistics unreliable, consistent with the finding that batch normalization in a weight-shared head degrades performance owing to inter-scale statistical differences [30]) is configured for each detection scale to map multi-scale features to a unified hidden dimension C_h:

h_{i} = {C o n v}_{3 \times 3_G N}^{(i)} (f_{i}), i = 1,2, \dots, N_{l}

(4)

Second, all scale features are jointly fed into a parameter-shared refinement module Φ, consisting of a cascaded 3 × 3 depthwise convolution and 1 × 1 pointwise convolution:

z_{i} = Φ (h_{i}) = {C o n v}_{1 \times 1} ({D W C o n v}_{3 \times 3} (h_{i}))

(5)

Third, the shared regression head cv2 and classification head cv3 are applied to the refined features, with a learnable per-scale amplitude scaling factor S_i (initialized to 1.0) applied to the regression output:

b_{i} = S_{i} \cdot c v 2 (z_{i}), c_{i} = c v 3 (z_{i}) y_{i} = C o n c a t (b_{i}, c_{i})

(6)

At inference, the final bounding box coordinates and class probabilities are decoded as:

{b b o x}_{i} = D F L (b_{i}) \times s_{i} {c l s}_{i} = σ (c_{i})

(7)

where s_i denotes the stride of detection scale i, and DFL denotes Distribution Focal Loss. CSDH converges the shared refinement and prediction parameters to constants independent of N_l, thereby significantly reducing overall model parameter count while maintaining multi-scale detection performance.

2.5. Adaptive Dual-Path Downsampling Module (ADown)

The trade-off between spatial resolution compression and information retention is a fundamental design tension in hierarchical visual feature learning. This paper adopts the Adaptive Dual-path Downsampling (ADown) module originally proposed in YOLOv9 [22], which achieves a better balance through a multi-strategy parallel design (Figure 6). The input feature map x is first subjected to progressive pre-downsampling via average pooling:

X_{p r e} = A v g P o o l (X; k = 2, s = 1, p = 0)

(8)

The pre-processed features are equally divided along the channel dimension into two paths. Path 1 (Semantic-aware path) employs a stride-2 3 × 3 convolution: y₁ = Conv_3×3(x₁, s = 2). Path 2 (Structure-enhancement path) applies max pooling followed by a 1 × 1 convolution: y₂ = Conv_1×1(MaxPool(x₂)). The dual-path outputs are finally concatenated:

y = C o n c a t (y_{1}, y_{2})

(9)

The channel splitting strategy reduces parameter count by approximately 86% compared to full-channel stride convolution. Furthermore, compared to traditional large-stride convolution (s = 4), ADown’s two-stage progressive downsampling effectively mitigates abrupt information reduction, which is critical for retaining low-level visual features such as rice lesion edges.

2.6. Model Training and Inference Settings

Experiments were conducted in a high-performance computing environment. All comparison models employed unified hyperparameter settings to ensure fair comparison. Detailed hardware configurations, software environments, and training parameters are presented in Table 2, with remaining hyperparameters maintaining Ultralytics official default values. To ensure a fair comparison, every model (including all baselines and comparison detectors) was trained and evaluated under identical conditions: the same 640 × 640 input size, the same 300 training epochs, the same SGD optimizer and learning-rate schedule, the same data-augmentation pipeline, and the same training/validation/test splits and stopping criterion. All comparison models were retrained by the authors under these unified settings rather than quoting results from the original publications, and none of the models (DCA-YOLO and every baseline alike) were initialized from COCO- or ImageNet-pretrained weights; all networks were trained from random initialization. For the YOLO-family detectors, the hyperparameter configuration was kept identical to that of DCA-YOLO, and no hyperparameters were individually tuned per model, so the reported differences reflect architectural design rather than tuning or pretraining advantages.

2.7. Evaluation Metrics

This paper constructs an evaluation framework from two dimensions (model efficiency and detection accuracy) to comprehensively assess the practical value of the proposed lightweight approach. Parameter count (Parameters/M) and floating-point operations (GFLOPs) directly determine whether the model can achieve real-time deployment on resource-constrained agricultural edge devices.

Precision (P) and recall (R) measure detection quality from complementary dimensions:

P = \frac{T P}{T P + F P} R = \frac{T P}{T P + F N}

(10)

where TP, FP, and FN denote the numbers of true positive, false positive, and false negative detections, respectively. Mean average precision (mAP) is obtained by averaging the area under the P–R curve across all categories:

m A P = \frac{1}{N} \sum_{i = 1}^{N} \int_{0}^{1} P_{i} (R) d R

(11)

where N denotes the total number of target categories. This paper primarily reports mAP@0.5 (Intersection over Union (IoU) threshold = 0.5) and mAP@0.5:0.95 (averaged over IoU ∈ {0.50, 0.55, …, 0.95}), the latter serving as the core accuracy metric given its more stringent localization requirements.

2.8. Edge Deployment and On-Device Measurement Protocol

To assess the on-device inference feasibility of DCA-YOLO on a representative edge platform, this study used the NVIDIA Jetson TX2 (NVIDIA Corporation, Santa Clara, CA, USA) as the target edge computing unit, mounted on a self-developed wheeled robot platform. The Jetson TX2 features a 256-core Pascal GPU (1.3 GHz), dual-core Denver2 and quad-core ARM Cortex-A57 heterogeneous CPU (up to 2.0 GHz), 8 GB 128-bit LPDDR4 memory, and 32 GB eMMC5.1 storage, with a rated TDP of 15 W.

The deployment pipeline follows the standard PyTorch → ONNX → TensorRT acceleration paradigm using TensorRT 8.2.1 (paired with JetPack 4.6.2). FP16 mixed-precision was adopted because INT8 inference on Pascal-architecture GPUs requires layer-wise activation calibration sensitive to high-frequency texture features of small lesion targets, while FP16 reduces weight storage to 50% of FP32 while confining quantization-induced accuracy loss to within 0.1 percentage points. All measurements were conducted under Jetson TX2 15 W Max-N mode at 640 × 640 input resolution, averaged over the 463 test images. The optimization pipeline is detailed for reproducibility: the model was exported to ONNX (opset 13) and compiled with TensorRT using a 4 GB workspace, automatic layer/tensor fusion, and a fixed input (batch size = 1); 50 warm-up iterations preceded timing and 300 measured iterations were averaged, and no pruning or re-parameterization beyond the proposed modules was applied.

3. Results

3.1. Detection Results

The detection results of DCA-YOLO on the compiled dataset are shown in Figure 7c. Figure 7a shows the original disease images, and Figure 7b shows the ground truth annotations. DCA-YOLO accurately identified the primary disease regions in the images, with detection bounding boxes highly consistent with the ground truth annotations and relatively few missed and false detections.

Figure 8 further illustrates the details of the detection results. Across the panels, correctly localized high-confidence detections are distinguished from false or missed detections; the complete color and line scheme of each panel is defined in the figure caption. Such false or missed detections mainly occur where the lesion area is small, the contrast with the background is low, or multiple disease types overlap. These results indicate that DCA-YOLO performed excellently on typical samples but still faced some risk of missed detection on extremely difficult samples, providing direction for subsequent model optimization.

In the field, rice plants are frequently affected by more than one disease at the same time. DCA-YOLO localizes and classifies bacterial blight, brown spot, and rice blast in a single forward pass; as shown in Figure 7, it detects multiple lesion instances within a single image, including images in which lesions of different diseases co-occur. The normalized confusion matrix in Figure 9 further shows that the three diseases are rarely mistaken for one another; the only non-negligible inter-disease error is about 1% of brown-spot instances predicted as rice blast. The dominant error mode is therefore missed detection of small, low-contrast lesions or background false positives rather than confusion between co-occurring diseases. Disease categories beyond the three studied here, and the stronger visual interference when many lesions of different diseases overlap densely, remain for future work.

3.2. Ablation Study

To systematically evaluate the independent contribution of each improved module to the model’s overall performance, this paper takes the original YOLO11n as the baseline and sequentially incorporates the DICN, CSDH, and ADown modules, designing four groups of controlled experiments. Results are presented in Table 3.

The baseline model achieved mAP@0.5 of 87.18% and GFLOPs of 6.3, serving as the reference benchmark. Upon introducing DICN alone, mAP@0.5 improved to 87.77% (+0.59%), parameter count decreased from 2.6 M to 2.3 M, and GFLOPs reduced to 5.8. Upon further introducing CSDH, mAP@0.5 improved to 88.33% (+1.15% over baseline) and GFLOPs decreased to 5.1; the marginal change in mAP@0.5:0.95 at this stage (+0.01 pp) reflected that CSDH’s parameter-sharing design primarily benefits category discrimination rather than bounding box localization precision. After introducing ADown, the complete model achieved mAP@0.5 of 88.42% (+1.24%) and mAP@0.5:0.95 of 45.82% (+0.66% over baseline), with GFLOPs reduced to 4.0 (36.5% reduction) and parameter count to 1.7 M (34.6% reduction). Each module contributed incrementally across accuracy and efficiency metrics, supporting the contribution of each component and their complementary gains. It should be noted, however, that Table 3 follows a sequential-addition protocol, which does not fully isolate an individual module from interaction effects. To address this, each module was additionally added individually to the YOLO11n baseline (mAP@0.5 = 87.18%, mAP@0.5:0.95 = 45.16%, 6.3 GFLOPs, 2.6 M parameters). Added on its own, C3k2-DICN reached 87.77% mAP@0.5 and 45.25% mAP@0.5:0.95 (5.8 GFLOPs, 2.32 M parameters); CSDH reached 88.01% and 45.40% (5.6 GFLOPs, 2.42 M); and ADown reached 88.31% and 45.28% (5.3 GFLOPs, 2.10 M). Every module therefore improved mAP@0.5 over the baseline on its own (by +0.59, +0.83, and +1.13 percentage points, respectively) while simultaneously reducing both computation and parameters, and the full DCA-YOLO (88.42% mAP@0.5, 45.82% mAP@0.5:0.95) exceeded every single-module variant, indicating that the three modules contribute complementary, partly synergistic gains rather than a single dominant effect.

To assess run-to-run variability, the full model was trained five times with different random seeds; the results (mean ± standard deviation) were mAP@0.5 = 88.37 ± 0.14%, mAP@0.5:0.95 = 45.83 ± 0.11%, precision = 81.74 ± 0.73%, recall = 84.68 ± 1.19%, and F1 = 83.17 ± 0.34% (GFLOPs and parameter count unchanged at 4.0 and 1.68 M). The single-run values reported above (88.42% and 45.82%) fall within this range, and the improvement in mAP@0.5 over the baseline (about 1.2 percentage points) is far larger than the seed-induced standard deviation (0.14%). One-sample t-tests of the five-seed results against the baseline reference values indicate that the improvements in mAP@0.5 and mAP@0.5:0.95 are statistically significant (p < 0.001), treating each baseline value as a fixed reference; a fully paired test would additionally require multi-seed runs of the baseline, which we identify as a straightforward extension.

3.3. Per-Class Performance and Error Analysis

Because a single global mAP can mask uneven behavior across categories, per-class detection performance on the test set is reported in Table 4, and the three per-class AP values average to the overall 88.42% mAP@0.5 and 45.82% mAP@0.5:0.95 reported above. At the IoU = 0.5 level, brown spot was the strongest class and bacterial blight the weakest, the latter showing the lowest recall (77.24%) and AP@0.5 (82.40%), consistent with its thin, elongated, low-contrast stripe lesions being comparatively easy to miss. The ranking reversed under the stricter AP@0.5:0.95 criterion: bacterial blight attained the highest value (52.48%) and rice blast the lowest (41.05%), indicating that blast lesions, although readily detected at IoU = 0.5, are harder to localize tightly because of their small, diffuse, and irregular appearance, whereas detected bacterial-blight lesions are localized comparatively tightly.

To make the error structure explicit, the normalized confusion matrix on the test set is shown in Figure 9. The matrix is column-normalized, so each column sums to one over the instances of the corresponding true class; the background column is normalized the same way over the total false-positive detections, and its entries are therefore proportions rather than raw counts. Inter-class confusion among the three diseases is negligible: the only non-zero off-diagonal entry between diseases corresponds to two brown-spot instances predicted as rice blast (about 1%), so bacterial blight, brown spot, and rice blast are almost never mistaken for one another. The normalized diagonal entries of the confusion matrix are 0.84 (bacterial blight), 0.94 (brown spot), and 0.92 (rice blast). Although Table 4 and Figure 9 were produced by the same test-set evaluation, these diagonal entries are not intended to duplicate Table 4 recall values: Table 4 reports the per-class recall returned by the detection metrics, whereas the confusion matrix provides a threshold-dependent, normalized summary of class-wise prediction outcomes. The dominant error mode is therefore confusion with the background rather than between diseases: 16%, 5%, and 8% of the true bacterial-blight, brown-spot, and rice-blast instances, respectively, were missed (assigned to background), while some background regions were detected as disease, most often as rice blast (45% of such cases), followed by bacterial blight (31%) and brown spot (25%). This clear separation among the disease classes is the quantitative basis for the discrimination discussed in Section 4, whereas the GradCAM++ maps in Section 3.5 provide only qualitative support. This view complements the qualitative detection examples in Figure 8.

Beyond aggregate accuracy, the dominant failure modes can be characterized directly from the per-class and confusion-matrix results. The principal weaknesses are (i) missed detections of bacterial blight, whose thin, low-contrast stripe lesions give the lowest recall (77.24%) and are most often assigned to the background, and (ii) background false positives that are disproportionately labeled as rice blast, together with the comparatively loose localization of rice-blast lesions at strict IoU thresholds (AP@0.5:0.95 = 41.05%), consistent with their small, diffuse, and irregular margins. Because the images were compiled from heterogeneous public sources rather than captured with the field robot, robustness under field-specific conditions (partial occlusion and mutual canopy shading, motion blur from robot movement, and cast shadows or strongly uneven field illumination) could not be evaluated here and is left to future work using field-collected imagery (Section 4). This analysis clarifies where detection is currently most reliable and motivates the targeted improvements discussed in Section 4.

3.4. Comparison with State-of-the-Art Methods

To comprehensively evaluate DCA-YOLO, this study selects representative mainstream object detection models for systematic comparison, covering the two-stage detector Faster R-CNN [32] and multiple single-stage YOLO variants [33,34,35,36,37,38]. Supplementary validation was also conducted on the Roboflow public dataset to validate cross-dataset generalization. Results are presented in Table 5 and Table 6, respectively.

As shown in Table 5, DCA-YOLO achieved the best overall performance on the compiled dataset: mAP@0.5 of 88.42% and mAP@0.5:0.95 of 45.82%, with only 4.0 GFLOPs and 1.7 M parameters. Notably, DCA-YOLO surpassed Faster R-CNN (mAP@0.5 = 87.80%) while requiring only 1/33 of its GFLOPs (4.0 vs. 134.0) and 1/24 of its parameters (1.7 M vs. 41.4 M), demonstrating that lightweight architecture design can match two-stage detector accuracy at a fraction of the computational cost. Among lightweight single-stage models, YOLOv9t achieved the closest parameter count (2.0 M) to DCA-YOLO (1.7 M), yet DCA-YOLO outperformed it on both mAP@0.5 (88.42% vs. 87.40%, +1.02 pp) and mAP@0.5:0.95 (45.82% vs. 44.60%, +1.22 pp) while also reducing GFLOPs by 47.4% (4.0 vs. 7.6), confirming that the proposed architectural improvements deliver accuracy gains alongside efficiency benefits. Table 6 excludes Faster R-CNN (134.0 GFLOPs, 41.4 M parameters) and YOLOv7-tiny (13.0 GFLOPs, 6.0 M parameters) as their resource footprints already disqualify them from the edge deployment scenario evaluated here; the remaining models represent practically viable lightweight alternatives. As further shown in Table 6, on the Roboflow public dataset, DCA-YOLO achieved the best results (mAP@0.5 = 91.71%, mAP@0.5:0.95 = 56.78%), outperforming the comparison methods evaluated here and supporting the generalization and cross-dataset robustness of the proposed framework.

3.5. Feature Visualization Analysis

To deeply analyze the intrinsic mechanisms underlying DCA-YOLO’s performance improvement, this paper introduces the GradCAM++ method for comparative visualization of feature activation distributions between the baseline model and DCA-YOLO. GradCAM++ generates spatially localizable class activation heatmaps by computing higher-order gradient-weighted aggregation of target class outputs with respect to feature maps.

As shown in Figure 10, for three typical disease samples (bacterial blight, brown spot, and blast), the baseline model exhibited dispersed response regions and uneven activation intensity: in bacterial blight samples, substantial activation energy was distributed in non-lesion background regions; in brown spot samples, activation intensity at some lesion positions was low; in blast samples, activation boundaries were blurred. In contrast, DCA-YOLO heatmaps showed significant improvements across all three disease types: activation regions were highly concentrated on the lesion body, and background misactivation was markedly suppressed. These results validated the effectiveness of DCA-YOLO’s improvements at the feature level.

3.6. On-Device Inference Performance

Following the deployment and measurement protocol described in Section 2.8 (the wheeled robot platform that hosts the Jetson TX2 edge unit is shown in Figure 11), the on-device inference performance of DCA-YOLO was measured as follows.

DCA-YOLO was benchmarked under two timing scopes. For the TensorRT inference stage alone (including host–device memory transfer and stream synchronization but excluding image I/O and post-processing), the model achieved an average latency of 31.0 ms per frame, corresponding to 32.2 FPS. For the complete end-to-end pipeline (image loading, letterbox resizing, normalization, host-to-device transfer, TensorRT inference, device-to-host transfer, and non-maximum suppression), the average latency was 50.6 ms per frame, corresponding to 19.8 FPS on the held-out test images. Even under this stricter end-to-end measure, the throughput remains sufficient for the slow traverse speeds typical of field-inspection robots, as quantified in Section 4. It should be noted, however, that the reduction in GFLOPs does not translate proportionally into a higher on-device frame rate: the multi-branch and dynamic-convolution operations of DCA-YOLO lower the theoretical computation but introduce additional memory-access and kernel-launch overhead that partly offset the expected reduction in inference time. The lightweighting benefit of the proposed design therefore manifests as a smaller model size, a reduced memory footprint, and a lower per-frame energy cost (together with the improved detection accuracy) rather than as a higher inference frame rate. The deployed model occupies only 1.7 M parameters and 7.21 MB in ONNX format, fitting comfortably within the memory and storage constraints of the Jetson TX2 platform (8 GB LPDDR4, 32 GB eMMC). These results indicate that DCA-YOLO fits within the compute, memory, and storage budget of this edge platform.

Beyond throughput, the following on-device runtime indicators were recorded to characterize the deployment cost. These were captured with NVIDIA’s tegrastats utility at one-second intervals and comprise GPU and CPU utilization, system memory usage, VDD_IN input power, and on-chip temperature. During pure TensorRT inference, the average GPU utilization was 95.0% (peak 97%) and the average CPU utilization was 10.6%, indicating that inference is GPU-bound; in the end-to-end setting, the average GPU utilization fell to 56.3% while the GPU waited on CPU-side image loading, preprocessing, and NMS, and the average CPU utilization rose to 13.9%. Because the Jetson TX2 uses a shared-memory architecture without dedicated video memory, memory consumption is reported as system RAM: the peak footprint during inference was 2624 MB for pure inference and 2650 MB end-to-end. The average input power measured from the VDD_IN rail was 10.5 W during pure inference and 8.3 W end-to-end (peaks of 10.8 W and 8.6 W, respectively), both well within the platform’s 15 W rating; the peak on-chip temperature remained at 50 °C (GPU ≤ 38 °C) throughout the measurement window. Operating endurance was not estimated, as it depends on the battery capacity of the eventual mobile platform, which is not yet fixed. Importantly, these indicators were obtained on the held-out test images with the Jetson TX2 running on the field robot platform while the platform was stationary; the system was not evaluated during continuous robot motion in the field. Closed-loop in-field validation (including the effects of robot travel speed, image-acquisition rate, camera mounting and angle, vibration, and motion blur) is identified as future work (Section 4 and Section 5).

4. Discussion

The experimental results indicate that DCA-YOLO achieves a favorable balance between detection accuracy and computational efficiency for rice disease detection on the datasets evaluated in this study. The three proposed modules (C3k2-DICN, CSDH, and ADown) contribute complementary improvements, as shown by the ablation study. The progressive stacking of the three modules shows monotonically improving trends across all metrics, supporting their contribution; the isolated effect of each module is further examined through the single-module ablation reported in Section 3.2.

The C3k2-DICN module addresses the inherent limitation of fixed-kernel convolutions in adapting to the highly heterogeneous morphological characteristics of rice diseases. The multi-branch dynamic design is designed to provide content-adaptive feature extraction for blast’s spindle-shaped lesions, bacterial blight’s stripe-shaped necrotic patterns, and brown spot’s near-circular distributions, which correspond to distinct pathogen types. The CSDH’s parameter-sharing strategy primarily improves classification-level discrimination, contributing a 0.56 percentage point gain in mAP@0.5 (87.77%→88.33%) while its effect on mAP@0.5:0.95 is marginal (+0.01 pp); this is consistent with the design intent of CSDH, which targets scale-agnostic feature sharing for category discrimination rather than bounding box localization precision. The ADown module’s dual-path design makes the most significant contribution to mAP@0.5:0.95 (+0.56 pp, from 45.26% to 45.82%), reflecting its role in preserving disease-discriminative edge and texture detail during downsampling, precisely the low-level spatial information required for accurate lesion boundary localization under the stringent IoU thresholds used in mAP@0.5:0.95 evaluation. Importantly, attributing improved discrimination among the three diseases specifically to these modules rests on the per-class metrics and confusion matrix reported in Section 3.3, which quantify how well the classes are separated, together with the module-wise ablation in Section 3.2; the GradCAM++ maps in Section 3.5 illustrate where the network attends but do not by themselves establish class discrimination. It should also be stated plainly that, although the mAP@0.5 values are high, the mAP@0.5:0.95 values (around 45.8% on the compiled dataset) are moderate rather than excellent. This is expected under the stricter localization criterion for small, low-contrast, and diffuse lesions, and is compounded by the variable quality and heterogeneous resolution of the source images; the principal advantage of DCA-YOLO therefore lies in its accuracy–efficiency trade-off rather than in absolute localization accuracy.

From an agricultural application perspective, DCA-YOLO’s on-device profile on the Jetson TX2 (19.8 FPS end-to-end and 32.2 FPS for the inference stage, 1.7 M parameters, 7.21 MB ONNX model) provides on-device throughput adequate for slow-traverse field inspection and makes it a promising candidate for future integration into wheeled robots and UAV-mounted systems. The 34.6% reduction in parameter count translates into a smaller runtime memory footprint, which would leave additional headroom on the Jetson TX2 for concurrent on-board processes such as the robot operating system, chassis control, and path planning that share the same 8 GB LPDDR4 memory pool. The reduced per-inference computational cost (4.0 vs. 6.3 GFLOPs) also lowers energy consumption per frame, which would help extend battery endurance on battery-powered platforms. Compared with recent lightweight agricultural object-detection studies, including rice disease detection and related crop or grain inspection tasks [2,5,6,16,18,21], DCA-YOLO achieves a favorable accuracy–efficiency balance, providing a practical basis toward closing the gap between laboratory performance and in-field operation, which remains to be validated. As a planning consideration for future field deployment rather than a demonstrated result, the end-to-end throughput of 19.8 FPS corresponds to one processed frame roughly every 51 ms; at an illustrative travel speed of 0.5–1.0 m/s this would place one processed frame about every 2.5–5.1 cm of forward motion, suggesting the throughput should be adequate for dense canopy coverage once the system is deployed and validated in motion.

Machine-learning detection of foliar plant diseases has advanced rapidly, and several recent studies report very high accuracy on both fungal diseases such as rice blast and brown spot and bacterial diseases such as bacterial blight. For example, YOLO-LeafNet attains an mAP@0.5 of about 0.99 across four crop species [39], and BGM-YOLO improves the detection of small lesions against complex natural backgrounds [40]. Such figures, however, are not directly comparable with the present results, because they are typically reported on curated single-leaf datasets (for example, PlantVillage-style images) with relatively clean backgrounds and often with substantially heavier models. YOLO-LeafNet, for instance, requires about 28.5 GFLOPs, roughly seven times the 4.0 GFLOPs of DCA-YOLO. The dataset used in this study is instead deliberately heterogeneous and field-representative, with varied capture devices, resolutions, viewpoints, and illumination; this lowers absolute accuracy but more faithfully reflects open-field conditions. Relative to recent lightweight agricultural detectors for rice disease recognition and other crop or grain inspection scenarios [2,5,6,16,18,21], DCA-YOLO attains competitive or higher accuracy with markedly fewer parameters (1.7 M) and less computation (4.0 GFLOPs). Its contribution therefore lies primarily in an accuracy–efficiency trade-off that makes on-device, real-time inference feasible.

The framework-agnostic nature of the proposed modules means they can be readily integrated into other YOLO variants or lightweight detection architectures for broader agricultural sensing tasks, including pest detection, weed identification, and crop growth monitoring [10,41,42]. GradCAM++ visualization further suggests that DCA-YOLO attends to lesion regions while markedly suppressing background activation, offering qualitative interpretive support rather than direct proof of class discrimination; this is important for building trust among plant protection practitioners who may use these systems in practice.

The compiled dataset is visually diverse, spanning many capture devices, resolutions, and imaging conditions. This diversity exposes the model to a wide range of appearances and should improve its robustness to incidental variation in image quality, but it is not a substitute for controlled domain coverage. Because the capture metadata are undocumented, the variation cannot be resolved into specific, characterized domains, and the experiments therefore cannot quantify how well the model would transfer to systematically different settings such as new geographic regions, growth stages, seasons, or rice varieties, where symptoms may look different. The strong result on the independent public benchmark (Table 6) points to some cross-dataset generalization, but establishing robust cross-domain transfer would require controlled, multi-site, multi-season data collection.

Key limitations of this study include: (1) the compiled dataset covers only three major rice diseases (blast, brown spot, bacterial blight) and does not encompass other common pathogens such as sheath blight and false smut that may co-occur in field conditions; and (2) all experiments were conducted under a single geographic and seasonal setting, and performance under variable weather, growth stages, and regional disease strains remains to be validated. To avoid over-generalization, we explicitly distinguish demonstrated performance from expected generalization: the reported accuracy is demonstrated only on the two evaluated datasets (one assembled from mobile-device and online images and one public benchmark), whereas robustness across different rice varieties, growth stages, weather conditions, and regional disease strains (all of which may alter the visual appearance of symptoms) is anticipated but not yet verified, and should be confirmed through dedicated multi-site, multi-season trials before broad field claims are made. Three further limitations follow from the nature of the data and the evaluation. (3) As detailed in Section 2.1, the images were compiled from heterogeneous mobile-device and online sources with uncontrolled, undocumented capture settings (resolution, aperture, depth of field), and some show shallow depth of field or blur; this variability constrains image quality and reproducibility. (4) The disease growth stage and severity of the source images are not documented, so the dataset cannot support questions about an optimal image scale, stage-specific accuracy, or whether the crop was naturally infected or artificially inoculated. (5) The model was evaluated on static images and in on-device tests on the field robot platform with the platform stationary; no closed-loop evaluation was performed during continuous robot motion in the field. As a direct consequence of (4), the early-warning capacity of the system (that is, reliable detection of incipient, early-stage symptoms in time to support management decisions) cannot be established from the present data and is identified as an explicit objective for future work using stage-labeled, field-collected imagery. Future work should extend the disease category coverage, apply knowledge distillation and quantization-aware training to further reduce deployment overhead, and validate generalization across diverse agricultural deployment scenarios.

5. Conclusions

This study set out to design a lightweight rice leaf disease detection model that improves the accuracy–efficiency trade-off for edge deployment and to assess its on-device inference feasibility on an embedded GPU. Both objectives were met. The proposed DCA-YOLO, built on YOLO11n with three complementary modules (C3k2-DICN, CSDH, and ADown), attains detection accuracy competitive with or better than mainstream detectors while using substantially less computation, and it runs in real time on an NVIDIA Jetson TX2. The main finding is that careful, content-adaptive lightweight design, rather than added model capacity, can deliver accurate rice disease detection within the compute, memory, and power budget of a field-deployable edge platform, helping to narrow the gap between laboratory performance and on-device operation.

These results should be interpreted together with the conditions under which they were obtained: accuracy was demonstrated on heterogeneous, mobile- and web-sourced imagery for three diseases, and on-device throughput was measured with the platform stationary rather than during continuous field motion, so absolute localization accuracy and field robustness remain to be confirmed. Future work will broaden the disease and domain coverage and pursue closed-loop validation on a moving robot, supported by further model compression through knowledge distillation and quantization-aware training.

Author Contributions

Conceptualization, Y.X. and C.L.; Methodology, Y.X., Y.L. and X.M.; Software, Y.L.; Validation, Y.X., Y.L. and Q.Y.; Formal Analysis, Y.L.; Investigation, Y.X.; Resources, C.L.; Data Curation, Y.L., D.W. and L.W.; Writing—Original Draft Preparation, Y.L.; Writing—Review & Editing, Y.X. and C.L.; Visualization, Y.L. and X.Y.; Supervision, C.L.; Project Administration, C.L.; Funding Acquisition, L.F. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Foundation of Education Department of Liaoning Province, grant number LJ212510157008.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The image dataset used in this study comprises images acquired by the authors with mobile devices together with images from publicly available online sources; because the online portion is subject to third-party copyright, the original images are not redistributed by the authors. To support reproducibility, the curated bounding-box annotation files, the per-class annotation statistics, and the exact training/validation/test split lists, together with the list of original image sources (the public online repository identified in Section 2.1, namely the Kaggle data-sharing platform, https://www.kaggle.com, accessed on 5 January 2026), are made available upon reasonable request to the corresponding author. The public benchmark dataset is openly available at https://universe.roboflow.com/pest-i83ul/rice-leaf-disease-weadf (accessed on 25 May 2026).

Acknowledgments

The authors thank the College of Engineering, Shenyang Agricultural University, for providing research facilities and support. During the preparation of this work, the authors used ChatGPT (GPT-5.5 Thinking; OpenAI, San Francisco, CA, USA) in order to improve language readability and check grammar. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Jain, S.; Sahni, R.; Khargonkar, T.; Gupta, H.; Verma, O.P.; Sharma, T.K.; Bhardwaj, T.; Agarwal, S.; Kim, H. Automatic rice disease detection and assistance framework using deep learning and a chatbot. Electronics 2022, 11, 2110. [Google Scholar] [CrossRef]
Li, P.; Zhou, J.; Sun, H.; Zeng, J. RDRM-YOLO: A high-accuracy and lightweight rice disease detection model for complex field environments based on improved YOLOv5. Agriculture 2025, 15, 479. [Google Scholar]
Ramli, N.A.; Pratondo, A.; Sulaiman, S.A.; Yusoff, W.N.S.W.; Abu, N. Detection of paddy plant diseases using Google Teachable Machine. In Recent Advances on Soft Computing and Data Mining; Springer: Cham, Switzerland, 2024; pp. 360–369. [Google Scholar]
Zhou, T.; Wei, L. YOLO-DP: A detection model of fifteen common rice diseases and pests. Sci. Rep. 2025, 15, 35968. [Google Scholar] [CrossRef] [PubMed]
Fang, K.; Zhou, R.; Deng, N.; Li, C.; Zhu, X. RLDD-YOLOv11n: Research on rice leaf disease detection based on YOLOv11. Agronomy 2025, 15, 1266. [Google Scholar] [CrossRef]
Pan, C.; Wang, S.; Wang, Y.; Liu, C. SSD-YOLO: A lightweight network for rice leaf disease detection. Front. Plant Sci. 2025, 16, 1643096. [Google Scholar] [CrossRef] [PubMed]
Kanna, S.K.; Ramalingam, K.; Pazhanivelan, P.; Jagadeeswaran, R.; Prabu, P.C. YOLO deep learning algorithm for object detection in agriculture: A review. J. Agric. Eng. 2024, 55, 1641. [Google Scholar] [CrossRef]
Xu, Y.; Yu, H.; Wu, L.; Song, Y.; Liu, C. Contingency planning of visual contamination for wheeled mobile robots with chameleon-inspired visual system. Electronics 2023, 12, 2365. [Google Scholar] [CrossRef]
Sangaiah, A.K.; Anandakrishnan, J.; Devarapelly, A.R.; Mohamad, M.L.A.B.; Bian, G.B.; Alenazi, M.J.F. R-UAV-Net: Enhanced YOLOv4 with graph-semantic compression for transformative UAV sensing in paddy agronomy. IEEE Trans. Cogn. Commun. Netw. 2025, 11, 1197–1209. [Google Scholar] [CrossRef]
Lu, H.; Dong, B.; Zhu, B.; Ma, S.; Zhang, Z.; Peng, J.; Song, K. A survey on deep learning-based object detection for crop monitoring: Pest, yield, weed, and growth applications. Vis. Comput. 2025, 41, 3037–3058. [Google Scholar] [CrossRef]
Sun, H.; Wang, R.F. BMDNet-YOLO: A lightweight and robust model for high-precision real-time recognition of blueberry maturity. Horticulturae 2025, 11, 1202. [Google Scholar] [CrossRef]
Ma, B.; Xu, J.; Liu, R.; Mu, J.; Li, B.; Xie, R.; Liu, S.; Hu, X.; Zheng, Y.; Zhang, H.; et al. MDAS-YOLO: A lightweight adaptive framework for multi-scale and dense pest detection in apple orchards. Horticulturae 2025, 11, 1273. [Google Scholar]
Yang, L.; Guo, F.; Zhang, H.; Cao, Y.; Feng, S. Research on lightweight rice false smut disease identification method based on improved YOLOv8n model. Agronomy 2024, 14, 1934. [Google Scholar] [CrossRef]
Zhang, E.; Zhang, H. An intelligent apple identification method via the collaboration of YOLOv5 algorithm and fast-guided filter theory. J. Circuits Syst. Comput. 2024, 33, 2450188. [Google Scholar] [CrossRef]
Jia, H.; Zhang, L.; Liang, X.; Yin, P.; You, H.; Li, D. DPDB-YOLO: A lightweight YOLOv13 cherry tomato ripeness detection method with adaptive extraction module and multi-scale feature fusion architecture. Ind. Crops Prod. 2025, 238, 122419. [Google Scholar]
Liang, Z.; Xu, X.; Yang, D.; Liu, Y. The development of a lightweight DE-YOLO model for detecting impurities and broken rice grains. Agriculture 2025, 15, 848. [Google Scholar] [CrossRef]
Fang, L.; Gao, G.; Li, J.; Zhang, Z. Lightweight and efficient real-time tomato detection based on improved YOLOv11 network. In Proceedings of the 40th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Zhengzhou, China, 17–19 May 2025; pp. 1971–1976. [Google Scholar]
Liao, J.; He, X.; Liang, Y.; Wang, H.; Zeng, H.; Luo, X.; Li, X.; Zhang, L.; Xing, H.; Zang, Y. A lightweight cotton Verticillium wilt hazard level real-time assessment system based on an improved YOLOv10n model. Agriculture 2025, 15, 911. [Google Scholar]
Zhang, H.; Ren, G. Intelligent leaf disease diagnosis: Image algorithms using Swin Transformer and federated learning. Vis. Comput. 2025, 41, 4815–4838. [Google Scholar]
Gan, B.; Pu, G.; Xing, W.; Wang, L.; Liang, S. Enhanced YOLOv8 with lightweight and efficient detection head for detecting rice leaf diseases. Sci. Rep. 2025, 15, 22179. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Ma, S.; Wang, Z.; Ma, X.; Yang, C.; Chen, G.; Wang, Y. Improved lightweight YOLOv8 model for rice disease detection in multi-scale scenarios. Agronomy 2025, 15, 445. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef] [PubMed]
Gong, X.; Yu, J.; Zhang, H.; Dong, X. AED-YOLO11: A small object detection model based on YOLO11. Digit. Signal Process. 2025, 166, 105411. [Google Scholar] [CrossRef]
Mi, J.; Gan, Z.; Tan, P.; Chang, X.; Wang, Z.; Xie, H. Pavement crack detection based on Star-YOLO11. CMC Comput. Mater. Contin. 2026, 86, 1–22. [Google Scholar] [CrossRef]
Zhang, X.; Wei, L.; Yang, R. TriPerceptNet: A lightweight multi-scale enhanced YOLOv11 model for accurate rice disease detection in complex field environments. Front. Plant Sci. 2025, 16, 1614929. [Google Scholar] [PubMed]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Yu, W.; Zhou, P.; Yan, S.; Wang, X. InceptionNeXt: When Inception meets ConvNeXt. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5672–5683. [Google Scholar]
Shi, D. TransNeXt: Robust foveal visual perception for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2024; pp. 17773–17783. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Olorunshola, O.E.; Irhebhude, M.E.; Evwiekpaefe, A.E. A comparative study of YOLOv5 and YOLOv7 object detection algorithms. J. Comput. Soc. Inform. 2023, 2, 1–12. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on YOLOv8 and its advancements. In Data Intelligence and Cognitive Informatics; Springer: Singapore, 2024; pp. 529–545. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Kaur, R.; Mittal, U.; Wadhawan, A.; Almogren, A.; Singla, J.; Bharany, S.; Hussen, S.; Rehman, A.U.; Al-Huqail, A.A. YOLO-LeafNet: A robust deep learning framework for multispecies plant disease detection with data augmentation. Sci. Rep. 2025, 15, 28513. [Google Scholar] [PubMed]
Yu, C.; Xie, J.; Tony, F.J.A. BGM-YOLO: An accurate and efficient detector for detecting plant disease. PLoS ONE 2025, 20, e0322750. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Qiao, X.; Ding, L.; Li, X.; Chen, Z.; Yue, X. Enhanced YOLOv5 with ECA module for vision-based apple harvesting using a 6-DOF robotic arm in occluded environments. Agriculture 2025, 15, 1850. [Google Scholar] [CrossRef]
Yue, X.; Qi, K.; Na, X.; Zhang, Y.; Liu, Y.; Liu, C. Improved YOLOv8-Seg network for instance segmentation of healthy and diseased tomato plants in the growth stage. Agriculture 2023, 13, 1643. [Google Scholar] [CrossRef]

Figure 1. Example images from the compiled dataset and target scale distribution: (a) rice diseases; (b) target scale distribution.

Figure 2. Disease categories, proportions, and target scale distribution of the public dataset: (a) rice disease categories and proportions; (b) target scale distribution.

Figure 3. Improved YOLO11 architecture diagram. Modules newly designed in this work are outlined in red.

Figure 4. C3k2-DICN architecture diagram: (a) overall C3k2-DICN structure; (b) DynamicIncMixerBlock; (c) DynamicInceptionMixer; (d) DynamicInceptionDWConv2d(Core).

Figure 5. CSDH architecture diagram.

Figure 6. ADown architecture diagram.

Figure 7. Example detection results of DCA-YOLO on the compiled dataset: (a) original images; (b) ground truth annotations; (c) detection results. The bounding boxes denote the predicted disease category: blue for bacterial blight, cyan for brown spot, and white for rice blast; the ground-truth annotations in panel (b) are drawn in green.

Figure 8. Details of example detection results. (a,b) bacterial blight; (c) brown spot; (d) rice blast. Bounding boxes denote the predicted disease category: blue for bacterial blight, cyan for brown spot, and white for rice blast; ground-truth annotations are shown in green. A green line marks a correctly detected lesion, whereas a red line or circle marks a missed detection.

Figure 9. Normalized confusion matrix of DCA-YOLO on the compiled-dataset test set (rows: predicted class; columns: true class). The background row corresponds to missed detections (false negatives), and the background column to false positives (background regions predicted as a disease).

Figure 10. Feature visualization heatmaps comparing baseline model and DCA-YOLO across three disease types.

Figure 11. The self-developed wheeled robot platform used to host the Jetson TX2 edge unit, intended for future in-field deployment.

Table 1. Architectural provenance of the DCA-YOLO components relative to the YOLO11n baseline.

Component	Network Position	Basis/Source	Designation
YOLO11n	Overall detection framework	Ultralytics YOLO11	Baseline (adopted unchanged)
C3k2-DICN	Backbone	This work: Dynamic Inception Mixer with data-driven dynamic kernel-weighting and morphology-aligned strip-kernel branches	Newly designed
CSDH	Detection head	This work: scale-adaptive projection with shared refinement, realizing cross-scale parameter sharing	Newly designed
ADown	Downsampling	YOLOv9 [22]	Adapted and integrated; validated by ablation
Neck	Feature aggregation	Standard YOLO11 PAN-FPN	Unchanged

Table 2. Experimental environment and parameter configuration.

Category	Configuration Item	Parameter Value
Hardware Platform	CPU	Intel Xeon Gold 6430 (Intel Corporation, Santa Clara, CA, USA; 16 cores, 2.00 GHz)
	Memory	120 GB
	GPU	NVIDIA GeForce RTX 4090 (NVIDIA Corporation, Santa Clara, CA, USA)
Software Environment	Operating System	Linux
	Deep Learning Framework	PyTorch 2.2.2
	CUDA Version	12.1
	Python Version	3.10.14
Training Parameters	Input Size	640 × 640
	Training Epochs	300
	Batch Size	32
	Optimizer	SGD
	Initial Learning Rate	0.01
	Momentum Factor	0.937

Table 3. Ablation study results on the compiled-dataset test set.

Model	DICN	CSDH	ADown	mAP@0.5 (%)	mAP@0.5:0.95 (%)	GFLOPs	Params (M)
YOLO11n	×	×	×	87.18	45.16	6.3	2.6
YOLO11n + DICN	✓	×	×	87.77	45.25	5.8	2.3
YOLO11n + DICN + CSDH	✓	✓	×	88.33	45.26	5.1	2.2
DCA-YOLO	✓	✓	✓	88.42	45.82	4.0	1.7

Note: ✓ indicates the module is used; × indicates the module is not used.

Table 4. Per-class detection performance of DCA-YOLO on the compiled-dataset test set.

Class	Precision (%)	Recall (%)	F1 (%)	AP@0.5 (%)	AP@0.5:0.95 (%)
Bacterial blight	75.56	77.24	76.39	82.40	52.48
Brown spot	90.52	88.30	89.40	93.57	43.93
Rice blast	82.50	83.10	82.80	89.28	41.05

Table 5. Comparative experimental results of different models on the compiled-dataset test set.

Model	mAP@0.5 (%)	mAP@0.5:0.95 (%)	GFLOPs	Params (M)
Faster R-CNN	87.80	45.20	134.0	41.4
YOLOv7-tiny	84.80	39.70	13.0	6.0
YOLOv5n	87.60	45.20	7.1	2.5
YOLOv5s	86.50	44.10	23.8	9.1
YOLOv8n	87.30	43.90	8.1	3.0
YOLOv9t	87.40	44.60	7.6	2.0
YOLOv10n	85.80	43.30	6.5	2.3
YOLO11n	87.18	45.16	6.3	2.6
YOLOv12n	87.20	44.70	6.3	2.6
DCA-YOLO (Ours)	88.42	45.82	4.0	1.7

Table 6. Comparative experimental results of different models on the public-dataset test set.

Model	mAP@0.5 (%)	mAP@0.5:0.95 (%)	GFLOPs	Params (M)
YOLOv5n	87.69	52.05	7.1	2.5
YOLOv5s	89.33	53.73	23.8	9.1
YOLOv8n	85.93	52.55	8.1	3.0
YOLOv9t	89.30	53.55	7.6	2.0
YOLOv10n	86.98	51.66	6.5	2.3
YOLO11n	88.84	54.62	6.3	2.6
YOLOv12n	80.99	45.24	6.3	2.6
DCA-YOLO (Ours)	91.71	56.78	4.0	1.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, Y.; Liu, Y.; Meng, X.; Yuan, Q.; Wang, D.; Wu, L.; Yue, X.; Feng, L.; Liu, C. Lightweight Visual Detection Framework for Real-Time Rice Leaf Disease Identification on Edge Mobile Robots. Agriculture 2026, 16, 1383. https://doi.org/10.3390/agriculture16131383

AMA Style

Xu Y, Liu Y, Meng X, Yuan Q, Wang D, Wu L, Yue X, Feng L, Liu C. Lightweight Visual Detection Framework for Real-Time Rice Leaf Disease Identification on Edge Mobile Robots. Agriculture. 2026; 16(13):1383. https://doi.org/10.3390/agriculture16131383

Chicago/Turabian Style

Xu, Yan, Yinan Liu, Xiangchen Meng, Qing Yuan, Dazhong Wang, Liyan Wu, Xiang Yue, Longlong Feng, and Cuihong Liu. 2026. "Lightweight Visual Detection Framework for Real-Time Rice Leaf Disease Identification on Edge Mobile Robots" Agriculture 16, no. 13: 1383. https://doi.org/10.3390/agriculture16131383

APA Style

Xu, Y., Liu, Y., Meng, X., Yuan, Q., Wang, D., Wu, L., Yue, X., Feng, L., & Liu, C. (2026). Lightweight Visual Detection Framework for Real-Time Rice Leaf Disease Identification on Edge Mobile Robots. Agriculture, 16(13), 1383. https://doi.org/10.3390/agriculture16131383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Visual Detection Framework for Real-Time Rice Leaf Disease Identification on Edge Mobile Robots

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Compilation and Processing

2.2. Proposed DCA-YOLO Architecture

2.3. C3k2-DICN Dynamic Hybrid Convolution Module

2.3.1. Dynamic Inception Depth-Wise Convolution Layer (DynamicInceptionDWConv2d)

2.3.2. Multi-Scale Hybrid Convolution Module (DynamicInceptionMixer)

2.3.3. Unified Network Building Block (DynamicIncMixerBlock)

2.4. Cross-Scale Shared Detection Head (CSDH)

2.5. Adaptive Dual-Path Downsampling Module (ADown)

2.6. Model Training and Inference Settings

2.7. Evaluation Metrics

2.8. Edge Deployment and On-Device Measurement Protocol

3. Results

3.1. Detection Results

3.2. Ablation Study

3.3. Per-Class Performance and Error Analysis

3.4. Comparison with State-of-the-Art Methods

3.5. Feature Visualization Analysis

3.6. On-Device Inference Performance

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI