1. Introduction
Rice (
Oryza sativa L.), as the primary caloric staple for over 3.5 billion people worldwide, plays an irreplaceable strategic role in global food security [
1]. However, fungal and bacterial diseases such as blast, brown spot, and bacterial blight persist throughout the entire rice growing season. Estimated mean annual global yield losses due to these diseases are as high as 10–30% [
2,
3,
4], posing a serious threat to agricultural production in major rice-growing regions. Currently, field disease diagnosis remains heavily reliant on in-person surveys and subjective judgment by plant protection experts. Such methods are not only labor-intensive and poorly reproducible, but also extremely limited in coverage efficiency for large-scale continuous planting areas. Once the optimal treatment window is missed, economic losses are often irreversible [
5,
6]. Against this backdrop, developing an automated disease detection system that can adapt to complex field environments has become an urgent priority in modern plant protection research.
The rapid development of deep learning-based image recognition has opened new pathways for automated agricultural disease diagnosis. Convolutional neural networks can learn end-to-end multi-level visual representations of lesions from raw images, achieving performance far superior to traditional image processing methods in various crop disease detection tasks [
7,
8]. In particular, the YOLO family of single-stage detection frameworks has been widely adopted in agricultural intelligent sensing due to its fast inference speed and flexible end-to-end deployment [
9]. Recent work has further extended this direction: deep learning-based object detection for crop monitoring has been systematically reviewed by Lu et al. [
10], covering pest, yield, weed, and growth detection applications that collectively demonstrate the broad applicability of efficient detection architectures in precision agriculture.
Toward higher detection accuracy, researchers have mainly pursued two strategies: stronger backbones and richer multi-scale feature aggregation. Representative works integrate attention or specialized convolution into the YOLO family (for example, the FasterPW-plus-attention design of BMDNet-YOLO [
11], the multi-scale small-object optimization of MDAS-YOLO [
12], large separable-kernel receptive-field expansion [
13], and guided-filtering combined with BiFPN fusion [
14]), reporting high accuracy on their respective crop datasets. The shared limitation, however, is that such accuracy gains are usually obtained at the cost of increased parameters and computation, which constrains deployment on resource-limited field devices and motivates explicitly efficiency-oriented designs.
Accordingly, lightweight network design has received considerable attention in agricultural detection. Building on depthwise-separable convolution, efficient channel attention, and compact backbones, models such as DPDB-YOLO [
15], DE-YOLO [
16], a MobileNetv4–BiFPN fusion [
17], and YOLO-VW [
18] substantially reduce parameters while maintaining accuracy, and transformer-based and privacy-preserving variants have also been reported [
19]. Nevertheless, most of these solutions are tuned for a specific crop or for controlled imaging conditions, and their generalizability to open-field environments is limited. A favorable accuracy–efficiency trade-off for real-time, edge-deployed detection on agricultural robots therefore remains an open problem, which is the gap this work addresses.
Focusing on rice disease detection in field environments, models must address a dual challenge. First, dynamic natural lighting, mutual canopy occlusion, and high diversity in lesion morphology at different disease stages make precise localization and discrimination extremely difficult [
20]. Second, the computational overhead of most high-accuracy models far exceeds the capacity of field edge devices such as unmanned aerial vehicles (UAVs) and handheld plant protection instruments [
21], creating an engineering gap between laboratory performance and field-usable systems.
The objective of this study is twofold: to design a lightweight rice leaf disease detection model that improves the accuracy–efficiency trade-off for edge deployment, and to assess its on-device inference feasibility on an embedded GPU platform. Two related questions are beyond the present scope and are treated as future work: closed-loop field evaluation on a moving robot, and the determination of an optimal image scale for training or on-device operation. Based on the above analysis, this paper proposes DCA-YOLO, a lightweight rice disease detection model for edge deployment with YOLO11n as the baseline. Our contributions are summarized as follows: (1) We propose C3k2-DICN, a dynamic hybrid convolution module employing a multi-branch structure with data-driven dynamic weight allocation to enhance adaptive perception of lesion features at different scales, achieving co-optimization of detection accuracy and model lightweighting. (2) We introduce the Cross-Scale Shared Detection Head (CSDH), which replaces independently parameterized per-scale prediction branches with a cross-scale parameter-sharing design to compress detection head redundancy. (3) We incorporate and adapt the Adaptive Dual-path Downsampling module (ADown) into YOLO11n, which decouples spatial resolution compression into complementary parallel paths to maximize retention of disease-discriminative information during downsampling, and systematically validate its contribution within the proposed architecture through controlled ablation experiments.
To position these contributions precisely, we clarify the origin of each module. The C3k2-DICN backbone is newly designed in this work: its core Dynamic Inception Mixer, the data-driven dynamic-kernel-weight mechanism, and the morphology-aligned strip-kernel branches form, to our knowledge, a new combination tailored to rice-lesion morphology. The CSDH is a newly designed cross-scale shared detection head that instantiates the general principle of cross-scale parameter sharing as a concrete scale-adaptive-projection plus shared-refinement structure. The ADown module is adopted from YOLOv9 [
22] and is integrated and systematically validated within the proposed architecture rather than newly proposed here. To make these distinctions explicit,
Table 1 summarizes the architectural provenance of each DCA-YOLO component, indicating which modules are newly designed, adapted, or left unchanged relative to the YOLO11n baseline.
The remainder of this paper is organized as follows:
Section 2 presents the materials and methods;
Section 3 reports the experimental results;
Section 4 discusses the findings;
Section 5 concludes the paper.
2. Materials and Methods
2.1. Dataset Compilation and Processing
This study used a rice leaf disease image dataset of 4622 annotated images assembled from two complementary sources, images acquired by the authors using mobile devices and images obtained from publicly available online repositories, covering three representative disease categories: rice blast (
Magnaporthe oryzae), brown spot (
Bipolaris oryzae), and bacterial blight (
Xanthomonas oryzae pv.
oryzae). Because the images came from heterogeneous sources and were not acquired under a single standardized protocol, their resolution, capture devices, optical settings, and acquisition conditions varied across images and were not systematically documented. None of the images were captured with the onboard camera of the field robot platform described in
Section 2.8.
The source images further differ in viewpoint, background, and image quality, and some exhibit shallow depth of field, partial blur, or uneven illumination. Because such characteristics are inherited from the source images rather than introduced by a controlled-capture setup, the shallow depth of field visible in some examples reflects this source heterogeneity rather than a specific imaging configuration. During curation, clearly mislabeled, duplicated, or very-low-quality images were removed. The crop growth stage and disease severity associated with each image were not systematically recorded and consequently cannot be controlled or reported here, which is acknowledged as a limitation in
Section 4.
All bounding boxes were manually annotated by three of the authors using the LabelMe annotation tool (version 5.10.1), and every image was cross-checked by at least one other annotator to ensure labeling consistency; images with evidently incorrect or inconsistent labels were discarded during curation. To enable independent verification, the per-class image and bounding-box counts are reported for each split. For the training set, bacterial blight, brown spot, and rice blast comprise 1101, 1096, and 1039 images (1909, 2928, and 3650 bounding boxes), respectively; for the validation set, 317, 337, and 270 images (524, 872, and 918 boxes); and for the test set, 164, 169, and 131 images (268, 376, and 471 boxes). In total, the dataset provides 11,916 annotated lesion instances. Because a few images contain lesions of more than one disease, the per-class image counts sum to slightly more than the corresponding split totals. Because the web-sourced portion of the images is subject to third-party copyright, the images are not redistributed; instead, the curated annotation files and the list of original online sources are provided. The online images were obtained from the Kaggle data-sharing platform (
https://www.kaggle.com, accessed on 5 January 2026).
Following a stratified random sampling strategy, the dataset was first partitioned at the image level into training (3235 images), validation (924 images), and test (463 images) subsets in a 7:2:1 ratio; the per-class image and bounding-box counts reported above refer to these original, pre-augmentation images. To address class imbalance at the lesion-instance level (bacterial blight has the fewest annotated boxes in the training set), data augmentation was then applied only to the minority-class samples within the training subset after this split, including random rotation (±30°), horizontal and vertical flipping, brightness perturbation (±20%), and Gaussian noise injection, whereas the validation and test subsets retained only original, non-augmented images. Because the partition is performed at the image level and strictly precedes augmentation, no augmented variant of any training image can appear in the validation or test subset, which eliminates train–test data leakage and the associated risk of inflated evaluation results.
To evaluate cross-domain generalization capability, this study additionally incorporated a publicly available rice disease dataset from the Roboflow Universe platform (
https://universe.roboflow.com/pest-i83ul/rice-leaf-disease-weadf, accessed on 25 May 2026) as an external independent benchmark. This dataset contains 2553 images covering various common rice diseases, and exhibits significant differences from the compiled dataset in image acquisition conditions, capture devices, and annotation styles, and therefore serves as an independent benchmark for testing whether the proposed architectural improvements generalize to a different dataset. Example images from the compiled dataset are shown in
Figure 1, and the category and scale distribution of the public dataset are presented in
Figure 2.
2.2. Proposed DCA-YOLO Architecture
Object detection has been widely deployed across domains such as agricultural monitoring and industrial inspection [
23]. Among existing frameworks, the YOLO family has established mainstream status in agricultural intelligent sensing by virtue of its real-time performance [
24,
25]. This paper adopts YOLO11n, the lightweight variant of the YOLO11 series [
26], as the baseline model. While YOLO11n achieves competitive detection accuracy, two aspects offer room for further improvement toward resource-constrained agricultural edge deployment: (1) the backbone feature extraction can be enhanced with more adaptive multi-scale receptive field design to better capture the morphological heterogeneity of rice lesions; and (2) the independently parameterized per-scale detection head branches introduce parameter redundancy that can be reduced through cross-scale sharing without sacrificing detection performance. To exploit these opportunities, we propose DCA-YOLO, which systematically enhances YOLO11n along three dimensions (
Figure 3): C3k2-DICN for adaptive multi-scale feature extraction, a lightweight Cross-Scale Shared Detection Head (CSDH) to reduce detection head redundancy, and ADown to preserve discriminative disease features during downsampling. The network retains the YOLO11 backbone–neck–head structure, with P1–P5 denoting the five backbone feature-extraction stages and the neck following the standard YOLO11 PAN–FPN design that combines a top-down feature pyramid network (FPN) with a bottom-up path-aggregation network (PAN).
2.3. C3k2-DICN Dynamic Hybrid Convolution Module
A fundamental challenge in fine-grained visual recognition is that discriminative patterns vary substantially in scale, orientation, and spatial extent across different object instances. Fixed-kernel convolutional networks are inherently limited in dynamically adapting receptive fields to input content. To address this, this paper proposes the Dynamic Inception Mixer (DIM) module as a data-driven, content-adaptive feature extraction unit. DIM is embedded into the C3k2 module of YOLO11 to construct the improved C3k2-DICN (Dynamic Inception Convolution Network) module (
Figure 4a). The DIM module adopts a hierarchically progressive architectural design, composed of three core components.
2.3.1. Dynamic Inception Depth-Wise Convolution Layer (DynamicInceptionDWConv2d)
This layer introduces a data-driven Dynamic Kernel Weights (DKW) mechanism. This mechanism follows the principle of dynamic convolution, which aggregates several candidate kernels through input-dependent attention instead of applying a single fixed kernel [
27]. Rice leaf diseases exhibit highly heterogeneous morphological characteristics: blast lesions are typically spindle-shaped with gray-white centers and brown margins; bacterial blight produces elongated stripe-shaped necrotic lesions along the vein direction; and brown spot presents as near-circular scattered distributions with strong background interference.
To address these morphological differences, three depthwise separable convolutional branches are deployed in parallel: the 3 × 3 square kernel captures local texture details of near-circular lesions; the 1 × k horizontal strip kernel (k = 11, empirically selected on the validation set from candidates {7, 9, 11, 13}) aligns with transverse extension of bacterial blight stripe lesions; and the k × 1 vertical strip kernel perceives the longitudinal extension of blast’s spindle-shaped lesions. This decomposition of a depthwise convolution into a square kernel and two orthogonal strip (band) kernels follows the Inception-style depthwise design, in which a large-kernel depthwise convolution is split into a small square kernel and orthogonal band kernels for efficiency [
28]. The DKW generator compresses the spatial dimension via global average pooling, generates three sets of channel-level attention weights through a 1 × 1 convolution, and performs adaptive weighted fusion after Softmax normalization:
where
Fi denotes the
i-th depthwise convolutional branch,
αi(
x) represents the dynamically generated normalized attention weights, σ is the Sigmoid Linear Unit (SiLU) activation function, and BN denotes the batch normalization layer.
2.3.2. Multi-Scale Hybrid Convolution Module (DynamicInceptionMixer)
Rice leaf lesion features exhibit significant complementarity across different receptive field scales: kernel size 3 corresponds to local perception of early-stage micro-lesions, while kernel size 5 accommodates modeling of mid- to late-stage lesion expansion. The DynamicInceptionMixer module adopts a channel splitting and parallel processing strategy: input features are evenly divided along the channel dimension into two groups, respectively fed into DynamicInceptionDWConv2d layers configured with different kernel sizes (default {3, 5}), achieving parallel extraction and aggregation of cross-scale lesion features. After concatenating the two paths, a 1 × 1 convolution performs cross-channel information interaction and feature recalibration (
Figure 4c).
2.3.3. Unified Network Building Block (DynamicIncMixerBlock)
DynamicIncMixerBlock constructs the network building unit based on a dual-path residual structure. The feature mixing path applies batch normalization before feeding into DynamicInceptionMixer to extract multi-scale lesion spatial features; the channel interaction path employs a Convolutional Gated Linear Unit (Convolutional GLU) [
29] for efficient channel-dimension feature transformation, where its gating branch dynamically suppresses background noise channels. Both residual branches incorporate Layer Scale and DropPath mechanisms:
where
γ1 and
γ2 are learnable Layer Scale scaling parameters.
2.4. Cross-Scale Shared Detection Head (CSDH)
A well-recognized challenge in multi-scale visual detection is that independently parameterized prediction heads for each scale lead to parameter redundancy and hinder scale-invariant representation learning. Cross-scale parameter sharing forces the model to use a unified set of refinement parameters across all detection scales, acting as an inductive bias that encourages generalizable, scale-agnostic discriminative features. Weight-shared detection heads across multiple feature scales have also been adopted in recent real-time detectors to reduce head redundancy [
30]. The standard YOLO11n detection head independently configures regression and classification branches for each detection scale, with parameter count growing linearly with the number of detection layers. CSDH replaces this redundant design with cross-scale parameter sharing (
Figure 5).
The construction of CSDH follows a three-stage logic: scale-adaptive projection, shared refinement, and decoupled head prediction. First, an independent 3 × 3 convolutional projection layer with Group Normalization (GN) [
31] (chosen over Batch Normalization because small per-scale batch sizes make BN statistics unreliable, consistent with the finding that batch normalization in a weight-shared head degrades performance owing to inter-scale statistical differences [
30]) is configured for each detection scale to map multi-scale features to a unified hidden dimension
Ch:
Second, all scale features are jointly fed into a parameter-shared refinement module Φ, consisting of a cascaded 3 × 3 depthwise convolution and 1 × 1 pointwise convolution:
Third, the shared regression head cv2 and classification head cv3 are applied to the refined features, with a learnable per-scale amplitude scaling factor
Si (initialized to 1.0) applied to the regression output:
At inference, the final bounding box coordinates and class probabilities are decoded as:
where
si denotes the stride of detection scale
i, and DFL denotes Distribution Focal Loss. CSDH converges the shared refinement and prediction parameters to constants independent of
Nl, thereby significantly reducing overall model parameter count while maintaining multi-scale detection performance.
2.5. Adaptive Dual-Path Downsampling Module (ADown)
The trade-off between spatial resolution compression and information retention is a fundamental design tension in hierarchical visual feature learning. This paper adopts the Adaptive Dual-path Downsampling (ADown) module originally proposed in YOLOv9 [
22], which achieves a better balance through a multi-strategy parallel design (
Figure 6). The input feature map
x is first subjected to progressive pre-downsampling via average pooling:
The pre-processed features are equally divided along the channel dimension into two paths. Path 1 (Semantic-aware path) employs a stride-2 3 × 3 convolution:
y1 = Conv
3×3(
x1,
s = 2). Path 2 (Structure-enhancement path) applies max pooling followed by a 1 × 1 convolution:
y2 = Conv
1×1(MaxPool(
x2)). The dual-path outputs are finally concatenated:
The channel splitting strategy reduces parameter count by approximately 86% compared to full-channel stride convolution. Furthermore, compared to traditional large-stride convolution (s = 4), ADown’s two-stage progressive downsampling effectively mitigates abrupt information reduction, which is critical for retaining low-level visual features such as rice lesion edges.
2.6. Model Training and Inference Settings
Experiments were conducted in a high-performance computing environment. All comparison models employed unified hyperparameter settings to ensure fair comparison. Detailed hardware configurations, software environments, and training parameters are presented in
Table 2, with remaining hyperparameters maintaining Ultralytics official default values. To ensure a fair comparison, every model (including all baselines and comparison detectors) was trained and evaluated under identical conditions: the same 640 × 640 input size, the same 300 training epochs, the same SGD optimizer and learning-rate schedule, the same data-augmentation pipeline, and the same training/validation/test splits and stopping criterion. All comparison models were retrained by the authors under these unified settings rather than quoting results from the original publications, and none of the models (DCA-YOLO and every baseline alike) were initialized from COCO- or ImageNet-pretrained weights; all networks were trained from random initialization. For the YOLO-family detectors, the hyperparameter configuration was kept identical to that of DCA-YOLO, and no hyperparameters were individually tuned per model, so the reported differences reflect architectural design rather than tuning or pretraining advantages.
2.7. Evaluation Metrics
This paper constructs an evaluation framework from two dimensions (model efficiency and detection accuracy) to comprehensively assess the practical value of the proposed lightweight approach. Parameter count (Parameters/M) and floating-point operations (GFLOPs) directly determine whether the model can achieve real-time deployment on resource-constrained agricultural edge devices.
Precision (
P) and recall (
R) measure detection quality from complementary dimensions:
where TP, FP, and FN denote the numbers of true positive, false positive, and false negative detections, respectively. Mean average precision (mAP) is obtained by averaging the area under the P–R curve across all categories:
where
N denotes the total number of target categories. This paper primarily reports mAP@0.5 (Intersection over Union (IoU) threshold = 0.5) and mAP@0.5:0.95 (averaged over IoU ∈ {0.50, 0.55, …, 0.95}), the latter serving as the core accuracy metric given its more stringent localization requirements.
2.8. Edge Deployment and On-Device Measurement Protocol
To assess the on-device inference feasibility of DCA-YOLO on a representative edge platform, this study used the NVIDIA Jetson TX2 (NVIDIA Corporation, Santa Clara, CA, USA) as the target edge computing unit, mounted on a self-developed wheeled robot platform. The Jetson TX2 features a 256-core Pascal GPU (1.3 GHz), dual-core Denver2 and quad-core ARM Cortex-A57 heterogeneous CPU (up to 2.0 GHz), 8 GB 128-bit LPDDR4 memory, and 32 GB eMMC5.1 storage, with a rated TDP of 15 W.
The deployment pipeline follows the standard PyTorch → ONNX → TensorRT acceleration paradigm using TensorRT 8.2.1 (paired with JetPack 4.6.2). FP16 mixed-precision was adopted because INT8 inference on Pascal-architecture GPUs requires layer-wise activation calibration sensitive to high-frequency texture features of small lesion targets, while FP16 reduces weight storage to 50% of FP32 while confining quantization-induced accuracy loss to within 0.1 percentage points. All measurements were conducted under Jetson TX2 15 W Max-N mode at 640 × 640 input resolution, averaged over the 463 test images. The optimization pipeline is detailed for reproducibility: the model was exported to ONNX (opset 13) and compiled with TensorRT using a 4 GB workspace, automatic layer/tensor fusion, and a fixed input (batch size = 1); 50 warm-up iterations preceded timing and 300 measured iterations were averaged, and no pruning or re-parameterization beyond the proposed modules was applied.
4. Discussion
The experimental results indicate that DCA-YOLO achieves a favorable balance between detection accuracy and computational efficiency for rice disease detection on the datasets evaluated in this study. The three proposed modules (C3k2-DICN, CSDH, and ADown) contribute complementary improvements, as shown by the ablation study. The progressive stacking of the three modules shows monotonically improving trends across all metrics, supporting their contribution; the isolated effect of each module is further examined through the single-module ablation reported in
Section 3.2.
The C3k2-DICN module addresses the inherent limitation of fixed-kernel convolutions in adapting to the highly heterogeneous morphological characteristics of rice diseases. The multi-branch dynamic design is designed to provide content-adaptive feature extraction for blast’s spindle-shaped lesions, bacterial blight’s stripe-shaped necrotic patterns, and brown spot’s near-circular distributions, which correspond to distinct pathogen types. The CSDH’s parameter-sharing strategy primarily improves classification-level discrimination, contributing a 0.56 percentage point gain in mAP@0.5 (87.77%→88.33%) while its effect on mAP@0.5:0.95 is marginal (+0.01 pp); this is consistent with the design intent of CSDH, which targets scale-agnostic feature sharing for category discrimination rather than bounding box localization precision. The ADown module’s dual-path design makes the most significant contribution to mAP@0.5:0.95 (+0.56 pp, from 45.26% to 45.82%), reflecting its role in preserving disease-discriminative edge and texture detail during downsampling, precisely the low-level spatial information required for accurate lesion boundary localization under the stringent IoU thresholds used in mAP@0.5:0.95 evaluation. Importantly, attributing improved discrimination among the three diseases specifically to these modules rests on the per-class metrics and confusion matrix reported in
Section 3.3, which quantify how well the classes are separated, together with the module-wise ablation in
Section 3.2; the GradCAM++ maps in
Section 3.5 illustrate where the network attends but do not by themselves establish class discrimination. It should also be stated plainly that, although the mAP@0.5 values are high, the mAP@0.5:0.95 values (around 45.8% on the compiled dataset) are moderate rather than excellent. This is expected under the stricter localization criterion for small, low-contrast, and diffuse lesions, and is compounded by the variable quality and heterogeneous resolution of the source images; the principal advantage of DCA-YOLO therefore lies in its accuracy–efficiency trade-off rather than in absolute localization accuracy.
From an agricultural application perspective, DCA-YOLO’s on-device profile on the Jetson TX2 (19.8 FPS end-to-end and 32.2 FPS for the inference stage, 1.7 M parameters, 7.21 MB ONNX model) provides on-device throughput adequate for slow-traverse field inspection and makes it a promising candidate for future integration into wheeled robots and UAV-mounted systems. The 34.6% reduction in parameter count translates into a smaller runtime memory footprint, which would leave additional headroom on the Jetson TX2 for concurrent on-board processes such as the robot operating system, chassis control, and path planning that share the same 8 GB LPDDR4 memory pool. The reduced per-inference computational cost (4.0 vs. 6.3 GFLOPs) also lowers energy consumption per frame, which would help extend battery endurance on battery-powered platforms. Compared with recent lightweight agricultural object-detection studies, including rice disease detection and related crop or grain inspection tasks [
2,
5,
6,
16,
18,
21], DCA-YOLO achieves a favorable accuracy–efficiency balance, providing a practical basis toward closing the gap between laboratory performance and in-field operation, which remains to be validated. As a planning consideration for future field deployment rather than a demonstrated result, the end-to-end throughput of 19.8 FPS corresponds to one processed frame roughly every 51 ms; at an illustrative travel speed of 0.5–1.0 m/s this would place one processed frame about every 2.5–5.1 cm of forward motion, suggesting the throughput should be adequate for dense canopy coverage once the system is deployed and validated in motion.
Machine-learning detection of foliar plant diseases has advanced rapidly, and several recent studies report very high accuracy on both fungal diseases such as rice blast and brown spot and bacterial diseases such as bacterial blight. For example, YOLO-LeafNet attains an mAP@0.5 of about 0.99 across four crop species [
39], and BGM-YOLO improves the detection of small lesions against complex natural backgrounds [
40]. Such figures, however, are not directly comparable with the present results, because they are typically reported on curated single-leaf datasets (for example, PlantVillage-style images) with relatively clean backgrounds and often with substantially heavier models. YOLO-LeafNet, for instance, requires about 28.5 GFLOPs, roughly seven times the 4.0 GFLOPs of DCA-YOLO. The dataset used in this study is instead deliberately heterogeneous and field-representative, with varied capture devices, resolutions, viewpoints, and illumination; this lowers absolute accuracy but more faithfully reflects open-field conditions. Relative to recent lightweight agricultural detectors for rice disease recognition and other crop or grain inspection scenarios [
2,
5,
6,
16,
18,
21], DCA-YOLO attains competitive or higher accuracy with markedly fewer parameters (1.7 M) and less computation (4.0 GFLOPs). Its contribution therefore lies primarily in an accuracy–efficiency trade-off that makes on-device, real-time inference feasible.
The framework-agnostic nature of the proposed modules means they can be readily integrated into other YOLO variants or lightweight detection architectures for broader agricultural sensing tasks, including pest detection, weed identification, and crop growth monitoring [
10,
41,
42]. GradCAM++ visualization further suggests that DCA-YOLO attends to lesion regions while markedly suppressing background activation, offering qualitative interpretive support rather than direct proof of class discrimination; this is important for building trust among plant protection practitioners who may use these systems in practice.
The compiled dataset is visually diverse, spanning many capture devices, resolutions, and imaging conditions. This diversity exposes the model to a wide range of appearances and should improve its robustness to incidental variation in image quality, but it is not a substitute for controlled domain coverage. Because the capture metadata are undocumented, the variation cannot be resolved into specific, characterized domains, and the experiments therefore cannot quantify how well the model would transfer to systematically different settings such as new geographic regions, growth stages, seasons, or rice varieties, where symptoms may look different. The strong result on the independent public benchmark (
Table 6) points to some cross-dataset generalization, but establishing robust cross-domain transfer would require controlled, multi-site, multi-season data collection.
Key limitations of this study include: (1) the compiled dataset covers only three major rice diseases (blast, brown spot, bacterial blight) and does not encompass other common pathogens such as sheath blight and false smut that may co-occur in field conditions; and (2) all experiments were conducted under a single geographic and seasonal setting, and performance under variable weather, growth stages, and regional disease strains remains to be validated. To avoid over-generalization, we explicitly distinguish demonstrated performance from expected generalization: the reported accuracy is demonstrated only on the two evaluated datasets (one assembled from mobile-device and online images and one public benchmark), whereas robustness across different rice varieties, growth stages, weather conditions, and regional disease strains (all of which may alter the visual appearance of symptoms) is anticipated but not yet verified, and should be confirmed through dedicated multi-site, multi-season trials before broad field claims are made. Three further limitations follow from the nature of the data and the evaluation. (3) As detailed in
Section 2.1, the images were compiled from heterogeneous mobile-device and online sources with uncontrolled, undocumented capture settings (resolution, aperture, depth of field), and some show shallow depth of field or blur; this variability constrains image quality and reproducibility. (4) The disease growth stage and severity of the source images are not documented, so the dataset cannot support questions about an optimal image scale, stage-specific accuracy, or whether the crop was naturally infected or artificially inoculated. (5) The model was evaluated on static images and in on-device tests on the field robot platform with the platform stationary; no closed-loop evaluation was performed during continuous robot motion in the field. As a direct consequence of (4), the early-warning capacity of the system (that is, reliable detection of incipient, early-stage symptoms in time to support management decisions) cannot be established from the present data and is identified as an explicit objective for future work using stage-labeled, field-collected imagery. Future work should extend the disease category coverage, apply knowledge distillation and quantization-aware training to further reduce deployment overhead, and validate generalization across diverse agricultural deployment scenarios.
5. Conclusions
This study set out to design a lightweight rice leaf disease detection model that improves the accuracy–efficiency trade-off for edge deployment and to assess its on-device inference feasibility on an embedded GPU. Both objectives were met. The proposed DCA-YOLO, built on YOLO11n with three complementary modules (C3k2-DICN, CSDH, and ADown), attains detection accuracy competitive with or better than mainstream detectors while using substantially less computation, and it runs in real time on an NVIDIA Jetson TX2. The main finding is that careful, content-adaptive lightweight design, rather than added model capacity, can deliver accurate rice disease detection within the compute, memory, and power budget of a field-deployable edge platform, helping to narrow the gap between laboratory performance and on-device operation.
These results should be interpreted together with the conditions under which they were obtained: accuracy was demonstrated on heterogeneous, mobile- and web-sourced imagery for three diseases, and on-device throughput was measured with the platform stationary rather than during continuous field motion, so absolute localization accuracy and field robustness remain to be confirmed. Future work will broaden the disease and domain coverage and pursue closed-loop validation on a moving robot, supported by further model compression through knowledge distillation and quantization-aware training.