YOLOv11-RDTNet: A Lightweight Model for Citrus Pest and Disease Identification Based on an Improved YOLOv11n

Dai, Qiufang; Liang, Shiyao; Li, Zhen; Lyu, Shilei; Xue, Xiuyun; Song, Shuran; Huang, Ying; Zhang, Shaoyu; Fu, Jiaheng

doi:10.3390/agronomy15051252

Open AccessArticle

YOLOv11-RDTNet: A Lightweight Model for Citrus Pest and Disease Identification Based on an Improved YOLOv11n

by

Qiufang Dai

^1,2,3,

Shiyao Liang

^1,2,

Zhen Li

^1,2,3,*,

Shilei Lyu

^1,2,3,

Xiuyun Xue

^1,2,3,

Shuran Song

⁴,

Ying Huang

^1,2,

Shaoyu Zhang

^1,2 and

Jiaheng Fu

^1,2

¹

College of Electronic Engineering (College of Artificial Intelligence), South China Agricultural University, Guangzhou 510642, China

²

Division of Citrus Machinery, China Agriculture Research System, Guangzhou 510642, China

³

Guangdong Engineering Research Center for Monitoring Agricultural Information, Guangzhou 510642, China

⁴

School of Electronic Information and Control Engineering, Software Engineering Institute of Guangzhou, Guangzhou 510900, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(5), 1252; https://doi.org/10.3390/agronomy15051252

Submission received: 11 April 2025 / Revised: 15 May 2025 / Accepted: 20 May 2025 / Published: 21 May 2025

(This article belongs to the Special Issue Smart Pest Control for Building Farm Resilience)

Download

Browse Figures

Versions Notes

Abstract

Citrus pests and diseases severely impact fruit yield and quality. However, existing object detection models face limitations in complex backgrounds, target occlusion, and small target recognition, and they struggle to be efficiently deployed on resource-constrained devices. To address these issues, this study proposes a lightweight pest and disease detection model, YOLOv11-RDTNet, based on the improved YOLOv11n. This model integrates multi-scale features and attention mechanisms to enhance recognition performance in complex scenarios, while adopting a lightweight design to reduce computational costs and improve deployment adaptability. The model introduces three key enhancement features: First, shallow RFD (SRFD) and deep RFD (DRFD) downsampling modules replace traditional convolution modules, improving image feature extraction accuracy and robustness. Second, the Dynamic Group Shuffle Transformer (DGST) module replaces the original C3k2 module, reducing the model’s parameter count and computational demand, further enhancing efficiency and performance. Lastly, the lightweight Task Align Dynamic Detection Head (TADDH) replaces the original detection head, significantly reducing the parameter count and improving accuracy in small-object detection. After processing the collected images, we obtained 1382 images and constructed a dataset containing five types of citrus pests and diseases: anthracnose, canker, yellow vein disease, coal pollution disease, and leaf miner moth. We applied data augmentation on the dataset and conducted experimental validation. Experimental results showed that the YOLOv11-RDTNet model had a parameter count of 1.54 MB, an mAP50 of 87.0%, and a model size of 3.4 MB. Compared to the original YOLOv11 model, the YOLOv11-RDTNet model reduced the parameter count by 40.3%, improved mAP50 by 4.8%, and reduced the model size from 5.5 MB to 3.4 MB. This model not only improved detection accuracy and reduced computational load but also achieved a balance in performance, size, and speed, making it more suitable for deployment on mobile devices. Additionally, the research findings provided an effective tool for citrus pest and disease detection with small sample sizes, offering valuable insights for citrus pest and disease detection in agricultural practices.

Keywords:

precision; deployment on mobile devices; deep learning; data augmentation; object detection; improved network

1. Introduction

In recent decades, China’s citrus industry has experienced rapid development, with the planting area continuously expanding [1]. Citrus pests and diseases have long been major constraints on citrus growth and development, significantly affecting yield and quality, and posing a serious threat to the citrus industry [2]. Therefore, creating an accurate, efficient, and automated approach for detecting pests and diseases in citrus is essential for effective management and for promoting the long-term sustainability of the citrus sector.

Traditional citrus pest and disease detection methods rely heavily on manual observation, which is labor-intensive, requires substantial human resources, and is prone to subjective bias, resulting in low detection efficiency [3]. To improve both accuracy and efficiency, automated detection systems based on deep learning and computer vision have emerged as a research hotspot in recent years [4]. By collecting image data of plants and analyzing it through deep learning models, these systems can significantly enhance the precision and efficiency of pest and disease detection, enabling real-time monitoring of large-scale agricultural areas [5]. The application of such technologies not only reduces dependence on manual inspection but also improves the accuracy of pest control, thereby decreasing the use of chemical pesticides, mitigating environmental pollution, and promoting sustainable agricultural development [6]. Traditional machine learning methods, such as the Color Co-occurrence Method (CCM), analyze crop texture and color features in a straightforward manner and are typically evaluated under controlled laboratory conditions [7]. For example, Sharif et al. [8] enhanced input images and extracted feature points using an optimized weighted segmentation method and finally used a multi-class support vector machine (M-SVM) to classify citrus diseases based on selected features. However, traditional machine learning techniques often struggle in complex environments due to occlusion, varying lighting conditions, and other external interferences. They typically suffer from long processing times and poor robustness, making them insufficient for practical use in real-world scenarios [9].

With the rise of artificial intelligence and the rapid development of deep learning technologies, significant breakthroughs have been achieved in image classification and object detection. Convolutional Neural Networks (CNNs) have demonstrated tremendous potential in image recognition tasks [10]. These technologies have been widely applied in agricultural pest and disease detection, enabling automated and accurate identification by training models to recognize specific features of pests and diseases [11]. Among the many deep learning models, the YOLO (You Only Look Once) series has become one of the most preferred object detection frameworks due to its high-speed real-time detection capabilities [12]. YOLO models can simultaneously perform object localization and classification through a single forward pass, greatly improving detection efficiency [13]. In agriculture, in particular, the high efficiency of YOLO models makes them highly valuable for real-time monitoring of pests and diseases [14].

Although the YOLO series has achieved remarkable results in citrus pest and disease detection, challenges remain in detection accuracy under complex backgrounds, in the detection of small objects, and in robustness against occlusion [15]. To address these issues, researchers have proposed various targeted improvements. Zhang et al. [16] enhanced YOLOv4 by introducing a multi-scale feature fusion module, which improved the model’s ability to detect small objects and increased its robustness in complex environments. Similarly, Hu et al. [17] improved YOLOv5 using techniques such as data augmentation and transfer learning, thereby enhancing the model’s generalization ability across different environments and achieving promising results in citrus pest and disease detection tasks. In addition, Xu et al. [18] applied convolutional layer pruning in YOLOv4-Tiny to reduce parameter count and adopted efficient computation strategies that significantly improved detection speed, enabling real-time performance on resource-constrained devices. Li et al. [19] proposed a lightweight strategy that integrated loss function optimization and attention mechanisms, effectively reducing redundant computation, enhancing real-time capability, and maintaining high detection accuracy even in low-power environments. However, despite the benefits of lightweight strategies in reducing computational complexity, limitations such as insufficient generalization and suboptimal small-object detection accuracy remain. To address these limitations, researchers have also explored methods such as multi-task learning and ensemble learning to further improve YOLO’s performance in citrus pest and disease detection. For example, Song et al. [20] proposed a multi-task learning framework based on YOLOv5 that integrated object detection, image classification, and segmentation tasks, thereby improving the overall effectiveness of pest and disease identification. Meanwhile, the introduction of ensemble learning strategies has contributed to improved model stability and accuracy. By combining multiple YOLO models, ensemble learning effectively reduces the bias of individual models and enhances adaptability and robustness across various environments [21]. Soeb et al. [22] introduced a tea leaf pest and disease dataset and demonstrated the superior performance of YOLOv7 in object detection and recognition. Dai et al. [23] improved YOLOv8 to significantly enhance accuracy and robustness in citrus disease detection, offering a feasible direction for developing more lightweight and efficient models. Recent studies have shown that YOLOv11 achieves the fastest detection and image processing speed in complex orchard environments [24], and its performance surpasses that of many current mainstream models across various domains [25,26]. In summary, although numerous excellent modules and network architectures have been developed for crop pest and disease recognition, these approaches still commonly face challenges such as limited accuracy, simplified background settings, and large model sizes. Hence, it is crucial to enhance these techniques to effectively identify citrus pests and diseases amid challenging, ever-changing environments.

This paper proposes a high-precision and lightweight citrus pest and disease detection model named YOLOv11-RDTNet. Based on improvements to the YOLOv11 architecture, the proposed model addresses the limitations of detection performance in complex orchard environments. It enhances both the accuracy and speed of recognition while being lightweight. The model demonstrates significant effectiveness in the intelligent identification of citrus pests and diseases and offers valuable insights for the efficient detection of pests and diseases in other crops.

2. Materials and Methods

2.1. Materials

2.1.1. Data Acquisition

Diseased plants typically exhibit visible damage or alterations on leaves, stems, flowers, or fruits. In most cases, each type of plant pest or disease presents distinct symptoms that enable accurate identification based on visual characteristics. Among these plant parts, leaves are often the earliest to show signs of infection, making them a critical indicator for diagnosing plant abnormalities. Therefore, this study focused on citrus leaves as the primary research subject to investigate the diagnosis and identification of citrus pests and diseases.

To train and evaluate the proposed YOLOv11-RDTNet model for citrus pest and disease detection, a dedicated image dataset was constructed. The data for this study were collected in Lantian Village, No. 888, Xitang Fairy Tale Town, Aotou Town, Conghua District, Guangzhou City, Guangdong Province, China (latitude: 23.64241° N, longitude: 113.50641° E). Images of citrus leaves were captured using a smartphone under natural field conditions. To ensure diversity in the dataset, photos were taken at different times of day—between 9:00–12:00 a.m. and 2:00–5:00 p.m.—on sunny days, at a fixed distance of approximately 8 cm from the leaf surface. The dataset encompassed a wide range of lighting conditions, weather variations, and complex background scenarios. Additionally, particular attention was paid to capturing partially occluded leaves and small targets, in order to better simulate real-world detection challenges and improve model robustness in practical agricultural environments. The final dataset consisted of 1382 annotated images, covering five categories of citrus pests and diseases: anthracnose, canker, yellow vein disease, coal pollution disease, and leaf miner moth. The number of images per category was relatively balanced to ensure both diversity and representativeness of the dataset. Representative samples of each pest and disease type are shown in Figure 1.

The Xiaomi 13 smartphone was utilized to photograph the visuals, each featuring a 1920 × 1080 pixel resolution and stored as JPG files. Precise annotations were performed on each infected leaf image using the LabelImg 1.8.6 tool, and the annotation results were saved in XML format to ensure compatibility with various algorithm requirements. The dataset was randomly divided into training, testing, and validation sets at a ratio of 8:1:1. The dataset split details are shown in Table 1.

2.1.2. Data Enhancement

To enhance the generalization ability of the citrus pest and disease detection model, reduce its reliance on specific data features, and improve its adaptability to complex environments and diverse targets, a systematic data augmentation process was applied to the dataset in this study. Data augmentation involves applying various transformations to the original images to increase the diversity of training samples, effectively mitigating overfitting and improving the model’s robustness in real-world scenarios. A total of eight augmentation techniques were used to expand the dataset, including: random lighting adjustment [27], horizontal flipping, contrast variation [28], Gaussian noise [29], salt-and-pepper noise [30], saturation adjustment [31], random scaling, and mosaic augmentation.

(1) Random lighting simulates variations in illumination conditions by randomly adjusting image brightness; (2) horizontal flipping simulates different shooting angles by randomly rotating images horizontally; (3) random contrast adjustment alters pixel brightness differences to generate high-contrast or low-contrast images, simulating image quality changes caused by camera settings or lighting; (4) Gaussian noise mimics sensor noise or blurring by randomly adding Gaussian noise, enhancing model robustness to low-quality images; (5) salt-and-pepper noise simulates extreme noise caused by sensor malfunction or signal interference, improving the model’s stability in noisy environments; (6) random saturation adjustment produces grayscale to highly saturated image variations, enhancing adaptability to color diversity; (7) random scaling simulates targets of varying sizes by randomly resizing target areas, improving multi-scale detection performance; and (8) mosaic augmentation stitches four different images into one, combining multiple targets and diverse backgrounds within a single image, thereby further expanding the diversity of training data and improving generalization in complex scenes.

Figure 2 showcases how data augmentation impacted the dataset. The final collection included a total of 12,438 images, made up of 11,056 augmented versions and 1382 original images, ensuring that there was no overlap between the training and validation groups. At that stage, the dataset compilation was complete.

2.2. YOLOv11-RDTNet Improvement

2.2.1. New Object Detection Algorithm: YOLOv11

YOLOv11 is the latest generation of object detection algorithms developed by Ultralytics. Building upon previous versions of the YOLO series, it introduces substantial improvements in both network architecture and training methodology, significantly enhancing detection speed, accuracy, and efficiency [32]. In addition to traditional object detection, YOLOv11 extends its functionality to a variety of computer vision tasks, including instance segmentation, pose estimation, oriented object detection, image classification, and object tracking. As illustrated in Figure 3, YOLOv11 adopts an improved backbone and neck design that greatly enhances feature extraction capabilities. The backbone utilizes a series of convolutional and deconvolutional layers, combined with residual connections and bottleneck structures, to reduce model complexity while maintaining strong performance. The neck, situated between the backbone and head, is responsible for feature fusion and enhancement, further boosting detection capability. Furthermore, YOLOv11 introduces new feature extraction modules such as C3K2 and C2PSA. The C3K2 module replaces the C2f module in YOLOv8 by using two small convolutional kernels (kernel size 2) instead of a single large kernel, improving computational efficiency while preserving feature representation. The C2PSA module incorporates a spatial attention mechanism that enhances the model’s focus on key regions, significantly improving the detection of small and overlapping objects. By refining the model’s ability to selectively attend to regions of interest, YOLOv11 outperforms previous versions in scenarios requiring fine-grained object recognition. It also retains the SPPF module from YOLOv8 to facilitate multi-scale feature fusion and further optimize inference speed. With a refined architecture and optimized training strategy, YOLOv11 achieves a better balance between speed and accuracy, demonstrating higher mean Average Precision (mAP) on the COCO dataset, faster inference, and fewer parameters compared to earlier versions. Owing to the incorporation of the C2PSA module, its performance on small and densely packed targets is notably enhanced [33]. With robust cross-environment adaptability and support for diverse vision tasks, YOLOv11 proves to be a powerful tool for real-time computer vision applications across mobile and cloud-based platforms. Therefore, it was selected as the baseline model in this study.

2.2.2. Robust Feature Downsampling Module: RDF

Robust Feature Downsampling (RFD) is a novel and general-purpose downsampling module [34]. It has two variants: Shallow RFD (SRFD) and Deep RFD (DRFD), designed to optimize different stages of feature extraction and enhance feature robustness. Compared to conventional downsampling modules, RFD integrates multi-path downsampling features to improve the representational capability of feature maps while minimizing information loss. Figure 4 shows RFD’s framework utilizing three distinct downsampling methods: convolutional, slicing-based, and max pooling. Each of these techniques offers unique characteristics and advantages, collectively contributing to the powerful performance of the RFD module.

The SRFD (Shallow Robust Feature Downsampling) module is specifically designed to process shallow feature maps and primarily consists of two key components: a feature enhancement layer and slice downsampling. The convolution layer in the feature enhancement stage extracts additional information from the input, thereby boosting the feature maps’ expressive power. Subsequently, the initial feature maps undergo slice-based downsampling, thereby retaining key details from the original dataset. This method preserves shallow features’ minute details while reducing data loss. In contrast, the DRFD (Deep Robust Feature Downsampling) module is designed to handle more complex feature maps by employing a variety of downsampling techniques. These methods include convolutional downsampling, max-pooling reduction, and slice-based downsampling, ensuring robust feature extraction across different layers. These techniques are combined to extract more robust and highly discriminative deep features, thus significantly improving the model’s overall performance in complex tasks. Convolutional downsampling preserves key texture information through filtering operations, max-pooling downsampling enhances local invariance by selecting maximum values in local regions, and slice downsampling reduces redundant information while retaining key features by processing the feature map in slices. By integrating these methods, the DRFD module not only maintains the structural integrity of feature maps but also ensures that critical details are not lost, thereby increasing the granularity of feature maps and further enhancing the model’s accuracy and robustness. Figure 5 provides a comprehensive breakdown of the SRFD and DRFD architectures, highlighting key components: GConv (group convolution), DWConvD (Depthwise Separable Convolution Downsampling), CutD (slice downsampling), and MaxD (max-pooling downsampling). Each element plays a distinct role in the overall framework [35]. Unlike traditional downsampling methods such as single convolution or pooling operations (as used in models like YOLOv5 and YOLOv8), the RFD module adopts a three-path fusion structure (convolution + slicing + max pooling) to retain more spatial structural information. Compared to similar structures like SPPF, the RFD module emphasizes robust downsampling and is specifically designed with shallow (SRFD) and deep (DRFD) branches to accommodate different receptive fields. Originally proposed in [35] for remote sensing object detection, we are the first to introduce this structure to orchard pest and disease small-object detection, with structural adaptations and retraining, verifying its effectiveness in complex agricultural environments.

2.2.3. Dynamic Group Shuffle Transformer: DGST

To improve the accuracy and efficiency of object detection, researchers have continuously explored new network architectures and algorithms. The Dynamic Group Shuffle Transformer (DGST) was proposed under this background as an innovative technique [36]. As shown in Figure 6, the DGST module combines the ideas of Dynamic Group Convolution and Shuffle Transformer. Dynamic Group Convolution is a flexible convolutional operation that can dynamically adjust the grouping strategy according to the characteristics of the input data. Moreover, it can automatically adjust the number of groups based on task complexity, thus balancing computational cost and model performance. The Shuffle Transformer is an improved algorithm based on the Transformer architecture. It enhances the model’s generalization capability by applying a shuffle operation to the input sequence, enabling the model to better learn the intrinsic structure of the input data. The core concept of the DGST module adopts a 3:1 partition strategy, in which one-third of the module performs group convolution and channel shuffle operations. By replacing the fully connected layer with convolutional operations to achieve similar functionality, the original neck module is substituted. This design not only reduces computational requirements but also aligns better with the characteristics of convolutional neural networks, potentially offering superior performance for the model. By integrating the concepts of dynamic group convolution and shuffle transformation, DGST enhances the model’s feature extraction and representation capabilities, improving the accuracy and efficiency of object detection. At the same time, this design reduces computational demand and model size, providing better overall performance. The DGST module integrates the concepts of dynamic group convolution and the Shuffle Transformer. Unlike the static group convolutions used in ShuffleNet or MobileNet, DGST employs a dynamic allocation strategy, allowing the grouping structure to adapt to task variations, while the shuffle mechanism enhances cross-group information flow. Although this idea is inspired by lightweight network designs in [36], we are the first to introduce it into the YOLOv11 architecture and tailor it specifically for object detection through structural pruning and lightweight optimization. This enhancement improves the representation capability for small-object detection while maintaining model efficiency. To our knowledge, this design has not yet been applied in agricultural pest and disease detection scenarios.

2.2.4. Task Align Dynamic Detection Head: TADDH

The Task Align Dynamic Detection Head (TADDH) is an innovative technique in the field of computer vision and deep learning that has demonstrated significant advantages in object detection tasks [37]. It was proposed to address the inconsistency between the localization and classification tasks, which are often handled independently in conventional object detection models. Such separation can result in accurate localization but incorrect classification, or vice versa. The TADDH dynamically aligns these two tasks to improve the overall performance of the model. As illustrated in Figure 7, the TADDH employs shared convolutional layers to substantially reduce the number of parameters, making the model lighter and more suitable for deployment on resource-constrained devices. To accommodate the varying object scales targeted by different detection heads, it introduces a scale adjustment layer that adapts the resolution of the feature maps. An interactive feature learner, utilizing multiple convolutional layers, extracts and amalgamates representations for intertask interactions. In the localization branch, the TADDH integrates these interactive features with DCNv2 (Deformable Convolutional Network v2) to generate the necessary offsets and modulation masks for deformable sampling. In the classification branch, the interactive features enable dynamic feature selection. DCNv2 enhances standard deformable convolution with multi-level offset learning, enabling flexible sampling across multi-scale feature maps. Its modulation mechanism dynamically adjusts sampling positions and intensities, improving the network’s adaptability. By promoting effective interaction between the localization and classification branches, TADDH enhances both tasks’ precision and robustness, thereby significantly improving detection accuracy.

In Figure 7, the Task Decomposition module is designed to divide the concatenated features into task-specific representations. Based on this, the calculated task interaction features are utilized to perform both object classification and localization tasks simultaneously, allowing mutual awareness between the two. Consequently, the integrated interaction features inherently result in some level of contention between the paired tasks, given the single-branch architectural design. This is primarily because classification and localization focus on different objectives and therefore rely on different types of features, such as varying levels of abstraction and receptive fields. To address this issue, we introduced a layer attention mechanism that dynamically computed task-specific features across hierarchical levels, enabling more precise task decomposition and improving the coordination between classification and localization.

In the context of Figure 8, there are two tasks: classification and localization. To compute the task-specific features for each of these tasks, a layer attention mechanism was introduced. The core idea behind this mechanism was that the interactions between different feature layers could provide valuable insights into the relationship between tasks, thereby facilitating the extraction of features specific to each task. The task-specific features were computed using Equation (1).

X_{k}^{t a s k} = ω_{k} \cdot X_{k}^{int e r}, \forall k \in \{1, 2, \dots, N\}

(1)

Among them,

ω_{k}

represents the

k^{t h}

element of the learned layer attention weights. It is computed based on the cross-layer task interaction features using Equation (2) and is capable of perceiving and recording the dynamic interactions between different layers.

ω = σ (f_{c 2} (σ (f_{c 1} (x^{int e r})))) σ

(2)

Among them,

f_{c 1}

and

f_{c 2}

represent two fully connected layers,

σ

denotes the sigmoid activation function.

x^{int e r}

is the result obtained by performing average pooling on

X^{int e r}

, while

X^{int e r}

is obtained by performing a concatenate operation on

X_{k}^{int e r}

.

This method not only focuses on information from individual network layers but also analyzes feature interactions across different layers to gain a deeper understanding of intertask relationships, thereby enabling the extraction of more accurate and effective task-specific features. The design of TADDH is inspired by the concept of task alignment and deformable convolution (DCNv2). Unlike the original detection head in YOLOv8, we introduced a task-interactive feature fusion mechanism that modelled the features required for localization and classification separately through a layer attention mechanism, while using lightweight shared convolutions to reduce the number of parameters. Compared with existing detection heads (such as YOLOHead in YOLOv5 and the Decoupled Head in YOLOv8), the TADDH is not only more lightweight in structure but also better suited for multi-scale feature alignment, significantly improving detection accuracy for small-scale disease targets. Our implementation was based on the work of Zhong et al. [37] in power equipment detection, but we adapted and optimized it specifically for the pest and disease detection scenario.

2.2.5. YOLOv11-RDTNet Model

This study introduces the YOLOv11-RDTNet, a streamlined and effective citrus pest and illness identification system derived from the YOLOv11 framework. In the backbone, the model integrates shallow and deep Robust Feature Downsampling (SRFD and DRFD) modules. Additionally, the original C3k2 modules in both the backbone and neck are replaced with a Dynamic Grouping and Shuffling Transformer (DGST) module. Finally, a Task-Aligned Dynamic Detection Head (TADDH) is employed in place of the original detection head. As depicted in Figure 9, the YOLOv11-RDTNet model consists of five main components: input, backbone, neck, head, and output.

The proposed model incorporates a series of carefully designed components to enhance detection performance under complex conditions. The input layer employs several data augmentation methods, like altered lighting and contrast variations, to mimic pest and illness visuals across varied lighting contexts. These augmentations significantly improve the model’s adaptability to diverse lighting environments and its robustness against complex backgrounds and noise interference. In the backbone, the RFD (Robust Feature Downsampling) modules—comprising shallow (SRFD) and deep (DRFD) variants—are introduced as plug-and-play downsampling units. These modules integrate multi-path downsampling features to improve the expressive capacity of feature maps while minimizing information loss. Feature downsampling is a critical step in image processing, as it reduces computational cost, accelerates processing, and enables the model to learn more abstract and representative features. The unique design of the RFD modules ensures deeper downsampling while preserving rich feature information. Moreover, the original C3k2 modules in both the backbone and neck are replaced with the DGST (Dynamic Grouping and Shuffling Transformer) module, which divides input feature maps into multiple groups and performs group-wise convolution operations independently, thus improving the model’s capacity to identify detailed local characteristics. In the head, the model adopts a novel Task-Aligned Dynamic Detection Head (TADDH), which introduces a dynamic alignment mechanism to promote the coordination between classification and localization tasks during training. By incorporating feature fusion and optimization strategies, the model effectively combines features of different scales and levels, achieving enhanced detection capability in complex scenes while significantly reducing the number of parameters and model size, thus improving both robustness and generalization.

2.3. Training Environment and Evaluation Metrics

The YOLOv11 model was trained using the PyTorch 2.5.1 deep learning framework. The testing platform was a deep learning server equipped with an NVIDIA GeForce RTX 3090 GPU (24 GB VRAM), a 14-core Intel (R) Xeon (R) Gold 6330 CPU (2.00 GHz), CUDA version 12.4, and 50 GB of memory. To ensure consistency of experimental conditions, no pre-trained weights were used. The model was trained for 300 epochs with a batch size of 16. The SGD optimizer was adopted with a learning rate set to 0.01. The loss function used consisted of the three built-in components of YOLOv11: box_loss, cls_loss, and dfl_loss.

To select the optimal model, this study employed mean average precision (mAP), precision (P), recall (R), frames per second (FPS), and parameter count (parameters/M) as evaluation metrics to compare and assess model performance.

In Equations (3) and (4), TP, FP, and FN represent true positives, false positives, and false negatives, respectively. TP refers to the number of citrus pest and disease instances correctly identified by the model, FP refers to the number of non-citrus pest and disease instances incorrectly identified as positive, and FN refers to the number of citrus pest and disease instances missed by the model. Precision measures the proportion of true positives relative to all detections, and recall is the proportion of true positives compared to all annotated examples.

P = \frac{T P}{T P + F P}

(3)

R = \frac{T P}{T P + F N}

(4)

In Equation (5), AP refers to the area under the precision–recall (PR) curve. In Equation (6), mAP is the average of AP scores per category, with N indicating the total classes in the dataset. With the dataset featuring 5 citrus pest and disease categories, the sample size was N = 5.

A P = \int_{0}^{1} p (r) d r

(5)

m A P = \frac{\sum_{K = 1}^{N} A P_{K}}{N}

(6)

In addition, the inference speed (FPS) of all models was measured on the same hardware platform. The input image resolution was uniformly set to 640 × 640, and the batch size was set to 1 to ensure fair comparison under single-image input conditions. The measured inference time included only the forward pass, excluding data preprocessing and post-processing, in order to maximize the comparability of FPS values. All models were tested with their default configurations, without any additional acceleration plugins or unofficial optimization strategies, to ensure the consistency and transparency of the evaluation. The FPS was calculated as the reciprocal of the average inference time per image, and the standard deviation was recorded to reflect the stability of the runtime performance.

In this article, GenAI tools were used solely for translating and polishing the manuscript, with no other applications.

3. Results and Discussion

3.1. Model Training Result

With pre-trained weights, YOLOv11 reached peak accuracy in 300 epochs. This research involved training the enhanced model for 300 epochs in ablation trials, contrasting it with the baseline YOLOv11 architecture. Figure 10 illustrates the annotation results (a, d, g), the detection results of the YOLOv11 model (b, e, h), and the detection results of the improved YOLOv11-RDTNet model (c, f, i). These samples were randomly selected from the validation set and represent citrus pest and disease detection under complex conditions. Compared to the original YOLOv11 model, the improved YOLOv11-RDTNet model significantly enhanced detection accuracy for small objects in complex environments. Furthermore, the comparison between Figure 10h,i demonstrates that the YOLOv11-RDTNet model not only improved detection accuracy but also substantially reduced false positives and missed detections. Despite some fluctuations in detection performance, the improved YOLOv11-RDTNet model showed strong capability in addressing small-object detection and missed detection issues across a variety of complex scenarios. Overall, the YOLOv11-RDTNet model exhibited outstanding performance in enhancing detection accuracy.

3.2. Ablation Experiment

To evaluate the actual contribution of the proposed improvement modules to the overall performance of the model, we designed and conducted an ablation study. By progressively removing or replacing key modules within the model, we analyzed the impact of each component on the model’s performance, thereby quantifying the role of the improved modules in object detection tasks. The ablation experiments were carried out on the constructed dataset, which included the training, validation, and test sets. Evaluation criteria encompassed mAP@50, model size, parameter count, FLOPs, inference time, precision, and recall. The YOLOv11 model integrated with the RDF module was denoted as YOLOv11+RDF, the model integrated with the DGST module was denoted as YOLOv11+DGST, and the model using the TADDH detection head was denoted as YOLOv11+TADDH. The experimental results are summarized in Table 2.

As shown in Table 2, the YOLOv11-RDTNet model demonstrated outstanding performance in citrus pest and disease detection by integrating multiple high-performance modules. It achieved a mean Average Precision (mAP@50) of 0.87, ranking highest among all comparison models, thereby highlighting its exceptional detection accuracy. Notably, the model had a compact size of only 3.4 MB, making it an ideal choice for deployment in resource-constrained environments such as mobile and embedded devices, and effectively addressing the growing demand for lightweight solutions. Even more impressive was its processing speed of 101.2 frames per second (FPS), achieving an excellent balance between detection speed and accuracy. This made the model particularly suitable for real-time applications that required both high responsiveness and precision. In addition, YOLOv11-RDTNet exhibited improvements in both precision (P) and recall (R) compared to the baseline model. Through optimized detection algorithms and the integration of efficient modules, the model was capable of precise target identification and localization. Its high precision ensured minimal false positives, while the improved recall capability allowed it to accurately detect the majority of objects within an image, significantly reducing the miss rate. This ideal combination of high precision and high recall enabled YOLOv11-RDTNet to deliver highly reliable results across a wide range of object detection tasks.

Figure 11 shows the predictions of the original YOLOv11 model and the improved YOLOv11-RDTNet model across five categories. It can be seen that the detection accuracy for small targets such as anthracnose, leaf miner moth, and yellow vein disease significantly improved, with a slight improvement on the canker label, and an ambiguous result for coal pollution disease. This validated the feasibility and superiority of the improved model. To enhance the accuracy of Figure 11, this study imposed strict requirements on the clarity of the dataset. During the data collection process, the camera lens was first cleaned using lens cleaning tissue, and images were captured under well-lit conditions. To ensure precise camera focus, the device was kept stable during shooting, avoiding any unnecessary movement to improve image quality. These measures effectively minimized measurement errors caused by lighting inconsistencies, lens contamination, or motion during image acquisition.

Overall, the proposed approach significantly optimized the YOLOv11 model by not only enhancing detection accuracy but also substantially reducing computational parameters. This improvement perfectly aligns with the practical requirements of disease monitoring in agricultural production, representing a highly valuable and applicable advancement.

3.3. Comparative Experiment of Different Network Models

To further verify the performance of the YOLOv11-RDTNet model in the current object detection task, this study conducted training under the same conditions using the same citrus pest and disease dataset and compared the effectiveness of YOLOv11-RDTNet with several mainstream models, including Faster R-CNN, SSD, YOLOv5s, YOLOv8n, YOLOv9t, and YOLOv10n. The mAP, model size, precision, recall, FPS, and FLOPs of these eight models are shown in Table 3. YOLOv11-RDTNet outperformed the other models in all major performance metrics. Specifically, YOLOv11-RDTNet achieved an mAP50 of 87.0%, which was higher than that of YOLOv11n (82.2%), YOLOv10n (78.8%), YOLOv9t (82.0%), YOLOv8n (81.8%), YOLOv5s (81.6%), Faster R-CNN (71.8%), and SSD (67.2%), indicating a significant improvement in the recognition accuracy of citrus pests and diseases.

The RDF module, DGST module, and TADDH detection head effectively optimized the original YOLOv11 architecture, leading to improved detection accuracy (mAP) while maintaining high runtime efficiency. YOLOv11-RDTNet demonstrated superior performance in terms of model size and computational complexity (FLOPs) compared to other benchmark models. Although the frame rate (FPS) decreased to 101.2 due to the increased computational load from architectural enhancements, this trade-off was reasonable and justified. In citrus pest and disease detection, accuracy is far more critical than inference speed, as early-stage symptoms often appear as subtle features such as tiny spots or insect traces. Missing such signs due to model simplicity could result in significant and irreversible economic losses. Therefore, our model design prioritized detection performance. Despite the slight drop in FPS, the improved model still far exceeded the real-time requirements of practical agricultural scenarios. For instance, video capture systems on UAVs or mobile devices typically operate at 20–30 FPS, making our model’s frame rate (>100 FPS) more than adequate for real-time use and offering a buffer for deployment on edge devices with limited resources. In summary, YOLOv11-RDTNet offers enhanced detection capabilities and precise feature extraction, performing especially well under complex conditions such as leaf occlusion and cluttered backgrounds. With its lightweight design and robust performance, it is highly suitable for practical applications in agricultural pest and disease identification.

4. Discussion

The YOLOv11-RDTNet model was evaluated on the test dataset and compared with seven state-of-the-art object detection models, demonstrating outstanding performance across multiple metrics. (1) The proposed model achieved an mAP50 of 87%, which represented improvements of 4.8%, 8.2%, 5.0%, 5.2%, 5.4%, 15.2%, and 12.4% over YOLOv11n, YOLOv10n, YOLOv9t, YOLOv8n, YOLOv5s, Faster R-CNN, and SSD, respectively. These results indicate that YOLOv11-RDTNet offers superior precision in citrus pest and disease recognition tasks. (2) In terms of parameter efficiency and model compactness, YOLOv11-RDTNet reduced the parameter count by 40.3% compared to the original YOLOv11, with the total model size compressed to just 3.4 MB. This highlights the model’s potential for lightweight deployment, as it substantially reduces parameter volume and computational complexity while maintaining high performance. Consequently, YOLOv11-RDTNet significantly enhances computational efficiency and reduces resource consumption, enabling efficient operation on resource-limited platforms such as mobile and embedded devices. These features make the model highly practical for real-world agricultural applications.

Zhu et al. [38] proposed a multi-model fusion network (MMFN) for citrus leaf disease detection based on model ensemble and transfer learning. Although their model achieved a high classification accuracy of 98.68%, it was trained on a publicly available dataset with a simple and repetitive background, which limited its applicability in complex real-world environments. Apacionado et al. [39] attempted to detect sooty mold on citrus leaves in complex scenarios using YOLOv7, but the model only achieved an mAP of 74.4%. In contrast, the YOLOv11-RDTNet model proposed in this study not only performed effectively in complex backgrounds but also demonstrated high detection precision, showcasing its superior performance. Furthermore, Yan et al. [40] introduced a citrus disease classification model based on an improved ConvNeXt architecture; however, the model was limited to detecting only three types of diseases and suffered from a large size and high parameter count, resulting in slow detection speeds that were unsuitable for deployment on mobile devices. Compared with these existing methods, the YOLOv11-RDTNet model overcomes the aforementioned limitations, maintaining high accuracy in complex environments while remaining lightweight and efficient. This makes it a practical and effective solution for pest and disease management in citrus production and offers valuable support for promoting sustainable agricultural development.

In agricultural practice, the proposed model can be integrated into smart agriculture management platforms to enable early warning of citrus pests and diseases, precision pesticide application, and digital management. This would effectively reduce economic losses caused by pests and diseases, minimize the use of chemical pesticides, and improve fruit quality and yield. Furthermore, the design concept of this model can be extended to the identification of pests and diseases in other crops, providing a technical foundation for building a universal and efficient crop health monitoring system. Therefore, the outcomes of this study have strong potential for practical application and hold significant importance for promoting intelligent and precision agriculture.

5. Conclusions

In this study, we proposed an innovative model named YOLOv11-RDTNet for the detection of citrus pests and diseases under complex environmental conditions. The model incorporated three major enhancements. First, the traditional convolutional downsampling module was replaced with the RDF downsampling module, which leveraged a recursive structure and dense connections to improve multi-scale feature fusion, thereby enhancing the model’s adaptability and detection accuracy in challenging scenarios. Second, the original C3k2 module was substituted with the DGST module, which effectively addressed issues such as leaf occlusion, background clutter, and multi-scale lesion detection, while maintaining a lightweight architecture. Third, the standard detection head was replaced with the TADDH detection head, which not only boosted detection precision but also significantly reduced the model’s parameter count and computational complexity. Consequently, YOLOv11-RDTNet significantly enhanced computational efficiency and reduced resource consumption, enabling efficient operation on resource-limited platforms such as mobile and embedded devices. These features make the model highly practical for real-world agricultural applications.

Although the proposed model demonstrated significant advantages in citrus pest and disease detection and showed great application potential, there are still some limitations to address. Specifically: ① Lack of cross-regional independent validation: The model was trained and tested on the same dataset split, without evaluation on independent images from different regions or years. As a result, the model’s robustness and stability under diverse external data conditions require further validation. ② Challenges under complex and extreme conditions: although multiple mechanisms were introduced in this study to enhance detection capability in complex backgrounds, the model still faced challenges in scenarios with extreme lighting, severe occlusion, or the coexistence of multiple diseases. ③ Limited coverage of pest and disease types: The current dataset includes only five common types—anthracnose, canker, yellow vein disease, coal pollution disease, and leaf miner moth. Other agriculturally significant pests and viral diseases, such as citrus rust mite and psyllids, are not yet included. This limits the model’s ability to recognize more complex and diverse pest and disease systems in real-world applications. Based on the above limitations, future research will focus on the following directions: First, the context-learning mechanism is critical for enabling the model to effectively handle complex and dynamic environments. At present, the model’s detection performance remains limited under extreme lighting conditions or in cluttered backgrounds. To address this issue, future research could explore more efficient context modeling approaches. For example, incorporating Graph Neural Networks (GNNs) may enhance the model’s global understanding of the pest occurrence environment by leveraging their powerful capability to model relationships between nodes. Alternatively, improvements to current attention mechanisms could help the model more accurately capture key features, thereby maintaining high detection accuracy even in complex scenarios. Second, multi-scale feature fusion plays a crucial role in improving the precision of pest and disease detection. However, existing methods exhibit limitations in cross-domain adaptation and few-shot learning settings. To overcome these challenges, future work may focus on developing dynamic weighting strategies for multi-scale features, allowing the model to flexibly adjust the importance of each scale based on varying detection needs and environmental conditions. This would enhance the model’s adaptability and robustness. Moreover, the introduction of meta-learning techniques presents a promising direction; by enabling the model to quickly adapt to new environments with limited data, it can significantly improve generalization performance across diverse conditions. With the implementation of these improvements, it is reasonable to expect a significant enhancement in overall model performance. This would not only provide strong technical support for the development of smart agriculture but also assist farmers in managing citrus pests and diseases more efficiently, thereby reducing production costs and increasing agricultural output. More importantly, it would contribute to the sustainable development of agriculture and support the implementation of rural revitalization strategies.

Author Contributions

Conceptualization, Q.D. and S.L. (Shiyao Liang); methodology, S.L. (Shiyao Liang), S.Z. and J.F.; software, S.Z. and X.X.; validation, Y.H. and S.S.; formal analysis, J.F. and Y.H.; investigation, S.Z. and X.X.; resources, Z.L. and S.S.; data curation, Y.H. and J.F.; writing—original draft preparation, S.L. (Shiyao Liang); writing—review and editing, S.L. (Shiyao Liang) and Q.D.; supervision, S.S. and X.X.; project administration, Z.L. and S.L. (Shilei Lyu); funding acquisition, Z.L. and S.L. (Shilei Lyu). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Earmarked Fund for CARS, grant No. CARS-26. It was also partly supported by the National Natural Science Foundation of China, grant No. 32271997; the National Natural Science Foundation of China, grant No. 32472020; Science and Technology Projects in Guangzhou, grant No. 2024B03J1309. The recipients of these four funds are Zhen Li, Shilei Lyu, and Shuran Song.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

ChatGPT 4.0 was used in this article for the translation and polishing of the entire manuscript. The GenAI tool is accessible at: https://chatgpt.com/, accessed on 8 April 2025.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Lu, Q. Major Citrus Pests and Diseases and Their Green Prevention and Control Techniques. Rural. Sci. Technol. 2023, 14, 90–92. [Google Scholar] [CrossRef]
Liao, H. Occurrence and Control of Common Citrus Diseases. Mod. Agric. Sci. Technol. 2022, 12, 60–61. [Google Scholar]
Chohan, M.; Khan, A.; Chohan, R.; Katpar, S.H.; Mahar, M.S. Plant Disease Detection Using Deep Learning. Int. J. Recent Technol. Eng. 2020, 9, 909–914. [Google Scholar] [CrossRef]
Ferentinos, K.P. Deep Learning Models for Plant Disease Detection and Diagnosis. Comput. Electron. Agric. 2018, 145, 311–318. [Google Scholar] [CrossRef]
Saleem, M.H.; Potgieter, J.; Arif, K.M. Plant Disease Detection and Classification by Deep Learning. Plants 2019, 8, 468. [Google Scholar] [CrossRef]
Jogekar, R.N.; Tiwari, N. A Review of Deep Learning Techniques for Identification and Diagnosis of Plant Leaf Disease. In Smart Trends in Computing and Communications: Proceedings of SmartCom 2020; Zhang, Y.-D., Senjyu, T., So–In, C., Joshi, A., Eds.; Smart Innovation, Systems and Technologies; Springer: Singapore, 2021; Volume 182, pp. 435–441. ISBN 978-981-15-5223-6. [Google Scholar]
Pydipati, R.; Burks, T.F.; Lee, W.S. Identification of Citrus Disease Using Color Texture Features and Discriminant Analysis. Comput. Electron. Agric. 2006, 52, 49–59. [Google Scholar] [CrossRef]
Sharif, M.; Khan, M.A.; Iqbal, Z.; Azam, M.F.; Lali, M.I.U.; Javed, M.Y. Detection and Classification of Citrus Diseases in Agriculture Based on Optimized Weighted Segmentation and Feature Selection. Comput. Electron. Agric. 2018, 150, 220–234. [Google Scholar] [CrossRef]
Yue, K.; Zhang, P.C.; Wang, L.; Guo, Z.M.; Zhang, J.J. Citrus Recognition in Complex Environments Based on Improved YOLOv8n. Trans. Chin. Soc. Agric. Eng. 2024, 40, 152–158. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 779–788. [Google Scholar]
Chen, Z.; Wu, R.; Lin, Y.; Li, C.; Chen, S.; Yuan, Z.; Chen, S.; Zou, X. Plant Disease Recognition Model Based on Improved YOLOv5. Agronomy 2022, 12, 365. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 770–778. [Google Scholar]
Prashanthi, V. Plant Disease Detection Using Convolutional Neural Networks. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 2632–2637. [Google Scholar] [CrossRef]
Guan, H.; Fu, C.; Zhang, G.; Li, K.; Wang, P.; Zhu, Z. A Lightweight Model for Efficient Identification of Plant Diseases and Pests Based on Deep Learning. Front. Plant Sci. 2023, 14, 1227011. [Google Scholar] [CrossRef] [PubMed]
Zeng, T.; Li, S.; Song, Q.; Zhong, F.; Wei, X. Lightweight Tomato Real-Time Detection Method Based on Improved YOLO and Mobile Deployment. Comput. Electron. Agric. 2023, 205, 107625. [Google Scholar] [CrossRef]
Zhang, X.; Xun, Y.; Chen, Y. Automated Identification of Citrus Diseases in Orchards Using Deep Learning. Biosyst. Eng. 2022, 223, 249–258. [Google Scholar] [CrossRef]
Hu, W.; Xiong, J.; Liang, J.; Xie, Z.; Liu, Z.; Huang, Q.; Yang, Z. A Method of Citrus Epidermis Defects Detection Based on an Improved YOLOv5. Biosyst. Eng. 2023, 227, 19–35. [Google Scholar] [CrossRef]
Xu, L.; Wang, Y.; Shi, X.; Tang, Z.; Chen, X.; Wang, Y.; Zou, Z.; Huang, P.; Liu, B.; Yang, N.; et al. Real-Time and Accurate Detection of Citrus in Complex Scenes Based on HPL-YOLOv4. Comput. Electron. Agric. 2023, 205, 107590. [Google Scholar] [CrossRef]
Li, K.; Wang, J.; Jalil, H.; Wang, H. A Fast and Lightweight Detection Algorithm for Passion Fruit Pests Based on Improved YOLOv5. Comput. Electron. Agric. 2023, 204, 107534. [Google Scholar] [CrossRef]
Song, Z.; Wang, D.; Xiao, L.; Zhu, Y.; Cao, G.; Wang, Y. DaylilyNet: A Multi-Task Learning Method for Daylily Leaf Disease Detection. Sensors 2023, 23, 7879. [Google Scholar] [CrossRef]
Li, M.; Cheng, S.; Cui, J.; Li, C.; Li, Z.; Zhou, C.; Lv, C. High-Performance Plant Pest and Disease Detection Based on Model Ensemble with Inception Module and Cluster Algorithm. Plants 2023, 12, 200. [Google Scholar] [CrossRef]
Soeb, M.J.A.; Jubayer, M.F.; Tarin, T.A.; Al Mamun, M.R.; Ruhad, F.M.; Parven, A.; Mubarak, N.M.; Karri, S.L.; Meftaul, I.M. Tea Leaf Disease Detection and Identification Based on YOLOv7 (YOLO-T). Sci. Rep. 2023, 13, 6078. [Google Scholar] [CrossRef]
Dai, Q.; Xiao, Y.; Lv, S.; Song, S.; Xue, X.; Liang, S.; Huang, Y.; Li, Z. YOLOv8-GABNet: An Enhanced Lightweight Network for the High-Precision Recognition of Citrus Diseases and Nutrient Deficiencies. Agriculture 2024, 14, 1964. [Google Scholar] [CrossRef]
Sapkota, R.; Meng, Z.; Churuvija, M.; Du, X.; Ma, Z.; Karkee, M. Comprehensive Performance Evaluation of YOLOv12, YOLO11, YOLOv10, YOLOv9 and YOLOv8 on Detecting and Counting Fruitlet in Complex Orchard Environments. Authorea Prepr. 2024. [Google Scholar] [CrossRef]
Sharma, A.; Kumar, V.; Longchamps, L. Comparative Performance of YOLOv8, YOLOv9, YOLOv10, YOLOv11 and Faster R-CNN Models for Detection of Multiple Weed Species. Smart Agric. Technol. 2024, 9, 100648. [Google Scholar] [CrossRef]
He, Z.; Wang, K.; Fang, T.; Su, L.; Chen, R.; Fei, X. Comprehensive Performance Evaluation of YOLOv11, YOLOv10, YOLOv9, YOLOv8 and YOLOv5 on Object Detection of Power Equipment. arXiv 2024, arXiv:2411.18871. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Ebrahimi, M.A.; Khoshtaghaza, M.H.; Minaei, S.; Jamshidi, B. Vision-Based Pest Detection Based on SVM Classification Method. Comput. Electron. Agric. 2017, 137, 52–58. [Google Scholar] [CrossRef]
Liu, N.; Zhai, G. Free Energy Adjusted Peak Signal to Noise Ratio (FEA-PSNR) for Image Quality Assessment. Sens. Imaging 2017, 18, 11. [Google Scholar] [CrossRef]
Chan, R.H.; Ho, C.W.; Nikolova, M. Salt-and-Pepper Noise Removal by Median-Type Noise Detectors and Detail-Preserving Regularization. IEEE Trans. Image Process. 2005, 14, 1479–1485. [Google Scholar] [CrossRef]
Chiang, J.-S.; Hsia, C.-H.; Peng, H.-W.; Lien, C.-H. Color Image Enhancement with Saturation Adjustment Method. Tamkang J. Sci. Eng. 2014, 17, 341–352. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Jegham, N.; Koh, C.Y.; Abdelatti, M.; Hendawi, A. YOLO Evolution: A Comprehensive Benchmark and Architectural Review of YOLOv12, YOLO11, and Their Previous Versions. arXiv 2024, arXiv:2411.00201. [Google Scholar]
Lu, W.; Chen, S.-B.; Tang, J.; Ding, C.H.Q.; Luo, B. A Robust Feature Downsampling Module for Remote-Sensing Visual Tasks. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4404312. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, J.; Pan, P.; Zhang, X. YOLO-RRL: A Lightweight Algorithm for PCB Surface Defect Detection. Appl. Sci. 2024, 14, 7460. [Google Scholar] [CrossRef]
Gong, W. Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer. arXiv 2024, arXiv:2403.01736. [Google Scholar]
Zhong, W. Fault Diagnosis of Insulators Based on Improved YOLOv8. J. Chongqing Technol. Bus. Univ. (Nat. Sci. Ed.) 2024, 1–12. Available online: http://kns.cnki.net/kcms/detail/50.1155.N.20241105.1023.002.html (accessed on 10 April 2025).
Zhu, H.; Wang, D.; Wei, Y.; Zhang, X.; Li, L. Combining Transfer Learning and Ensemble Algorithms for Improved Citrus Leaf Disease Classification. Agriculture 2024, 14, 1549. [Google Scholar] [CrossRef]
Apacionado, B.V.; Ahamed, T. Sooty Mold Detection on Citrus Tree Canopy Using Deep Learning Algorithms. Sensors 2023, 23, 8519. [Google Scholar] [CrossRef]
Yan, J.; Mo, Y.; Yu, Y.; Dou, S.; Yang, R. Citrus Disease Classification Model Based on Improved ConvNeXt. IEEE Access 2024, 12, 152498–152510. [Google Scholar] [CrossRef]

Figure 1. Representative images of each citrus pest and disease category: (a) citrus canker, (b) anthracnose, (c) yellow vein disease, (d) coal pollution disease, and (e) leaf miner moth.

Figure 2. Diagram illustrating data augmentation methods. (a) Original Image; (b) random illumination; (c) horizontal rotation; (d) random contrast adjustment; (e) Gaussian noise; (f) salt-and-pepper noise; (g) random saturation adjustment; (h) random scaling; (i) mosaic data augmentation.

Figure 3. Network structure of the original YOLOv11.

Figure 4. Network structure of the RFD module.

Figure 5. Network structures of SRFD and DRFD modules.

Figure 6. Network architecture of the DGST module.

Figure 7. Network architecture of the TADDH module.

Figure 8. Schematic diagram of the task decomposition structure.

Figure 9. YOLOv11-RDTNet network structure.

Figure 10. Analysis of identification results. (a,d,g) Labeled results; (b,e,h) detection results of YOLOv11; (c,f,i) detection results of YOLOv11-RDTNet.

Figure 11. (a) YOLOv11 confusion matrix; (b) YOLOv11-RDTNet confusion matrix.

Table 1. Division of datasets.

Name	Proportion	Number of Pictures	Number of Labels
Training set	80%	881	3016
Validation set	10%	110	383
Test set	10%	111	385
Total set	100%	1102	3784

Table 2. Ablation experiments of modules.

Model	mAP50 (%)	Model Size (MB)	Params (MB)	FLOPs (G)	FPS (Frames/s)	P (%)	R (%)
v11	82.2	5.5	2.58	6.3	172.9	76.2	78.6
v11+R	85.0	5.5	2.55	7.6	136.8	78.2	79.5
v11+D	84.1	4.7	2.21	6.1	140.7	77.7	80.8
v11+T	85.4	4.7	2.20	7.9	61.2	86.6	72.4
v11+R+D	85.6	4.7	2.18	7.3	153.6	83.3	77.6
v11+R+T	82.8	3.9	1.76	6.2	166.3	85.1	72.0
v11+D+T	84.6	4.1	1.91	7.4	120.3	81.2	76.6
v11+R+D+T	87.0	3.4	1.54	6.1	101.2	82.2	79.4

Note: v11 stands for YOLOv11, R stands for RDF, D stands for DGST, T stands for TADDH.

Table 3. Comparison of eight detection models.

Model	mAP50 (%)	Model Size (MB)	Params (MB)	FLOPs (G)	FPS (Frames/s)	P (%)	R (%)
Faster R-CNN	74.6	361.3	45.27	35.3	64.7	74.1	64.1
SSD	71.8	178.8	26.28	32.3	96.2	71.8	58.1
YOLOv5s	81.6	5.3	2.50	23.8	265.4	85.6	69.6
YOLOv8n	81.8	6.2	3.01	8.7	285.9	85.2	70.8
YOLOv9t	82.0	6.6	2.80	11.7	50.1	81.5	70.9
YOLOv10n	78.8	5.8	2.70	8.2	216.3	81.5	70.7
YOLOv11n	82.2	5.5	2.58	6.3	172.9	76.2	78.6
YOLOv11-RDTNet	87.0	3.4	1.54	6.1	101.2	83.7	80.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, Q.; Liang, S.; Li, Z.; Lyu, S.; Xue, X.; Song, S.; Huang, Y.; Zhang, S.; Fu, J. YOLOv11-RDTNet: A Lightweight Model for Citrus Pest and Disease Identification Based on an Improved YOLOv11n. Agronomy 2025, 15, 1252. https://doi.org/10.3390/agronomy15051252

AMA Style

Dai Q, Liang S, Li Z, Lyu S, Xue X, Song S, Huang Y, Zhang S, Fu J. YOLOv11-RDTNet: A Lightweight Model for Citrus Pest and Disease Identification Based on an Improved YOLOv11n. Agronomy. 2025; 15(5):1252. https://doi.org/10.3390/agronomy15051252

Chicago/Turabian Style

Dai, Qiufang, Shiyao Liang, Zhen Li, Shilei Lyu, Xiuyun Xue, Shuran Song, Ying Huang, Shaoyu Zhang, and Jiaheng Fu. 2025. "YOLOv11-RDTNet: A Lightweight Model for Citrus Pest and Disease Identification Based on an Improved YOLOv11n" Agronomy 15, no. 5: 1252. https://doi.org/10.3390/agronomy15051252

APA Style

Dai, Q., Liang, S., Li, Z., Lyu, S., Xue, X., Song, S., Huang, Y., Zhang, S., & Fu, J. (2025). YOLOv11-RDTNet: A Lightweight Model for Citrus Pest and Disease Identification Based on an Improved YOLOv11n. Agronomy, 15(5), 1252. https://doi.org/10.3390/agronomy15051252

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv11-RDTNet: A Lightweight Model for Citrus Pest and Disease Identification Based on an Improved YOLOv11n

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. Data Acquisition

2.1.2. Data Enhancement

2.2. YOLOv11-RDTNet Improvement

2.2.1. New Object Detection Algorithm: YOLOv11

2.2.2. Robust Feature Downsampling Module: RDF

2.2.3. Dynamic Group Shuffle Transformer: DGST

2.2.4. Task Align Dynamic Detection Head: TADDH

2.2.5. YOLOv11-RDTNet Model

2.3. Training Environment and Evaluation Metrics

3. Results and Discussion

3.1. Model Training Result

3.2. Ablation Experiment

3.3. Comparative Experiment of Different Network Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI