MSHLB-DETR: Transformer-Based Multi-Scale Citrus Huanglongbing Detection in Orchards with Aggregation Enhancement

Liu, Zhongbin; Wu, Dasheng; Xu, Fengya; Du, Zengjie; Luo, Ruikang; Li, Cheng

doi:10.3390/horticulturae11101225

Open AccessArticle

MSHLB-DETR: Transformer-Based Multi-Scale Citrus Huanglongbing Detection in Orchards with Aggregation Enhancement

by

Zhongbin Liu

^1,2,3,

Dasheng Wu

^1,2,3,*

,

Fengya Xu

¹

,

Zengjie Du

^1,2,3,

Ruikang Luo

^1,2,3 and

Cheng Li

^1,2,3

¹

College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou 311300, China

²

Key Laboratory of State Forestry and Grassland Administration on Forestry Sensing Technology and Intelligent Equipment, Hangzhou 311300, China

³

Key Laboratory of Forestry Intelligent Monitoring and Information Technology of Zhejiang Province, Hangzhou 311300, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2025, 11(10), 1225; https://doi.org/10.3390/horticulturae11101225

Submission received: 26 August 2025 / Revised: 7 October 2025 / Accepted: 9 October 2025 / Published: 11 October 2025

(This article belongs to the Special Issue Applied Artificial Intelligence in Digital Horticulture: Practices and Innovations)

Download

Browse Figures

Versions Notes

Abstract

Detecting citrus Huanglongbing (HLB) in orchard environments is particularly challenging due to multi-scale targets and occlusions due to clustering, which manifest as complex and variable backgrounds, targets ranging from distant single leaves to nearby full canopies, and frequent instances where symptomatic leaves are hidden behind others, all significantly hindering accurate detection. To overcome these challenges, this study introduces a novel citrus object detection model, Multi-Scale Huanglongbing DETR (MSHLB-DETR), developed on the basis of an improved Real-Time DEtection TRansformer (RT-DETR). The model significantly enhances detection accuracy and efficiency for HLB under complex orchard conditions. To address the issue of small target feature loss in leaf detection, a new efficient transformer module called Smart Disease Recognition for Citrus Huanglongbing with Multi-scale (SDRM) is introduced. SDRM includes a space-to-depth (SPD) module and inverted residual mobile block (IRMB), which facilitate deep interaction between local and global features and significantly improve the computational efficiency of the transformer. Additionally, the transformer encoder incorporates a Context-Guided Block (CGBlock) for contextual feature learning. To evaluate the proposed model under complex background conditions, a dataset of 4367 images was collected from diverse orchard scenes, preprocessed, and divided into training, validation, and testing subsets. The experimental results demonstrate that the proposed MSHLB-DETR achieved the best detection performance on the test set, with an

{mAP}_{50}

of 96.0%, surpassing other state-of-the-art models of similar scale. Compared to the original RT-DETR, the proposed model increased

{mAP}_{50}

by 15.8%, reduced Params by 7.5%, and decreased GFLOPs by 5.2%. This study reveals the critical importance of developing efficient multi-scale detection techniques for the accurate identification of citrus Huanglongbing in complex real-time monitoring scenarios. The proposed algorithm is expected to provide valuable references and new insights for the precise and timely detection of citrus Huanglongbing.

Keywords:

multi-scale targets; RT-DETR; object detection; citrus Huanglongbing; transformer

Graphical Abstract

1. Introduction

The citrus industry, including the farming of Citrus reticulata Blanco, contributes significantly to global agriculture, especially in tropical and subtropical regions, but faces significant threats from Huanglongbing (HLB). Rapid and accurate detection is crucial for maintaining productivity and farmer incomes [1,2]. Machine learning (ML) has advanced disease diagnosis through digital imaging; however, traditional ML methods rely on manual feature extraction, whereas deep learning (DL) overcomes this limitation by automatically learning discriminative features, thereby improving efficiency and accuracy [3,4]. Though DL models have high parameter counts and training costs, they offer faster testing speeds, aiding early disease identification and preventing economic losses [5,6,7,8]. ML has gained traction in object detection for its feature extraction and pattern recognition strengths [9]. Convolutional Neural Networks (CNNs), like DenseNet-121, have been applied to citrus disease detection, achieving 95% accuracy through pre-trained weights [10,11]. While You Only Look Once (YOLO) achieves 90% precision through comprehensive data augmentation [12], its inherent design generates excessive redundant bounding boxes. This necessitates Non-Maximum Suppression (NMS), a process that may inadvertently suppress valid detections of small objects due to their lower confidence scores [13]. Transformer-based models, like Detection Transformer (DETR), use self-attention to capture global relationships and have demonstrated superior generalization [14,15,16,17]. DETR integrates a CNN backbone with a transformer encoder–decoder, eliminating NMS and simplifying the detection process [18,19,20]. However, DETR suffers from slow convergence and struggles with small object detection, prompting various improvements. Xizhou et al. [21] developed Deformable-DETR, a novel architecture that improves shape adaptability by employing deformable attention mechanisms. Meng et al. [22] introduced Conditional-DETR, which optimizes attention with a variational autoencoder. Gao et al. [23] presented Spatially Modulated Co-Attention DETR (SMCA-DETR), which significantly improves small object detection by incorporating multi-scale attention modules. Cao et al. [24] proposed CF-DETR, a coarse-to-fine framework that enhances detection accuracy through local context refinement. Hao et al. [25] proposed DINO (Detection transformer with Instance Noise Optimization), which reduces label noise to achieve better convergence performance.

Notably, the Real-Time Detection TRansformer (RT-DETR) [26] has made breakthroughs in real-time object detection with its efficient hybrid encoder, effectively handling multi-scale features. In evaluations on the Common Objects in Context (COCO) dataset, RT-DETR has shown advantages over similarly sized YOLO detectors in both speed and accuracy. Zhu et al. [27] proposed an improved RT-DETR that creatively integrates features generated from multi-scale perception, greatly enhancing feature extraction capabilities and improving accuracy in drone detection, achieving precisions of 95.6% and 97.8% on two drone datasets. Li et al. [28] introduced an improved high-precision and robust model for free-range chicken detection based on RT-DETR (EMSC-DETR), significantly enhancing the computational efficiency of the transformer. Although EMSC-DETR has a speed advantage compared to other transformer-based models, its parameters and computational complexity cannot be compared to those of lightweight models.

To achieve high precision and recall in the identification of Huanglongbing (HLB) in citrus, while also considering cost and training time, it is recommended to abandon the use of YOLO and instead select a lightweight model from the DETR series, such as the real-time end-to-end object detection model RT-DETR, as an improved baseline model. Compared to YOLO, RT-DETR features a hybrid encoder that effectively handles multi-scale features and Intersection over Union (IoU)-aware query selection, enabling RT-DETR to achieve higher computational efficiency than YOLO models with comparable accuracy, demonstrating superior performance in detecting small objects in agricultural contexts. Huangfu et al. [29] utilizing the improved lightweight HHS-RT-DETR model for citrus Huanglongbing detection. The experimental results indicate that it achieves an accuracy of 92.4%, outperforming YOLOv5m and YOLOv8n, though there remains room for further improvement. Li et al. [30] demonstrated the applicability of RT-DETR-SoilCuc in cucumber germination detection within soil environments, exhibiting certain advantages over similarly sized YOLO series models. However, these studies primarily focused on large groups or individual objects, without adequately addressing challenges such as severe occlusion or small object detection.

Based on the challenges identified in detecting Huanglongbing (HLB) in citrus, including low detection accuracy and inefficiency, this paper proposes an improved HLB detection model called MSHLB-DETR. Built on RT-DETR, this model integrates a novel transformer module (SDRM) and a contextual information learning module (CG Block), significantly enhancing detection accuracy for HLB in complex environments while reducing model complexity and increasing detection speed. This advancement provides robust technical support for precise and efficient HLB detection in citrus orchards.

The main contributions of this study are summarized as follows:

(1) A dataset for citrus Huanglongbing (HLB) disease is constructed, capturing images of HLB-affected citrus in natural orchard environments. The dataset records the growth conditions of citrus leaves under natural settings. Additionally, a novel RGB image enhancement algorithm is proposed, specifically addressing the challenge of subtle visual color features associated with HLB symptoms on citrus leaves.

(2) To address the issue of small target feature loss, a novel transformer module called the Smart Disease Recognition Multi-scale Transformer (SDRM) is proposed. SDRM incorporates a space-to-depth (SPD) module and an inverted residual mobile block (IRMB), which facilitate deep interaction and information flow between local and global features, minimizing the loss of critical information and significantly enhancing the computational efficiency of the transformer.

(3) Additionally, to address the detection and differentiation challenges caused by occlusions in diseased citrus leaves, this study introduces an innovative feature learning module within the transformer encoder called the Context-Guided Block (CG Block). Inspired by traditional self-attention mechanisms and the human visual system’s reliance on contextual information for scene comprehension, the CG Block learns both the local features of target objects and contextual information regarding their surroundings. This results in a more precise feature representation, enhancing the fusion of adjacent feature maps. By fully utilizing the global context information obtained from adjacent high-level features in the backbone network, the model achieves a more accurate localization of overlapping diseased leaf targets.

2. Materials and Methods

2.1. Data Acquisition and Preprocessing

The dataset used in this study consists of 4347 citrus Huanglongbing (HLB) images captured in a citrus orchard in Ganzhou, Jiangxi Province, China, during February and July 2024. The specific experimental plot was located at geographical coordinates 25°48′ N, 114°55′ E. All images were acquired using a REDMI K40 (Xiaomi Technology Co., Ltd., Beijing, China) with a resolution of 3024 × 3024 pixels. The images were captured at varying distances ranging from approximately 0.5 to 3 m from the citrus trees, representing typical working distances for in-field plant phenotyping and manual inspection. This distance range ensures the acquisition of both close-up leaf details and broader canopy-level features, effectively capturing the multi-scale characteristics of HLB symptoms in natural growing conditions. Environmental conditions during image acquisition were systematically recorded to ensure experimental transparency. Data collection was conducted during daylight hours (8:00–17:00) under natural lighting conditions, with illumination intensities ranging from 15,000 to 80,000 lux depending on weather conditions and the time of day. The temperature varied between 12 and 28 °C during February and 25 and 35 °C during July, reflecting the typical seasonal variations in the region. Relative humidity ranged from 55% to 85% across different collection days. All images were acquired under clear to partly cloudy conditions to ensure consistent lighting quality. These environmental parameters cover the typical growing conditions of citrus orchards in this region and provide important context for understanding the visual characteristics of the captured leaf images.

Prior to augmentation, all collected images underwent manual quality screening. Images were removed if they were blurry, had severe lighting issues, contained irrelevant subjects, or were near-duplicates of other shots, ensuring the quality and relevance of the base dataset. To enhance model robustness and effectively prevent overfitting, a series of data augmentation techniques was applied to process the image data [31,32]. To improve detection accuracy in the Huanglongbing-affected regions of citrus images, an innovative data augmentation method was proposed, specifically targeting the features of diseased leaves. Through an in-depth analysis of the visual characteristics of these diseased leaves, a transformation formula was developed for image processing, as shown in Equation (1).

I = α \cdot R + β \cdot G - (1 - \frac{R + G}{255}) \cdot (γ + β)

(1)

Equation (1) provides a physiologically informed transformation to amplify HLB symptoms and suppress confounding noise by processing the digital intensity values (0–255) of the RGB channels. Here, “I” denotes the adjusted pixel intensity after enhancement, applied uniformly to each pixel. The constants α, β, and γ are empirically tuned to emphasize the red and green spectral regions (R: 620–750 nm; G: 495–570 nm) most affected by HLB, leveraging the correspondence between the camera’s color filters and human visual perception. The first component, α·R + β·G, is tailored to the pathology of HLB. It enhances the specific spectral bands (red and green) that signify the loss of chlorophyll and the emergence of yellowing pigments, while the diagnostically irrelevant blue channel is omitted to minimize its susceptibility to noise. The second component, −(1 − (R + G)/255)·(γ + β), acts as an adaptive corrector. It counters the effects of uneven lighting by boosting contrast in dimly lit areas and simultaneously suppressing the influence of common dark background noises like soil and shadows. The synergistic effect of both terms ensures that HLB signatures are robustly enhanced against complex orchard backgrounds, providing a purified input for the detection model. Prior to training and detection, the aforementioned method is employed to augment the collected images of Huanglongbing-infected citrus. The color of the citrus leaf epidermis is transformed into a deeper orange-yellow hue, overall brightness is increased, and the contrast between diseased and healthy leaves is enhanced. Furthermore, the color of the diseased leaf areas becomes more pronouncedly orange-yellow, significantly improving the recognition accuracy of the diseased regions. This enhancement provides better input images for subsequent object detection algorithms, thereby increasing the recognition accuracy of the Huanglongbing detection model for citrus leaves. As shown in Figure 1, this study employed several image enhancement methods, including Histogram Equalization, RGB Enhancement, Gaussian Noise addition, and Horizontal Rotation, to expand the dataset. The augmented dataset consists of a total of 4367 images, which were then split into training, validation, and testing sets at an approximately 7:1:2 ratio. Image annotation was performed using the Python-based labeling tool LabelImg (https://github.com/tzutalin/labelImg, accessed on 8 October 2025), which utilizes rectangular annotation frames and generates label files in VOC format. Since the symptoms of citrus Huanglongbing (HLB) naturally occurring in orchards typically involve systemic pathological changes across the entire leaf rather than localized lesions, and considering the need to evaluate the overall health status of the citrus tree, this study adopts a whole-leaf annotation approach. This strategy not only ensures the accurate capture of systemic disease characteristics but also provides a more comprehensive basis for assessing tree health in orchard management. The categories include Huanglongbing (HLB), healthy (health), and other minor diseases (ill), with their respective proportions of 75%, 18%, and 7%. The “ill” category comprises conditions that may present visual similarities to HLB or cause detection challenges, including citrus scab, citrus canker, and nutritional deficiencies (particularly zinc and magnesium deficiency). This grouping reflects field conditions where multiple pathologies may co-occur and allows for an evaluation of model specificity against non-HLB conditions. It is noteworthy that all images retained their natural backgrounds to ensure that the model was trained and evaluated under realistic orchard conditions.

2.2. Structure and Features of Datasets

The structure and characteristics of the dataset have a decisive impact on model performance. In the aggregate, the training set, validation set, and testing set in this study contain 3056, 874, and 437 images, respectively. In some datasets, the distribution of target sizes is uneven, with a predominance of small targets and only a few large targets, as illustrated in Figure 2. Following the COCO evaluation protocol established by Lin et al. [33], targets can be categorized as small (areas < 32²), medium (32² < areas < 96²), or large (areas > 96²), where “areas” refers to the number of pixels, measured in square pixels. Table 1 summarizes the total number of bounding boxes and the proportions of different target sizes for each dataset, with “boxes” indicating the total number of bounding boxes [34]. In total, all datasets contain 24,258 targets, with small targets comprising the largest proportion at 55.12%. Naturally, there are significant challenges in object detection for Huanglongbing images.

Environment: Detecting diseased citrus leaves in outdoor environments is significantly more challenging than in controlled settings. Complex backgrounds and highly variable lighting conditions impede the discrimination of diseased leaves from healthy ones based on color and shape. These factors necessitate the development of robust models capable of distinguishing subtle symptomatic features within cluttered visual scenes.

Target Scale: In the process of detecting diseased citrus leaves in images, the sizes of leaves and diseased areas are varying along with the distance between the camera and the leaves. As the distances from the camera to the target increase, the pixel numbers occupied by the leaves and diseased areas in the image decrease, resulting in noticeable scale variations among leaves within a single image. This issue is particularly pronounced in side-angle shots, where perspective changes can lead to even greater variations in leaf size and shape, thereby increasing the difficulty of accurately detecting diseased leaves in the images.

Target Characteristics: In the natural environment of citrus trees, leaves often cluster together, increasing additional challenges for image-based disease detection due to overlapping and occlusion. The dense arrangement of leaves presents significant challenges for the precise localization and regression of bounding boxes, particularly under mutual occlusion. These occlusions and the need for tightly positioned bounding boxes increase the difficulty of feature extraction, potentially leading to convergence issues during the model training process.

2.3. RT-DETR Structure

RT-DETR is an innovative real-time end-to-end object detector that integrates the multi-scale feature processing capabilities of Vision Transformers (ViTs), ensuring high-speed performance without sacrificing accuracy. It consists of a backbone network, hybrid encoder, transformer decoder, and auxiliary predictor head, eliminating handcrafted components like the Non-Maximum Suppression (NMS) found in the YOLO series. The backbone network utilizes the CNN to extract features at different levels, including high-level features (downsampled 32 times), mid-level features (downsampled 16 times), and low-level features (downsampled 8 times). The hybrid encoder enhances multi-scale feature interaction through Attention-based Intra-scale Feature Interaction (AIFI) and Cross-scale Feature Fusion (CCFM) modules, improving feature representation. Additionally, IoU-aware query selection optimizes object query initialization, while adjustable decoder layers provide flexible inference speed. The experimental results show that RT-DETR outperforms YOLO models of similar size in both speed and accuracy on the COCO dataset. Given its high detection performance, RT-DETR serves as a strong foundation for detecting Huanglongbing (HLB) in citrus. The structure of RT-DETR is illustrated in Figure 3.

2.4. The Improved MSHLB-DETR Model

To address the challenges of multi-scale object detection in complex agricultural environments, this study proposes an enhanced and efficient detection model based on RT-DETR, named Multi-Scale Huanglongbing DETR (MSHLB-DETR), with its architecture depicted in Figure 4. While extensively validated on the specific task of citrus Huanglongbing detection, the model’s design principles are generalizable. MSHLB-DETR comprises four key components: a backbone feature extraction network, an optimized encoder, a feature fusion network, and a decoder. In the ResNet-18 backbone, the novel SDRM replaces the first downsampling layer in Stage 2, substituting the original depthwise convolution (DWConv) with stride = 2. This modification enables more efficient feature extraction for small targets in the early stages of the network. The SDRM integrates a space-to-depth (SPD) module and an inverted residual mobile block (IRMB). The SPD component, consisting of an SPD layer and a non-strided convolutional layer, mitigates the information loss typical of strided convolution and pooling operations, thereby preserving finer image details. The IRMB module combines the lightweight characteristics of CNNs with the dynamic modeling capability of transformers, enhancing accuracy in dense prediction tasks. Crucially, the SDRM converts spatial features into depth features, facilitating deep interaction between local and global features while boosting computational efficiency. Furthermore, the model enhances the original transformer encoder with the Context-Guided Block (CG Block), a feature learning module that captures the combined features of local details and their surrounding context, thereby improving the integration of global contextual information. More detailed explanations of these improved modules are provided below.

2.4.1. Space-to-Depth (SPD)

Capturing diseased citrus leaves in images in orchard environments presents two main challenges due to multi-scale targets: not only does it make small targets difficult to detect accurately, but it also increases the difficulty of detecting edge targets within the image. These challenges often overlap, as small targets located at the image edges are both small and edge targets, facing an even higher level of detection difficulty. The combination of small target size and incomplete information for edge targets requires detection algorithms with greater precision and robustness to ensure the accurate recognition of these challenging targets [35]. Classical CNNs like AlexNet [36] and ResNet [37] play key roles in visual recognition but rely on stride convolution and pooling, which may cause information loss, degrading small target detection due to low resolution and limited contextual information. In multi-scale citrus leaf images, models tend to prioritize larger targets, as they occupy more prominent areas, leading to suboptimal small target detection.

To address low-resolution and small target detection challenges, this study introduces Space-to-Depth Convolution (SPD-Conv), which replaces strided convolution and pooling with CNN building blocks—a space-to-depth (SPD) layer and a non-strided convolution layer—enhancing feature preservation for small and edge targets. As shown in Figure 5, the feature map (Figure 5a) represents the traditional input feature map with channel count (C), height (H), and width (W). Through the space-to-depth operation (Figure 5b), spatial blocks of pixels are rearranged into the depth/channel dimension, increasing the channel count to 4 while halving the spatial dimensions. Afterward, channel merging is applied (Figure 5c), where four different channel groups are combined along the channel dimension. The merged feature map is then added to other processed feature maps (Figure 5d). Finally, a convolution with a stride of 1 is applied to the resulting output feature map (Figure 5e), reducing the channel dimension to 1 while maintaining the spatial resolution, which is still 1/4 of the original size. Liu et al. [38] have demonstrated the effectiveness of SPD-Conv in mitigating information loss during the downsampling process, and Han et al. [39] have shown its excellent performance in small target detection and segmentation tasks.

2.4.2. Inverted Residual Mobile Block (IRMB)

In densely packed canopies, mutual occlusion among leaves results in severely occluded or merged leaf appearances, which complicates the accurate regression of bounding boxes for individual instances.

The inverted residual mobile block (IRMB) combines the efficiency of CNN architectures for local feature modeling with the dynamic modeling capabilities of transformers for long-range interactions. This study constructs a ResNet-like model composed solely of IRMBs for downstream tasks, with IRMB stacking at different levels enabling the development of a more efficient and lightweight model.

For the dense prediction of diseased citrus leaves, this study designs a four-stage ResNet-like efficient model based on a series of IRMBs. To ensure lightweight computation, multi-head self-attention is replaced with window multi-head self-attention, and standard convolutions are substituted with depthwise convolution (DWConv). Since the query (

Q

), key (

K

), and value (

V

) features generate a large number of additional parameters, we assume that the input image

X (\in R^{C \times H \times W})

satisfies

Q = K = X

and

V = X_{e}

. The efficient module first applies a multi-layer perceptron (MLP) operation with a dilation rate λ to perform dimensionality expansion.

Q = K = X (\in R^{C \times H \times W})

(2)

V = X_{e} (\in R^{λ C \times H \times W})

(3)

where

X_{e}

represents the expanded feature map obtained by processing the original input X through a multi-layer perceptron with expansion factor

λ

, “

R

” indicates that the value of each channel and each position in the input image “

X

” are within the range of real numbers, and

C \times H \times W

corresponds to the number of channels, height, and width of the image. The term

λ C \times H \times W

indicates that the output maintains the same spatial dimensions (H × W) but expands the channel dimension from C to λC. Imagine this as creating an “enhanced version” of the original features—while the spatial structure remains intact, each position now carries λ times more feature information, enabling a richer representation of subtle disease patterns. During this operation, skip connections are incorporated to enhance the information flow and stability within the network while reducing the number of parameters. The resulting window multi-head self-attention, implemented through these modifications, is referred to as extended window multi-head self-attention (EW-MHSA). The efficient operator is redefined using

F (\cdot)

to further enhance image features, and the specific implementation can be expressed as follows:

F (\cdot) = (DW - Conv, Skip) (EW - MHSA (\cdot))

(4)

This equation defines our efficient feature transformation operator F(·) that integrates extended window multi-head self-attention (EW-MHSA), depthwise convolution (DW-Conv), and skip connections. The processing pipeline operates as follows: EW-MHSA captures long-range dependencies between image regions through its attention mechanism; DW-Conv then processes local feature patterns, while skip connections preserve original information during transformations. This design achieves an optimal balance between global feature capture and local detail modeling. The inverted residual mobile block effectively combines the advantages of dynamic global context and static local features, significantly enhancing the transmission of feature information and the expressive power of the network, leading to improved performance in dense object detection. Therefore, the inverted residual mobile block is chosen to update the basic building blocks of the RT-DETR backbone model.

As illustrated in Figure 6, the process begins with generating the query (

Q

) and key (

K

) vectors, which serve as the foundation for the self-attention mechanism. A dilated convolution is then applied to generate the feature key (

V

) vector. Next, window self-attention is performed on

Q

,

K

, and

V

to enable long-range interactions. Subsequently, DWConv is used to model local features, effectively decoupling channel mixing and spatial mixing, thereby reducing computational load while preserving local feature sensitivity.

Finally, a compressing convolution restores the number of channels, and the output is added to the input to form a residual connection, facilitating gradient flow in deep networks and enhancing model performance. Notably, since the core operations of dilated convolution and the self-attention mechanism involve matrix multiplication, self-attention can be computed before executing the dilated convolution [40]. This approach reduces the number of floating-point operations while maintaining computational equivalence, thereby improving model efficiency.

2.4.3. Smart Disease Recognition for Citrus Huanglongbing with Multi-Scale (SDRM)

To address the challenges of citrus disease detection in complex backgrounds, multi-scale targets, and clustering occlusions, this study introduces the Smart Disease Recognition Module (SDRM), which strategically integrates the complementary strengths of the space-to-depth (SPD) and inverted residual mobile block (IRMB) components. The SPD component serves as a feature-preserving frontend that converts spatial information into channel depth through lossless downsampling, effectively reducing information loss while enhancing the detection of small-scale leaf targets. The IRMB component then processes these enriched features using an efficient transformer-inspired architecture, employing attention mechanisms to improve feature flow in densely occluded regions. The synergistic SDRM architecture enhances global feature extraction while preserving the local details and boundary features of citrus leaves, making it particularly suitable for detecting HLB symptoms under challenging orchard conditions where both fine-scale details and contextual understanding are essential for accurate diagnosis. The SDRM structure is illustrated in Figure 7.

The SDRM consists of two branches: One branch utilizes DWConv operations, which reduces both parameters and computational load compared to regular convolution. In the other branch, SDRM uses DWConv followed by two SPD operations to reduce the feature map size. This branch then connects to the IRMB module, identifying key features from dense regions and supplementing global information to improve the model’s accuracy in recognizing clustered regions. After deconvolution (DeConv), feature maps X₁ and X₂ are concatenated along the channel dimension, producing the final feature map

X^{'}

following DWConv. This process ensures that the feature map after SDRM possesses both local and global representational capabilities.

2.4.4. Context-Guided Block (CG Block)

To enhance the accuracy of citrus disease detection, this study proposes a Context-Guided (CG) Block inspired by the human visual system. The CG Block encodes multi-level contextual information through three complementary pathways: local features, surrounding context, and global semantic guidance. It incorporates an efficient attention-like mechanism that jointly models spatial dependencies while applying channel-wise weighting based on global context, similar to the SENet architecture. This design effectively balances representational power and computational efficiency, making it particularly well-suited for dense object recognition scenarios. By integrating local and contextual features, CG Block significantly improves the model’s ability to detect citrus diseases in complex orchard environments, especially under conditions of occlusion, cluttered backgrounds, and subtle symptom presentation. It enhances the recognition of fine-grained and edge-based disease characteristics, thereby optimizing both detection accuracy and model reliability.

As shown in Figure 8, human vision improves recognition by incorporating local details (purple) and global context (red), underscoring the importance of surrounding information. This combined approach strengthens feature extraction and detection capability in challenging scenes. Figure 8d illustrates CG Block’s structure. The local feature extractor

f_{l o c}

uses 3 × 3 convolution to learn from eight neighboring vectors. The surrounding context extractor

f_{s u r}

applies 3 × 3 dilated convolution to expand the receptive field. The joint feature extractor

f_{j o i}

concatenates features, followed by Batch Normalization (BN) and Exponential Linear Unit (PReLU) activation. The global feature extractor

f_{g l o}

employs global pooling and fully connected layers to generate a weighted vector, refining feature emphasis through element-wise multiplication (C). The effectiveness of CG Block in enhancing contextual learning has been demonstrated in prior research [41].

2.5. Evaluation Metrics and Experimental Environments

This study evaluates the object detection model using key metrics, including parameters (Params), giga-floating-point operations (GFLOPs), frames per second (FPS), and mean average precision (

mAP

). Params reflect the model’s storage requirements, while GFLOPs measure computational complexity based on forward and backward propagation. FPS serves as an indicator of processing speed, collectively assessing the model’s deployment efficiency.

mAP

quantifies detection performance by calculating the area under the precision–recall (PR) curve. Precision is the ratio of correctly identified samples to all detected samples, while recall represents the ratio of correctly identified samples to all actual samples. As precision increases, recall often decreases, highlighting their trade-off. The precision–recall calculation formulas are provided accordingly.

precision = \frac{TP}{TP + FP}

(5)

recall = \frac{TP}{TP + FN}

(6)

where TP (True Positive) is the number of correctly predicted positive samples, FP (False Positive) is the number of incorrectly predicted positive samples, and FN (False Negative) is the number of positive samples missed by the model. The mean average precision (mAP) evaluates overall object detection performance by averaging AP across categories, making it suitable for multi-class tasks.

{m A P}_{50}

represents the average precision at a 50% IoU threshold, while

m {A P}_{50 : 95}

averages precision from IoU 50% to 95% in 5% intervals. IoU measures the accuracy of predicted bounding boxes. FPS (frames per second) indicates the number of images processed per second. The main formulas are as follows.

AP = \int_{0}^{1} Precision (r) dr

(7)

mAP = \frac{1}{C} \sum_{i = 1}^{C} {AP}_{i}

(8)

FPS = \frac{1000}{preprocess + inference + postprocess}

(9)

mA P_{50 : 95} = \frac{1}{10} (mA P_{50} + mA P_{55} + \dots + mA P_{95})

(10)

where C represents the total number of categories. Preprocess represents the preprocessing time of the image, inference is the inference time of the image, postprocess is the postprocessing time of the image, and their units are milliseconds. The experimental environment used in this study includes a Windows 11 system (CPU: Intel(R) Core (TM) i5-12490F; GPU: NVIDIA GeForce RTX 4070 SUPER; RAM: 32 GB). Model construction was conducted using Python version 3.10.14, with the CUDA 12.1 library for GPU acceleration and the PyTorch 2.3.0 deep learning framework. Details on hyperparameter settings are listed in Table 2.

3. Experiment Results and Analysis

3.1. Backbone Network Design and Selection

The backbone network is crucial to model performance. Baidu developed four RT-DETR variants, and this study selected RT-DETR-L, RT-DETR-VGG, RT-DETR-R18, RT-DETR-R34, and RT-DETR-RegNet for comparison. RT-DETR-L serves as the baseline without backbone substitution. Evaluations were based on parameters, computational complexity, and detection accuracy, with the results detailed in Table 3. As a lightweight ResNet model, ResNet18 features 18 weight layers and low complexity. It has only 19.9M parameters, significantly fewer than other networks, reducing computational requirements, storage, and deployment costs while minimizing overfitting risk. ResNet18 also has the lowest computational demand at 50.8 GFLOPs, making it suitable for resource-constrained environments. It achieved the highest

mA P_{50}

at 80.2%, enhancing small target detection and localization precision. As a result, RT-DETR-R18 is selected as the base model for subsequent improvements in this study.

3.2. Improving the Backbone Network Based on the SDRM

The SDRM, integrating SPD and the IRMB, significantly improves detection accuracy and generalization. To validate its efficiency, ablation experiments were conducted, with the results shown in Table 4. Adding SPD and the IRMB to the baseline increased

{m A P}_{50}

by 8.8% and 10.8%, respectively, while combining both in the SDRM led to a 15.5% increase and a reduction in parameters. SPD reduces feature map size while increasing channels, accelerating transformer computations by minimizing attention cycles and inner product operations. The IRMB enhances local–global feature interaction. Overall, the SDRM effectively leverages both feature types, improving model generalization for complex scenes while boosting computational efficiency. Metric visualizations are shown in Figure 9.

3.3. The Effectiveness of the CG Block Module

After improving the backbone network of RT-DETR with the SDRM, the model demonstrates a significant improvement in mAP metrics on the test set. However, further enhancement is still required for prediction performance in complex occlusion scenarios. The original RT-DETR model employs a conventional self-attention mechanism in its transformer encoder component, which may limit expressive feature mapping, particularly for dense occlusion prediction tasks. To address this limitation and further enhance its generalization ability while preserving its prediction capability, the improved model incorporates CGblock, a lightweight module that enhances object detection by fusing multi-scale context through dilated convolutions and channel attention mechanisms.

To validate the effectiveness of CGBlock’s hierarchical context fusion, we compare it with representative attention mechanisms (SimAM [42] and CAA [43]) under identical settings, focusing on their trade-offs between accuracy and efficiency. CGBlock achieves 96.0% mAP50 with only 48.3 GFLOPs, surpassing SimAM and CAA by 3.1% and 1.4% in accuracy while running 2.8× and 1.8× faster, respectively. This demonstrates its superior capability in balancing accuracy and efficiency. The impact of different attention modules on the recognition results is presented in Table 5.

3.4. Ablation Experiments on the Proposed MSHLB-DETR

To assess the effectiveness of the SDRM and CG Block, ablation experiments on MSHLB-DETR were conducted, with the results shown in Table 6. Replacing DWConv with the SDRM in RT-DETR increased

{m A P}_{50}

by 14% while achieving the lowest parameter count. Modifying the transformer encoder to include only CG Block improved FPS and enhanced dense target recognition. Compared to RT-DETR-R18, MSHLB-DETR achieved the highest

{m A P}_{50}

at 96.0%, with other metrics slightly improved or comparable, demonstrating superior generalization. The training curves in Figure 10 reveal that RT-DETR-R18 exhibited significant oscillations during the early training phase and slower convergence, suggesting a potential risk of overfitting. In contrast, the other models showed rapid

{m A P}_{50}

improvement by the 75th epoch, followed by stable convergence. Overall, models equipped with the SDRM and CG Block outperformed baseline methods, demonstrating faster convergence and the highest accuracy while also improving continual learning stability.

3.5. Comparison Experiments of Different Models

In this experimental phase, we thoroughly evaluated the MSHLB-DETR model’s capability to detect citrus leaf diseases. The comparison aimed to illustrate the performance differences between various network models, with a particular emphasis on MSHLB-DETR’s superior accuracy and detection speed. In this study, the MSHLB-DETR model was evaluated for citrus leaf disease detection and compared with eight leading object detection algorithms. The comparison models included both CNN-based and transformer-based architectures. Specifically, YOLOv5, YOLOv8 [44], YOLOv12 [45], and Faster-R-CNN [46] were chosen as CNN-based models, while DEIM-D-FINE-L [47], RT-DETRv3-R18 [48], DINO, and RT-DETR-R18 represented transformer-based models. To ensure fairness, the performance metrics selected for comparison included

{mAP}_{50}

,

{mAP}_{50 : 95}

, parameters, GFLOPs, and FPS. A visualization of the comparison models is presented in Table 7 and Figure 11.

Among the CNN-based models, the YOLO series demonstrates slightly lower accuracy and generalization capability across various scenarios when compared to the MSHLB-DETR model. Specifically, YOLOv5 achieves an

{mAP}_{50}

of 72.5% on the test set, significantly worse than MSHLB-DETR (96.0%). Although the YOLOv8 model benefits from a more complex architecture, yielding relatively good results in

{mAP}_{50}

and FPS (outperforming YOLOv5’s 72.5% and 41.0 FPS and YOLOv12’s 78.7% and 40.8 FPS), its anchor-based detection paradigm, constrained receptive fields, and NMS-induced errors hinder its ability to match transformer-based models, and this explains why its

{mAP}_{50}

remains significantly lower than MSHLB-DETR’s 96.0%. Additionally, Faster-R-CNN’s single-layer, lower-resolution feature maps result in suboptimal accuracy. When training CNN-based models, there is a tendency for these models to focus on recognizing and locating larger target objects. This bias may reduce detection accuracy for smaller targets, especially when multiple object sizes are present in the image [49]. In contrast, the proposed MSHLB-DETR effectively boosted a higher detection accuracy, particularly with small targets, outperforming the CNN-based models.

Among the transformer-based models, DETR has the lowest

{m A P}_{50}

and shows a slower convergence rate compared to other models trained on the same number of images. Notably, DINO demonstrates remarkable performance on the test set due to its rapid convergence during training, with an

{mAP}_{50}

only 6.6% lower than that of MSHLB-DETR. However, its GFLOPs are nearly double those of MSHLB-DETR, and it shows lower precision in

{mAP}_{50 : 95}

. Furthermore, when training transformer-based models, particularly those based on DETR, there is often a tendency to perform well in small object detection, as DETR’s design enables it to capture global context information about all objects within an image. This capability allows DETR-based models to detect small objects more effectively, thereby enhancing overall performance in object detection tasks. In conclusion, the proposed MSHLB-DETR effectively addresses challenges such as small object detection and the loss of contextual information, making it superior in these aspects.

In natural environments, citrus leaf detection must account for complex conditions including heterogeneous lighting, partial occlusion by other plant organs, and cluttered backgrounds. To address these challenges, we evaluated the performance of various models in different scenarios, comparing MSHLB-DETR, YOLOv8, Faster R-CNN, and RT-DETR-R18. Figure 12 provides comprehensive visual evidence of MSHLB-DETR’s detection capabilities under challenging orchard conditions. The comparative analysis reveals several distinct advantages of our approach: In occlusion-heavy scenarios (Figure 12B), MSHLB-DETR demonstrates remarkable resilience, accurately identifying diseased leaves that remain undetected by other methods. While Faster R-CNN, YOLOv8, and RT-DETR-R18 exhibit significant missed detections and positional inaccuracies for occluded targets, our model maintains precise localization through its context-guided design. This capability is particularly evident in edge occlusion cases where conventional detectors struggle with partial visibility. The model’s proficiency with densely packed small targets (Figure 12A,C) further underscores its architectural superiority. Where competing methods fail to distinguish individual leaves in clustered regions, MSHLB-DETR’s innovative feature fusion strategy enables the clear separation and accurate detection of small-scale targets. This advantage is crucial for practical applications where early disease symptoms often appear on distant or minimally visible leaves. Multi-scale detection challenges (Figure 12C) are effectively addressed by our approach, as demonstrated by MSHLB-DETR’s consistent performance across varying target sizes. The missed detections observed in YOLOv8 and RT-DETR-R18 under these conditions highlight the limitations of their feature representation capabilities compared to our method. Notably, in complex background environments (Figure 12D), MSHLB-DETR excels at detecting diseased leaves under mild occlusions and at image edges. The model’s precision in capturing subtle boundary features allows it to identify minute pathological changes even when leaves partially blend with the background—a capability that directly stems from our context-aware architecture. These visual comparisons collectively demonstrate that MSHLB-DETR’s innovations in feature fusion and context guidance translate to tangible performance improvements across all challenging conditions examined. The model’s consistent accuracy under occlusion, dense clustering, multi-scale variation, and complex backgrounds confirms its robustness and readiness for real-world agricultural applications.

As illustrated in Figure 13, a visual comparison between the heatmaps generated by different models and the corresponding original images reveals significant differences in attention distribution for citrus disease detection. Although MSHLB-DETR, YOLOv8, Faster R-CNN, and RT-DETR-R18 each exhibit their respective strengths, the heatmaps produced by MSHLB-DETR stand out prominently. Not only does MSHLB-DETR accurately highlight the high-concentration regions of citrus disease, but its attention distribution is also highly aligned with the actual disease locations in the original images. Compared to YOLOv8 and Faster R-CNN, MSHLB-DETR demonstrates superior control over false-positive hotspots, with attention more distinctly and centrally focused on the critical targets, thereby reducing the risk of both false alarms and missed detections.

While RT-DETR-R18 displays a broader coverage of the detection area, it appears less precise in identifying disease features under complex scenarios. As shown in Figure 12A,C, the RT-DETR-R18 model struggles to effectively focus on the subtle features of citrus disease within large and complex detection regions. This limitation is likely attributed to its coarse-grained attention across expansive areas, which may compromise its ability to capture fine-grained disease characteristics, increasing the likelihood of erroneous predictions. In contrast, the heatmaps generated by MSHLB-DETR not only intuitively reflect its high sensitivity to small targets under complex backgrounds but also reveal its strong specificity in object detection tasks. This specificity is manifested in the model’s capacity to accurately distinguish and identify critical features of citrus disease, maintaining robust performance even under conditions of dense coverage and visual occlusion.

Figure 14 presents the confusion matrices of MSHLB-DETR and the main comparative models on the test dataset across three categories: HLB, Healthy, and Other. The results demonstrate that MSHLB-DETR achieves the highest classification reliability, with most samples concentrated along the diagonal, particularly for HLB detection. In contrast, YOLOv8 and RT-DETR-R18 show higher misclassification rates between Healthy and Other leaves, indicating difficulty in handling subtle visual differences. These findings confirm that MSHLB-DETR not only improves detection accuracy, as reported in Table 7, but also achieves higher reliability in distinguishing between disease and non-disease classes. This further highlights the model’s practical potential for deployment in real-world citrus orchards.

3.6. Discussion

The experimental results presented in this study demonstrate that the proposed MSHLB-DETR model achieves state-of-the-art performance in detecting citrus Huanglongbing under complex orchard conditions. This section provides a comprehensive analysis of these findings and their implications.

The significant performance improvement in MSHLB-DETR over baseline models can be attributed to its innovative architectural design. The 15.8% increase in mAP₅₀ compared to RT-DETR-R18, while maintaining comparable inference speed, highlights the effectiveness of our approach. This enhancement is particularly notable given the challenging nature of our dataset, which contains over 52% small targets and numerous occluded leaf scenarios. The SDRM proves crucial for handling small targets through its space-to-depth transformation and inverted residual mobile block. This design effectively mitigates information loss during downsampling while preserving computational efficiency. Simultaneously, CG Block demonstrates strong capability in processing occluded leaves by leveraging contextual information from surrounding regions. The synergistic combination of these modules enables robust performance in dense orchard environments where traditional detectors often fail. The development of MSHLB-DETR addresses critical limitations in current agricultural disease detection systems. By maintaining high accuracy under varying lighting conditions, complex backgrounds, and different leaf arrangements, our model shows strong potential for real-world deployment. This capability represents a significant advancement over conventional methods that typically require controlled imaging conditions.

Despite its strong performance, we acknowledge certain limitations in this study. First, the model’s sensitivity to early-stage infections requires further enhancement. The current dataset contains a larger proportion of severely infected leaves with obvious visual features. While this distribution reflects the reality of field scouting priorities and contributes to the model’s high overall accuracy, it may limit the model’s sensitivity to early-stage, mildly infected leaves whose symptoms are subtle and harder to recognize. Second, the model’s generalization capability across different citrus varieties and geographical regions requires further validation. Finally, the current computational requirements, although efficient for a research model, may present challenges for deployment on resource-constrained edge devices. These limitations define clear directions for our future research. Our immediate plan is to collect more samples with mild and moderate infection levels and to annotate symptom severity explicitly. This will enable a stratified evaluation of model performance and direct improvement in early detection capabilities. Concurrently, we will focus on expanding the training dataset to include more diverse growing conditions and citrus cultivars. To address deployment challenges, we will develop optimized versions of MSHLB-DETR through model compression and quantization techniques. Furthermore, the adaptable architecture of our model shows promising potential for extension to other critical agricultural vision tasks, such as pest detection and nutrient deficiency identification.

In summary, MSHLB-DETR represents a significant step forward in agricultural computer vision, demonstrating that specialized architectural designs can effectively address the unique challenges of plant disease detection in complex environments.

4. Conclusions

In contrast to previous studies that often overlooked model generalization by relying on test images with uniform backgrounds, this research directly addressed the complex challenges of real orchard environments, including occlusion, noise, varying illumination, and diverse backgrounds. The proposed MSHLB-DETR model effectively overcomes these issues, achieving state-of-the-art performance with an mAP₅₀ of 96.0% and demonstrating strong robustness across multi-scale and complex scenarios. Ablation experiments further verified the stability and effectiveness of the proposed architectural improvements under diverse conditions. While MSHLB-DETR sets a new benchmark for accuracy, robustness, and efficiency in citrus HLB detection, we recognize that its sensitivity to early-stage infections and its generalizability across different citrus varieties require further enhancement. These limitations point toward clear and meaningful directions for future work. Our subsequent research will prioritize optimizing the model for lightweight edge deployment while also expanding datasets to encompass a wider variety of symptom stages, cultivars, and environmental conditions to improve generalization. Furthermore, we plan to explore the transferability of our core architectural contributions—particularly the SDRM and CG Block module—to other vital agricultural vision tasks, such as pest monitoring and nutrient deficiency detection. Overall, the framework presented in this work provides a robust and extensible foundation for developing the next generation of intelligent agricultural vision systems.

Author Contributions

Z.L.: Methodology, conceptualization, investigation, writing—original draft. F.X.: Project administration, validation. Z.D.: Investigation. R.L.: Data curation. C.L.: Visualization, formal analysis. D.W.: Resources, supervision, project administration, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Zhejiang Forestry Science and Technology Project (2023SY08), the National Natural Science Foundation of China (Grant No. 42001354). The authors would also like to thank all those who contributed to this research and the authors cited in this paper.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Yang, K.; Hu, B.; Zhang, W.; Yuan, T.; Xu, Y. Recent progress in the understanding of Citrus Huanglongbing: From the perspective of pathogen and citrus host. Mol. Breed. 2024, 44, 77. [Google Scholar] [CrossRef]
Sampathkumar, S.; Rajeswari, R. An automated crop and plant disease identification scheme using cognitive fuzzy c-means algorithm. IETE J. Res. 2022, 68, 3786–3797. [Google Scholar] [CrossRef]
Tian, Y.; Wang, X.; Huang, H.; Deng, X.; Zhang, B.; Meng, Y.; Wu, L.; Chen, H.; Zhong, Y.; Chen, W. Genome-Wide Identification of the DnaJ Gene Family in Citrus and Functional Characterization of ClDJC24 in Response to Citrus Huanglongbing. Int. J. Mol. Sci. 2024, 25, 11967. [Google Scholar] [CrossRef]
Prabu, M.; Chelliah, B.J. An intelligent approach using boosted support vector machine based arithmetic optimization algorithm for accurate detection of plant leaf disease. Pattern Anal. Appl. 2022, 26, 367–379. [Google Scholar] [CrossRef]
Mzoughi, O.; Yahiaoui, I. Deep learning-based segmentation for disease identification. Ecol. Inform. 2023, 75, 102000. [Google Scholar] [CrossRef]
Wang, P.; Fan, E.; Wang, P. Comparative Analysis of Image Classification Algorithms Based on Traditional Machine Learning and Deep Learning. Pattern Recognit. Lett. 2020, 141, 61–67. [Google Scholar] [CrossRef]
Xie, W.; Feng, F.; Zhang, H. A Detection Algorithm for Citrus Huanglongbing Disease Based on an Improved YOLOv8n. Sensors 2024, 24, 4448. [Google Scholar] [CrossRef]
Wang, N.; Cao, H.; Huang, X.; Ding, M. Rapeseed Flower Counting Method Based on GhP2-YOLO and StrongSORT Algorithm. Plants 2024, 13, 2388. [Google Scholar] [CrossRef]
Xu, H.; Lv, X.; Wang, X.; Ren, Z.; Bodla, N.; Chellappa, R. Deep Regionlets: Blended Representation and Deep Learning for Generic Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1914–1927. [Google Scholar] [CrossRef]
Zheng, L.; Yi, J.; He, P.; Tie, J.; Zhang, Y.; Wu, W.; Long, L. Improvement of the YOLOv8 Model in the Optimization of the Weed Recognition Algorithm in Cotton Field. Plants 2024, 13, 1843. [Google Scholar] [CrossRef]
Anwar, T.; Anwar, H. Citrus Plant Disease Identification using Deep Learning with Multiple Transfer Learning Approaches. Pak. J. Eng. Technol. 2022, 3, 34–38. [Google Scholar] [CrossRef]
Wu, X.; Liang, J.; Yang, Y.; Li, Z.; Jia, X.; Pu, H.; Zhu, P. SAW-YOLO: A Multi-Scale YOLO for Small Target Citrus Pests Detection. Agronomy 2024, 14, 1571. [Google Scholar] [CrossRef]
Jiang, H.; Hu, F.; Fu, X.; Chen, C.; Wang, C.; Tian, L.; Shi, Y. YOLOv8-Peas: A lightweight drought tolerance method for peas based on seed germination vigor. Front. Plant Sci. 2023, 14, 1257947. [Google Scholar] [CrossRef]
Liu, X.; Hu, Y.; Zhou, G.; Cai, W.; He, M.; Zhan, J.; Hu, Y.; Li, L. DS-MENet for the classification of citrus disease. Front. Plant Sci. 2022, 13, 884464. [Google Scholar] [CrossRef]
Guan, C.-Z. Masked Visual Transformer for Efficient Training with Small Dataset. Int. J. Pattern Recognit. Artif. Intell. 2023, 37, 2351010. [Google Scholar] [CrossRef]
Wang, S.; Wang, Y.; Chang, Y.; Zhao, R.; She, Y. EBSE-YOLO: High Precision Recognition Algorithm for Small Target Foreign Object Detection. IEEE Access 2023, 11, 57951–57964. [Google Scholar] [CrossRef]
Zhao, H.; Jia, J.; Koltun, V. Exploring Self-Attention for Image Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 11–13 June 2020; pp. 10073–10082. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Tahira, S.; Khurram Azeem, H.; Didier, S.; Muhammad Zeshan, A. 2D Object Detection with Transformers: A Review. arXiv 2023, arXiv:2306.04670. [Google Scholar] [CrossRef]
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient DETR: Improving End-to-End Object Detector with Dense Prior. arXiv 2021, arXiv:2104.01318. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar] [CrossRef]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for Fast Training Convergence. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–11 October 2021; pp. 3631–3640. [Google Scholar] [CrossRef]
Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast Convergence of DETR with Spatially Modulated Co-Attention. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–11 October 2021; pp. 3601–3610. [Google Scholar] [CrossRef]
Cao, X.; Yuan, P.; Feng, B.; Niu, K. CF-DETR: Coarse-to-Fine Transformers for End-to-End Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Pittsburgh, PA, USA, 22 February–1 March 2022. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Zhu, M.; Kong, E. Multi-Scale Fusion Uncrewed Aerial Vehicle Detection Based on RT-DETR. Electronics 2024, 13, 1489. [Google Scholar] [CrossRef]
Li, X.; Cai, M.; Tan, X.; Yin, C.; Chen, W.; Liu, Z.; Wen, J.; Han, Y. An efficient transformer network for detecting multi-scale chicken in complex free-range farming environments via improved RT-DETR. Comput. Electron. Agric. 2024, 224, 109160. [Google Scholar] [CrossRef]
Huangfu, Y.; Huang, Z.; Yang, X.; Zhang, Y.; Li, W.; Shi, J.; Yang, L. HHS-RT-DETR: A Method for the Detection of Citrus Greening Disease. Agronomy 2024, 14, 2900. [Google Scholar] [CrossRef]
Li, Z.; Wu, Y.; Jiang, H.; Lei, D.; Pan, F.; Qiao, J.; Fu, X.; Guo, B. RT-DETR-SoilCuc: Detection method for cucumber germinationinsoil based environment. Front. Plant Sci. 2024, 15, 1425103. [Google Scholar] [CrossRef]
Fan, S.; Li, J.; Zhang, Y.; Tian, X.; Wang, Q.; He, X.; Zhang, C.; Huang, W. On line detection of defective apples using computer vision system combined with deep learning methods. J. Food Eng. 2020, 286, 110102. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Silwal, A.; Parhar, T.; Yandun, F.; Baweja, H.; Kantor, G. A Robust Illumination-Invariant Camera System for Agricultural Applications. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3292–3298. [Google Scholar] [CrossRef]
Feng, H.; Li, Q.; Wang, W.; Bashir, A.K.; Singh, A.K.; Xu, J.; Fang, K. Security of target recognition for UAV forestry remote sensing based on multi-source data fusion transformer framework. Inf. Fusion 2024, 112, 102555. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Han, F.; Lang, X. A Fast Magnetic Flux Leakage Small Defect Detection Network. IEEE Trans. Ind. Inform. 2023, 19, 11941–11948. [Google Scholar] [CrossRef]
Huang, L.; Huang, W.; Gong, H.; Yu, C.; You, Z. PEFNet: Position Enhancement Faster Network for Object Detection in Roadside Perception System. IEEE Access 2023, 11, 73007–73023. [Google Scholar] [CrossRef]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A Light-Weight Context Guided Network for Semantic Segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with Improved Matching for Fast Convergence. arXiv 2024, arXiv:2412.04234. [Google Scholar] [CrossRef]
Wang, S.; Xia, C.; Lv, F.; Shi, Y. RT-DETRv3: Real-Time End-to-End Object Detection with Hierarchical Dense Positive Supervision. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; pp. 1628–1636. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8510–8519. [Google Scholar] [CrossRef]

Figure 1. Data augmentation results of citrus Huanglongbing.

Figure 2. Illustration of characteristics of images. (a) Large-scale Targets: Prominent foreground leaves with clear morphological features. (b) Small-scale Targets: Distant or partially obscured leaves with limited visual information. (c) Multi-scale Scenario: Natural distribution of both large and small targets in orchard environment.

Figure 3. Structure illustration for RT-DETR.

Figure 4. The structure of the proposed MSHLB-DETR model.

Figure 5. The structure of SPD-Conv.

Figure 6. Subfigure (a) illustrates the network structure of the IRMB, where DWConv represents the depthwise convolution; Q, K, and V denote the query, key, and value features, respectively. Subfigure (b) shows an efficient model composed of four IRMBs, forming a ResNet-like network structure.

Figure 7. The network structure of the SDRM.

Figure 8. CG Block network structure.

Figure 9. Comparison of SDRM ablation test performance indicators.

Figure 10. The curves detailed Val loss and

{mAP}_{50}

for ablation experiments. Val loss is the comprehensive loss value on the validation set.

Figure 10. The curves detailed Val loss and

{mAP}_{50}

for ablation experiments. Val loss is the comprehensive loss value on the validation set.

Figure 11. A comparison chart of the main performance indexes for detecting citrus Huanglongbing based on the nine models. The size of the bubbles represents the size of the models. The bubbles positioned closer to the top-right corner refer to models that show better performance.

Figure 12. Comparison and visualization of different detection algorithms. Blue bounding boxes indicate detection failures.

Figure 13. Heatmap of detection results of different detection algorithms.

Figure 14. Comparison of confusion matrices for different detection algorithms.

Table 1. The details of the training, validation, and test sets.

Datasets	Images	Boxes	Box Sizes
			Small	Medium	Large
Training set	3056	19,430	55.20%	16.72%	28.08%
Validation set	437	1654	58.56%	15.72%	35.72%
Testing set	874	3201	52.98%	12.34%	34.68%
Total	4367	24,258	55.12%	16.23%	28.65%

Table 2. The hyperparameter settings of the proposed model.

Hyperparameter	Value
Training epochs	250
Resolution	640 × 640
Batch size	8
Optimizer	AdamW
Activation function	SiLM
Initial learning rate	0.0001
Weight decay	0.0001

Table 3. Comparison of training results of different backbone network models.

Model	Backbone	${mAP}_{50}$ (%)	Params (M)	GFLOPs (G)	FPS
RT-DETR-L	HGNetV2	79.1	31.8	103.4	51.9
RT-DETR-VGG	VGG11	74.4	55.4	200.9	41.5
RT-DETR-R18	ResNet18	80.2	19.9	50.8	48.8
RT-DETR-R34	ResNet34	78.9	31.1	88.3	56.9
RT-DETR-RegNet	RegNet	76.5	49.9	120.9	40.1

Table 4. Effectiveness experiments on the SDRM.

Module	${mAP}_{50}$ (%)	Params (M)	GFLOPs (G)	FPS
RT-DETR-R18	80.2	19.9	50.8	48.8
RT-DETR-R18 + SPD	89.0	18.3	49.2	35.5
RT-DETR-R18 + IRMB	91.1	28.4	59.3	44.1
RT-DETR-R18 + SDRM	94.2	16.5	47.6	47.2

Table 5. Comparison results of different attention modules in transformer encoder.

	${m A P}_{50}$ (%)	Params (M)	GFLOPs	FPS
RT-DETR-R18 + SDRM	94.2	16.5	47.6	47.2
+SimAM	92.9	19.3	50.2	21.4
+CAA	94.6	18.5	56.5	33.9
+CGBlock (ours)	96.0	16.4	48.3	60.2

Table 6. Results of ablation experiments on proposed MSHLB-DETR.

Module	${mAP}_{50}$ (%)	Params (M)	GFLOPs (G)	FPS
RT-DETR-R18	80.2	19.9	50.8	48.8
RT-DETR-R18 + SDRM	94.2	16.5	47.6	47.2
RT-DETR-R18 + CGBlock	94.5	20.5	48.8	53.6
RT-DETR-R18 + SDRM + CGBlock (ours)	96.0	16.4	48.3	60.2

Table 7. Comparison of performance indexes between different models.

	${mAP}_{50}$ (%)	${mAP}_{50 : 95}$ (%)	Params (M)	GFLOPs (G)	FPS
CNN-based models
YOLOv5	72.5	69.2	25.1	47.1	41.0
YOLOv8	79.0	68.0	30.6	58.1	42.8
YOLOv12	78.7	70.5	42.3	41.8	40.8
Faster-R-CNN	74.9	55.9	54.1	40.3	43.4
Transformer-based models
DEIM-D-FINE-L	79.7	70.7	32.1	103.4	46.0
RT-DETRv3-R18	80.0	71.9	28.2	49.2	50.9
DINO	85.4	75.1	22.2	91.7	58.2
RT-DETR-R18	80.2	71.5	19.9	50.8	48.8
MSHLB-DETR(ours)	96.0	82.2	16.4	48.3	60.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; Wu, D.; Xu, F.; Du, Z.; Luo, R.; Li, C. MSHLB-DETR: Transformer-Based Multi-Scale Citrus Huanglongbing Detection in Orchards with Aggregation Enhancement. Horticulturae 2025, 11, 1225. https://doi.org/10.3390/horticulturae11101225

AMA Style

Liu Z, Wu D, Xu F, Du Z, Luo R, Li C. MSHLB-DETR: Transformer-Based Multi-Scale Citrus Huanglongbing Detection in Orchards with Aggregation Enhancement. Horticulturae. 2025; 11(10):1225. https://doi.org/10.3390/horticulturae11101225

Chicago/Turabian Style

Liu, Zhongbin, Dasheng Wu, Fengya Xu, Zengjie Du, Ruikang Luo, and Cheng Li. 2025. "MSHLB-DETR: Transformer-Based Multi-Scale Citrus Huanglongbing Detection in Orchards with Aggregation Enhancement" Horticulturae 11, no. 10: 1225. https://doi.org/10.3390/horticulturae11101225

APA Style

Liu, Z., Wu, D., Xu, F., Du, Z., Luo, R., & Li, C. (2025). MSHLB-DETR: Transformer-Based Multi-Scale Citrus Huanglongbing Detection in Orchards with Aggregation Enhancement. Horticulturae, 11(10), 1225. https://doi.org/10.3390/horticulturae11101225

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSHLB-DETR: Transformer-Based Multi-Scale Citrus Huanglongbing Detection in Orchards with Aggregation Enhancement

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition and Preprocessing

2.2. Structure and Features of Datasets

2.3. RT-DETR Structure

2.4. The Improved MSHLB-DETR Model

2.4.1. Space-to-Depth (SPD)

2.4.2. Inverted Residual Mobile Block (IRMB)

2.4.3. Smart Disease Recognition for Citrus Huanglongbing with Multi-Scale (SDRM)

2.4.4. Context-Guided Block (CG Block)

2.5. Evaluation Metrics and Experimental Environments

3. Experiment Results and Analysis

3.1. Backbone Network Design and Selection

3.2. Improving the Backbone Network Based on the SDRM

3.3. The Effectiveness of the CG Block Module

3.4. Ablation Experiments on the Proposed MSHLB-DETR

3.5. Comparison Experiments of Different Models

3.6. Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI