1. Introduction
Tomato, scientifically known as
Solanum lycopersicum L., is one of the most widely cultivated and consumed crops worldwide. In 2024, global tomato production exceeded 186 million tonnes across approximately 4.9 million hectares [
1]. Although forecasts suggest that future production may slightly decline [
2], tomatoes remain an important subject of agricultural research. Tomatoes are rich in various vitamins and antioxidants, such as lycopene, which plays a crucial role in reducing the risks of chronic diseases, including microvascular complications related to diabetes, viral infections, and various types of cancer [
3,
4,
5,
6,
7,
8]. Due to their numerous health benefits, tomatoes are increasingly regarded as a functional food that promotes human health [
9]. Given their nutritional and economic value, ensuring the healthy growth of tomato plants is critical. Unfortunately, tomato plants are highly susceptible to various leaf diseases. These diseases significantly impact both the yield and quality of tomatoes, causing substantial losses to agriculture and related industries [
10,
11,
12,
13,
14,
15,
16]. In traditional agricultural production, the early prevention and control of tomato leaf diseases relies on field sampling and manual assessments by specialized technicians, a process that is not only inefficient but also prone to subjective biases [
17].
Furthermore, with rising labor costs, there is an urgent need for equipment and methods that can reduce the costs and lower the expertise required for agricultural work.
Recent agricultural vision studies have pushed toward deployable and system-aware sensing pipelines, including foundation model-assisted UAV forest monitoring and wireless collaborative inference for plant disease recognition [
18,
19]. These works motivate our focus on deployable, system-aware feature design for tomato disease detection on low-power hardware. Researchers have gradually shifted from traditional machine learning (ML) algorithms to convolutional neural network (CNN) models for object detection and classification tasks [
20,
21]. Various CNN architectures have been applied to target detection and classification tasks in the field of agriculture. Gehlot and Saini [
22] used several CNN architectures, including AlexNet [
23], VGG-16 [
24], GoogleNet [
25], DenseNet-121 [
26], and ResNet-101 [
27], to classify 10 types of tomato leaf diseases in the PlantVillage [
28] dataset.
Accordingly, classic CNNs now underpin two-stage and single-stage detectors, extending beyond classification and improving the size–speed trade-off for agricultural object detection. Two-stage object detection algorithms [
29], such as Faster R-CNN [
30] and Mask R-CNN [
31], generate candidate boxes first and then perform classification and bounding box regression. In contrast, single-stage object detection algorithms, such as You Only Look Once (YOLO) [
32], the Single-Shot Multi-Box Detector (SSD) [
33], and EfficientDet [
34], complete the object detection task in a single forward pass. These algorithms do not require the generation of candidate regions and directly predict the class and position of the object, offering faster speeds and better real-time performance. However, single-stage object detection algorithms are generally slightly less accurate than two-stage algorithms. Nevertheless, some advanced single-stage algorithms have made significant progress in terms of accuracy. Liu et al. [
35] proposed an early recognition method for tomato gray leaf spot based on the YOLOv3 model, where the original backbone of YOLOv3 was replaced with MobileNetv2, aiming to balance accuracy and real-time detection. The model was evaluated using F1 scores and AP values on images captured under four conditions and compared with Faster R-CNN and SSD models. The experimental results showed that the proposed model significantly improved the recognition performance. After YOLOv3, the YOLO series evolved rapidly. As the accuracy was continuously improved, YOLO’s speed and real-time performance increasingly surpassed those of two-stage algorithms and other single-stage algorithms, making it a mainstream choice in the field of object detection. Meanwhile, Wang et al. [
36] proposed a tomato leaf disease detection model, TDGA, based on deep learning with a global attention mechanism. TDGA enhanced feature extraction through a global attention mechanism on the original YOLOv5, used switchable atrous convolution (SAConv) to improve the detection capabilities, and adopted the efficient IoU loss (EIoU) instead of the traditional IoU loss (CIoU) to address issues like aspect ratio ambiguity and sample imbalance. Their experimental results showed that TDGA achieved average accuracy of 91.40%, which was 2.93% higher than that of the original YOLOv5 network. Tang et al. [
37] introduced a tomato leaf disease detection model based on PLPNet, which was built upon the YOLOX architecture. The model incorporated a perceptual adaptive convolution module for effective feature extraction, a location reinforcement attention mechanism to suppress soil interference, and a switchable atrous convolution and deconvolution network with secondary observation and feature consistency mechanisms to address disease interclass similarity. Their experimental results showed that PLPNet achieved 94.5% mAP50, 54.4% average recall (AR), and a speed of 25.45 frames per second (FPS) on a self-built dataset.
However, the accuracy of the model was improved by sacrificing its size, resulting in a lower recall rate. In addition, deploying the model on edge devices further reduces the FPS, making it difficult to meet the practical requirements of embedded edge devices. In practical agricultural applications, the performance of models is often limited by the hardware constraints of edge devices, such as limited computational power, memory, and energy supply [
38,
39,
40,
41,
42,
43]. Therefore, lightweighting models has become a critical research direction. Early efforts focused on creating smaller, more efficient CNN architectures for classification tasks. For example, Agarwal et al. [
44] designed a simplified CNN with only eight hidden layers, achieving high accuracy with minimal computational resources. This work, along with others, provided valuable insights into optimizing classic CNNs like VGG16 [
24] for resource-constrained environments.
Since then, researchers have increasingly focused on the performance of models deployed on edge devices and the need for model lightweighting. Liu et al. [
45] proposed an improved tomato leaf disease recognition method based on YOLOX, which addressed the issue of class imbalance by introducing a sample-adaptive cross-entropy loss function. The backbone of YOLOX was replaced with MobileNetV3 to achieve lightweight feature extraction. The authors added a convolutional block attention module (CBAM) between the backbone and neck network of YOLOX to enhance the feature extraction performance. Their experimental results showed that the improved YOLOX model increased the accuracy by 1.27%, reduced the memory usage by 35.34%, and improved the detection speed by 50.20%. After quantizing the model with TensorRT, the detection speed reached 11.1 FPS on the Jetson Nano embedded device. However, the real-time performance of the model remained limited, and, after deployment on actual hardware, the FPS dropped significantly because of hardware constraints. Liu et al. [
46] designed an improved YOLOv8n-based tomato leaf disease detection algorithm aimed at addressing the challenge of real-time detection on embedded devices. The algorithm introduced a lightweight multi-scale module (LMSM) and an attention lightweight subsampling module (ALSA) to effectively extract multi-scale semantic information for tomato leaf diseases, addressing the issues of irregular spot sizes and dense foliage. The head network was redesigned using partial convolution and group convolution along with a parameter-sharing method, and scalable auxiliary bounding boxes and loss function optimization strategies were introduced to further enhance its performance. After pruning, the computation was reduced by 61.7%, the model size decreased by 55.6%, and the FPS increased by 44.8%, while maintaining high accuracy. After TensorRT quantization, the detection speed reached 19.70 FPS on the Jetson Nano. Compared to previous studies, this algorithm improved the FPS after deployment, enhancing its real-time performance. However, 19.70 FPS still did not meet the required standard, and the accuracy of the model decreased slightly compared to the original YOLOv8n.
Outside the YOLO family, lightweight object detectors have also been deployed on embedded platforms for vision tasks. Single-Shot Multi-Box Detector (SSD) variants built on top of MobileNetV3 [
33,
47] achieve low-latency inference on ARM-class CPUs but tend to be weaker in small-object recall, which matters for lesion-level agricultural imagery. EfficientDet-Lite [
34] scales the BiFPN-based detector down to Edge-TPU and Cortex-A targets and reports strong accuracy–efficiency trade-offs, yet the typical deployment assumes a TensorFlow Lite backend rather than the BPU-accelerated tool chain that we target in this work. On the Transformer side, MobileViT [
48] and related mobile-friendly hybrid architectures have been ported to Raspberry Pi and Jetson-class boards for vision tasks, but their current exploration in leaf-level disease detection is still limited and the INT8 quantization support on agricultural NPU/BPU chips is less mature than for convolutional backbones. Our work therefore focuses on a convolution-dominated lightweight detector that maps cleanly onto the RDK X5 BPU, while acknowledging that the broader MobileNet-SSD/EfficientDet-Lite/MobileViT design space remains a relevant reference point for future cross-platform comparison.
Overall, previous research in applying object detection algorithms to agricultural scenarios has been constrained by the hardware performance, cost, model size, and speed on edge devices. While these efforts have advanced the field, there remains considerable scope for exploration in achieving an optimal balance between accuracy and real-time performance in agricultural embedded systems. In real agricultural applications, most farms rely on portable, low-cost hardware with limited computing power and energy supplies [
49,
50]. Previous works often neglect this gap between research and deployment: (1) many models cannot operate in real time on edge processors; (2) lightweight versions sacrifice accuracy excessively; and (3) few studies offer a complete prototype validated on physical hardware.
To address these challenges and narrow the PlantVillage-to-edge deployment gap, we propose HGS-YOLO, a system-oriented deployable lightweight adaptation of YOLOv11 for leaf-level tomato disease detection. Rather than introducing a novel detector architecture, the system is designed to maintain competitive detection performance while substantially reducing the model complexity and supporting practical edge deployment. The overall system workflow is illustrated in
Figure 1. After deploying the model on embedded devices, it achieved the real-time detection of tomato leaf diseases under the imaging conditions used in our benchmark, in a controlled indoor test on separately collected leaf samples, and in a small-scale outdoor field round carried out with the handheld unit in a working tomato greenhouse on an independent site. The field round was kept small on purpose, so we report only the operational evidence from it (whether it runs, the detection rate, the confidence, and the latency); testing how well it generalizes to other sites, seasons, and weather was beyond the scope of this study and is left as future work. The main contributions of this paper are summarized as follows:
We present a system-oriented lightweight adaptation of YOLOv11 that combines the HGNetV2 backbone, the HS-FPN neck with CA, and the MPDIoU loss to achieve a better accuracy–efficiency trade-off for tomato leaf disease detection under edge constraints.
We conduct a structured evaluation including loss function comparison, backbone and neck ablations, brightness sensitivity analysis, and cross-model benchmarking to characterize the trade-offs among accuracy, complexity, robustness, and deployment suitability.
We implement a fully integrated handheld system on the RDK X5 platform using post-training quantization (PTQ) and BPU acceleration, achieving 93.6% with 86.5% recall on the modified PlantVillage benchmark and 90.3% , 89.0% precision, 82.3% recall, and 85.5% F1 on the indoor deployment test set, together with 25.0 ± 0.4 ms end-to-end latency, 40.0 ± 0.6 FPS, and 9.8 ± 0.4 W average system power in indoor deployment testing.
We take the handheld system into a working tomato greenhouse for a small outdoor field round, where it runs end-to-end and produces on-device disease detections under natural lighting and background. This moves the system past a benchmark-only prototype to one that has been shown to work in the field, while leaving wider validation across sites, seasons, and weather as the next step rather than a present claim.
2. Materials and Methods
In this study, following the construction of a comprehensive dataset, a system-oriented deployable lightweight adaptation of YOLOv11 was developed for leaf-level tomato disease detection and named HGS-YOLO. Instead of pursuing raw benchmark gains alone, the model was co-optimized across its backbone, neck, loss function, and deployment pipeline to maintain competitive detection performance under strict edge constraints. To rigorously evaluate this design, a series of comparative experiments were conducted, encompassing analyses of various loss functions, backbone structures, neck configurations, and brightness sensitivity, as well as performance comparisons against classical models. Experimental data were subsequently collected and analyzed to assess the trade-offs, advantages, and limitations of the proposed system. Once the experimental results confirmed that the model met the application requirements for lightweight deployment and practical detection quality, it was deployed on embedded devices and further validated through controlled indoor device tests using separately collected tomato leaf samples. This validation process involved sample collection conducted by the research team in Yuzhong County, Lanzhou City, Gansu Province, China, where authentic tomato leaf disease samples were obtained and then used for indoor deployment evaluation under controlled imaging conditions.
Unless otherwise stated, the benchmark baseline reported in the following tables and figures corresponds to the nano-scale YOLOv11n configuration. Throughout the paper, we write “HS-FPN” when referring to the architecture; the hyphen-free spelling “HSFPN” appears only inside model-variant labels (e.g., YOLOv11n-HSFPN) and inside compact table headers where a hyphen would break column alignment. The two spellings refer to the same structure.
2.1. Data Acquisition
The dataset formed the foundation of the study. In this research, we selected images of eight types of tomato leaf diseases along with healthy tomato leaf photos.
Figure 2 illustrates the eight diseased categories, whereas the healthy class is omitted from the panel for space. The dataset used in this study is a modified version of PlantVillage [
51], consisting of 4323 images and 15,135 annotations. It includes a variety of scenes, such as plain backgrounds, individual plant photographs, and collages of leaves affected by different diseases. The dataset is divided into nine categories: Early Blight, Healthy, Late Blight, Leaf Miner, Leaf Mold, Mosaic Virus, Septoria, Spider Mites, and Yellow Leaf Curl Virus. The final data were divided into training, validation, and test sets at a ratio of 8:1:1.
Unlike the original PlantVillage dataset, which is dominated by single leaves under relatively clean or controlled backgrounds, the modified dataset was reconstructed for object detection and lightweight deployment research. Low-quality images were removed before any annotation work using three concrete screening rules: (i) images in which more than 30% of pixels were saturated above value 240 in the HSV V channel were discarded as overexposed; (ii) images whose variance in the Laplacian was below 80 (on the luminance channel of resized copies) were discarded as blurred; (iii) images in which no lesion region could be identified by the annotator within 10 s of inspection were discarded as having extremely weak lesion visibility. Starting from 5041 candidate images collected from PlantVillage and online supplementary sources, 718 images were removed via these rules (412 via rule (i), 193 via rule (ii), and 113 via rule (iii)), leaving the 4323 images reported above. The remaining images were converted from image-level category labels to YOLO-format object detection annotations. In the final dataset, approximately 2000 images (46%) correspond to plain-background single leaves, approximately 1000 images (23%) correspond to individual plant photographs captured with more complex greenhouse or field backgrounds, and approximately 1323 images (31%) correspond to collage samples. The individual plant photographs introduce more realistic interference factors, such as soil, support structures, and illumination variation, while the collage images explain the relatively high average number of annotations per image.
The collage samples were generated automatically using Python scripts based on OpenCV through mosaic or grid composition rather than fully manual assembly. Each collage follows a regular grid on a canvas: nine source leaves are drawn without replacement from a pool of 2000 plain-background single leaves and segmented individual-plant crops, resized to fit a grid cell, and pasted with a 4-pixel gap so that adjacent bounding boxes never overlap. A cross-category constraint requires that at least five distinct disease classes appear within each collage; we enforced this via rejection sampling, redrawing the nine-source set until the constraint was satisfied. Lesion-rich classes (Early Blight, Late Blight, Septoria) were upweighted at a ratio of 1.5× during sampling to force the detector to confront visually similar pathological textures in the same receptive field. For annotation, the diseased leaf was used as the basic bounding-box unit so that each box preserved sufficient venation and lesion-texture context for detection. Accordingly, the detection task addressed in this study is leaf-level disease object detection—that is, localizing and classifying individual diseased leaves—rather than lesion-level segmentation or pixelwise disease region delineation.
All images were annotated using standard computer vision tools such as LabelImg or Roboflow. The annotation process was conducted independently by 2–3 annotators with relevant professional backgrounds, followed by cross-checking. When disagreements regarding bounding-box boundaries occurred, a senior researcher made the final decision to improve the consistency and quality of the ground truth. To quantify inter-annotator reliability, a shared subset of images was independently relabelled by two annotators; the resulting agreement amounted to Cohen’s
, which corresponds to almost-perfect agreement on the Landis–Koch scale. This figure is reported as a single aggregate value; per-class
values were all above 0.80 except that for Spider Mites, which is consistent with the per-class analysis in
Section 3.1.2. To reduce the risk of information leakage, source image-level separation was enforced during data preparation. Original images used to generate collage samples were assigned to the training set only and were never allowed to appear in the validation or test sets, either as single images or as components of a collage. In addition, near-duplicate images from the same capture session or continuous multi-angle shooting sequence were grouped into the same split, typically the training split, to avoid an overly optimistic evaluation.
In addition to the modified PlantVillage benchmark, a separately collected deployment-test set composed of tomato leaf disease samples from Yuzhong County was prepared for indoor evaluation after deployment. These samples were not used for model training or validation. The manually checked and annotated deployment-test subset contained 192 images and was used to quantitatively assess the deployed system under controlled indoor lighting conditions; the deployment-performance metrics reported in Table 13 in
Section 3.3 were computed on these 192 images.
To verify that the source image-level separation described above was not silently undone by distributional overlap between the splits, we performed a structural similarity index (SSIM) check on 200–300 randomly sampled images per split.
Table 1 reports the mean and standard deviation for four SSIM aggregates: the pairwise SSIM between random non-duplicate training images, the nearest-neighbor SSIM from each validation image to the training pool, the nearest-neighbor SSIM from each test image to the training pool, and the nearest-neighbor SSIM from synthetic (collage) training images to the validation and test pools. The elevation of the val/test-to-train nearest-neighbor SSIM over the train-to-train random SSIM is only 0.07–0.08—smaller than one standard deviation of the random baseline (0.118)—so the validation and test splits are not, on average, closer to the training set than a random training pair would be. The synthetic-collage figure (0.521) is also within the same range, which is consistent with the fact that collage source images never crossed the train/val/test boundary.
2.2. Improved YOLOv11 Method
In this study, HGS-YOLO is developed as a system-oriented deployable lightweight adaptation of the YOLOv11n detector for leaf-level tomato disease detection on edge agricultural devices. It replaces the original backbone network with HGNetV2, integrates the HS-FPN structure and channel attention into the neck network, and adopts the MPDIoU during training. These adjustments reduce the model complexity while preserving the competitive multi-scale feature representation and stable convergence. The contribution of the design lies in the coordinated system-level integration of existing lightweight components and deployment requirements into a single deployable configuration, rather than in proposing fundamentally new detection modules.
The choice of YOLOv11n as the base model is a tooling-and-support decision rather than an accuracy decision. Three considerations drove this choice. First, YOLOv11n is the nano-tier variant of the most recent mainline Ultralytics release at the time of this work, so the same training code, export pipeline, and INT8 PTQ calibration tools could be reused for every variant in our ablation without reimplementing them per architecture. Second, its baseline parameter count (2.6 M) already sits within the order of magnitude that we target for handheld BPU deployment, which means that the lightweighting work described in this paper operates inside the nano tier rather than being derived from a larger model family. Third, the YOLOv8/v11 detection head, loss configuration, and AdamW schedule are the de facto reference on which recent edge agriculture studies (for example, [
46]) are built, which makes a direct cross-study comparison more feasible. We do not claim that YOLOv11n is uniquely suited on accuracy grounds; the multi-seed comparison in
Section 3.2.4 shows that HGS-YOLO is in fact slightly less accurate than the YOLOv11n baseline on the benchmark, and the case for our modifications rests on the deployment-side evidence given in
Section 3.3.
To avoid any ambiguity about what is reused and what is new in this paper,
Table 2 summarizes the provenance of each component. HGS-YOLO does not propose new detection modules; the three algorithmic blocks (HGNetV2 backbone, HS-FPN neck with channel attention, and MPDIoU loss) are adopted as published. The originality lies in the coordinated co-selection of these blocks under a shared deployment constraint, the PTQ calibration procedure used to map the combined model onto INT8, the kernel-level mapping onto the RDK X5 BPU, and the handheld hardware integration validated by our indoor tests.
For clarity of reading, the equations in the remainder of
Section 2 fall into three types: (i) standard-form equations reproduce textbook definitions—IoU, CIoU and MPDIoU losses, precision/recall/F1, mAP, FLOPs, and AdamW—for self-containedness; their implementations coincide with the Ultralytics reference code used in this work; (ii) implementation-accurate equations describe operators that we reuse as published—the Stem module, HG block aggregation, channel attention, and HS-FPN top-down path—and match the corresponding source code; (iii) exposition-level equations describe the numerical form of the INT8 PTQ mapping (Equations (26) and (27) in
Section 2.5) applied before BPU inference; the actual per-layer calibration search, operator fusion, and tensor-layout transformations are executed by the D-Robotics RDK X5 tool chain rather than by us and are not reproduced here.
2.2.1. Backbone Improvement
To reduce the size and computational load of the model, and to improve its performance in real-time detection scenarios, the original backbone network of YOLOv11n was replaced with the lightweight network HGNetV2, proposed by Baidu PaddlePaddle. Additionally, the core module of the original network, named C3K2, was substituted with the HG block, which is the core module of HGNetV2. HGNetV2 is based on VOVNet [
57] as the baseline model, utilizing as many
standard convolutions as possible to maximize the computational density. While HGNetV2 was originally designed for efficient GPU inference, its regular convolution-dominated architecture also maps well to fixed-point accelerators such as the BPU used in this study, because
convolutions are the primary operation type natively supported by most NPU/BPU hardware schedulers. The basic structure of HGS-YOLO is shown in
Figure 3, where the backbone consists of a Stem module and four HG stages.
The Stem module transforms the raw input into an initial feature representation through successive convolution–normalization–activation layers augmented with a residual-style shortcut:
where
denotes a Conv-BN-SiLU block,
denotes a depthwise
convolution,
denotes channelwise concatenation, and
is the stem output fed to the first HG stage. Following the standard YOLO stride-level convention, the subscript of
corresponds to the spatial scale index of the feature map, so the stem produces
and the four successive HG stages produce
,
,
, and
, respectively.
From a functional perspective, the hierarchical feature extraction through the four stages can be written as
where
denotes the
-th HG stage and
is its output feature map. The multi-scale backbone features
are then forwarded to the neck for subsequent cross-scale selection and fusion. The complete architecture of HGS-YOLO is shown in
Figure 3.
The HG block [
52,
53] adopts a hierarchical data processing design, as shown in
Figure 4, enabling the network to capture both low-level and high-level features at multiple abstraction levels. Internally, each HG block applies
K successive
convolution layers to produce intermediate features
and then aggregates them:
where
is the block input and
y is the block output after a pointwise projection. The progressive concatenation ensures that each subsequent layer receives an increasingly rich context while keeping individual convolution kernels small, which is favorable for both parameter efficiency and fixed-point hardware utilization on the target BPU.
Because the detection task is defined at the leaf level (
Section 2.1), the bounding box for each diseased leaf encompasses the entire leaf surface, including both the disease-affected area and the surrounding healthy tissue. The hierarchical feature aggregation of HG blocks helps the model to distinguish visually similar leaf categories, such as Early Blight and Late Blight leaves, whose surface textures share common color and spot patterns, as well as differentiate diseased leaves from complex backgrounds such as soil.
2.2.2. Neck Improvement
In order to further lighten the model and enhance its multi-scale feature representation capabilities, the neck structure of the model adopts the HS-FPN [
54] architecture. HS-FPN consists of two main modules: the feature selection module and the feature fusion module. The structure of HS-FPN can also be observed in
Figure 3.
The channel attention (CA) mechanism in the feature selection module plays a crucial role in leaf-level disease detection. Its structure is shown in
Figure 5. Diseased leaves in images are often affected by resolution and lighting variations and can be confused with soil or other background elements. The CA mechanism highlights important channels, thereby improving the model’s focus on disease-discriminative features, reducing background noise, and increasing its sensitivity to different disease categories. In addition, dimension matching (DM) ensures the effective alignment of features across different scales through global average pooling and global max pooling, allowing the model to detect diseased leaves at various scales. In the feature fusion module, selective feature fusion (SFF) enhances the model’s ability to represent different disease categories (such as those with yellow–brown or black surface patterns) by combining low-level texture features with high-level semantic features. The model adjusts the features through bilinear interpolation or convolution, improving the detection accuracy. The backbone extracts features at various scales; then, through top-down feature fusion achieved by the SFF module, these multi-scale representations are combined to enhance the model’s ability to detect diseased leaves of varying sizes, from small individual leaves to larger leaf clusters in collage images.
For a feature tensor
, the channel descriptors generated by global average pooling and global max pooling can be written as
The corresponding channel attention weights and reweighted features are then expressed as
where
denotes the sigmoid function and ⊙ denotes channelwise reweighting with broadcast over spatial dimensions. After channel selection, the simplified cross-scale fusion process of HS-FPN can be described as
where
denotes dimension matching,
denotes upsampling of the higher-level semantic feature, and
denotes channel concatenation before convolutional fusion. Using the same stride-level notation as in Equation (
2), the top-down SFF path of HS-FPN operates on three adjacent backbone levels
,
, and
and produces two fused outputs:
where
denotes the channel attention-reweighted feature at level
l and
is the corresponding fused output forwarded to the detection head. In the formulation above, both
and
are obtained by fusing adjacent channel attention-reweighted backbone features rather than cascading the fused output of the higher level, which is consistent with the selective feature fusion principle of HS-FPN and ensures that the index set on the right-hand side of Equation (
7) is fully defined by the backbone outputs in Equation (
2).
The same process is drawn explicitly in the neck in
Figure 3 and can be read in three stages. (1) Selection: The three backbone outputs
,
,
(shown as the three feature maps exiting the HGNetV2 stack) each pass through a channel attention block that produces
,
,
, with lesion-relevant channels emphasized and background-like channels suppressed. (2) Dimension matching:
is upsampled to the resolution of
and
is upsampled to the resolution of
(the up arrows in
Figure 3), so the SFF inputs occupy a shared spatial grid at each level. (3) Selective feature fusion: At level 4,
is combined with
to form
; at level 3,
is combined with
to form
; the bottom-up augmentation path then refines
and
before handing them to the detection head. This reading aligns Equations (4)–(7) with the boxes and arrows in
Figure 3, so the reader can trace each symbol back to an element of the neck diagram.
2.2.3. Loss Function Improvement
In HGS-YOLO, the choice of loss function plays a crucial role in accelerating model convergence and improving the overall detection performance. YOLOv11n originally utilized the Complete Intersection over Union (CIoU) loss function. The fundamental idea of CIoU is to measure the distance between the predicted and ground-truth bounding boxes by considering their overlap area (IoU), as well as the center point distance, aspect ratio, and the size of the boxes [
55,
56]. The IoU is defined as follows:
In this equation,
B represents the predicted bounding box,
represents the ground-truth bounding box,
is the intersection area, and
is the union area. The CIoU loss function is calculated as follows:
is the basic IoU loss term, where
and
are the center coordinates of the predicted and ground-truth bounding boxes, respectively.
and
are the width and height of the smallest enclosing rectangle.
is the weight coefficient, and
is the term related to the aspect ratio similarity, both of which are used to improve the convergence speed. The
,
, and
terms are defined as follows:
where
w and
h are the width and height of the predicted bounding box, and
and
are the width and height of the ground-truth bounding box. However, when the predicted bounding box and the ground-truth bounding box have the same aspect ratio and coincide at the center, but their width and height values differ, CIoU loses its effectiveness and degenerates into the standard IoU. To address this issue, this study adopts the MPDIoU loss function. MPDIoU calculates the minimum point distances between the upper-left and lower-right corners of the predicted and ground-truth bounding boxes, more accurately accounting for orientation and positional offsets. It introduces the Euclidean distance to calculate the relative positional relationship between the bounding boxes, thereby adjusting the bounding boxes more precisely during training. The MPDIoU loss is calculated as follows:
where
and
are the squared Euclidean distances between the upper-left and lower-right corners of the predicted and ground-truth bounding boxes, respectively. (
) and (
) are the coordinates of the upper-left and lower-right corners of the predicted bounding box, respectively, while (
) and (
) are the coordinates of the upper-left and lower-right corners of the ground-truth bounding box, respectively. The final MPDIoU loss is calculated as follows:
Note that Equation (
15) defines the MPDIoU similarity metric, which ranges from
to 1 and equals 1 when the predicted and ground-truth boxes coincide perfectly. The actual training loss minimized by the optimizer is
so that the loss is non-negative and approaches zero as the prediction improves. From the above equations, it can be seen that MPDIoU takes into account the positional differences between the bounding boxes through minimum point distances, addressing the degeneracy of CIoU when the predicted and ground-truth boxes share the same aspect ratio and center but differ in scale, thereby supporting more stable localization during training.
To make the degeneracy concrete,
Figure 6 visualizes a controlled toy example: the ground-truth box is fixed at the origin with a unit width and unit height, and the predicted box is placed at the same center with the same aspect ratio but its width and height are both scaled by a factor
s. The prediction is therefore identical to the ground truth only at
; for any other
s, the two boxes differ in scale while keeping exactly the aspect ratio and center, which causes CIoU to collapse to its IoU term. Panel (a) plots
and
against
s, and panel (b) plots the corresponding gradient magnitudes
. Both losses reach zero at
, as expected. Away from
, however, the gradient of
decays noticeably faster than that of
, especially in the severely mismatched regime
and
. In this regime, MPDIoU keeps supplying a larger correction signal, which is exactly the property that the algebraic argument above predicts and is what motivates the substitution in HGS-YOLO.
The total multi-task training loss of HGS-YOLO aggregates the classification, bounding-box regression, and distribution focal losses:
where
is the binary cross-entropy classification loss,
is the distribution focal loss that supervises the discretized bounding-box distribution, and
,
,
are the balancing weights. We inherit the Ultralytics YOLOv11n default values
,
, and
without additional tuning, because varying
between 5.0 and 10.0 on a small held-out subset changed the
by less than 0.3 percentage points in our runs, which is within the run-to-run noise of single-seed training. In HGS-YOLO, the only change relative to the YOLOv11n baseline is the substitution of
with
in the box-regression term; all other loss components and their default weights remain unchanged.
2.3. Model Evaluation
To evaluate HGS-YOLO from both detection quality and deployment efficiency perspectives, this study uses the precision (P), recall (R), F1 score, average precision (AP), mean average precision (mAP), model parameters, GFLOPs, model size, latency, FPS, memory usage, and power consumption. For benchmark validation, AP-based metrics were obtained under the standard detector validation protocol. For deployment inference and visualization, candidate detections were filtered with a confidence threshold of 0.25 and then processed by non-maximum suppression (NMS) with an IoU threshold of 0.70. These two threshold values constitute the default real-time inference configuration of the Ultralytics YOLO tool chain and have been widely adopted as a reproducible baseline for YOLO-family detectors reported in the recent object detection literature [
36,
46]. A confidence threshold of 0.25 retains detections that are sufficiently confident for on-device display while avoiding the excessive false-positive rate produced by a much lower threshold, whereas the NMS IoU threshold of 0.70 is a commonly used trade-off value that suppresses duplicate predictions for the same leaf-level object while still preserving closely positioned but genuinely different diseased leaves. We also verified on a held-out validation subset that changing the confidence threshold within [0.20, 0.30] and the NMS IoU threshold within [0.60, 0.75] caused only minor fluctuations in the benchmark
(below 0.3 percentage points), indicating that the default pair is a reasonable operating point rather than a tuning-sensitive choice. For a prediction of class
i, a detection is counted as a true positive only when the predicted category matches a ground-truth instance and the IoU between the matched boxes is not lower than the evaluation threshold
. Unmatched predictions are counted as false positives, and ground-truth objects that are not matched by any prediction are counted as false negatives. Because this study addresses multi-class object detection rather than binary classification, true negatives are not used as a primary statistic. In the comparative ablation and cross-model tables, the main benchmark metric is mAP@0.5, denoted as mAP50, and it is reported together with the recall, parameters, GFLOPs, and model size to highlight the accuracy–complexity trade-off. Additional HGS-YOLO-specific results, including the per-class AP50, mAP@[0.5:0.95], PTQ sensitivity, and indoor deployment precision/F1, are further reported in Tables 4 and 11–14. The relevant calculations for these metrics are shown as follows:
Here,
denotes the average precision of class
i at IoU threshold
, computed as the area under the precision–recall curve:
where
is the precision at recall level
r for class
i under the IoU threshold
.
denotes the total number of tomato leaf disease categories. In this paper,
corresponds to the mAP50, which is used as the primary benchmark metric in Tables 3, 7, and 8. More comprehensive HGS-YOLO results, including the mAP@[0.5:0.95], per-class AP50, and pre-/post-PTQ accuracy differences, are reported separately because these records were not retained uniformly for every compared baseline.
In addition to detection quality metrics, model complexity is quantified by the total number of learnable parameters and the number of floating-point operations (FLOPs). For a single convolutional layer with kernel size
k, input channels
, output channels
, and output spatial dimensions
, the FLOPs are computed as
The total GFLOPs reported in the comparison tables are obtained by summing the FLOPs of all layers in the network and dividing by
. Frames per second (FPS) and latency are measured as reciprocal quantities under the deployment conditions specified in
Section 2.5.
2.4. Training Platform and Settings
The model training for this experiment was carried out on a workstation equipped with a 12th-Gen Intel Core i9-12900K CPU and an NVIDIA GeForce RTX 3060 GPU with 12 GB of memory and 32 GB RAM under WSL2-Ubuntu 22.04. The development environment used Visual Studio Code 1.95, Python 3.9.21, and PyTorch 2.0.1 with CUDA 11.8. Training was performed on a single GPU with batch size 8 for 300 epochs. The input image size was set to
pixels in RGB JPG format. The AdamW optimizer was adopted with an initial learning rate of
. Unlike the original Adam, AdamW decouples the weight decay term from the gradient-based update:
where
and
are the bias-corrected first and second moment estimates of the gradient,
is a small constant for numerical stability,
is the weight decay coefficient, and
is the learning rate at step
t. Decoupled weight decay improves generalization regularity compared with L2 regularization folded into the gradient, which is beneficial for training lightweight models with limited parameter budgets.
The remaining training hyperparameters follow the Ultralytics YOLO v8/v11 defaults so that our configuration is easy to reproduce. The weight decay coefficient is
. The learning rate follows a linear warm-up over the first 3 epochs from
to
and then decays to
on a cosine schedule until epoch 300. Data augmentation consists of mosaic (probability 1.0, disabled during the last 10 epochs), HSV jitter (H = 0.015, S = 0.7, V = 0.4), random horizontal flip (probability 0.5), random translation (fraction 0.1), and random scale (fraction 0.5); no mixup, copy–paste, or vertical flip was used. Early stopping uses patience of 100 epochs on the validation fitness metric; in practice, no run triggered early stopping before epoch 300. The complete training and deployment configuration, including every value listed above together with the PTQ settings used for BPU deployment, is reported in
Table A1 of
Appendix A so that the pipeline can be reproduced directly from this paper.
2.5. Edge Device
The 3D modeling, physical display, and electrical connections of the entire edge device are shown in
Figure 7. The deployed hardware includes the D-Robotics RDK X5 (D-Robotics, Shenzhen, China), the Hikvision MV-CS016-10UC industrial camera (Hangzhou Hikvision Digital Technology Co., Ltd., Hangzhou, China), a DJI TB48s battery (SZ DJI Technology Co., Ltd., Shenzhen, China), a battery holder, and a power conversion board. The system was assembled as a handheld platform for indoor deployment testing. The final deployment mode used INT8 PTQ on the RDK X5 BPU with batch size 1. Unless otherwise stated, the deployment experiments followed the same
input resolution as the benchmark setting. For deployment inference and visualization, the confidence threshold was set to 0.25 and the NMS IoU threshold was set to 0.70. During operation, the handheld device not only displays the current detection boxes in real time but also saves the corresponding detection results to local storage in real time for later review and traceability.
For the PTQ procedure used during deployment conversion, the floating-point tensor value
x can be mapped to an integer value
q through a standard affine quantization rule:
where
s is the scale factor,
z is the zero point,
and
define the target integer range, and
is the dequantized approximation of
x. For symmetric INT8 quantization (
,
,
), the per-tensor scale is determined by the calibration-set statistics:
The worst-case rounding error for any single value is bounded by
. Equations (26) and (27) are exposition-level: they describe the numerical form of the INT8 PTQ mapping that is applied before BPU inference. The calibration-set search, the per-layer scale determination, operator fusion, and layout transformation are carried out automatically by the D-Robotics RDK X5 tool chain using a vendor-supplied calibration set of 300 representative benchmark images, and they are not reproduced in closed form here. The assembled handheld edge deployment platform is shown in
Figure 7.
4. Conclusions
The contribution of this work is a system-level integration study rather than a new detector architecture, and the evidence presented supports a deployment-side argument rather than an accuracy-side argument. Three observations can be derived from the results. First, a three-seed retraining comparison with the YOLOv11n baseline shows that HGS-YOLO is slightly less accurate than the baseline on the modified PlantVillage benchmark ( vs. , percentage points, Welch ), so the case for the proposed design should not be framed as an accuracy gain. The accuracy cost is paid deliberately in exchange for a smaller and cheaper model. Second, on the RDK X5 BPU, every other deployment-side metric moves in HGS-YOLO’s favor: it achieves the lowest end-to-end latency ( ms, 19.6% below YOLOv11n), the highest end-to-end FPS (, 24.2% above YOLOv11n), the smallest peak memory footprint (1.34 GB vs. 1.55 GB), and the lowest average system power (9.8 W vs. 10.2 W) among all six compared detectors, at single-charge endurance of roughly 13 h on a DJI TB48s battery. Third, the dominant cost of moving from the workstation to the handheld device is not quantization: PTQ contributes only a 0.6-percentage-point drop, whereas the change from benchmark imaging to controlled indoor leaf samples contributes a further 2.7 points, so practical deployment efforts should focus on distribution alignment rather than on targeting quantization loss.
These observations are accompanied by several limitations. The evaluation has been carried out on a modified PlantVillage benchmark, on separately collected tomato leaf samples imaged indoors, and on a small-scale outdoor field-validation round (18 context stills and 20 short clips, covering a single day, single site, and single operator), reported in
Section 3.4. The field round shows that the system works under natural greenhouse conditions, but it does not demonstrate an indoor-to-field accuracy drop on an annotated subset, and it does not cover other sites, other seasons, varied weather, or long-term repeated use. The multi-seed comparison covered only HGS-YOLO and the YOLOv11n baseline; the other lightweight variants are still represented by single-seed point estimates, so small differences between them should be read as suggestive. The PTQ pipeline was run on one chip family (RDK X5 BPU) and one quantization policy (symmetric INT8). We therefore position HGS-YOLO as a handheld disease detection prototype that has been validated on the benchmark, indoors, and in a small outdoor field round; whether it generalizes to many sites, seasons, and weather conditions is the open question that the larger campaign described in the Conclusions is intended to resolve.
The natural next step is a larger field campaign that builds on the small round presented in
Section 3.4. We plan to collect data at more sites, in more seasons, and under direct sun, overcast skies, and dawn or dusk light, as well as adding footage taken in rain and strong wind. We also plan to hand-annotate a few hundred video frames so that the mAP, precision, and recall can be tracked numerically across these conditions. On the model side, we will extend the multi-seed comparison from the YOLOv11n baseline to every lightweight variant and broaden the PTQ study to mixed-precision and per-channel quantization on at least one more edge chip. We will report these results separately once measured; we list them here so that the reader understands what is required to transform the findings of the present small field round into a full field deployment claim.