1. Introduction
Tomatoes are an important component of the global vegetable trade, and their production has become a leading industry for increasing farmers’ income, making a contribution to agricultural efficiency [
1]. However, various diseases, such as bacterial spots, early blight, late blight, leaf mold, septoria leaf spot, spider mites, target spot, yellow leaf curl virus, and mosaic virus [
2] can occur during tomato production, affecting high and stable yields. Therefore, the accurate recognition of tomato leaf diseases is crucial for ensuring high and stable yields and reducing economic losses.
The traditional diagnosis of tomato leaf disease relies on visual inspection, which cannot meet the demands of modern agricultural development [
3]. With the advancement of image-processing technology, deep learning has gradually been applied in agricultural fields [
4,
5]. Currently, object recognition algorithms can be categorized into two types: two-stage and one-stage. The first category, two-stage algorithms, generates candidate regions through convolutional neural networks and performs classification and regression. This method offers high accuracy and scalability, but it unsuitable for real-time recognition applications. Meng et al. [
6] introduced the CA attention mechanism in the EfficientNet v2 model and applied transfer learning based on ImageNet, achieving 96.8% accuracy in mushroom disease classification tasks. However, the model still relies on a large computational load and has not yet been optimized for small-target diseases. Zhang et al. [
7] combined the CBAM attention mechanism and multi-scale convolution in the Inception v3 model using transfer learning to optimize the training process and achieved an accuracy of 98.4%. However, a large number of parameters limits their application in re-source-constrained environments. The aforementioned two-stage object recognition algorithms suffer from slow detection speeds and fail to ensure real-time performance.
The second category, one-stage algorithms, uses a single CNN to predict the type and spatial location of the target, balancing the trade-off between accuracy and speed. Lv et al. [
8] enhanced the YOLOv5 model by integrating attention mechanisms and a Transformer encoder, improving its ability to recognize apple leaf diseases, particularly excelling in distinguishing visually similar Alternaria spots and gray spot disease. Yang et al. [
9] proposed a corn leaf disease recognition method based on YOLOv8 by constructing a real image dataset and incorporating the GAM attention module, achieving accurate detection in complex backgrounds and addressing the challenges posed by field environments. Abuizi et al. [
10], based on YOLOv9, introduced a lightweight dynamic upsampling mechanism, DySample, to improve the detection of small lesions, achieving over a 2% improvement across several metrics. Yan Chenhuizi et al. [
11] replaced the backbone network of YOLOv4 with MobileNetv3 and added a coordinate attention module, achieving a good balance between accuracy and efficiency by controlling the parameter count to 39.35 M. Zhou Wei et al. [
12] replaced the CSPDarkNet-53 of YOLOv4 with GhostNet and improved PANet by incorporating depthwise separable convolutions, achieving an accuracy of 79.36% in rice disease identification, although there is still room for improvement in accuracy. The aforementioned researchers have proposed various optimization methods for the object recognition problem, resulting in improvements in recognition capabilities compared to traditional approaches; however, the accuracy of these detections still requires further validation.
To address the issues of subtle differences among various tomato leaf diseases, complex and irregular shapes, and low detection accuracy in existing recognition methods, this paper proposes the SXA-YOLO model (an improvement based on YOLO, where S stands for the SAAPAN architecture, X represents the XIoU loss function, and A denotes the AsDDet module) as a symmetry-aware recognition model. This model establishes a bidirectional balanced feature interaction and a dual-cooperative task processing mechanism. The model adopts a classic three-stage design consisting of Backbone, Neck, and Head, utilizing modular components to achieve efficient multi-scale object recognition, effectively meeting the requirements of agricultural monitoring. The main contributions of this study are as follows.
By synergistically integrating the C3k2 module (Cross-stage Partial Concatenated Bottleneck Convolution with Dual-kernel Design) and the SPPF module (the Fast Pyramid Pooling module), we construct the Backbone network to enable efficient pyramid pooling, thereby establishing a structured feature foundation for subsequent symmetric feature fusion.
The Neck network innovatively integrates a bidirectional feature pyramid, proposing the SAAPAN architecture (Symmetry-Aware Adaptive Path Aggregation Architecture), which facilitates the formation of symmetric feature flows in both top-down and bottom-up directions.
The head network incorporates the AsDDet module (Adaptive Symmetry-aware Decoupled Detection Head) in conjunction with the XIoU loss function (Extended Intersection over Union), enhancing the geometric alignment accuracy between predicted bounding boxes and ground truth targets. By adopting a decoupled head structure, we maintain functional symmetry while improving recognition performance.
The structure of this paper is arranged as follows:
Section 3 presents the overall architecture of the model and provides a detailed design of each module.
Section 4 discusses the dataset and experimental setup.
Section 5 covers comparative experiments, result visualization analysis, and ablation studies. Finally,
Section 6 summarizes the paper and provides an outlook on future research directions.
2. Related Work
In recent years, leaf disease recognition has been extensively studied in precision agriculture and crop management. Because of the balance between accuracy and speed in the YOLO framework, it is highly suitable for real-time agricultural applications. Several studies have used YOLO models for leaf disease recognition, proving their effectiveness in identifying different crop diseases. This section reviews existing work from three perspectives: the application of general models, lightweight improvements in agricultural scenarios, and structural optimization strategies. Based on this review, we clarify the starting points of this paper.
2.1. General Agricultural Object Recognition Based on the YOLO Framework
Early studies primarily validated and compared the applicability of different YOLO variants in agricultural recognition tasks. For instance, Koirala et al. [
13] first applied the fusion of features from YOLOv3 and YOLOv2-Tiny for mango recognition, achieving a 98.3% mAP50 and laying the foundation for subsequent fruit recognition tasks. Sozzi et al. [
14] systematically compared six versions of YOLO, ultimately determining YOLOv4-Tiny as the optimal model for balancing the speed and accuracy in grape bunch recognition. Aldakheel et al. [
15] conducted disease classification training on a dataset of 14 plant leaf diseases using the YOLOv4 model, and the results demonstrated strong performance. These studies established an empirical foundation for the application of YOLO in the agricultural domain. However, these studies primarily focus on the recognition of a single crop or specific pests and diseases, resulting in trained models with limited generalization capabilities that are difficult to directly transfer to similar tasks for other crops.
2.2. Lightweight and Customized Improvements for Agricultural Scenarios
To accommodate mobile or edge computing devices, a series of studies have focused on model lightweighting. Zhang et al. [
16] integrated MobileNetV2 with depth-wise separable convolutions, Cui et al. [
17] used Enhance ShuffleNet with a simplified detection head, and Li et al. [
18] introduced Apple-CSP and attention mechanisms, reducing computational complexity. At the same time, customized architectures tailored for specific crops have also been proposed. Zhang et al. [
19] designed “OrangeYolo” for citrus recognition, improving the mAP50 to 95.7%; Wang et al. [
20] combined ShuffleNet v2 with YOLOv5 for lychee recognition, achieving a 93.9% mAP50; Cao et al. [
21] introduced the attention mechanism to build “YOLOv4-LightC-CBAM,” reaching a 95.39% mAP50 in mango recognition; Chandana et al. [
22] proposed the lightweight MANGO YOLO5, improving upon YOLOv5 by 3%; Abulizi et al. [
10] proposed an improved model called DM-YOLO for detecting tomato leaf diseases in natural environments, which enhances the extraction of small disease features, the localization of overlapping lesion edges, and the suppression of interference from complex backgrounds; Quach et al. [
23] developed a tomato fruit health recognition system with real-time tracking and counting capabilities using the YOLOv8-Grad-CAM++ model, further improving detection accuracy; while Lai et al. [
24] and Quach et al. [
23] applied YOLOv4 and YOLOv8 to oil palm fresh fruit bunch and tomato recognition, respectively, expanding the model’s application range. These studies highlight the importance of domain adaptation, but they primarily focus on individual crops or specific lightweight component replacements. Many existing models enhance small target features by integrating attention mechanisms such as CBAM, CA, and GAM. However, frequent upsampling operations and complex attention computations, while emphasizing key locations, distort the original detail information of other areas. In terms of multi-scale processing, improvements to FPN/PAN structures have been made, yet shallow detail information still struggles to influence deeper decisions. When addressing complex backgrounds, the design of more intricate classification heads or refined loss functions often leads to backbone features shared within the coupled detection heads being dominated by a single task, resulting in insufficient signals for other tasks and making it challenging to accurately delineate irregular disease boundaries.
2.3. Structural Optimization for Complex Scenarios and Performance Enhancement
More recent research has focused on enhancing model performance in complex agricultural environments through more refined network architecture design. Hu et al. [
25] integrated ECA and CA attention mechanisms to develop a multi-module YOLOv7-L, while Ren et al. [
26] proposed a lightweight backbone LFEBNet and CPEA attention module, creating a more efficient YOLO-Lite network. Wu et al. [
27] developed the YOLOv11-PGC model, specifically designed for tomato ripeness recognition, by integrating a polarization state space strategy with a global context module, thereby extending the application of precision agriculture recognition. Feng et al. [
28] introduced the IFEM module and a degradation-aware loss function to construct the underwater recognition model IFEM-YOLOv13, which balances optical compensation and symmetry recovery, effectively enhancing object recognition in complex medium environments. These studies not only improve accuracy, but also focus on achieving model lightweighting and structural optimization, providing important technical accumulation for real-time disease recognition in complex agricultural scenarios.
2.4. Research Gaps and the Entry Point of This Study
Despite advancements in model accuracy and structural optimization in the aforementioned studies, the detection of tomato leaf diseases continues to face complex challenges, including small targets, variable shapes, and complicated backgrounds. Existing advanced improvements often employ a modular stacking strategy, such as integrating attention mechanisms or enhancing specific pathways. However, these strategies can create significant fragmentation when addressing complex challenges; for example, attention mechanisms may amplify background noise and dilute details of small targets, while asymmetric feature pyramids can prevent shallow detail information from influencing deeper decisions. Enhancing a single network layer may alleviate one issue but can exacerbate another. Therefore, conventional modular enhancements are not suitable for tackling these complex challenges.
Based on the aforementioned research foundations and technical context, this study proposes the SXA-YOLO symmetry perception recognition model, aimed at improving disease detection accuracy in complex scenarios. In contrast to previous work, this paper does not introduce isolated new modules; instead, it takes “symmetry” as a design principle that permeates the entire process of feature extraction, fusion, and decision-making. The aim is to achieve equitable interaction of multi-scale features and collaborative optimization of classification and regression tasks.
The tomato leaf disease dataset presented in this paper, which includes concentric rings of early blight and amorphous lesions of late blight, often features a coexistence of various sizes of diseases within a single image, with a distribution that emphasizes both local and global features. The unidirectional Feature Pyramid Network (FPN) suffers from a bias of detail loss when transmitting feature information, whereas the proposed bidirectional symmetric feature flow ensures that subtle disease details and contextual semantics across the entire leaf can be equivalently fused, thereby improving detection accuracy for targets of varying sizes. Additionally, to differentiate between similar disease features, such as spider mites and leaf mold, the model relies on a balanced perception of multiple attributes, including color, texture, and edges. Traditional coupled heads may lean towards localization during optimization, while the functionally symmetric decoupled head proposed in this paper, through a dual-branch design, enforces the model to learn geometric features for precise localization in a parallel and balanced manner, thus enhancing recognition capabilities. As shown in
Table 1, the comparisons between this study and representative state-of-the-art works are clearly outlined.
3. SXA-YOLO Symmetry Perception Recognition Model
The background of tomato leaf disease images is complex, and there are small differences in the representation among different disease types in the images. However, the same disease type may vary in shape, position, and size, leading to false and missed detections. To effectively address the issue of insufficient extraction of key features from disease images in natural backgrounds, this study proposed the SXA-YOLO symmetry perception recognition model. The model consists of a backbone network, a neck network, and a head network. The input data were processed by convolution operations in the backbone network to extract features, which were then fused at multiple levels in the neck network and passed into the head network to generate the classification and location information of the detected objects. The structure of the model is shown in
Figure 1.
The backbone is based on the YOLOv11 framework and consists of five convolutional layers, four C3k2 (Cross-stage Partial Concatenated Bottleneck Convolution with Dual-kernel Design) modules, and one SPPF (the Fast Pyramid Pooling Module) module. It is primarily designed to progressively extract features from low-level to high-level from the input image. This gradually reduces the spatial resolution while increasing the number of channels to obtain more abstract and semantically rich feature representations, all while maintaining the symmetry of feature hierarchy and structure. The C3k2 module is a residual structure that employs a multi-path parallel architecture, allowing it to retain shallow detail features such as fine punctate textures of early-stage diseases while extracting deep semantic information from extensive necrotic areas of late-stage diseases. The SPPF module is a symmetric spatial pyramid structure constructed through concatenated multi-scale pooling operations, which synchronously extracts contextual information from multiple receptive fields without altering the size of the feature maps. This enables effective fusion of cross-scale representations, such as the back mold layer features of leaf mold and the overall wrinkled morphology of yellow leaf curl virus disease. At the same time, the C2PSA module has been removed because its spatial attention mechanism dynamically enhances certain spatial locations within the feature map, which leads to excessive suppression of features in certain areas during the early stages of the backbone network.
The input image first passes through two 3 × 3 convolutions with a stride of two, reducing the resolution to 1/4 and expanding the channel count to 128. Subsequently, two sets of C3k2 modules, combined with regular convolutional branches, enhanced the feature diversity, with the resolution gradually reduced to 1/8 and 1/16. In the deeper stages of the network, C3k2 employs a bottleneck structure to strengthen semantic extraction, and with convolutional downsampling, it compresses the resolution to 1/32, while increasing the channel count to 1024. Finally, the SPPF module aggregates context information through multiscale pooling, outputting a high-dimensional feature pyramid of 20 × 20 × 1024, achieving a hierarchical representation from low-level to high-level semantics. This provides a rich and structurally symmetric multi-scale feature foundation for the subsequent detection head.
In the Neck, the features extracted from different layers of the backbone were integrated to enhance the model’s ability to detect targets of varying sizes. An improvement based on top-down (FPN) and bottom-up (PAN) pathways, the SAAPAN architecture (Symmetry-Aware Adaptive Path Aggregation Architecture) is proposed to improve the computational accuracy and feature representation capability. Here, P1–P5 are the feature maps output from different layers of the backbone network, while N1–N3 are the feature maps used for predictions in the detection head. First, the P4 feature output from the backbone was downsampled through SimConvWrapper (Simplified Convolution Wrapper) and concatenated with the P5 feature. This was then processed through the RepHDW (Re-parameterizable Hybrid Dilated Convolution Wrapper) reparameterization module, resulting in the N3 feature. Subsequently, the feature was upsampled and fused with the P3 feature and passed through the C2f module to generate the N2 feature. After another upsampling, it was concatenated with the original P3 feature to obtain the N1 feature. The entire network employs a bidirectional feature pyramid structure, alternating between the RepHDW and C2f modules within the PAN-FPN pathway to achieve symmetric and efficient feature fusion. The RepHDW module enhances feature representation during training through a multi-branch structure, which can be merged into a single branch for faster computation during inference, whereas the C2f module achieves lightweight feature interaction through cross-stage partial connections. Finally, three feature maps of different scales are output: N1 (80 × 80), N2 (40 × 40), and N3 (20 × 20), which are used to detect small-scale lesions (such as the tiny spots of septoria leaf spot), medium-scale lesions (such as the circular lesions of target spot), and large-scale lesions (such as the concentric rings of early blight), reflecting the structural symmetry of multi-scale feature interactions.
The Head was designed with a decoupled structure, where the boundary box regression, confidence prediction, and classification tasks were handled separately, enhancing detection accuracy through functional symmetry. It receives N1, N2, and N3 from the Neck, each handling different object recognition tasks. An innovative XIoU (Extended Intersection over Union) loss function is proposed to optimize boundary box regression by considering factors such as center distance, aspect ratio, and IoU. Compared to traditional IoU, it achieved more precise localization, particularly enhancing the accuracy of locating small targets such as the speckled leaf discoloration caused by spider mites. To enhance the small object recognition capability, the AsDDet (Adaptive Symmetry-aware Decoupled Detection Head) module is introduced, which improves the detection head by increasing the network depth, incorporating dynamic convolutions, and using attention mechanisms to further strengthen the feature representation. By integrating dynamic label assignment strategies with the collaborative optimization of the XIoU loss function, the geometric alignment accuracy between predicted boxes and ground truth targets is improved. While maintaining functional symmetry advantages, this approach has also achieved breakthroughs in the recognition performance of subtle features, such as the complex mosaic symptoms of the mosaic virus and the halo edges of bacterial spots.
3.1. SAAPAN
The SAAPAN architecture in the neck network consists of three modules that work collaboratively along with the C2f module, clearly demonstrating a symmetrical design in terms of functionality. The SimConvWrapper serves as the basic feature extraction unit, capturing shallow features through a Conv-BN-ReLU structure. ConvWrapper performs high-level feature transformation using a Conv-BN-SiLU structure. The RepHDW module utilizes reparameterizable depth-wise separable convolutions to achieve multi-scale feature fusion, supporting dynamic depth expansion and mixed kernel sizes. The modules are structurally responsive to each other, creating a clear and bidirectionally symmetric feature processing workflow. As shown in
Figure 2, the flow of feature map changes is illustrated, with different layers using distinct color mappings to enhance visual differentiation and indicating the size variations in inputs and outputs at each layer. A detailed description of this process is provided below.
The SimConvWrapper is a lightweight wrapper that encapsulates the SimConv module, utilizing a combination of simplified convolution and ReLU activation, as shown in
Figure 3. In
Figure 3a, the module employs a 3 × 3 convolution kernel (with default stride = 1 and padding = 1) to maintain the feature map resolution, accelerate training convergence via batch normalization, and enhance the nonlinear expression capabilities using the ReLU activation function. As the foundational feature extraction unit within the SAAPAN architecture, this module offers advantages in computational efficiency and parameter reduction, providing a stable underlying representation for subsequent symmetric feature fusion. It is particularly beneficial for preserving local detail features, such as the clear halo edges of bacterial spots and the small circular lesions of septoria leaf spot.
The ConvWrapper, which is based on standard convolution with the SiLU activation function, provides stronger feature representation capabilities, as shown in
Figure 3b. This module uses a 3 × 3 convolution kernel by default, with automatic padding to maintain the feature map size, stabilizes the data distribution through batch normalization, and employs the SiLU activation function for a smoother nonlinear feature transformation. This module supports grouped convolutions and customizable strides, primarily undertaking the tasks of high-level feature transformation and enhancement within the SAAPAN architecture. It effectively extracts complex semantic information, such as the irregular diffusion areas of late blight and the texture of the fungal layer on the underside of leaf mold, thereby achieving a hierarchical symmetry in both structure and function with the SimConvWrapper.
RepHDW is a reparameterizable depth-wise separable convolution module within the SAAPAN architecture, designed with a three-stage symmetrical process of “expansion-fusion-compression,” as shown in
Figure 4. First, this process establishes a closed-loop symmetry in both structure and computation, expanding the channel count of the input feature map to twice its original size using 1 × 1 convolutions. Subsequently, the feature X is split into two symmetric branches, X
1 and X
2. This symmetric splitting mechanism enables the parallel processing of differentiated morphological features, such as the concentric ring structures of early blight and the scattered yellowing patches caused by the mosaic virus. Depth-wise separable convolutions were then applied for multi-scale feature extraction. Finally, the features from each branch were concatenated and compressed through a 1 × 1 convolution for the output. This design maintains strict symmetry in the splitting and fusion of features. This module innovatively combines reparameterization techniques using multiple branches during training and a single branch during inference, along with a dynamic channel adjustment mechanism, to improve accuracy while reducing the computational load. This model also reflects the functional symmetry of the network architecture during both the training and inference phases.
3.2. XIoU
XIoU is an extended variant of IoU that introduces a penalty term for the aspect ratio difference in bounding boxes based on the DIoU. The workflow first computes the basic IoU as shown in (1).
In this context, Inter represents the intersection area (overlapping region) between the predicted and ground-truth bounding boxes, and Union represents the union area (total coverage area) between the predicted and ground-truth bounding boxes. IoU measures the degree of overlap between two boxes, where a higher value indicates a better match.
Subsequently, by combining the center distance penalty (similar to DIoU) and the newly added aspect ratio penalty term, the aspect ratio is transformed into the (0, 1) range using a sigmoid function. The squared difference was then calculated, as shown in (2)–(5).
Here, and represent the center coordinates of the ground truth box, whereas and denote the center coordinates of the predicted box. rho2 indicates the squared Euclidean distance between the centers of the two boxes (normalized by dividing by 4, aligned with the unit of the side length). cw represents the width, ch represents the length, and c2 is the squared diagonal length of the minimum enclosing box. q refers to the sigmoid transformation applied to the aspect ratio, compressing the value to range (1, 2) to prevent numerical instability at extreme aspect ratios. qpred represents the predicted box, qgt represents the ground-truth box, and v indicates the squared difference in the aspect ratio between the predicted and ground-truth boxes.
Finally, the contribution of the two penalty terms is balanced through adaptive weights, and the output is XIoU, as shown in (6) and (7).
where α denotes the adaptive weight, which is set to 0.8. Compared to CIoU, which uses arctan to calculate the angle difference, XIoU employs a simpler sigmoid function to handle the aspect ratio, maintaining sensitivity to the center point distance while penalizing shape differences in a smoother manner. The penalty strength was controlled through adjustable parameters, providing more robust guidance for box regression. Specifically, the XIoU loss function enhances the model’s localization accuracy for specific shaped lesions, such as the circular halo of bacterial spots and the annular lesions of target spot, by balancing the penalties for center point distance and shape differences through adaptive weights. Additionally, its smooth shape penalty mechanism effectively optimizes the stability of bounding box regression for morphologically variable targets, such as the overall leaf curling caused by the yellow leaf curl virus and the irregular expansion areas associated with late blight.
3.3. AsDDet
AsDDet employs a parallel symmetric dual-branch structure that simultaneously handles both regression and classification tasks. The regression branch (REG) enhances feature interaction through standard convolution and channel shuffling operations, outputting the bounding box coordinate distribution, specifically optimizing for the precise geometric features that require accurate localization, such as the clear edges of bacterial spots and the circular contours of target spots. The classification branch (CLS) employs depthwise separable convolutions to reduce computational burden while outputting class probabilities, focusing on distinguishing the textural differences between early blight and late blight, as well as the color and pattern features associated with leaf mold and mosaic virus. The two branches are designed symmetrically in structure, functioning independently while working collaboratively, reflecting the functional symmetry of the detection head module. During the training phase, the AsDDet module employs a dynamic optimization strategy to set key parameters. For positive sample matching, a dynamic anchor-to-ground truth bounding box intersection over union (IoU) threshold of 0.5 is established, with anchors above this threshold considered positive samples. To enhance the dynamicity of learning, the module introduces a hard negative sample ignoring mechanism, setting the negative sample ignoring threshold to (0.4, 0.5). Anchors within this range are temporarily ignored during training to optimize the gradient update process. In the inference phase, the module adaptively allocates computation based on classification confidence. A dynamic allocation threshold of 0.7 is set, using enhanced convolutions for feature extraction in prediction regions with confidence above this threshold, while lightweight convolutions are applied to regions with lower confidence, thus achieving efficient allocation of computational resources. First, the input multiscale feature maps are fed into both the REG and CLS for parallel processing. Then, REG extracts positional features through two 3 × 3 convolutions and outputs the bounding box distribution, whereas CLS extracts semantic features through depthwise separable convolution and conventional convolution, outputting class predictions. The processing flows of the two branches are structurally designed to form a mirror symmetry, with each branch focusing on the complementary recognition tasks of geometric localization and semantic classification. During training, the concatenated raw predictions are directly outputted for loss computation. Finally, during inference, the regression outputs are decoded into actual coordinates through DFL, and the final detection results are output after concatenating with the classification scores, including post-processing steps such as dynamic anchor generation, conversion to bounding boxes (dist2bbox), and class score normalization. The entire process follows the “feature extraction-two-branch symmetric prediction-differentiated output” architecture, as shown in
Figure 5, achieving efficient object recognition.
6. Conclusions
To address the challenges of multiple categories, subtle differences, and complex irregular shapes in tomato leaf disease recognition, the SXA-YOLO symmetric perception recognition model was proposed. This model achieves equitable fusion of multi-scale features and collaborative optimization of classification and regression tasks by constructing a backbone network with symmetrical feature hierarchies, integrating a neck network with symmetrical fusion paths, and employing a detection head with task-functional symmetry. Using the PlantifyDr dataset for testing, the model recognizes ten types of tomato leaf diseases. The final results showed mAP50 and mAP50-95 scores of 0.993 and 0.932, respectively, outperforming nine other object recognition models, including those in the YOLO series, CenterNet, EfficientDet, DETR, and EfficientNet. The model effectively reduced false negatives and false positives in tomato leaf disease recognition, demonstrating its advantage in handling complex backgrounds and multiple diseases. However, the model training requires substantial computational power, which poses a high demand for computer performance. In addition, its multi-object recognition capability in dense scenes requires improvement. Therefore, future work will focus on further optimizing the model structure to reduce its size, enabling real-time inference on embedded platforms (such as the Jetson series and mobile devices). This includes constructing a cross-season and cross-region dataset that covers a broader range of disease types and growth stages to enhance the model’s adaptability to complex environmental changes. Additionally, we will explore its applications and development in the field of multi-target recognition to facilitate the translation of research findings into practical applications.