Symmetry-Aware SXA-YOLO: Enhancing Tomato Leaf Disease Recognition with Bidirectional Feature Fusion and Task Decoupling

Du, Guangyue; Fang, Shuyu; Zhang, Lianbin; Ren, Wanlu; He, Biao

doi:10.3390/sym18010178

Open AccessArticle

Symmetry-Aware SXA-YOLO: Enhancing Tomato Leaf Disease Recognition with Bidirectional Feature Fusion and Task Decoupling

by

Guangyue Du

¹

,

Shuyu Fang

²

,

Lianbin Zhang

³

,

Wanlu Ren

²

and

Biao He

^4,*

¹

School of Rail Transportation, Shandong Jiaotong University, Jinan 250357, China

²

School of Information and Electrical Engineering, Shandong Jianzhu University, Jinan 250101, China

³

Shandong Pinhong Intelligent Technology Co., Ltd., Jinan 250101, China

⁴

College of Computer Application, Guilin University of Technology at Nanning, Nanning 532100, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(1), 178; https://doi.org/10.3390/sym18010178

Submission received: 10 November 2025 / Revised: 6 January 2026 / Accepted: 14 January 2026 / Published: 18 January 2026

(This article belongs to the Topic Image Processing, Signal Processing and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

Tomatoes are an important economic crop in China, and crop diseases often lead to a decline in their yield. Deep learning-based visual recognition methods have become an approach for disease identification; however, challenges remain due to complex background interference in the field and the diversity of disease manifestations. To address these issues, this paper proposes the SXA-YOLO (an improvement based on YOLO, where S stands for the SAAPAN architecture, X represents the XIoU loss function, and A denotes the AsDDet module) symmetric perception recognition model. First, a comprehensive symmetry architecture system is established. The backbone network creates a hierarchical feature foundation through C3k2 (Cross-stage Partial Concatenated Bottleneck Convolution with Dual-kernel Design) and SPPF (the Fast Pyramid Pooling module) modules; the neck employs a SAAPAN (Symmetry-Aware Adaptive Path Aggregation Architecture) bidirectional feature pyramid architecture, utilizing multiple modules to achieve equal fusion of multi-scale features; and the detection head is based on the AsDDet (Adaptive Symmetry-aware Decoupled Detection Head) module for functional decoupling, combining dynamic label assignment and the XIoU (Extended Intersection over Union) loss function to collaboratively optimize classification, regression, and confidence prediction. Ultimately, a complete recognition framework is formed through triple symmetric optimization of “feature hierarchy, fusion path, and task functionality.” Experimental results indicate that this method effectively enhances the model’s recognition performance, achieving a P (Precision) value of 0.992 and an mAP50 (mean Average Precision at 50% IoU threshold) of 0.993. Furthermore, for ten categories of diseases, the SXA-YOLO symmetric perception recognition model outperforms other comparative models in both p value and mAP50. The improved algorithm enhances the recognition of foliar diseases in tomatoes, achieving a high level of accuracy.

Keywords:

tomato leaf disease; image detection; image classification; loss function; YOLO

1. Introduction

Tomatoes are an important component of the global vegetable trade, and their production has become a leading industry for increasing farmers’ income, making a contribution to agricultural efficiency [1]. However, various diseases, such as bacterial spots, early blight, late blight, leaf mold, septoria leaf spot, spider mites, target spot, yellow leaf curl virus, and mosaic virus [2] can occur during tomato production, affecting high and stable yields. Therefore, the accurate recognition of tomato leaf diseases is crucial for ensuring high and stable yields and reducing economic losses.

The traditional diagnosis of tomato leaf disease relies on visual inspection, which cannot meet the demands of modern agricultural development [3]. With the advancement of image-processing technology, deep learning has gradually been applied in agricultural fields [4,5]. Currently, object recognition algorithms can be categorized into two types: two-stage and one-stage. The first category, two-stage algorithms, generates candidate regions through convolutional neural networks and performs classification and regression. This method offers high accuracy and scalability, but it unsuitable for real-time recognition applications. Meng et al. [6] introduced the CA attention mechanism in the EfficientNet v2 model and applied transfer learning based on ImageNet, achieving 96.8% accuracy in mushroom disease classification tasks. However, the model still relies on a large computational load and has not yet been optimized for small-target diseases. Zhang et al. [7] combined the CBAM attention mechanism and multi-scale convolution in the Inception v3 model using transfer learning to optimize the training process and achieved an accuracy of 98.4%. However, a large number of parameters limits their application in re-source-constrained environments. The aforementioned two-stage object recognition algorithms suffer from slow detection speeds and fail to ensure real-time performance.

The second category, one-stage algorithms, uses a single CNN to predict the type and spatial location of the target, balancing the trade-off between accuracy and speed. Lv et al. [8] enhanced the YOLOv5 model by integrating attention mechanisms and a Transformer encoder, improving its ability to recognize apple leaf diseases, particularly excelling in distinguishing visually similar Alternaria spots and gray spot disease. Yang et al. [9] proposed a corn leaf disease recognition method based on YOLOv8 by constructing a real image dataset and incorporating the GAM attention module, achieving accurate detection in complex backgrounds and addressing the challenges posed by field environments. Abuizi et al. [10], based on YOLOv9, introduced a lightweight dynamic upsampling mechanism, DySample, to improve the detection of small lesions, achieving over a 2% improvement across several metrics. Yan Chenhuizi et al. [11] replaced the backbone network of YOLOv4 with MobileNetv3 and added a coordinate attention module, achieving a good balance between accuracy and efficiency by controlling the parameter count to 39.35 M. Zhou Wei et al. [12] replaced the CSPDarkNet-53 of YOLOv4 with GhostNet and improved PANet by incorporating depthwise separable convolutions, achieving an accuracy of 79.36% in rice disease identification, although there is still room for improvement in accuracy. The aforementioned researchers have proposed various optimization methods for the object recognition problem, resulting in improvements in recognition capabilities compared to traditional approaches; however, the accuracy of these detections still requires further validation.

To address the issues of subtle differences among various tomato leaf diseases, complex and irregular shapes, and low detection accuracy in existing recognition methods, this paper proposes the SXA-YOLO model (an improvement based on YOLO, where S stands for the SAAPAN architecture, X represents the XIoU loss function, and A denotes the AsDDet module) as a symmetry-aware recognition model. This model establishes a bidirectional balanced feature interaction and a dual-cooperative task processing mechanism. The model adopts a classic three-stage design consisting of Backbone, Neck, and Head, utilizing modular components to achieve efficient multi-scale object recognition, effectively meeting the requirements of agricultural monitoring. The main contributions of this study are as follows.

By synergistically integrating the C3k2 module (Cross-stage Partial Concatenated Bottleneck Convolution with Dual-kernel Design) and the SPPF module (the Fast Pyramid Pooling module), we construct the Backbone network to enable efficient pyramid pooling, thereby establishing a structured feature foundation for subsequent symmetric feature fusion.
The Neck network innovatively integrates a bidirectional feature pyramid, proposing the SAAPAN architecture (Symmetry-Aware Adaptive Path Aggregation Architecture), which facilitates the formation of symmetric feature flows in both top-down and bottom-up directions.
The head network incorporates the AsDDet module (Adaptive Symmetry-aware Decoupled Detection Head) in conjunction with the XIoU loss function (Extended Intersection over Union), enhancing the geometric alignment accuracy between predicted bounding boxes and ground truth targets. By adopting a decoupled head structure, we maintain functional symmetry while improving recognition performance.

The structure of this paper is arranged as follows: Section 3 presents the overall architecture of the model and provides a detailed design of each module. Section 4 discusses the dataset and experimental setup. Section 5 covers comparative experiments, result visualization analysis, and ablation studies. Finally, Section 6 summarizes the paper and provides an outlook on future research directions.

2. Related Work

In recent years, leaf disease recognition has been extensively studied in precision agriculture and crop management. Because of the balance between accuracy and speed in the YOLO framework, it is highly suitable for real-time agricultural applications. Several studies have used YOLO models for leaf disease recognition, proving their effectiveness in identifying different crop diseases. This section reviews existing work from three perspectives: the application of general models, lightweight improvements in agricultural scenarios, and structural optimization strategies. Based on this review, we clarify the starting points of this paper.

2.1. General Agricultural Object Recognition Based on the YOLO Framework

Early studies primarily validated and compared the applicability of different YOLO variants in agricultural recognition tasks. For instance, Koirala et al. [13] first applied the fusion of features from YOLOv3 and YOLOv2-Tiny for mango recognition, achieving a 98.3% mAP50 and laying the foundation for subsequent fruit recognition tasks. Sozzi et al. [14] systematically compared six versions of YOLO, ultimately determining YOLOv4-Tiny as the optimal model for balancing the speed and accuracy in grape bunch recognition. Aldakheel et al. [15] conducted disease classification training on a dataset of 14 plant leaf diseases using the YOLOv4 model, and the results demonstrated strong performance. These studies established an empirical foundation for the application of YOLO in the agricultural domain. However, these studies primarily focus on the recognition of a single crop or specific pests and diseases, resulting in trained models with limited generalization capabilities that are difficult to directly transfer to similar tasks for other crops.

2.2. Lightweight and Customized Improvements for Agricultural Scenarios

To accommodate mobile or edge computing devices, a series of studies have focused on model lightweighting. Zhang et al. [16] integrated MobileNetV2 with depth-wise separable convolutions, Cui et al. [17] used Enhance ShuffleNet with a simplified detection head, and Li et al. [18] introduced Apple-CSP and attention mechanisms, reducing computational complexity. At the same time, customized architectures tailored for specific crops have also been proposed. Zhang et al. [19] designed “OrangeYolo” for citrus recognition, improving the mAP50 to 95.7%; Wang et al. [20] combined ShuffleNet v2 with YOLOv5 for lychee recognition, achieving a 93.9% mAP50; Cao et al. [21] introduced the attention mechanism to build “YOLOv4-LightC-CBAM,” reaching a 95.39% mAP50 in mango recognition; Chandana et al. [22] proposed the lightweight MANGO YOLO5, improving upon YOLOv5 by 3%; Abulizi et al. [10] proposed an improved model called DM-YOLO for detecting tomato leaf diseases in natural environments, which enhances the extraction of small disease features, the localization of overlapping lesion edges, and the suppression of interference from complex backgrounds; Quach et al. [23] developed a tomato fruit health recognition system with real-time tracking and counting capabilities using the YOLOv8-Grad-CAM++ model, further improving detection accuracy; while Lai et al. [24] and Quach et al. [23] applied YOLOv4 and YOLOv8 to oil palm fresh fruit bunch and tomato recognition, respectively, expanding the model’s application range. These studies highlight the importance of domain adaptation, but they primarily focus on individual crops or specific lightweight component replacements. Many existing models enhance small target features by integrating attention mechanisms such as CBAM, CA, and GAM. However, frequent upsampling operations and complex attention computations, while emphasizing key locations, distort the original detail information of other areas. In terms of multi-scale processing, improvements to FPN/PAN structures have been made, yet shallow detail information still struggles to influence deeper decisions. When addressing complex backgrounds, the design of more intricate classification heads or refined loss functions often leads to backbone features shared within the coupled detection heads being dominated by a single task, resulting in insufficient signals for other tasks and making it challenging to accurately delineate irregular disease boundaries.

2.3. Structural Optimization for Complex Scenarios and Performance Enhancement

More recent research has focused on enhancing model performance in complex agricultural environments through more refined network architecture design. Hu et al. [25] integrated ECA and CA attention mechanisms to develop a multi-module YOLOv7-L, while Ren et al. [26] proposed a lightweight backbone LFEBNet and CPEA attention module, creating a more efficient YOLO-Lite network. Wu et al. [27] developed the YOLOv11-PGC model, specifically designed for tomato ripeness recognition, by integrating a polarization state space strategy with a global context module, thereby extending the application of precision agriculture recognition. Feng et al. [28] introduced the IFEM module and a degradation-aware loss function to construct the underwater recognition model IFEM-YOLOv13, which balances optical compensation and symmetry recovery, effectively enhancing object recognition in complex medium environments. These studies not only improve accuracy, but also focus on achieving model lightweighting and structural optimization, providing important technical accumulation for real-time disease recognition in complex agricultural scenarios.

2.4. Research Gaps and the Entry Point of This Study

Despite advancements in model accuracy and structural optimization in the aforementioned studies, the detection of tomato leaf diseases continues to face complex challenges, including small targets, variable shapes, and complicated backgrounds. Existing advanced improvements often employ a modular stacking strategy, such as integrating attention mechanisms or enhancing specific pathways. However, these strategies can create significant fragmentation when addressing complex challenges; for example, attention mechanisms may amplify background noise and dilute details of small targets, while asymmetric feature pyramids can prevent shallow detail information from influencing deeper decisions. Enhancing a single network layer may alleviate one issue but can exacerbate another. Therefore, conventional modular enhancements are not suitable for tackling these complex challenges.

Based on the aforementioned research foundations and technical context, this study proposes the SXA-YOLO symmetry perception recognition model, aimed at improving disease detection accuracy in complex scenarios. In contrast to previous work, this paper does not introduce isolated new modules; instead, it takes “symmetry” as a design principle that permeates the entire process of feature extraction, fusion, and decision-making. The aim is to achieve equitable interaction of multi-scale features and collaborative optimization of classification and regression tasks.

The tomato leaf disease dataset presented in this paper, which includes concentric rings of early blight and amorphous lesions of late blight, often features a coexistence of various sizes of diseases within a single image, with a distribution that emphasizes both local and global features. The unidirectional Feature Pyramid Network (FPN) suffers from a bias of detail loss when transmitting feature information, whereas the proposed bidirectional symmetric feature flow ensures that subtle disease details and contextual semantics across the entire leaf can be equivalently fused, thereby improving detection accuracy for targets of varying sizes. Additionally, to differentiate between similar disease features, such as spider mites and leaf mold, the model relies on a balanced perception of multiple attributes, including color, texture, and edges. Traditional coupled heads may lean towards localization during optimization, while the functionally symmetric decoupled head proposed in this paper, through a dual-branch design, enforces the model to learn geometric features for precise localization in a parallel and balanced manner, thus enhancing recognition capabilities. As shown in Table 1, the comparisons between this study and representative state-of-the-art works are clearly outlined.

3. SXA-YOLO Symmetry Perception Recognition Model

The background of tomato leaf disease images is complex, and there are small differences in the representation among different disease types in the images. However, the same disease type may vary in shape, position, and size, leading to false and missed detections. To effectively address the issue of insufficient extraction of key features from disease images in natural backgrounds, this study proposed the SXA-YOLO symmetry perception recognition model. The model consists of a backbone network, a neck network, and a head network. The input data were processed by convolution operations in the backbone network to extract features, which were then fused at multiple levels in the neck network and passed into the head network to generate the classification and location information of the detected objects. The structure of the model is shown in Figure 1.

The backbone is based on the YOLOv11 framework and consists of five convolutional layers, four C3k2 (Cross-stage Partial Concatenated Bottleneck Convolution with Dual-kernel Design) modules, and one SPPF (the Fast Pyramid Pooling Module) module. It is primarily designed to progressively extract features from low-level to high-level from the input image. This gradually reduces the spatial resolution while increasing the number of channels to obtain more abstract and semantically rich feature representations, all while maintaining the symmetry of feature hierarchy and structure. The C3k2 module is a residual structure that employs a multi-path parallel architecture, allowing it to retain shallow detail features such as fine punctate textures of early-stage diseases while extracting deep semantic information from extensive necrotic areas of late-stage diseases. The SPPF module is a symmetric spatial pyramid structure constructed through concatenated multi-scale pooling operations, which synchronously extracts contextual information from multiple receptive fields without altering the size of the feature maps. This enables effective fusion of cross-scale representations, such as the back mold layer features of leaf mold and the overall wrinkled morphology of yellow leaf curl virus disease. At the same time, the C2PSA module has been removed because its spatial attention mechanism dynamically enhances certain spatial locations within the feature map, which leads to excessive suppression of features in certain areas during the early stages of the backbone network.

The input image first passes through two 3 × 3 convolutions with a stride of two, reducing the resolution to 1/4 and expanding the channel count to 128. Subsequently, two sets of C3k2 modules, combined with regular convolutional branches, enhanced the feature diversity, with the resolution gradually reduced to 1/8 and 1/16. In the deeper stages of the network, C3k2 employs a bottleneck structure to strengthen semantic extraction, and with convolutional downsampling, it compresses the resolution to 1/32, while increasing the channel count to 1024. Finally, the SPPF module aggregates context information through multiscale pooling, outputting a high-dimensional feature pyramid of 20 × 20 × 1024, achieving a hierarchical representation from low-level to high-level semantics. This provides a rich and structurally symmetric multi-scale feature foundation for the subsequent detection head.

In the Neck, the features extracted from different layers of the backbone were integrated to enhance the model’s ability to detect targets of varying sizes. An improvement based on top-down (FPN) and bottom-up (PAN) pathways, the SAAPAN architecture (Symmetry-Aware Adaptive Path Aggregation Architecture) is proposed to improve the computational accuracy and feature representation capability. Here, P1–P5 are the feature maps output from different layers of the backbone network, while N1–N3 are the feature maps used for predictions in the detection head. First, the P4 feature output from the backbone was downsampled through SimConvWrapper (Simplified Convolution Wrapper) and concatenated with the P5 feature. This was then processed through the RepHDW (Re-parameterizable Hybrid Dilated Convolution Wrapper) reparameterization module, resulting in the N3 feature. Subsequently, the feature was upsampled and fused with the P3 feature and passed through the C2f module to generate the N2 feature. After another upsampling, it was concatenated with the original P3 feature to obtain the N1 feature. The entire network employs a bidirectional feature pyramid structure, alternating between the RepHDW and C2f modules within the PAN-FPN pathway to achieve symmetric and efficient feature fusion. The RepHDW module enhances feature representation during training through a multi-branch structure, which can be merged into a single branch for faster computation during inference, whereas the C2f module achieves lightweight feature interaction through cross-stage partial connections. Finally, three feature maps of different scales are output: N1 (80 × 80), N2 (40 × 40), and N3 (20 × 20), which are used to detect small-scale lesions (such as the tiny spots of septoria leaf spot), medium-scale lesions (such as the circular lesions of target spot), and large-scale lesions (such as the concentric rings of early blight), reflecting the structural symmetry of multi-scale feature interactions.

The Head was designed with a decoupled structure, where the boundary box regression, confidence prediction, and classification tasks were handled separately, enhancing detection accuracy through functional symmetry. It receives N1, N2, and N3 from the Neck, each handling different object recognition tasks. An innovative XIoU (Extended Intersection over Union) loss function is proposed to optimize boundary box regression by considering factors such as center distance, aspect ratio, and IoU. Compared to traditional IoU, it achieved more precise localization, particularly enhancing the accuracy of locating small targets such as the speckled leaf discoloration caused by spider mites. To enhance the small object recognition capability, the AsDDet (Adaptive Symmetry-aware Decoupled Detection Head) module is introduced, which improves the detection head by increasing the network depth, incorporating dynamic convolutions, and using attention mechanisms to further strengthen the feature representation. By integrating dynamic label assignment strategies with the collaborative optimization of the XIoU loss function, the geometric alignment accuracy between predicted boxes and ground truth targets is improved. While maintaining functional symmetry advantages, this approach has also achieved breakthroughs in the recognition performance of subtle features, such as the complex mosaic symptoms of the mosaic virus and the halo edges of bacterial spots.

3.1. SAAPAN

The SAAPAN architecture in the neck network consists of three modules that work collaboratively along with the C2f module, clearly demonstrating a symmetrical design in terms of functionality. The SimConvWrapper serves as the basic feature extraction unit, capturing shallow features through a Conv-BN-ReLU structure. ConvWrapper performs high-level feature transformation using a Conv-BN-SiLU structure. The RepHDW module utilizes reparameterizable depth-wise separable convolutions to achieve multi-scale feature fusion, supporting dynamic depth expansion and mixed kernel sizes. The modules are structurally responsive to each other, creating a clear and bidirectionally symmetric feature processing workflow. As shown in Figure 2, the flow of feature map changes is illustrated, with different layers using distinct color mappings to enhance visual differentiation and indicating the size variations in inputs and outputs at each layer. A detailed description of this process is provided below.

The SimConvWrapper is a lightweight wrapper that encapsulates the SimConv module, utilizing a combination of simplified convolution and ReLU activation, as shown in Figure 3. In Figure 3a, the module employs a 3 × 3 convolution kernel (with default stride = 1 and padding = 1) to maintain the feature map resolution, accelerate training convergence via batch normalization, and enhance the nonlinear expression capabilities using the ReLU activation function. As the foundational feature extraction unit within the SAAPAN architecture, this module offers advantages in computational efficiency and parameter reduction, providing a stable underlying representation for subsequent symmetric feature fusion. It is particularly beneficial for preserving local detail features, such as the clear halo edges of bacterial spots and the small circular lesions of septoria leaf spot.

The ConvWrapper, which is based on standard convolution with the SiLU activation function, provides stronger feature representation capabilities, as shown in Figure 3b. This module uses a 3 × 3 convolution kernel by default, with automatic padding to maintain the feature map size, stabilizes the data distribution through batch normalization, and employs the SiLU activation function for a smoother nonlinear feature transformation. This module supports grouped convolutions and customizable strides, primarily undertaking the tasks of high-level feature transformation and enhancement within the SAAPAN architecture. It effectively extracts complex semantic information, such as the irregular diffusion areas of late blight and the texture of the fungal layer on the underside of leaf mold, thereby achieving a hierarchical symmetry in both structure and function with the SimConvWrapper.

RepHDW is a reparameterizable depth-wise separable convolution module within the SAAPAN architecture, designed with a three-stage symmetrical process of “expansion-fusion-compression,” as shown in Figure 4. First, this process establishes a closed-loop symmetry in both structure and computation, expanding the channel count of the input feature map to twice its original size using 1 × 1 convolutions. Subsequently, the feature X is split into two symmetric branches, X₁ and X₂. This symmetric splitting mechanism enables the parallel processing of differentiated morphological features, such as the concentric ring structures of early blight and the scattered yellowing patches caused by the mosaic virus. Depth-wise separable convolutions were then applied for multi-scale feature extraction. Finally, the features from each branch were concatenated and compressed through a 1 × 1 convolution for the output. This design maintains strict symmetry in the splitting and fusion of features. This module innovatively combines reparameterization techniques using multiple branches during training and a single branch during inference, along with a dynamic channel adjustment mechanism, to improve accuracy while reducing the computational load. This model also reflects the functional symmetry of the network architecture during both the training and inference phases.

3.2. XIoU

XIoU is an extended variant of IoU that introduces a penalty term for the aspect ratio difference in bounding boxes based on the DIoU. The workflow first computes the basic IoU as shown in (1).

I o U = \frac{I n t e r}{U n i o n},

(1)

In this context, Inter represents the intersection area (overlapping region) between the predicted and ground-truth bounding boxes, and Union represents the union area (total coverage area) between the predicted and ground-truth bounding boxes. IoU measures the degree of overlap between two boxes, where a higher value indicates a better match.

Subsequently, by combining the center distance penalty (similar to DIoU) and the newly added aspect ratio penalty term, the aspect ratio is transformed into the (0, 1) range using a sigmoid function. The squared difference was then calculated, as shown in (2)–(5).

r h o^{2} = \frac{{(b_{c x}^{p r e d} - b_{c x}^{g t})}^{2} + {(b_{c y}^{p r e d} - b_{c y}^{g t})}^{2}}{4},

(2)

c^{2} = c_{w}^{2} + c_{h}^{2}

(3)

q = 1 + e^{- w / h}

(4)

v = {(\frac{1}{q_{p r e d}} - \frac{1}{q_{g t}})}^{2}

(5)

Here,

b_{c x}^{g t}

and

b_{c y}^{g t}

represent the center coordinates of the ground truth box, whereas

b_{c x}^{p r e d}

and

b_{c y}^{p r e d}

denote the center coordinates of the predicted box. rho² indicates the squared Euclidean distance between the centers of the two boxes (normalized by dividing by 4, aligned with the unit of the side length). c_w represents the width, c_h represents the length, and c² is the squared diagonal length of the minimum enclosing box. q refers to the sigmoid transformation applied to the aspect ratio, compressing the value to range (1, 2) to prevent numerical instability at extreme aspect ratios. q_pred represents the predicted box, q_gt represents the ground-truth box, and v indicates the squared difference in the aspect ratio between the predicted and ground-truth boxes.

Finally, the contribution of the two penalty terms is balanced through adaptive weights, and the output is XIoU, as shown in (6) and (7).

α = \frac{v}{v - I o U + 1},

(6)

X I o U = I o U - (\frac{r h o^{2}}{c^{2}} + v \times α)

(7)

where α denotes the adaptive weight, which is set to 0.8. Compared to CIoU, which uses arctan to calculate the angle difference, XIoU employs a simpler sigmoid function to handle the aspect ratio, maintaining sensitivity to the center point distance while penalizing shape differences in a smoother manner. The penalty strength was controlled through adjustable parameters, providing more robust guidance for box regression. Specifically, the XIoU loss function enhances the model’s localization accuracy for specific shaped lesions, such as the circular halo of bacterial spots and the annular lesions of target spot, by balancing the penalties for center point distance and shape differences through adaptive weights. Additionally, its smooth shape penalty mechanism effectively optimizes the stability of bounding box regression for morphologically variable targets, such as the overall leaf curling caused by the yellow leaf curl virus and the irregular expansion areas associated with late blight.

3.3. AsDDet

AsDDet employs a parallel symmetric dual-branch structure that simultaneously handles both regression and classification tasks. The regression branch (REG) enhances feature interaction through standard convolution and channel shuffling operations, outputting the bounding box coordinate distribution, specifically optimizing for the precise geometric features that require accurate localization, such as the clear edges of bacterial spots and the circular contours of target spots. The classification branch (CLS) employs depthwise separable convolutions to reduce computational burden while outputting class probabilities, focusing on distinguishing the textural differences between early blight and late blight, as well as the color and pattern features associated with leaf mold and mosaic virus. The two branches are designed symmetrically in structure, functioning independently while working collaboratively, reflecting the functional symmetry of the detection head module. During the training phase, the AsDDet module employs a dynamic optimization strategy to set key parameters. For positive sample matching, a dynamic anchor-to-ground truth bounding box intersection over union (IoU) threshold of 0.5 is established, with anchors above this threshold considered positive samples. To enhance the dynamicity of learning, the module introduces a hard negative sample ignoring mechanism, setting the negative sample ignoring threshold to (0.4, 0.5). Anchors within this range are temporarily ignored during training to optimize the gradient update process. In the inference phase, the module adaptively allocates computation based on classification confidence. A dynamic allocation threshold of 0.7 is set, using enhanced convolutions for feature extraction in prediction regions with confidence above this threshold, while lightweight convolutions are applied to regions with lower confidence, thus achieving efficient allocation of computational resources. First, the input multiscale feature maps are fed into both the REG and CLS for parallel processing. Then, REG extracts positional features through two 3 × 3 convolutions and outputs the bounding box distribution, whereas CLS extracts semantic features through depthwise separable convolution and conventional convolution, outputting class predictions. The processing flows of the two branches are structurally designed to form a mirror symmetry, with each branch focusing on the complementary recognition tasks of geometric localization and semantic classification. During training, the concatenated raw predictions are directly outputted for loss computation. Finally, during inference, the regression outputs are decoded into actual coordinates through DFL, and the final detection results are output after concatenating with the classification scores, including post-processing steps such as dynamic anchor generation, conversion to bounding boxes (dist2bbox), and class score normalization. The entire process follows the “feature extraction-two-branch symmetric prediction-differentiated output” architecture, as shown in Figure 5, achieving efficient object recognition.

4. Experimental Setup and Evaluation Metrics

4.1. Tomato Disease Dataset

This study used the publicly available dataset PlantifyDr [29], which contains images of 10 plant species and 37 plant disease types, totaling 125,000 images. The tomato leaf disease images in the dataset included 10 types of tomato leaf diseases: healthy (H), Bacterial Spot (BS), Early Blight (EB), Late Blight (LB), Leaf Mold (LM), Septoria Leaf Spot (SLS), Spider Mites (SM), Target Spot (TS), Yellow Leaf Curl Virus (YLCV), and Mosaic Virus (MV), as shown in Figure 6. The images are sized at 640 px × 640 px, with 200 images per disease type. The dataset was divided into training, validation, and test sets in an 8:1:1 ratio, consisting of 1600 training images, 200 validation images, and 200 test images for a total of 2000 images across the ten disease types.

4.2. Configuration Parameters

The experiment was set up on a Windows 10 64-bit professional system, using an NVIDIA GeForce RTX 3090 to accelerate model training. The CUDA version was 11.7, Python version 3.10.15, and PyTorch version 2.0.0. The model was configured with 300 epochs and a batch size of 16. SGD was used as the optimizer used is SGD with an initial learning rate of 0.01. In the final 10 epochs, mosaic augmentation was disabled and the IoU threshold was set to 0.7. When the IoU between the predicted box and ground truth box exceeded this threshold, the target object was detected. The training parameters for all models are detailed in Table 2.

In Table 2, the learning rate refers to the step size of the optimizer during each parameter update, which determines the speed of model convergence and its final performance. Weight decay is a regularization technique that adds the square sum of the model weights multiplied by a decay coefficient to the loss function, penalizing excessively large weight values to prevent overfitting. Momentum is a parameter of the SGD optimizer used to accelerate the gradient descent process and reduce oscillations, helping to speed up convergence in flat areas of the loss function and escape local minima. Finally, disabling mosaic augmentation during the last 10 epochs allows the model to focus on more realistic single-image data for fine-tuning adjustments.

4.3. Performance Indicators

The SXA-YOLO symmetry perception recognition model uses Precision (P), Recall (R), and Mean Average Precision (mAP) as key metrics to evaluate model performance. P is used to assess the model’s ability in sample classification, detection, and localization; R evaluates the ability of the model to detect true targets; and mAP is the mean of the average precision at different recall rates, evaluating the model’s stability under various conditions, as shown in (8)–(11).

P = \frac{T P}{T P + F P} \times 100 %,

(8)

R = \frac{T P}{T P + F N} \times 100 %

(9)

A P = \int_{0}^{1} P (R) d R \times 100 %

(10)

m A P = \frac{\sum A P}{N_{c}} \times 100 %

(11)

Here, TP (True Positive) refers to the number of positive samples correctly predicted as positive, that is, where the actual value is 1 and the prediction is also 1; FP (False Positive) refers to the number of negative samples incorrectly predicted as positive, that is, where the actual value is 0 and the prediction is 1; FN (False Negative) refers to the number of positive samples incorrectly predicted as negative, that is, where the actual value is 1 and the prediction is 0; Nc denotes the number of categories of tomato leaf diseases in the dataset, with a total of ten disease categories in this study.

5. Results and Discussion

5.1. Comparative Experiment of Loss Functions

To analyze the impact of various loss functions on the recognition performance of the SXA-YOLO symmetry perception recognition model and evaluate the matching degree between XIoU prediction bounding box and real bounding box, this paper systematically replaces the XIoU loss function used in the model with CIoU, EfficiCIoU, EIoU, UIoU, and WIoU. Comparative analysis is conducted under identical experimental conditions to evaluate the performance of each loss function on a unified dataset. As shown in Table 3, the detection performance metrics of each loss function, including precision, recall, mAP50, and mAP50-95, are provided.

The results indicate that, when considering all evaluation metrics, XIoU outperforms all other methods. Its precision is 0.992 and recall is 0.995, both of which are the highest recorded. Additionally, mAP50 reaches 0.993, ranking it as one of the best alongside EfficiCIoU. However, under the more stringent full-scale average precision metric mAP50-95, XIoU scores 0.932, surpassing other loss functions, thereby verifying its stability in detection quality across various intersection-over-union (IoU) thresholds. In contrast, other loss functions exhibit strengths and weaknesses across different metrics. For instance, CIoU demonstrates the fastest speed with a frame per second (FPS) rate of 191.2, but its mAP50-95 is notably lower than that of XIoU. EfficiCIoU approaches XIoU in terms of precision and mAP50, yet still falls short in mAP50-95. EIoU records relatively low values across all metrics, while UIoU ranks second to XIoU in mAP50-95 but has gaps in precision and recall. WIoU shows an overall moderate performance. In summary, XIoU achieves optimal results in precision, recall, mAP50, and mAP50-95 while maintaining a high FPS, thus enhancing the alignment between predicted boxes and ground truth boxes. As shown in Figure 7, Figure 7a presents the overall performance ranking of various loss functions, calculated by assigning a 30% weight to FPS and a 70% weight to precision metrics to obtain a comprehensive score. Figure 7b shows a normalized performance comparison of the loss functions. This visualization analyzes six loss functions—CIoU, EfficiCIoU, EIoU, UIoU, WIoU, and XIoU—across five key metrics: FPS, Precision, Recall, mAP50, and mAP50-95. In terms of speed, CIoU has the highest FPS. However, compared to it, XIoU leads by 1.9% in mAP50-95, achieving comprehensive superiority in various accuracy metrics while maintaining high efficiency. The results consistently indicate that XIoU demonstrates the most outstanding overall performance. Although its FPS is not the highest, the substantial advantage in precision compensates for the minor speed deficit. This advantage is particularly notable, suggesting that XIoU effectively reduces missed detections and provides more reliable detection results, reflecting superior localization accuracy.

5.2. Ablation Experiments

To validate the effectiveness of the improved model, an ablation study was conducted using a full-module disabled configuration as the baseline model (Unmodified YOLOv11). The SAAPAN architecture, XIoU and AsDDet modules were individually added, and the experimental results are listed in Table 4. The impact of each module on the model performance varies. When used alone, the SAAPAN architecture increased the computational load, with a 108% increase in GFLOPS and a 59% increase in parameters compared to the full-module disabled configuration, resulting in only a slight improvement in accuracy. The XIoU module, while not increasing the computational load, decreases the accuracy with a 1.3% reduction in mAP50. The AsDDet module, with only an 8% increase in GFLOPS, provides a stable performance improvement with minimal computational cost. Among the two-module combinations, the SAAPAN + XIoU combination imposes the highest computational burden, with a GFLOPS of 13.1, but provides moderate accuracy, with an mAP50 of only 98.7%. The combination of XIoU and AsDDet achieved a superior mAP50-95 index while maintaining better computing power. The SAAPAN + AsDDet combination has the highest computational cost, with a GFLOPS of 17.4, but does not result in a accuracy improvement. The final full-module combination produced a synergistic effect, achieving the best overall performance despite having a GFLOPS of 17.4 and 5.06 M parameters. The inference speed was 156.4 FPS, the mAP50 increased to 99.3%, and the mAP50-95 improved to 93.2%, a 4.5% increase over the baseline. This combination outperforms several single-module configurations, demonstrating that the collaborative optimization of modules can effectively enhance accuracy and validate the overall superior performance of the proposed algorithm.

Figure 8 presents the ablation experiment results for the improved modules of this model. Through four subfigures, the impact of different module combinations on model performance is systematically evaluated using four metrics: Precision (P), Recall (R), mean Average Precision (mAP), and mAP50-95. Overall, the improved model demonstrates optimal or near-optimal performance across all metrics, confirming the effectiveness of the enhancement strategies and the synergistic interactions among the modules.

Based on the experimental results in Table 4 and the line graph in Figure 8, it can be observed that the SXA-YOLO symmetric perception recognition model achieves the highest accuracy while maintaining a high inference speed. The advantage of the three-module combination is its complementary functionality. The SAAPAN architecture enhances the hierarchical representation of multi-scale disease features through an adaptive attention mechanism, while the XIoU module implements scale-aware bounding box optimization based on this foundation. It dynamically adjusts the loss weights for center point distance, aspect ratio, and overlap area for targets of different sizes, improving the localization accuracy of small-scale targets. The AsDDet module employs a dynamic computation allocation strategy during the inference phase, adaptively adjusting the convolution kernel configuration and feature sampling range based on feature complexity and scale distribution, effectively balancing the computational load. Although it has a high computational cost when used alone, its accuracy is improved under the cross-scale interaction optimization of the XIoU module and dynamic detection mechanism adjustment of AsDDet. When used alone, the XIoU module suffers from accuracy degradation owing to cross-scale information interference, but when combined with SAAPAN, it effectively regularizes the attention distribution. The dynamic sampling strategy of AsDDet further mitigates its negative effects. This combination not only brings stable gains with an 8% increase in GFLOPS but also achieves a reversal in inference speed by dynamically balancing the workload of the other two modules. The results demonstrate that the three-module combination forms a closed-loop optimization system with “feature enhancement-interaction optimization-dynamic adjustment” through collaborative design at the architectural level. This collaborative mechanism overcomes the limitations of the individual modules and compensates for the shortcomings of the two-module combination, ultimately achieving a Pareto-optimal balance of accuracy, speed, and efficiency. These results fully validate the superiority of the SXA-YOLO symmetry perception recognition model for complex- recognition tasks.

5.3. Comparison with SXA-YOLO and Its Mainstream Models

To explore the optimal algorithm, nine object recognition models, namely the YOLO series, Centernet, EffcientDet, Detr, and EfficientNet, were selected. The parameters used were the default values for each model and the official default training weights were applied. Comparative experiments were conducted using the PlantifyDr dataset. The comparison results are shown in Table 5.

From an overall performance perspective, the proposed SXA-YOLO symmetrical perception recognition model demonstrates optimal results across comprehensive metrics, with a precision (P) of 0.996, recall (R) of 0.995, mAP50 of 0.993, and mAP50-95 of 0.932, all surpassing those of the comparative models. In internal comparisons within the YOLO series, SXA-YOLO not only outperforms the lightweight YOLOv5 (1.9 M parameters, mAP50: 0.864) and YOLOv8 (3.2 M parameters, mAP50: 0.987), but it also exceeds the recently proposed YOLOv11 (mAP50: 0.989) and YOLOv13 (mAP50: 0.985). This validates the effectiveness of the symmetrical design principle presented in this paper, which enhances performance not by stacking modules but through the symmetrical fusion of feature flows and the collaborative optimization of task decoupling, thereby improving the model’s ability to perceive complex disease features.

As shown in Figure 9, SXA-YOLO symmetry perception recognition model achieved the highest accuracy with a relatively small model parameter size.

To more intuitively demonstrate the superiority of the SXA-YOLO symmetry perception recognition model among various object recognition models, the specific results for each tomato leaf disease are described in terms of Precision (P) and Recall (R). Table 6 presents the corresponding p values, and Table 7 presents the corresponding R values. Overall, compared to all other recognition models, the SXA-YOLO symmetry perception recognition model leads in both P and R, outperforming other models in most categories. In all 10 categories, healthy, leaf mold, septoria leaf spot, spider mites, and yellow leaf curl virus achieved 100% accuracy for Precision (P), while healthy, bacterial spot, late blight, leaf mold, spider mites, target spot, yellow leaf curl virus and mosaic virus reached 100% Recall (R). Other diseases also exhibited no performance shortcomings.

As shown in Figure 10, Figure 10a compares the average accuracy of each model, while Figure 10b shows the standard deviation analysis of each model’s performance stability. The results show that the improved model achieves the highest average precision of 0.996, which is 7.2% higher than the lowest-performing model, demonstrating that its symmetric fusion and decoupled design effectively enhance the model’s ability to distinguish and localize features of multi-scale and similar diseases. A smaller standard deviation indicates less fluctuation and greater stability across different experiments. The analysis reveals that the improved model has the lowest standard deviation, at only 0.007, which is lower than that of the other models. A comprehensive analysis of both figures reveals that the SXA-YOLO symmetry perception recognition model demonstrates the best performance. As shown in Figure 11, Figure 11a shows the radar chart of accuracy for each model across different categories, while Figure 11b presents the radar chart of recall for each model across the same categories. The results demonstrate that the SXS-YOLO model achieves the best overall performance, with accuracy close to 1 for most disease categories. In contrast, other models exhibit accuracy shortcomings in certain categories, such as the Centernet model’s detection results for the Septoria Leaf Spot disease category. In the recall rate radar chart, the improved model also shows a recall rate close to 1 for most categories. Combining insights from both figures, the SXA-YOLO model presents the most regular polygon closest to a circumscribed circle, demonstrating the most balanced performance in terms of accuracy and recall across all disease categories, without apparent performance deficiencies. In contrast, the radar charts of some comparative models display irregular shapes with inward concavities, indicating weak recognition capabilities for specific disease categories. These results indicate that the SXA-YOLO symmetry perception recognition model, while maintaining high precision and excellent category adaptability, achieves breakthroughs in stability and overall performance, providing a reliable solution for the accurate identification of tomato diseases in complex agricultural scenarios.

5.4. Classification Detection Results

To comprehensively assess the recognition performance of the SXA-YOLO symmetry perception recognition model on different tomato leaf disease categories, this study performed a systematic statistical analysis of P, R, mAP50, and mAP50-95 across nine disease types and healthy leaf categories, with the results presented in Table 8. The data indicates that although certain categories (such as early blight) performed relatively poorly in terms of recall and mAP50-95, reflecting the challenges of detecting complex lesion features, the model demonstrated excellent overall recognition performance across the majority of categories. Overall, the model achieved an average accuracy of 99.6%, an average recall of 99.5%, and an average mAP50 of 99.5% across all categories, showcasing exceptional generalization ability and strong robustness for multi-class tomato leaf disease recognition tasks.

5.5. Visualization Result

To verify the generalizability of the improved SXA-YOLO symmetry perception recognition model, predictions were made using previously unseen images from the PlantifyDr dataset, as shown in Figure 12. The characteristics of the 10 tomato leaf disease classes reveal that the background of each disease image is complex, and there is a minimal representation difference between different disease classes, while variations exist in shape, position, and size within the same disease class. This makes images prone to false positives and missed detections. However, the SXA-YOLO symmetry perception recognition model could accurately identify these 10 tomato leaf diseases, demonstrating its superior recognition capability, particularly for small targets and subtle diseases.

To visually demonstrate the classification performance of the SXA-YOLO symmetry perception recognition model, a confusion matrix was used for evaluation and analysis. The confusion matrix illustrates the classification accuracy of the model as shown in Figure 13. The darker the color of the diagonal elements, the higher the detection accuracy; the brighter the color of the off-diagonal elements, the more likely the horizontally and vertically aligned elements are to be misclassified. In summary, the SXA-YOLO symmetry perception recognition model achieves high detection accuracy for most classes in tomato leaf disease recognition, with a low misclassification rate, demonstrating a strong generalization capability. These advantages indicate that the SXA-YOLO symmetry perception recognition model is more suitable for complex and diverse agricultural disease- recognition applications.

To visually assess the focus of the SXA-YOLO symmetry perception recognition model on different regions of an image, the Grad-CAM [38] technique was employed to generate a heat map that illustrates the density, distribution, and variation in tomato leaf disease features. As shown in Figure 14, in this context, the red (highlighted) areas represent regions of high attention by the model, indicating key feature regions for disease discrimination, while the blue (dimmed) areas correspond to regions of low attention, representing background or irrelevant features. The heatmap of the SXA-YOLO symmetry perception recognition model highlights the concentrated areas around the target location, facilitating a comprehensive evaluation of the entire tomato leaf surface and enhancing the ability of the model to focus on disease features. Simultaneously, it effectively suppressed background interference, further improving the recognition performance of tomato leaf disease features and demonstrating the effectiveness of the SXA-YOLO symmetry perception recognition model.

5.6. Challenge Experiment

To verify the robustness of the SXA-YOLO model, Gaussian noise and blurring were added as two forms of image degradation interference to the images in the dataset, simulating complex conditions in real-world environments. Additionally, to assess the model’s generalization capability, a rice leaf disease dataset with complex backgrounds was introduced for comparative experiments, which includes eight disease categories: Healthy, Bacterial Leaf Blight, Brown Spot, Leaf Blast, Leaf Scald, Narrow Brown Leaf Spot, Neck Blast, and Rice Hispa. The quantitative evaluation results are presented in Table 9. The SXA-YOLO model demonstrated strong performance across both datasets, although its robustness varied depending on the task characteristics. In the tomato disease recognition task, the model’s performance on original images was nearly saturated, achieving an mAP50 of 0.993 and an mAP50-95 of 0.932. After the introduction of interference, while there was a slight decline in various metrics, overall performance remained high. This indicates that although blurring weakened some morphological features of the diseases, the model exhibited a certain level of fault tolerance to image quality degradation, maintaining robust feature extraction capabilities. In contrast, the rice leaf disease recognition task proved to be more challenging. The model achieved an mAP50 of 0.935 on original images, but this metric dropped by approximately 10% under blurring interference, more than the effect of Gaussian noise. Thus, in the complex backgrounds and densely packed targets of rice fields, blurring has a more destructive impact on critical texture features, greatly increasing the difficulty of model recognition. Nevertheless, the metric mAP50 under interference still maintained above 0.83, demonstrating the effectiveness of the SXA-YOLO foundational feature extraction framework.

The aforementioned quantitative results are visually corroborated in the images shown in Figure 15 and Figure 16. Regardless of whether Gaussian noise or blurring is applied, the SXA-YOLO model maintains a high level of accuracy and completeness in locating and identifying leaf disease areas, with detection results nearly indistinguishable from those on the original images. This clearly demonstrates the model’s stable capacity to capture morphological features of the diseases, enabling it to effectively complete the recognition task.

6. Conclusions

To address the challenges of multiple categories, subtle differences, and complex irregular shapes in tomato leaf disease recognition, the SXA-YOLO symmetric perception recognition model was proposed. This model achieves equitable fusion of multi-scale features and collaborative optimization of classification and regression tasks by constructing a backbone network with symmetrical feature hierarchies, integrating a neck network with symmetrical fusion paths, and employing a detection head with task-functional symmetry. Using the PlantifyDr dataset for testing, the model recognizes ten types of tomato leaf diseases. The final results showed mAP50 and mAP50-95 scores of 0.993 and 0.932, respectively, outperforming nine other object recognition models, including those in the YOLO series, CenterNet, EfficientDet, DETR, and EfficientNet. The model effectively reduced false negatives and false positives in tomato leaf disease recognition, demonstrating its advantage in handling complex backgrounds and multiple diseases. However, the model training requires substantial computational power, which poses a high demand for computer performance. In addition, its multi-object recognition capability in dense scenes requires improvement. Therefore, future work will focus on further optimizing the model structure to reduce its size, enabling real-time inference on embedded platforms (such as the Jetson series and mobile devices). This includes constructing a cross-season and cross-region dataset that covers a broader range of disease types and growth stages to enhance the model’s adaptability to complex environmental changes. Additionally, we will explore its applications and development in the field of multi-target recognition to facilitate the translation of research findings into practical applications.

Author Contributions

G.D., Writing—original draft, Resources, Funding acquisition, and Project administration; S.F., Writing—original draft, Methodology, and Visualization; L.Z., Investigation and Formal analysis; W.R., Data curation, Software, and Validation; B.H., Supervision and Writing—review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

Financial Support by the support of the Housing and Urban-Rural Development Department of Shandong Province (2024KYKF-JZGYH104) and the Funded Project under the Open Fund of Key Research Bases (XTYC202401).

Data Availability Statement

The data that support the findings of this study are openly available in PlantifyDr at https://www.kaggle.com/datasets/lavaman151/plantifydr-dataset (accessed on 16 February 2021).

Conflicts of Interest

Author Lianbin Zhang was employed by Shandong Pinhong Intelligent Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Fang, X.J.; Yan, L.Q.; Zhang, F.H.; Song, P.L. Research on Tomato Leaf Disease Recognition Based on Improved YOLOX Nano. J. Anhui Agric. Sci. 2025, 53, 238–242. [Google Scholar] [CrossRef]
Wang, H.B.; Xie, Z.C.; Yang, Y.Z.; Li, J.M.; Huang, Z.L.; Yu, Z.H. Fast identification of tomatoes in natural environments by improved YOLOv5s. J. Agric. Eng. 2024, 55, 232–242. [Google Scholar] [CrossRef]
Brahimi, M.; Boukhalfa, K.; Moussaoui, A. Deep learning for tomato diseases: Classification and symptoms visualization. Appl. Artif. Intell. 2017, 31, 299–315. [Google Scholar] [CrossRef]
Kotwal, J.; Kashyap, R.; Pathan, S. Agricultural plant diseases identification: From traditional approach to deep learning. Mater. Today Proc. 2023, 80, 344–356. [Google Scholar] [CrossRef]
Wspanialy, P.; Moussa, M. A detection and severity estimation system for generic diseases of tomato greenhouse plants. Comput. Electron. Agric. 2020, 178, 105701. [Google Scholar] [CrossRef]
Meng, L.S.; Yang, X.Z.; Liu, H.K. Research on mushroom image classification algorithm based on CA-EfficientNetV2. Prog. Laser Optoelectron. 2022, 59, 56–63. [Google Scholar] [CrossRef]
Zhang, N.; Wu, H.R.; Han, X.; Miao, Y.S. Tomato disease recognition method based on multi-scale and attention mechanism. Zhejiang Agric. J. 2021, 33, 1329–1338. [Google Scholar] [CrossRef]
Lv, M.; Su, W.H. YOLOV5-CBAM-C3TR: An optimized model based on transformer module and attention mechanism for apple leaf disease detection. Front. Plant Sci. 2024, 14, 1323301. [Google Scholar] [CrossRef]
Yang, S.; Yao, J.; Teng, G. Corn leaf spot disease recognition based on improved YOLOv8. Agriculture 2024, 14, 666. [Google Scholar] [CrossRef]
Abulizi, A.; Ye, J.; Abudukelimu, H.; Guo, W. DM-YOLO: Improved YOLOv9 model for tomato leaf disease detection. Front. Plant Sci. 2025, 15, 1473928. [Google Scholar] [CrossRef]
Yan, C.H.Z.; Tian, F.M.; Tan, F.; Wang, S.Q.; Shi, J.X. Fast rice disease detection method based on improved YOLOv4. Jiangsu Agric. Sci. 2023, 51, 187–194. [Google Scholar] [CrossRef]
Zhou, W.; Niu, Y.Z.; Wang, Y.W.; Li, D. Rice pest and disease recognition method based on improved YOLOv4-GhostNet. Jiangsu Agric. J. 2022, 38, 685–695. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning for real-time fruit detection and orchard fruit load estimation: Benchmarking of ‘MangoYOLO’. Precis. Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
Sozzi, M.; Cantalamessa, S.; Cogato, A.; Kayad, A.; Marinello, F. Automatic bunch detection in white grape varieties using YOLOv3, YOLOv4, and YOLOv5 deep learning algorithms. Agronomy 2022, 12, 319. [Google Scholar] [CrossRef]
Aldakheel, E.A.; Zakariah, M.; Alabdalall, A.H. Detection and identification of plant leaf diseases using YOLOv4. Front. Plant Sci. 2024, 15, 1355941. [Google Scholar] [CrossRef]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight underwater object detection based on yolo v4 and multi-scale attentional feature fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Cui, M.; Lou, Y.; Ge, Y.; Wang, K. LES-YOLO: A lightweight pinecone detection algorithm based on improved YOLOv4-Tiny network. Comput. Electron. Agric. 2023, 205, 107613. [Google Scholar] [CrossRef]
Li, J.; Zhu, X.; Jia, R.; Liu, B.; Yu, C. Apple-yolo: A novel mobile terminal detector based on yolov5 for early apple leaf diseases. In Proceedings of the 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), Los Alamitos, CA, USA, 27 June–1 July 2022; pp. 352–361. [Google Scholar] [CrossRef]
Zhang, W.; Wang, J.; Liu, Y.; Chen, K.; Li, H.; Duan, Y.; Guo, W. Deep-learning-based in-field citrus fruit detection and tracking. Hortic. Res. 2022, 9, uhac003. [Google Scholar] [CrossRef]
Wang, L.; Zhao, Y.; Xiong, Z.; Wang, S.; Li, Y.; Lan, Y. Fast and precise detection of litchi fruits for yield estimation based on the improved YOLOv5 model. Front. Plant Sci. 2022, 13, 965425. [Google Scholar] [CrossRef]
Cao, Z.; Yuan, R. Real-time detection of mango based on improved YOLOv4. Electronics 2022, 11, 3853. [Google Scholar] [CrossRef]
Hari Chandana, P.; Subudhi, P.; Vara Prasad Yerra, R. Mangoyolo5: A fast and compact yolov5 model for mango detection. In Computer Vision and Machine Intelligence, Proceedings of CVMI 2022, Allahabad, India, 12–13 August 2022; Springer Nature: Singapore, 2023; pp. 719–731. [Google Scholar] [CrossRef]
Quach, L.D.; Quoc, K.N.; Quynh, A.N.; Ngoc, H.T.; Thai-Nghe, N. Tomato health monitoring system: Tomato classification, detection, and counting system based on YOLOv8 model with explainable MobileNet models using Grad-CAM++. IEEE Access 2024, 12, 9719–9737. [Google Scholar] [CrossRef]
Lai, J.W.; Ramli, H.R.; Ismail, L.I.; Hasan, W.Z.W. Real-time detection of ripe oil palm fresh fruit bunch based on YOLOv4. IEEE Access 2022, 10, 95763–95770. [Google Scholar] [CrossRef]
Hu, R.; Su, W.H.; Li, J.L.; Peng, Y. Real-time lettuce-weed localization and weed severity classification based on lightweight YOLO convolutional neural networks for intelligent intra-row weed control. Comput. Electron. Agric. 2024, 226, 109404. [Google Scholar] [CrossRef]
Ren, X.; Bai, Y.; Liu, G.; Zhang, P. YOLO-Lite: An efficient lightweight network for SAR ship detection. Remote Sens. 2023, 15, 3771. [Google Scholar] [CrossRef]
Wu, Q.; Huang, H.; Song, D.; Zhou, J. YOLO-PGC: A Tomato Maturity Detection Algorithm Based on Improved YOLOv11. Appl. Sci. 2025, 15, 5000. [Google Scholar] [CrossRef]
Feng, Z.; Liu, F. Balancing Feature Symmetry: IFEM-YOLOv13 for Robust Underwater Object Detection Under Degradation. Symmetry 2025, 17, 1531. [Google Scholar] [CrossRef]
Shao, M.; Zhang, J.; Feng, Q.; Chai, X.; Zhang, N.; Zhang, W. Research progress of deep learning in detection and recognition of plant leaf diseases. Smart Agric. 2022, 4, 29. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. Available online: https://openaccess.thecvf.com/content_ICCV_2019/html/Duan_CenterNet_Keypoint_Triplets_for_Object_Detection_ICCV_2019_paper.html (accessed on 4 August 2024).
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. Available online: https://openaccess.thecvf.com/content_CVPR_2020/html/Tan_EfficientDet_Scalable_and_Efficient_Object_Detection_CVPR_2020_paper.html (accessed on 5 August 2023).
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR 97. pp. 6105–6114. Available online: https://proceedings.mlr.press/v97/tan19a.html?ref=ji (accessed on 8 August 2024).
Wei, C.; Lü, J.; Qu, C.Y. Object detection algorithm for complex traffic scenes based on improved YOLOv5s. Electron. Meas. Technol. 2024, 47, 121–130. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on yolov8 and its advancements. In Data Intelligence and Cognitive Informatics, Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; Springer: Singapore, 2024; pp. 529–545. [Google Scholar] [CrossRef]
Rasheed, A.F.; Zarkoosh, M. YOLOv11 optimization for efficient resource utilization. J. Supercomput. 2025, 81, 1085. [Google Scholar] [CrossRef]
Ali, M.L.; Zhang, Z. The YOLO framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. Available online: https://openaccess.thecvf.com/content_iccv_2017/html/Selvaraju_Grad-CAM_Visual_Explanations_ICCV_2017_paper.html (accessed on 10 July 2024).

Figure 1. SXA-YOLO symmetry perception recognition model.

Figure 2. The feature map change process of SAAPAN architecture.

Figure 3. SimConvWrapper and ConvWrapper module.

Figure 4. RepHDW module.

Figure 5. AsDDet module.

Figure 6. Illustration of tomato leaf disease.

Figure 7. Comparative analysis of the performance of each loss function.

Figure 8. Results curve diagram.

Figure 9. Comparison of precision in params.

Figure 10. Comparison of performance stability of each model.

Figure 11. The radar chart of the model in each category.

Figure 12. The visualization results.

Figure 13. Confusion matrix for prediction results.

Figure 14. The heatmap results.

Figure 15. Three challenging experiments on tomato leaf diseases.

Figure 16. Three challenging experiments on rice leaf diseases.

Table 1. Comparison of design of disease recognition model for agricultural scenes.

Model	Innovation	Symmetrical Design	Potential Limitations in Detecting Minor Diseases in Tomato Leaves
YOLOv4-LightC-CBAM [21]	Introducing Lightweight Convolution and CBAM Attention	Design without explicit symmetry	Attention may be biased towards large targets, while small lesion features are easily overlooked
YOLOv7-L [25]	Integrating ECA and CA attention mechanisms	Design without explicit symmetry	The complex attention mechanism increases computational overhead and lacks optimization for small objectives
YOLO-Lite [26]	Lightweight backbone LFEBNet and CPEA module	Design without explicit symmetry	Lightweight backbone may lose subtle disease characteristics
IFEM-YOLOv13 [28]	IFEM module	Partial symmetry restoration	Focusing on physical medium distortion correction rather than symmetrical fusion of semantic features of agricultural images themselves
SXA-YOLO (Ours)	Symmetric Perception Architecture	Systematic symmetrical design throughout the entire process	Targeted solutions to complex problems with small goals, subtle differences, and complex backgrounds

Table 2. Model training parameters.

Configurations	Values
Input Image Resolution	640 × 640
Batch Size	16
Workers	8
Model Training Epochs	300
Optimizer	SGD
Initial Learning Rate	0.01
Momentum	0.937

Table 3. Performance comparison of different loss functions.

Loss Function	FPS	P	R	mAP50	mAP50-95
CIoU	191.2	0.985	0.991	0.989	0.913
EfficiCIoU	148.2	0.989	0.992	0.993	0.912
EIoU	177.7	0.972	0.978	0.976	0.892
UIoU	179.0	0.976	0.986	0.988	0.917
WIoU	163.8	0.987	0.992	0.987	0.909
XIoU	156.4	0.992	0.995	0.993	0.932

Table 4. Ablation experimental results.

SAAPAN	XIoU	AsDDet	GFLOPS	Params	FPS	P	R	mAP50	mAP50-95
×	×	×	6.3	2.58	185.7	0.985	0.991	0.989	0.913
√	×	×	13.1	4.10	144.9	0.989	0.989	0.991	0.919
×	√	×	6.3	2.58	183.1	0.972	0.978	0.976	0.892
×	×	√	6.8	2.66	175.3	0.981	0.988	0.990	0.918
√	√	×	13.1	4.29	155.7	0.990	0.981	0.987	0.921
×	√	√	6.8	2.66	154.1	0.992	0.989	0.991	0.924
√	×	√	17.4	5.06	144.2	0.98	0.971	0.983	0.914
√	√	√	17.4	5.06	156.4	0.996	0.995	0.993	0.932

Table 5. Experimental results contrasting.

Model	Params	P	R	mAP50	mAP50-95
Centernet [30]	25.6	0.924	0.764	0.913	0.804
Efficientdet [31]	6.6	0.984	0.988	0.965	0.869
Detr [32]	41.3	0.979	0.967	0.936	0.855
Efficientnet [33]	7.8	0.978	0.963	0.921	0.873
Yolov5 [34]	1.9	0.924	0.795	0.864	0.699
Yolov8 [35]	3.2	0.987	0.991	0.987	0.920
Yolov11 [36]	3.2	0.985	0.991	0.989	0.913
Yolov13 [37]	2.0	0.985	0.996	0.985	0.927
Ours	5.1	0.996	0.995	0.993	0.932

Table 6. The p value of each category in the comparative experiment.

Model	P
Model	H	BS	EB	LB	LM	SLS	SM	TS	YLCV	MV
Centernet	0.969	0.911	0.851	0.912	0.975	0.718	0.993	0.947	0.990	0.976
Efficientdet	1	0.974	0.928	0.988	1	0.997	1	1	0.954	1
Detr	1	0.985	0.918	0.979	1	0.952	1	1	0.954	1
Efficientnet	1	0.974	0.926	0.936	1	0.960	1	0.995	0.998	0.996
Yolov5	0.986	0.837	0.797	0.901	0.977	0.874	1	0.907	1	0.957
Yolov8	1	0.988	0.969	0.939	0.997	0.994	0.999	0.996	0.992	0.997
Yolov11	0.993	0.978	0.999	0.933	0.992	1	0.991	0.986	0.989	0.992
Yolov13	0.992	0.906	0.998	0.997	0.996	0.992	0.993	0.994	0.992	0.991
Ours	1	0.976	0.998	0.999	1	1	1	0.991	1	0.993

Table 7. The R value of each category in the comparative experiment.

Model	R
Model	H	BS	EB	LB	LM	SLS	SM	TS	YLCV	MV
Centernet	0.950	0.687	0.786	0.741	0.933	0.448	0.857	0.474	0.955	0.808
Efficientdet	1	1	0.928	1	1	1	1	1	0.954	1
Detr	1	1	0.857	0.963	1	0.904	1	1	0.951	1
Efficientnet	1	0.875	0.928	0.925	1	0.952	1	1	0.953	1
Yolov5	0.950	0.750	0.642	0.592	0.866	0.619	0.904	0.789	0.954	0.884
Yolov8	1	1	0.964	0.944	1	1	1	1	1	1
Yolov11	1	1	0.964	1	1	1	0.95	1	1	1
Yolov13	1	1	0.962	1	1	1	1	1	1	1
Ours	1	1	0.964	1	1	0.982	1	1	1	1

Table 8. Results of the SXS-YOLO model for different tomato leaf disease categories.

Class	P	R	mAP50	mAP50-95
Healthy	1	1	0.995	0.905
Bacterial Spot	0.976	1	0.995	0.987
Early Blight	0.998	0.964	0.980	0.829
Late Blight	0.999	1	0.995	0.971
Leaf Mold	1	1	0.995	0.932
Septoria Leaf Spot	1	0.982	0.995	0.979
Spider Mites	1	1	0.995	0.922
Target Spot	0.991	1	0.995	0.995
Yellow Leaf Curl Virus	1	1	0.995	0.989
Mosaic Virus	0.993	1	0.995	0.812

Table 9. Comparison of experimental results between tomato and rice datasets.

Dataset	Experimental Setup	P	R	mAP50	mAP50-95
tomato leaf diseases	Original image	0.996	0.995	0.993	0.932
	+Blur	0.988	0.952	0.987	0.913
	+Noise	0.974	0.973	0.982	0.91
rice leaf diseases	Original image	0.870	0.93	0.935	0.842
	+Blur	0.824	0.852	0.836	0.806
	+Noise	0.863	0.877	0.890	0.827

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, G.; Fang, S.; Zhang, L.; Ren, W.; He, B. Symmetry-Aware SXA-YOLO: Enhancing Tomato Leaf Disease Recognition with Bidirectional Feature Fusion and Task Decoupling. Symmetry 2026, 18, 178. https://doi.org/10.3390/sym18010178

AMA Style

Du G, Fang S, Zhang L, Ren W, He B. Symmetry-Aware SXA-YOLO: Enhancing Tomato Leaf Disease Recognition with Bidirectional Feature Fusion and Task Decoupling. Symmetry. 2026; 18(1):178. https://doi.org/10.3390/sym18010178

Chicago/Turabian Style

Du, Guangyue, Shuyu Fang, Lianbin Zhang, Wanlu Ren, and Biao He. 2026. "Symmetry-Aware SXA-YOLO: Enhancing Tomato Leaf Disease Recognition with Bidirectional Feature Fusion and Task Decoupling" Symmetry 18, no. 1: 178. https://doi.org/10.3390/sym18010178

APA Style

Du, G., Fang, S., Zhang, L., Ren, W., & He, B. (2026). Symmetry-Aware SXA-YOLO: Enhancing Tomato Leaf Disease Recognition with Bidirectional Feature Fusion and Task Decoupling. Symmetry, 18(1), 178. https://doi.org/10.3390/sym18010178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry-Aware SXA-YOLO: Enhancing Tomato Leaf Disease Recognition with Bidirectional Feature Fusion and Task Decoupling

Abstract

1. Introduction

2. Related Work

2.1. General Agricultural Object Recognition Based on the YOLO Framework

2.2. Lightweight and Customized Improvements for Agricultural Scenarios

2.3. Structural Optimization for Complex Scenarios and Performance Enhancement

2.4. Research Gaps and the Entry Point of This Study

3. SXA-YOLO Symmetry Perception Recognition Model

3.1. SAAPAN

3.2. XIoU

3.3. AsDDet

4. Experimental Setup and Evaluation Metrics

4.1. Tomato Disease Dataset

4.2. Configuration Parameters

4.3. Performance Indicators

5. Results and Discussion

5.1. Comparative Experiment of Loss Functions

5.2. Ablation Experiments

5.3. Comparison with SXA-YOLO and Its Mainstream Models

5.4. Classification Detection Results

5.5. Visualization Result

5.6. Challenge Experiment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI