1. Introduction
The tomato is an important vegetable and is cultivated globally. However, tomato diseases frequently arise as a result of natural environmental factors, like climate change, and human interventions, like poor drainage and insufficient fertilization, resulting in substantial reductions in yield and economic value [
1]. Early disease detection can help in terms of applying precisely targeted pesticide to prevent the spread of disease and minimize crop loss [
2]. Most disease symptoms on tomato plants usually show up on the leaves, and these might include spots, yellowing, necrosis, leaf distortion, etc. These can serve as important indicators for disease identification [
3]. Traditional disease detection usually relies on visual examination or laboratory tests, both of which cannot only be quite time-intensive and labor-intensive but can also require specialized knowledge, making large-scale disease monitoring and accurate diagnosis difficult to achieve [
4].
The emergence and development of deep learning technologies, particularly the application of Convolutional Neural Networks (CNNs), has led to their being introduced into agricultural disease recognition [
5], which not only alleviates the workload but also enhances the accuracy of disease detection. S Ledbin Vini et al. [
6] proposed TrioConvTomatoNet, which is a deep convolutional neural network architecture; it exhibited a remarkable precision rate of around 99.39% in terms of classifying tomato diseases. Chakrabarty et al. [
7] introduced a hybrid framework integrating transformer architecture and lightweight CNN, which showed high precision and recall when detecting rice leaf diseases. Zhang et al. [
8] regards the ResNet-50 as the foundational framework for integrating architecture into their proposed model, and the precision of tomato disease recognition on the Plant Village dataset [
9] reached 98.25%. Liu et al. [
10] introduced a multi-scale constrained deformable convolution network, referred to as MCDCNet, which effectively enhanced the detection accuracy for apple leaf diseases by extracting reliable features from varying scales and geometries, and its accuracy reached 66.8% for apple leaf disease in a complex natural environment with an enhancement of 3.85% relative to the existing state-of-the-art models. Although CNNs demonstrate strong performance in agricultural disease detection tasks, they still face challenges in agricultural scenarios that require real-time processing because they have slow computing speed when applied to devices with constrained computational capabilities. In response to these challenges, You Only Look Once (YOLO) series models [
11,
12,
13,
14,
15,
16,
17,
18] were proposed. YOLO is capable of forecasting the bounding boxes and probabilities of different classes from entire images via a singular neural network. This approach significantly simplifies the traditional multi-step detection process and ensures excellent accuracy while maintaining a high detection speed [
11]. Therefore, YOLO has great advantages in scenarios requiring high precision and real-time, such as agricultural automation and crop health monitoring. Li et al. [
19] introduced an improved lightweight model based on YOLOv5s for identifying vegetable disease, and the model achieved 93.1% for the mAP@0.5 in the dataset with five diseases because the algorithm effectively reduced missed and wrong detection caused by a complex background and small-scale symptoms of disease. Guo et al. [
20] developed a model named YOLOv7-TMRTM for detecting rice leaf disease symptoms of various sizes, named YOLOv7-TMRTM, to rapidly and accurately detect rice leaf diseases. It outperforms the baseline YOLOv7-tiny model in detecting leaf spots of various sizes and types of small targets. Yang et al. [
21] introduced the slim-neck module and Global Attention Mechanism (GAM) based on YOLOv8, achieving improvements of 3.56%, 7.3%, 3.79%, and 4.65% in the mAP@0.5, mAP0.5:0.95, precision, and recall for corn leaf disease detection, respectively. Yan et al. [
22] introduced FSM-YOLO, an improved convolutional neural network for detecting apple leaf diseases, which effectively enhanced the detection accuracy by introducing adaptive feature capture and spatial context awareness. The model achieved a 2.7% improvement in the mAP@0.5 compared to the baseline YOLOv8s when using the ALDD dataset. However, YOLO and CNNs are constrained by their local receptive fields, and they usually find it difficult to capture global spatial features and dependencies [
23], which would lead to the models failing to fully identify scattered lesion spots and subtle disease signs on leaves in a complex background scenario.
The State Space Model (SSM)-based approach, exemplified by Mamba [
24], performs well in terms of solving long-distance dependencies while maintaining linear computational complexity in sequence length and has become an efficient and widely applicable sequence model that has addressed the computational inefficiency of transformers in modeling long sequences. Some studies have applied SSM to object detection and achieved good results. FER-YOLO-Mamba [
25], combining Mamba and YOLO, which integrates the inherent advantages of convolution layers in local feature extraction and the excellent ability of SSM in revealing long-distance dependencies, shows strong robustness and generalization ability in the task of facial expression detection and classification. Mamba-YOLO [
26] is a new object-detection model based on SSM. It not only optimizes the basis of SSM but also adapts specifically for object detection. Many experiments on public benchmark datasets, such as COCO and VOC, demonstrated that Mamba-YOLO surpassed the existing YOLO series models in performance, showcasing its substantial potential and competitive edge. More recently, Mamba2 [
27] has further refined the Selective State Space Models (S6) by introducing the concept of State Space Duality (SSD). It regards the state space transition matrix as a scalar and extends the dimensions of the state space, thus improving the model performance and the efficiency of training and inference. Hence, in this research, we incorporated Mamba2 into the neck network and proposed the YOLO-BSMamba model for tomato leaf disease recognition. The innovation of the algorithm is as follows:
(1) A Similarity-Based Attention Mechanism (SimAM) [
28] is introduced into the backbone network to reduce background noise interference and highlight the disease area, and to further enhance the adaptability of the model where there are various complex backgrounds.
(2) A Hybrid Convolutional Mamba (HCMamba) module is proposed in this study, which integrates local detail information extracted by convolution, with global context information being provided by the SSM. This design enhances the model’s capacity to capture both the global and detailed features of the image, thereby improving its performance in disease localization and classification.
(3) The weighted bidirectional feature pyramid network (BiFPN) is used as the feature-fusion module of the network. BiFPN’s multi-scale feature-fusion ability can improve the detection ability of the model to different degrees of diseases and its ability of weighted feature fusion improves the sensitivity of the model to key disease areas.
3. Results
This section provides a detailed account of the performance of the YOLO-BSMamba model in the task of tomato leaf disease detection. Initially, we introduced the experimental platform and parameter settings, including the hardware configuration and key parameters, during the training process. Subsequently, through comparative experiments with the YOLOv8s model, we demonstrated the significant improvements from using YOLO-BSMamba in various metrics. In addition, ablation experiments were conducted to verify the contributions of the proposed modules (the HCMamba module, SimAM attention mechanism, and BiFPN) to the model’s performance. Finally, by comparing with other models in the YOLO series, the superiority of YOLO-BSMamba for the task of tomato leaf disease detection was further substantiated.
3.1. Experimental Platform and Parameter Settings
All experiments were carried out using Python3.8 and Pytorch 2.0.0. A RTX 3090 GPU with a 24 GB memory was utilized for training purposes. The detailed configurations of the experimental setup are outlined in
Table 2, ensuring transparency and facilitating replication of the study.
In the training process, the input image dimensions for the network were set at 640 × 640 pixels and the stochastic gradient descent (SGD) algorithm was employed as the optimization strategy. The initial learning rate was set to 0.01, with a momentum of 0.9. Weight decay was configured at 0.0005. The batch size was 16, and the number of epochs was 300. We also use a warmup strategy for the first 3 epochs, gradually raising the learning rate to 0.01. It took 10 h to train for 300 epochs.
3.2. Comparison of Performance Between YOLO-BSMamba and YOLOv8s
Figure 7 illustrates the changes in the mAP@0.5 and the total loss during the training of both models. The loss curve reveals a rapid decline in total loss for YOLO-BSMamba (blue curve) during the initial training phase and maintains lower loss values than YOLOv8s (red curve) across the entire training process, with a more pronounced early decline reflecting swift convergence. Similarly, the accuracy of the mAP@0.5 curve reveals that after stabilizing, the YOLO-BSMamba model consistently outperforms YOLOv8s on the validation set results. This indicates that YOLO-BSMamba possesses higher accuracy and stronger generalization ability.
Table 3 shows the results for the YOLOv8s and YOLO-BSMamba models. It indicates that the YOLO-BSMamba model outperforms YOLOv8 across several key metrics. Specifically, YOLO-BSMamba shows improvements of 3.0%, 3.1%, 2.0%, 4.8%, and 4.3% in P, R, F1 score, mAP@0.5, and mAP@0.5:0.95, respectively, compared to YOLOv8s.
Table 4 compares the performance of YOLOv8s and YOLO-BSMamba models in detecting various tomato leaf diseases, offering a detailed evaluation of their detection capabilities across individual categories. Overall, YOLO-BSMamba demonstrates superior performance compared to YOLOv8s, exhibiting higher mAP@0.5, P, and R across the majority of disease categories. The mAP@0.5 results demonstrate that YOLO-BSMamba consistently surpasses YOLOv8s in all categories, with significant improvements observed in the detection of early blight and septoria, where the mAP@0.5 increases by 6.5% and 18.8%, respectively. Meanwhile, the P and R metrics for late blight, spider mites, and yellow leaf curl virus are slightly lower than those for YOLOv8s in certain detection results. YOLO-BSMamba achieves superior performance across the remaining six disease categories, underscoring its robustness in disease-specific detection.
3.3. Visual Analysis
To visually illustrate the detection performance of the YOLO-BSMamba model, we carried out assessments and analyzed the model’s performance using a confusion matrix and a heat map. The confusion matrix illustrates the model’s classification accuracy, while the heat map presents data information in an intuitive and visual way.
Figure 8 displays the normalized confusion matrices of YOLOv8s and YOLO-BSMamba. The detection accuracy increases as the color of the main diagonal cells becomes darker. The darker the color the non-main diagonal cells are, the more likely it is that the elements aligned horizontally and vertically will be mistaken for each other. Overall, the YOLOv8s-BSMamba model demonstrates superior detection accuracy across most disease categories when compared to YOLOv8s.As can be observed from
Figure 8b, the confusion matrix of YOLO-BSMamba shows a darker diagonal element than that of YOLOv8s, pointing to higher detection accuracy across most categories of disease. YOLO-BSMamba exhibits a lower rate of misclassifying various categories as “background” compared to YOLOv8s, indicating its superior performance in distinguishing between different classes and the background. Although its performance when detecting spider mites is slightly lower than that of YOLOv8s, the difference is small and does not significantly affect the overall improvement in performance. These results suggest that YOLO-BSMamba had superior detection accuracy for most categories in tomato leaf disease detection, with a reduced misclassification rate, showcasing its stronger generalization ability. These advantages suggest that YOLO-BSMamba is more appropriate for complex and diverse agricultural disease detection applications. We have noticed that there is confusion about some diseases; for example, the confusion between late blight and septoria may originate from the similar shape of the lesion areas they produce on leaves. Moreover, in complex background interference scenarios, the model’s ability to capture the boundaries and texture features of these lesion areas is not accurate enough, which affects the classification accuracy.
Figure 9 presents the feature heatmaps for both models, along with the original images, (
Figure 9a), YOLOv8s (
Figure 9b), and YOLO-BSMamba (
Figure 9c). The red regions in the heatmaps represent the areas of focus for the model. From the Figure, it is clear that the two models exhibit differences in their attention to the diseased areas.
In the heatmap for YOLOv8s shown in
Figure 9b, the distribution of high-attention areas is relatively sparse, with some lesion areas not being effectively marked and some background areas being falsely detected. This indicates that YOLOv8s is prone to interference when dealing with complex backgrounds, leading to a dispersion of attention on the diseased areas and a decrease in recognition accuracy. In contrast, the heatmap for YOLO-BSMamba in
Figure 9c demonstrates denser and more continuous high-attention areas, with the core parts and edge features of the diseased leaves being captured more distinctly. This indicates that the YOLO-BSMamba model demonstrates a more focused attention on the diseased areas in complex background scenarios, enabling a more comprehensive and accurate coverage of the lesion regions, and it demonstrates an enhanced ability to capture the boundaries and morphological details of the affected areas.
3.4. Ablation Experiments
To evaluate the influence of the introduced modules on the overall network performance, we carried out a series of ablation studies, and the results on the test set are presented in
Table 5.
Table 5 shows that the introduction of only the HCMamba module results in increases of 0.4% in P, 1.2% in R, 0.8% in the F1 score, 2.7% in the mAP@0.5, and 1.6% in the mAP@0.5:0.95 for the model, indicating that the HCMamba module can significantly improve the model’s general detection performance. When the SimAM attention mechanism is incorporated independently, the model achieves an mAP@0.5 and an mAP@0.5:0.95 of 0.842 and 0.693, representing respective improvements of 2.3% and 1.9%. Furthermore, the F1 score also increases to 0.804. This enhancement is likely attributable to SimAM’s ability to effectively suppress non-target information in complex backgrounds, thereby augmenting the representation of critical features. When both the HCMamba module and the SimAM are incorporated, the model demonstrates significantly improved performance across all evaluation metrics. Specifically, the F1 score increases to 0.811, and the mAP@0.5 and mAP@0.5:0.95 attain 0.860 and 0.710, respectively. These results suggest that the SimAM mechanism and the HCMamba module complement each other effectively: the former suppresses non-target information, while the latter excels in extracting critical features, which improved the capacity for feature representation. When the three modules—HCMamba, SimAM, and BiFPN—were integrated, the model achieved its optimal overall performance. Specifically, the P metrics increased to 0.858 and the F1 score reached 0.819, while the mAP@0.5 and mAP@0.5:0.95 improved to 0.867 and 0.720, respectively. Through bidirectional cross-scale connections and weighted feature fusion, BiFPN ensures comprehensive integration of features across scales at various levels, effectively mitigating the loss of critical feature information. The integration of BiFPN, HCMamba, and SimAM further improves the model’s capacity to detect leaf diseases in complex scenarios.
To further assess the effectiveness of the HCMamba module for tomato leaf disease detection, we compared it experimentally with the Swin transformer module. The Swin transformer, an advanced transformer-based model, excels in many computer vision tasks. It has strong feature extraction and representation capabilities, making it suitable for complex image data and for capturing long-range dependencies. In the experiment, we replaced the HCMamba module in the neck network with a Swin transformer module, and the experimental results are shown in
Table 6.
Experimental results show that the HCMamba outperforms the Swin Transformer in recall, F1 score, and mean average precision. Specifically, the HCMamba’s F1 score of 0.819 surpasses the Swin transformer’s 0.797, indicating better balance between precision and recall. In addition, HCMamba achieves an mAP@0.5 of 0.846 and an mAP@0.5:0.95 of 0.693, surpassing the Swin transformer’s 0.839 and 0.676, respectively, and demonstrating its capability in accurate lesion localization and classification across different confidence thresholds.
3.5. Performance Comparison with YOLO Series Model
A comparative evaluation framework was implemented to assess the YOLO-BSMamba model’s performance against the YOLO series model. In the experiment, YOLOv5s, YOLO6s, YOLO7-tiny, YOLO8s, YOLO9s, YOLO10s, and YOLO11s were used as comparison models, and the same datasets and experimental parameters were used for training and testing. The experimental results are shown in
Table 7.
As evidenced in
Table 7, the YOLO-BSMamba model has demonstrated superior performance when compared to other models, achieving good results regarding P, R, F1 score, mAP@0.5, and map@0.5:0.95. The YOLO-BSMamba model achieved 0.858 of P, which is higher than all other models under comparison. For recall (R), it achieved 0.784, showing good performance in this metric as well. Moreover, the F1 score of the YOLO-BSMamba model is notably superior to that of other comparative models. This suggests that the model not only accurately identifies targets but also minimizes the omission of targets. Regarding the mAP@0.5, YOLO-BSMamba achieved 0.864, outperforming all models in this regard. When considering the mAP@0.5:0.95, YOLO-BSMamba achieved 0.864, surpassing all models except for YOLOv9s, indicating a comparable level of detection accuracy. The above results indicate that YOLO-BSMamba strikes a superior balance in terms of comprehensive performance, thereby establishing its competitiveness and practicality in the task of detecting diseases in tomatoes.
To more clearly evaluate the accuracy of all contrast models for tomato disease detection, we compared the mAP@0.5 for different models across all categories, which can comprehensively reflect the models’ accuracy in detecting different categories.
Table 8 presents the mAP@0.5 for the detection of various diseases across all comparative models for tomato diseases. The best performance is shown in bold. As shown by the indicator value in the table, our proposed YOLO-BSMamba model outperformed other models by achieving the highest mAP@0.5 for the detection of plant diseases, specifically early blight, healthy, leaf miner, mosaic virus, and septoria. Although the detection accuracy for late blight, leaf mold, spider mites, and yellow leaf curl virus is somewhat lower compared to more recent models such as YOLOv9s, YOLOv10s, and YOLOv11s, YOLO-BSMamba is still the best in terms of overall performance, which suggests that YOLO-BSMamba maintains a competitive advantage in tomato disease detection.
4. Discussion
This study proposed the YOLO-BSMamba model for tomato leaf disease detection in complex background scenarios. The experimental results show that the model’s precision, recall, and mean average precision (mAP) have all been significantly improved, which confirms the potential of the model in agricultural application.
Through ablation experiments, we verified the effectiveness of the HCMamba module, SimAM attention mechanism, and BiFPN for improving the performance of the model. Compared to other models, YOLO-BSMamba has several distinct advantages. Unlike traditional CNN-based models that struggle with capturing long-range dependencies and global context in complex background scenarios, YOLO-BSMamba leverages the SSM within the HCMamba module to effectively model these dependencies, leading to more accurate disease localization and classification. The application of SimAM allows the model to better focus on regions tied to diseases, lessening the sway of complex background scenarios. The BiFPN module ensures comprehensive integration of features across scales, improving the model’s sensitivity to different disease severities and morphologies. YOLO-BSMamba’s improved accuracy and generalization ability make it a promising tool for precision agriculture, like disease detection and targeted pesticide application.
Although these components working together enhanced the model’s performance, the detection accuracy for certain diseases, such as spider mites and yellow leaf curl virus, is relatively low, indicating that the model is inefficient at extracting features of subtle morphological patterns—spider mite infestations present micron-level chlorotic spots requiring micron-scale features. While yellow leaf curl virus involves curling leaves and symptomatic regions, these occupy a relatively small effective pixel area within imaging data. And there is some confusion between some disease categories. For example, there is a certain degree of mutual misclassification between late blight and gray mold. These findings suggest that in future model optimization, we can enhance the ability to distinguish between similar disease characteristics. This can be achieved by introducing more suitable feature extraction modules or expanding the training data to improve the model’s robustness and accuracy in handling similar disease classification tasks.
5. Conclusions
In this study, we proposed the YOLO-BSMamba model, which incorporates the HCMamba module, SimAM attention mechanism, and BiFPN to address the challenges of detecting tomato leaf diseases in complex background scenarios. Through extensive experiments and comparisons with state-of-the-art models, YOLO-BSMamba demonstrated good performance across multiple evaluation metrics. On the tomato leaf disease dataset we constructed, YOLO-BSMamba has significantly improved the evaluation indicators include precision, recall, F1 score, mAP@0.5, and mAP@0.5:0.95, with respective enhancements of 3.0%, 3.1%, 3.0%, 4.8%, and 4.3%, compared to YOLOv8s. The ablation experiment further verifies the effectiveness of each module and proves that the combination of HCMamba, SimAM, and BiFPN can synergistically improve the overall performance of the model. The HCMamba module effectively extracts both fine-grained and coarse-grained features, significantly enhancing the model’s capacity to distinguish subtle disease patterns from background noise. The incorporation of the SimAM attention mechanism further refines the focus on disease-relevant regions, while the BiFPN module facilitates efficient multi-scale feature fusion, ensuring reliable detection across diverse scales and complexities. The current weighted mechanisms in BiFPN are mainly softmax-based fusion and fast normalized fusion. Although BiFPN has achieved remarkable results through its weighted feature-fusion mechanism, introducing a learnable attention-based weighting mechanism could further optimize feature integration. Future research could explore the idea of incorporating such a mechanism. Integrating attention mechanisms into the BiFPN’s weighting process allows the model to adaptively learn the importance of different feature paths.
Compared to other models, the YOLO-BSMamba has several distinct advantages. Unlike traditional CNN-based models that struggle with capturing long-range dependencies and global context in complex backgrounds, YOLO-BSMamba leverages the SSM within the HCMamba module to effectively model these dependencies, leading to more accurate disease localization and classification. The architectural design of the YOLO-BSMamba model possesses a certain degree of universality, which makes it potentially applicable to the detection tasks of other crop diseases. To verify the universality of the model, future research could consider conducting experiments on other crop disease datasets, and on whether the model can be applied to other crops by fine-tuning, such as the detection of diseases in other crops like cotton, wheat, or corn. Future research will also focus on optimizing the model architecture through techniques such as pruning and quantization to mitigate computational costs. Furthermore, we expect to enhance the model’s adaptability to specific diseases, such as spider mites and yellow leaf curl virus, through further optimization of the model structure or an adjustment of training strategies, with the aim of achieving even higher detection accuracy and broader applicability in agricultural disease detection.