Real-Time Detection of Apple Leaf Diseases in Natural Scenes Based on YOLOv5

: Aiming at the problem of accurately locating and identifying multi-scale and differently shaped apple leaf diseases from a complex background in natural scenes, this study proposed an apple leaf disease detection method based on an improved YOLOv5s model. Firstly, the model utilized the bidirectional feature pyramid network (BiFPN) to achieve multi-scale feature fusion efﬁciently. Then, the transformer and convolutional block attention module (CBAM) attention mechanisms were added to reduce the interference from invalid background information, improving disease characteristics’ expression ability and increasing the accuracy and recall of the model. Experimental results showed that the proposed BTC-YOLOv5s model (with a model size of 15.8M) can effectively detect four types of apple leaf diseases in natural scenes, with 84.3% mean average precision (mAP). With an octa-core CPU, the model could process 8.7 leaf images per second on average. Compared with classic detection models of SSD, Faster R-CNN, YOLOv4-tiny, and YOLOx, the mAP of the proposed model was increased by 12.74%, 48.84%, 24.44%, and 4.2%, respectively, and offered higher detection accuracy and faster detection speed. Furthermore, the proposed model demonstrated strong robustness and mAP exceeding 80% under strong noise conditions, such as exposure to bright lights, dim lights, and fuzzy images. In conclusion, the new BTC-YOLOv5s was found to be lightweight, accurate, and efﬁcient, making it suitable for application on mobile devices. The proposed method could provide technical support for early intervention and treatment of apple leaf diseases.


Introduction
As one of the top four popular fruits in the world, apple is highly nutritious and provides significant medicinal value [1].In China, apple production has expanded, making it the world's largest apple producer.However, a variety of diseases hamper the healthy growth of apple, seriously affecting the quality and yield of apple and causing significant economic losses.According to statistics, there are approximately 200 types of apple diseases, most of which occur in apple leaf areas.Therefore, to ensure the healthy development of the apple planting industry, accurate and efficient leaf disease identification and control measures are needed [2].
In traditional disease identification techniques, fruit farmers and experts rely on visual examination based on their experience, a method which is inefficient and highly subjective.With the advance of computer and information technology, image recognition technology has been gradually applied in agriculture.Many researchers have applied machine vision algorithms to extract features such as color, shape, and texture from disease images and input them into specific classifiers to accomplish plant disease recognition tasks [3].Zhang et al. [4] processed apple disease images using HSI, YUV, and gray models; then, the authors extracted features using genetic algorithms and correlation based-feature selection, and ultimately discriminated apple powdery mildew, mosaic, and rust diseases using an SVM classifier with an identification accuracy of more than 90%.However, the complex image background and the feature extraction, dominated by strong experience, FGVC7 and FGVC8 [22,23] consist of apple leaf disease images used in the Plant Pathology Fine-Grained Visual Categorization competition hosted by Kaggle.The images were captured by Cornell AgriTech using Canon Rebel T5i DSLR and smartphones, with a resolution of 4000 × 2672 pixels for each image.There are four kinds of apple leaf diseases, namely rust, frogeye leaf spot, powdery mildew, and scab.These diseases occur frequently and cause significant losses in the quality and yield of apples.Sample images of the dataset are shown in Figure 1.
VC7 and FGVC8 [22,23] consist of apple leaf disease images used in the Plant Pa-Grained Visual Categorization competition hosted by Kaggle.
frequently and cause significant losses in the quality and yield of apples.Sample images FGVC7 and FGVC8 disease images.
PlantDoc [24] is a dataset of non insufficient number of samction more difficult.In this study, apple rust and scab images using different shooting angles, (3) images with different disease intensities, and (4) images from different disease stages to ensure the PlantDoc [24] is a dataset of non-laboratory images constructed by Davinder Singh et al. in 2020 for visual plant disease detection.It contains 2598 images of plant diseases in natural scenes, involving 13 species of plants and as many as 17 diseases.Most of the images in PlantDoc have low resolution, large noise, and an insufficient number of samples, making detection more difficult.In this study, apple rust and scab images were used to enhance and validate the generalization of the proposed model.Examples of disease images are shown in Figure 2.
VC7 and FGVC8 [22,23] consist of apple leaf disease images used in the Plant Pa-Grained Visual Categorization competition hosted by Kaggle.
frequently and cause significant losses in the quality and yield of apples.Sample images FGVC7 and FGVC8 disease images.
PlantDoc [24] is a dataset of non insufficient number of samction more difficult.In this study, apple rust and scab images using different shooting angles, (3) images with different disease intensities, and (4) images from different disease stages to ensure the From the collected datasets, we selected (1) images with light intensity varying with the time of day, (2) images capture using different shooting angles, (3) images with different disease intensities, and (4) images from different disease stages to ensure the richness and diversity of the dataset.Finally, a total of 2099 apple leaf disease images were selected.La-belImg software was used to label the images with categories including disease type, center coordinates, width, and height of each disease spots.In total, we annotated 10,727 lesion instances, and annotations are shown in Table 1.The labeled dataset was randomly divided into training and test sets at a ratio of 8:2.This dataset was called ALDD (apple leaf disease data) and was used to train and test the model.The actual apple orchard in a complex environment contains many disturbances and the currently selected data is far from sufficient.To enrich the image dataset, mosaic image enhancement [16] and online data enhancement were chosen to expand the dataset.Mosaic image enhancement involves a random selection of 4 images from the training set, which are finally combined into one image after rotation, scaling, and hue adjustment.This approach not only enriches the image background and increases the number of instances, but also indirectly boosts the batch size.This accelerates model training and is favorable to improving small target detection performance.Online augmentation is the use of data augmentation in model training, which ensures the invariance of the sample size and the diversity of the overall sample and improves the model's robustness by continuously expanding the sample space.Mainly includes alterations to hue, saturation, brightness transformation, translation, rotation, flip, and other operations.The total number of the dataset is constant; however, the amount of data input to each epoch is changing, and it is more conducive to fast convergence of the model.Examples of enhanced images are shown in Figure 3.
the currently selected data is far from sufficient.To enrich the image dataset, enhancement [16] and online data enhancement were chosen to expand the dataset.which are finally combined in brightness transformation, translation, rotation, flip, and other operatio

YOLOv5s Model
Depending on the network depth and feature map width, YOLOv5 can be divided into YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x [25].As the depth and width increase, the number of layers of the network increases as well as the structure becomes more complex.In order to meet the requirements of lightweight deployment and real-time detection, reduce storage space occupied by the model and improve the identification speed, YOLOv5s was selected as the baseline model in this study.
The YOLOv5s was composed of four parts: input, backbone, neck, and prediction.The input section included mosaic data enhancement, adaptive calculation of the anchor box, and adaptive scaling of images.The backbone module performed feature extraction and consisted of four parts: focus, CBS, C3, and spatial pyramid pooling (SPP).There were two types of C3 [26] modules in YOLOv5s for backbone and neck, as shown in Figure 4.The first one used the residual units at the backbone layer, while the second one did not.SPP [27] performed the maximum pooling of feature maps using convolutional kernels of different sizes in order to fuse multiple sense fields and generate semantic information.The neck layer used a combination of (FPN) [28] and path aggregation networks (PANet) [29] to fuse the image features.The prediction included three detection layers, corresponding to 20 × 20, 40 × 40, and 80 × 80 feature maps, respectively, for detecting large, medium, and small targets.Finally, the distance between the predicted boxes and the true boxes was calculated using the complete intersection over union (CIOU) [30] loss function, and the NMS was applied to remove the redundant boxes and retain the detection boxes with the highest confidence.The YOLOv5s network model is shown in Figure 4. [26] modules in YOLOv5s for The first one used the residual units at the SPP [27] performed different sizes to fuse multiple sense fields and generate semantic information.layer used a combination of (FPN) [28] and [29] to fuse the intersection over union (CIOU) [30] loss function, and the highest confidence.The YOLOv5s network model is shown in Figure 4.

Bidirectional Feature Pyramid Network
The YOLOv5s combines FPN and PANet for multi-scale feature fusion, with FPN enhancing semantic information in a top-down fashion and PANet enhancing location information from the bottom up.This combination enhances the feature fusion capability of the neck layer.However, when fusing input features at different resolutions, the features are simply summed and their contributions to the fused output features are usually inequitable.To address this problem, Tan et al. [31] developed the BiFPN based on efficient bidirectional cross-scale connections and weighted multiscale feature fusion.The BiFPN introduced learnable weights in order to learn the importance of different input features, while topdown and bottom-up multi-scale feature fusion was applied iteratively.The structure of BiFPN is shown in Figure 5.The BiFPN removes the node with only one input edge because it does not perform feature fusion.The contribution to the network aim of fusing different features is minimal, and so it is removed and the bidirectional network is simplified.Additionally, an extra edge is added between the input and output nodes that are at the same layer to obtain higher-level fusion features through iterative stacking.The BiFPN introduces a simple and efficient weighted feature fusion mechanism by adding a learnable weight that assigns different degrees of importance to feature maps of different resolutions.The formulas are shown in ( 1) and (2): where P i in is the input feature of layer i, P i td is the intermediate feature on the top-down pathway of layer i, P i out is the output feature on the bottom-up pathway of layer i, ω is the learnable weight, ε = 0.0001 is a small value to avoid numerical instability, Resize is a downsampling or upsampling operation, and Conv is a convolution operation.
The neck layer with BiFPN added a fusion of multi-scale features to provide powerful semantic information to the network.It helped to detect apple leaf diseases of different sizes and alleviated the network's inaccurate identification of overlapping and fuzzy targets.

Transformer Encoder Block
There was a high density of lesions on apple leaves.In order to avoid the problem that the number of lesions and background information increased after mosaic data enhancement, which caused the inability to accurately locate the area where the diseases, the transformer [32] attention mechanism was added to the end of the backbone layer.The transformer module was employed to capture global contextual information and establish long-range dependencies between feature channels and disease targets.The transformer encoder module used a self-attentive mechanism to explore the feature representation capability and an had excellent performance in highly dense scenarios [33].The self-attention mechanism was designed based on the principles of human vision and allocated resources according to the importance of visual objects.The self-attentive mechanism had a global sensory field, which modeled long-range contextual information, captured rich global semantic information, and assigned different weights to different semantic information to make the network focus more on key information [34].It was calculated as (3), and contained three basic elements: query, key, and value, denoted by Q, K, and V, respectively.
where d k is the number of input feature map channel sequences, using normalized data to avoid gradient increment.Each transformer encoder is composed of a multi-head attention and a feed-forward neural network.The structure of multi-head attention mechanism is shown in Figure 6.It differs from the self-attentive mechanism in that the self-attentive mechanism uses only one set of Q, K, and V values, while it uses multiple sets of Q, K, and V values to compute and stitch multiple matrices together.The different linear transformations feature different vector spaces, which can help the current code to focus on the current pixels and acquire semantic information about the context [35].The multi-head attention mechanism enhances the ability to extract disease features by capturing long-distance dependent information without increasing the computational complexity and improves the model's detection performance.

Convolutional Block Attention Module
Determining the disease species relies more on local information in the feature map, while the localization of lesions is more concerned with the location information.This model used the CBAM [36] attention mechanism in the improved YOLOv5s to weight the features in space and channels and enhance the model's attention to local and spatial information.
As shown in Figure 7, the CBAM contained two sub-modules: the channel attention module (CAM) and the spatial attention module (SAM), for spatial and channel attention, respectively.The input feature map F∈R C×H×W was first passed through the one-dimensional convolution operation M c ∈R C×1×1 of the CAM, and the convolution result was multiplied with the input features.The output result of CAM was then used as input, the two-dimensional convolution operation M s ∈R 1×H×W of the SAM was performed, and then the result was multiplied with the CAM output to obtain the final result.The calculation formulas are as (4) and (5).
where different channels and multiplied the channels with the corresponding weights to increase attention to important channels.The The CAM in CBAM focused on the weights of different channels and multiplied the channels with the corresponding weights to increase attention to important channels.The feature map F of size H × W × C was averaged and maximally pooled to obtain two 1 × 1 × C channel mappings, respectively, and then a two-layer shared multi-layer perception (MLP) operation was performed.The two outputs were summed element by element, and then a sigmoid activation function was applied to output the final result.The calculation process is shown in Equation (6).
As shown in Equation ( 7), the SAM was more concerned with the location information of the lesions.The CAM output was averaged and maximally pooled to obtain two H' × W' × 1 channel maps.The final result was obtained by concatenating the two feature maps, followed by a 7 × 7 convolution operation and a Sigmoid activation function.

BTC-YOLOv5s Detection Model
Based on the original advantages of the YOLOv5s model, this study proposed using an improved BTC-YOLOv5s algorithm for detecting apple leaf diseases.While ensuring the speed of the procedure, it improved the accuracy of identifying apple leaf diseases in a complex environment.The proposed algorithm was improved mainly in three parts: the BiFPN, transformer, and CBAM attention mechanism.Firstly, the CBAM module was added in front of the SPP in the YOLOv5s backbone layer to highlight useful information and suppress useless information in the disease detection task, thereby improving the model's detection accuracy.Secondly, the C3 was replaced with the C3TR module with transformer and improved the ability to extract apple leaf disease features.Thirdly, we replaced the concat layer with the BiFPN layer, and a path from the 6th layer was added to the 20th layer.The features generated by the backbone at the same layer were bidirectionally connected with the features generated by the FPN and the PANet to provide stronger information representation capability.Figure 8 shows the overall framework of the BTC-YOLOv5s model for this study.
Experimental Equipment and Parameter Settings 1.10.0deep learning framework, using the following device specifications: Intel(R) Xeon(R) E5 and NVIDIA GeForce RTX3090 graphics card with 24

Experimental Equipment and Parameter Settings
The model was trained and tested on a Linux system running under the PyTorch 1.10.0deep learning framework, using the following device specifications: Intel(R) Xeon(R) E5-2686 v4 @ 2.30 GHz processor, 64 GB of memory, and NVIDIA GeForce RTX3090 graphics card with 24 GB of video memory.The software was executed on cuda 11.3, cudnn 8.2.1, and python 3.8.
During training, the initial learning rate was set to 0.01, and the cosine annealing strategy was employed to decrease the learning rate.Additionally, the neural network parameters were optimized using the stochastic gradient descent (SGD) method, with a momentum value of 0.937 and a weight decay index score of 0.0005.The training epoch was 150, the image batch size was set to 32, and the input image resolution was uniformly adjusted to 640 × 640.Table 2 shows the tuned training parameters.

Model Evaluation Metrics
The evaluation metrics are divided into two aspects: performance assessment and complexity assessment.The model performance evaluation metrics include precision, recall, mAP, and F1 score.The model complexity evaluation metrics include model size, floating point operations (FLOPs), and FPS, which evaluate the computational efficiency and image processing speed of the model.
Precision is the ratio of the correctly predicted positive samples to the total number of samples predicted as positive and is used to measure the classification ability of a model, while the recall measures the ratio of the correctly predicted positive samples to the total number of positive samples.The AP is the integral of precision and recall, and the mAP is the average of AP, which reflects the overall performance of the model for target detection and classification.F1 score is the harmonic mean of precision and recall, and it uses both precision and recall to evaluate the performance of the model.The calculation formulas are shown in Equations ( 8)- (12).
where TP is the number of positive samples with correct detection, FP is the number of positive samples with incorrect detection, and FN is the number of negative samples with incorrect detection.
where n is the number of disease species.
The model size refers to the amount of memory required for storing the model.FLOPs is used to measure the complexity of the model, which is the total number of multiplication and addition operations performed by the model.The lower the FLOPs value, the less computation is required for model inference, and the faster model computation will be.The formula for FLOPs is shown in Equations ( 13) and (14).The FPS indicates the number of pictures processed per second by the model, which can assess the processing speed and is crucial for real-time disease detection.Considering that the model can be implemented on mobile devices with low computational cost, an octa-core CPU without a graphics card was selected to run the test.
where C in represents the input channel, C out represents the output channel, K represents the convolution kernel size, and W out and H out represent the width and height of the output feature map, respectively.

Performance Evaluation
The proposed BTC-YOLOv5s model was validated using the constructed ALDD test set.Additionally, the same optimized parameters were used to compare results with YOLOv5s baseline model.As shown in Table 3, the improved model achieved similar AP scores for frogeye leaf spots as the original model, while significantly improving the detection performance for the other three diseases.Notably, scab disease, with its irregular lesion shape, was the most issue to detect, and the improved model achieved a 3.3% increase in AP, which was the largest improvement.These results indicated that the proposed model effectively detected all four diseases with improved accuracy.Figure 9 shows evaluation results of precision, recall, mAP@0.5, and mAP@0.5:0.95 for the baseline model YOLOv5s and the improved model BTC-YOLOv5s trained with 150 epochs.
In Figure 9, it is displayed that the precision and recall curves fluctuated within a narrow range after 50 epochs, but that the BTC-YOLOv5s curve remained consistently above the baseline model curve.From the mAP@0.5 curve, it can be seen that the mAP@0.5 curve of the improved model intersected with the baseline model at around 60 epochs.Although the mAP@0.5 of the baseline model increased rapidly in the early stage, the BTC-YOLOv5s model improved steadily in the later stage and showed better results.The mAP@0.5:0.95curve also demonstrated a similar behavior.
As apple leaf diseases were small and densely distributed, for further verification of the BTC-YOLOv5s model's accuracy, the test sets were divided into two groups based on lesion density, namely sparse distribution and dense distribution of lesions.We compared the detection results of the baseline model and the improved model.The mAP@0.5 of BTC-YOLOv5s model for sparse and dense lesions images was 87.3% and 81.4%, respectively, which was 1.7% and 0.7% higher than that of the baseline model.As shown in Figure 10, yellow circles represent missed detections and red circles represent false detections.It can be seen that, irrespective of whether the disease is sparse or dense, the baseline model YOLOv5s missed small or blurred lesions (the first row of images in Figure 10a,b).However, the improved model resolved this issue and detected small lesions or diseases on the leaves that were not in the focus range (the second row of images in Figure 10a,b).Additionally, the BTC-YOLOv5s model had higher confidence levels.The baseline model also mistakenly detected the non-diseased parts such as apples, background, and other irrelevant objects (Figure 10(a3,b1)), and there was a false detection whereby the scab was mistakenly detected as rust (Figure 10(b5)).The improved model could concentrate more on diseases and extract the gap characteristics between different diseases at a deeper level to avoid the above errors.Furthermore, the lesions of frogeye leaf spot, scab, and rust were small, dense, and distributed in different parts of the leaves, while powdery mildew typically affected the whole leaf.This led to the scale of the model detection box changing from large to small, and the proposed model was able to adapt well to the scale changes of different diseases.
Therefore, the BTC-YOLOv5s model could not only adapt to the detection of different disease distributions but could also adapt to the changes in apple leaf diseases with different scales and characteristics, showing excellent detection results.

Results of Ablation Experiments
This study verified the effectiveness of different optimization modules via ablation experiments.We constructed several improved models by adding the BiFPN module (BF), transformer module (TR), and CBAM attention module sequentially to the baseline model YOLOv5s and compared the results on the same test data.The experimental results are shown in Table 4.
In Table 4, the precision and mAP@0.5 of the baseline model YOLOv5s were 78.4% and 82.7%.By adding three optimization modules, namely the BiFPN module, transformer module, and CBAM attention module, both precision and mAP@0.5 were improved compared to the baseline model.Specifically, the precision increased by 3.3%, 3.3%, and 1.1%, respectively, and the mAP@0.5 increased by 0.5%, 1%, and 0.2%, respectively.The final combination of all three optimization modules achieved the best results, with precision, mAP@0.5 and mAP@0.5:0.95all reaching the highest values, which were 5.7%, 1.6%, and 0.1% higher than those of the baseline model, respectively.By fusing cross-channel information with spatial information, the CBAM attention mechanism focused on important features while suppressing irrelevant ones.Additionally, the transformer module used the self-attention mechanism to establish a long-range feature channel with the disease features.The BiFPN module fused the above features across scales to improve the identification of overlapping and fuzzy targets.As a result of the combination of three modules, the BTC-YOLOv5s model achieved the best performance.Where BF and TR represent the BiFPN module and transformer module, respectively.

Analysis of Attention Mechanisms
In order to assess the effectiveness of the CBAM attention mechanism module, other structures of the BTC-YOLOv5s model were retained as experimental parameter settings, and only the CBAM module was replaced with other mainstream attention mechanism modules, such as SE [37], CA [38], and ECA [39] modules, for comparison purposes.
Table 5 shows that the attention mechanism could significantly improve the accuracy of the model.The mAP@0.5 of SE, CA, ECA, and CBAM models reached 83.4%, 83.6%, 83.6%, and 84.3%, respectively, which was 0.4%, 0.6%, 0.6%, and 1.3% higher than that of YOLOv5s + BF + TR model.Each attention mechanism improved the mAP@0.5 to varying degrees, with the CBAM model performing the best and reaching 84.3%, which was 0.9%, 0.7%, and 0.7% higher than that of SE, CA, and ECA models, respectively, and the mAP @ 0.5: 0.95 was also the highest among the four attention mechanisms.The SE and ECA attention mechanisms only took into account the channel information in the feature map, while the CA attentional mechanism encoded the channel relations using the location information.In contrast, the CBAM attention mechanism combined spatial and channel attention, emphasizing the information on disease features in the feature map, which was more conducive to disease identification and localization.Moreover, the attention module did not increase the model size or FLOPs, indicating that it was a lightweight module.The BTC-YOLOv5s model with the CBAM module achieved improved recognition accuracy while maintaining the same model size and computational cost.

Comparison of State-of-the-Art Models
The current mainstream two-stage detection model Faster R-CNN and the one-stage detection models SSD, YOLOv4-tiny, and YOLOx-s were selected for comparison experi-ments.The ALDD dataset was used for training and testing, with the same experimental parameters across all models.The experimental results are shown in Table 6.Among all models, the mAP@0.5 and F1 score of Faster R-CNN were lower than 50%, with a large model size and computational effort, resulting in only 0.16 FPS, making it unsuitable for real-time detection of apple leaf diseases.The one-stage detection model SSD had an mAP@0.5 value of 71.56% and a model size of 92.1 MB, which did not meet the detection requirements in terms of model accuracy and complexity.In the YOLO model series, YOLOv4-tiny had an mAP@0.5 of only 59.86%, and the accuracy was too low.The YOLOx-s achieved 80.1% mAP@0.5, but the FLOPs were 26.64 G, and there were only 4.08 pictures per second.Neither of them was not conducive to mobile deployment.The proposed BTC-YOLOv5s model had the highest mAP@0.5 and F1 score among all models, exceeding SSD, Faster R-CNN, YOLOv4-tiny, YOLOx-s, and YOLOv5s by 12.74%, 48.84%, 24.44%, 4.2%, and 1.6%, respectively.The model size and FLOPs were similar to the baseline model, and FPS reached 8.7 frames per second to meet real-time detection of apple leaf diseases in real scenarios.
As seen in Figure 11, the BTC-YOLOv5s model outperformed the other five models in terms of detection accuracy.Additionally, the BTC-YOLOv5s model exhibited comparable model size, computational effort, and detection speed to the other lightweight models.In summary, the overall performance of the BTC-YOLOv5s model was excellent and could accomplish accurate and efficient apple leaf disease detection tasks in real-world scenarios.

Robustness Testing
In the actual production, the detection of apple leaf diseases may be interfered with by various objective environmental factors such as overexposure, dim light, and low-resolution images.In this study, the test set images were simulated by enhancing brightness, reducing brightness, and adding Gaussian noise, resulting in a total of 1191 images (397 images per case).We evaluated the robustness of the optimized BTC-YOLOv5s model under a variety of interference environments to determine its detection effectiveness.Additionally, we tested the model's ability to detect concurrent diseases by adding 50 images containing multiple diseases.Experimental results are shown in Figure 12.
here first to fifth rows show results for apple frogeye leaf spot, rust, ties.It also performs weighted feature fusion, allowing the network to learn the significance of different input features.In the field of agricultural detection, multi Li et al. [21] accomplished multi From the detection results, the model could accurately detect frogeye leaf spot, rust, and powdery mildew images under all three noise conditions (bright light, dim light, and blurry), with few missing detections.The scab disease was also correctly identified, but a certain degree of missing detections occurred in dim light and blurry conditions.This is mainly because the scab lesions appeared to be black, the overall background of the image has similar color to the lesions under dim light conditions.As shown in the fifth row of Figure 12, the model also demonstrated detection capabilities for images with concurrent onset, although a few missing detections occurred in the blurry condition.The experimental results achieved more than 80% of mAP.Overall, the BTC-YOLOv5s model still exhibited strong robustness under extreme conditions, such as blurred images and insufficient light.

Multi-Scale Detection
Multi-scale detection is a challenging task in apple leaf disease detection due to the varying sizes of the lesions.In this study, frogeye leaf spot, scab, and rust lesions are typically small and dense, while powdery mildew is a whole lesion distributed over the leaf.The size of the spots that need to be detected relative to the proportion of the whole image can vary widely between images or even within the same image.To address this issue, this study introduced the BiFPN into YOLOv5s based on the idea of multi-scale feature fusion to improve the model's ability.The BiFPN stacks the entire feature pyramid framework multiple times, providing the network with strong feature representation capabilities.It also performs weighted feature fusion, allowing the network to learn the significance of different input features.In the field of agricultural detection, multi-scale detection has been a popular research topic.For example, Li et al. [21] accomplished multi-scale cucumber disease detection by adding a set of anchors matching small instances.Cui et al. [40] used a squeeze-and-excitation feature pyramid network to fuse multi-scale information, retaining only the 26 × 26 detection head for pinecone detection.However, the current study still faces the challenge of significantly degraded detection accuracy for very large-or very small-scale targets.Future studies will focus on exploring how models can be applied to different scales of disease spots.

Attentional Mechanisms
The attention mechanism assigns weight to the image features extracted by the model, enabling the network to focus on target regions with important information, while suppressing other irrelevant information and reducing interference caused by irrelevant backgrounds on detection results.The introduction of the attention mechanism can effectively enhance the detection model's feature learning ability, and many researchers have incorporated it to improve model performance.For example, Liu et al. [41] added the SE attention module to YOLOX to enhance the extraction of the cotton boll feature details.Bao et al. [42] added a dual-dimensional mixed attention (DDMA) to the detection model Neck, which parallelizes coordinate attention with channel and spatial attention to reduce missed and false detections caused by dense blade distribution.This study used the CBAM attention mechanism to enhance the BTC-YOLOv5s model's feature extraction ability.CBAM comprised two modules, SAM and CAM, and using the two submodules alone yielded an accuracy of 83.2% and 83.1%, respectively, inferior to the performance of the model using CBAM.As SAM and CAM are only spatial and channel attention modules alone, whereas CBAM combines both, it considers useful information from both feature channels and spatial dimensions, making it more beneficial for the model to locate and identify lesions.

Outlook
Although the proposed model can accurately identify apple leaf diseases, there are still some issues that deserve attention and further study.Firstly, the dataset used in this study only contains images of four disease types, whereas there are approximately 200 apple diseases in total.Therefore, future research will include images of more species and different disease stages.Secondly, the accuracy of model is not good in case of dense disease and decreases significantly compared to the performance in the sparse case.The detection results showed that scab had the highest error rate, mainly due to its irregular lesion shape and non-obvious border which interfered with the model detection.In the future, scab disease will be considered as a separate research topic to improve the model's detection accuracy.

Conclusions
This study proposed an improved detection model BTC-YOLOv5s based on YOLOv5s aimed at addressing the issues of missing and false detection caused by different shapes of diseased spots, multi-scale, and dense distribution of apple leaf lesions.To enhance the overall detection performance of the original YOLOv5s model, the study introduced the BiFPN module, which increases the fusion of multi-scale features and provides more semantic information.Additionally, the transformer and CBAM attention modules were added to improve the ability to extract disease features.Results indicated that the BTC-YOLOv5s model achieved an mAP@0.5 of 84.3% on the ALDD test set, with a model size of 15.8 M and detection speed of 8.7 FPS on an octa-core CPU device.Additionally, it still maintained good performance and robustness under extreme conditions.The improved model has high detection accuracy, fast detection speed and low computational requirements, making it suitable for deployment on mobile devices for real-time monitoring and the intelligent control of apple diseases.
bottom n fusing input features at different resolutions, the feainequitable.To address this problem, Tan et al. [31] developed the BiFPN based on effito learn the importance of different input

Figure 5 .
Figure 5. BiFPN network structure diagram, where (a) FPN introduces a top-down path to fuse multi-scale features from P3 to P6; (b) PANet adds an additional bottom-up path on top of the FPN; (c) BiFPN removes redundant nodes and adds additional connections on top of PANet.
denotes the input feature map, M c denotes the one-dimensional convolution operation of CAM, M s denotes the two-dimensional convolution operation of SAM, and ⊗ denotes element multiplication. ′    ⊗   ′′    ′ ⊗  ′ ⨂ Convolutional block attention module (CBAM).
different the precision and recall curves fluctuated within a v5s model improved steadily in the later stage and showed better results.The As apple leaf diseases were small and densely distributed, for further verification of

Figure 9 .
Figure 9. Evaluation metrics of different models, where (a) is a comparison of precision curves before and after model improvement; (b) comparison of recall curves before and after model improvement; (c) comparison of mAP@0.5 curves before and after model improvement; (d) comparison of mAP@0.5:0.95curves before and after model improvement.

Figure 10 .
cs between different distributed in different parts of the leaves, while powdery mildew typically affected the whole well to the scale changes of different diseases.Comparison of detection effect of lesion (sparse and dense) before a Comparison of detection effect of lesion (sparse and dense) before and after model improvement.(a) Sparse distribution; (b) Dense distribution.Where yellow circles represent missed detections and red circles represent false detections.Lines 1 and 3 are YOLOv5s baseline model, and lines 2 and 4 are the improved BTC-YOLOv5s model.Numbers 1 and 2 are frogeye leaf spot, numbers 3 and 4 are rust, numbers 5 and 6 are scab, and numbers 7 and 8 are powdery mildew.
could accomplish accurate and efficient apple leaf disease detection tasks in realPerformance comparison of different detection algorithms.

Figure 11 .
Figure 11.Performance comparison of different detection algorithms.

Figure 12 .
Figure 12.Robustness test results under three extreme conditions.(a) Original; (b) Bright light; (c) Dim light; (d) Blurry.Where first to fifth rows show results for apple frogeye leaf spot, rust, scab, powdery mildew, and multiple diseases, respectively.

Table 2 .
Model training parameters.

Table 3 .
Comparison of detection results of YOLOv5s and BTC-YOLOv5s.

Table 4 .
Results of ablation experiments.

Table 5 .
Performance comparison of different attention mechanisms.

Table 6 .
Performance comparison of mainstream detection models.